# Relationship network extraction

### References:

- https://archive.org/stream/OneHundredYearsOfSolitude_201710/One_Hundred_Years_of_Solitude_djvu.txt
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/annotation#named-entities

In [1]:
import pandas as pd
import re
import spacy

In [2]:
def get_chapters(book: str) -> [str]:
    return [c.strip() for c in re.split('Chapter \d', book)][1:]

def get_pages(book):
    pagination_map = [int(p) for p in re.findall('\n(\d+) \\n?', book)]
    pages = re.split('\n\d+ \\n', book)
    return pages, pagination_map

def get_token_to_page_map(book):
    pass

## Load book

In [3]:
book_path = 'data/one_hundred_years_of_solitude_EN.txt'

with open(book_path, 'r') as f:
    book = f.read()

## Parsing

In [4]:
nlp = spacy.load("en_core_web_sm")

In [5]:
book_doc = nlp(book)

In [6]:
pages, pagination_map = get_pages(book)
chapters = get_chapters(book)

In [7]:
print (f'Nº chapters: {len(chapters)}')
print (f'Nº pages: {len(pages)}')
print(f'Nº sentences: {len(list(book_doc.sents))}')
print(f'Vocabulary size: {book_doc.vocab.length}')
print(f'Nº tokens: {len(book_doc)}')
print(f'Nº characters: {len(book)}')

Nº chapters: 20
Nº pages: 192
Nº sentences: 6687
Vocabulary size: 10006
Nº tokens: 171099
Nº characters: 821643


## Token identification

In [8]:
idx = 171094
book_doc[idx], book_doc[idx].orth, book_doc[idx].idx

(THE, 6398231955146299758, 821629)

In [9]:
idx = 171095
book_doc[idx], book_doc[idx].orth, book_doc[idx].idx

(END, 14349458037152488889, 821633)

In [10]:
book_doc.vocab.strings[14349458037152488889]

'END'

In [11]:
book_doc.vocab.strings['Aureliano']

3228277545756960083

In [12]:
found = 0
for i, t in enumerate(book_doc):
    if t.text == 'Aureliano':
        print(t.orth, i)
        found += 1
    if found == 2:
        break

3228277545756960083 14
3228277545756960083 2168


In [13]:
book_doc[14], book_doc[2168]

(Aureliano, Aureliano)

## Entities

### Most mentioned characters

In [24]:
from collections import defaultdict

entities = defaultdict(list)
entities_count = defaultdict(int)

for ent in book_doc.ents:
    if ent.label_ == 'PERSON':
        entities[ent.text].append(ent)
        entities_count[ent.text] += 1

In [25]:
pd.DataFrame(zip(entities_count.keys(), entities_count.values()), columns=['name', 'count']).sort_values('count', ascending=False).head(20)

Unnamed: 0,name,count
10,Ursula,314
20,Aureliano,244
113,Aureliano Segundo,142
19,Aureliano Buendia,132
12,Jose Arcadio,131
31,Rebeca,114
2,Jose Arcadio Buendia,98
1,Melquiades,75
154,Amaranta Ursula,71
98,Jose Arcadio Segundo,69


### Entity refenrece in the text

In [30]:
[e.start for e in entities['Melquiades'][:10]]

[159, 260, 667, 774, 1706, 1788, 2323, 2399, 2482, 2818]

In [37]:
span = entities['Melquiades'][0]
token = span[0]

In [48]:
book_doc[span.start]

Melquiades

### POS tags in the text

In [58]:
pos_tags = [
    'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 
    'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SPACE', 'SYM', 'VERB', 'X'
]

In [79]:
i = 0
for token in book_doc:
    if token.pos_ == 'PRON':
        print (token)
        i += 1
    if i == 10:
        break

he
him
them
it
they
they
who
himself
what
he


In [112]:
i = 0
for nc in book_doc.noun_chunks:
    print (nc, list(nc.subtree))
    i += 1
    if i == 10:
        break

he [he]
the firing squad [the, firing, squad]
Colonel Aureliano Buendfa [Colonel, Aureliano, Buendfa]
distant afternoon [distant, afternoon, when, his, father, took, him, to, discover, ice]
his father [his, father]
him [him]
ice [ice]
that time [that, time]
Macondo [Macondo]
a village [a, village, of, 
, twenty, adobe, houses, ,, built, on, the, bank, of, a, river, of, clear, water, that, ran, along, a, bed, of, polished, 
, stones, ,, which, were, white, and, enormous, ,, like, prehistoric, eggs]


In [116]:
i = 0
for s in book_doc.sents:
    print (i, s)
    i += 1
    if i == 10:
        break

0 Chapter 1 


MANY YEARS LATER as he faced the firing squad.
1 Colonel Aureliano Buendfa was to remember that 
distant afternoon when his father took him to discover ice.
2 At that time Macondo was a village of 
twenty adobe houses, built on the bank of a river of clear water that ran along a bed of polished 
stones, which were white and enormous, like prehistoric eggs.
3 The world was so recent that many 
things lacked names, and in order to indicate them it was necessary to point.
4 Every year during the 
month of March a family of ragged gypsies would set up their tents near the village, and with a great 
uproar of pipes and kettledrums they would display new inventions.
5 First they brought the magnet. 

6 A heavy gypsy with an untamed beard and sparrow hands, who introduced himself as Melquiades, 
put on a bold public demonstration of what he himself called the eighth wonder of the learned al¬ 
chemists of Macedonia.
7 He went from house to house dragging two metal ingots and eve

In [118]:
s.noun_chunks


[Things, a life, the gypsy, a harsh accent]