# Relationship network extraction

### References:

- https://archive.org/stream/OneHundredYearsOfSolitude_201710/One_Hundred_Years_of_Solitude_djvu.txt
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/annotation#named-entities

In [63]:
import pandas as pd
import re
import spacy

In [4]:
#!python -m spacy download en_core_web_sm

In [64]:
def get_chapters(book: str) -> [str]:
    return [c.strip() for c in re.split('Chapter \d', book)][1:]

def get_pages(book):
    pagination_map = [int(p) for p in re.findall('\n(\d+) \\n?', book)]
    pages = re.split('\n\d+ \\n', book)
    return pages, pagination_map

def get_token_to_page_map(book):
    pass

## Load book

In [65]:
book_path = 'data/one_hundred_years_of_solitude_EN.txt'

with open(book_path, 'r') as f:
    book = f.read()

In [70]:
chapters = get_chapters(book)

In [72]:
len(book), len(chapters)

(821643, 20)

## Parsing chapters and pages

In [92]:
nlp = spacy.load("en_core_web_sm")

In [93]:
book_doc = nlp(book)

In [94]:
pages, pagination_map = get_pages(book)
chapters = get_chapters(book)

In [95]:
print (f'Nº chapters: {len(chapters)}')
print (f'Nº pages: {len(pages)}')
print(f'Nº sentences: {len(list(book_doc.sents))}')
print(f'Vocabulary size: {book_doc.vocab.length}')
print(f'Nº tokens: {len(book_doc)}')
print(f'Nº characters: {len(book)}')

Nº chapters: 20
Nº pages: 192
Nº sentences: 6687
Vocabulary size: 10006
Nº tokens: 171099
Nº characters: 821643


## Token identification

In [11]:
idx = 171094
book_doc[idx], book_doc[idx].orth, book_doc[idx].idx

(THE, 6398231955146299758, 821629)

In [12]:
idx = 171095
book_doc[idx], book_doc[idx].orth, book_doc[idx].idx

(END, 14349458037152488889, 821633)

In [13]:
book_doc.vocab.strings[14349458037152488889]

'END'

In [14]:
book_doc.vocab.strings['Aureliano']

3228277545756960083

In [15]:
found = 0
for i, t in enumerate(book_doc):
    if t.text == 'Aureliano':
        print(t.orth, i)
        found += 1
    if found == 2:
        break

3228277545756960083 14
3228277545756960083 2168


In [16]:
book_doc[14], book_doc[2168]

(Aureliano, Aureliano)

In [17]:
book_doc[14].i, book_doc[2168].i

(14, 2168)

### Observations

- use attribute `i` to reference a token in the document
- use attribute `orth` to get the token id in the vocabulary

## Entities

### Most mentioned characters

In [18]:
entities_lst = []

for ent in book_doc.ents:
    e = {'entity': ent, 'text': ent.text, 'label': ent.label_, 'pos': ent.start }
    entities_lst.append(e)

In [19]:
entities_df = pd.DataFrame(entities_lst)

In [20]:
entities_df.label.unique()

array(['LAW', 'PERSON', 'TIME', 'ORG', 'DATE', 'ORDINAL', 'GPE',
       'CARDINAL', 'LOC', 'NORP', 'WORK_OF_ART', 'FAC', 'EVENT', 'MONEY',
       'PRODUCT', 'LANGUAGE', 'QUANTITY'], dtype=object)

In [21]:
entities_df[entities_df.label == 'PERSON'].groupby('text').size().reset_index(name='total').sort_values('total', ascending=False).head(20)

Unnamed: 0,text,total
192,Ursula,314
26,Aureliano,244
36,Aureliano Segundo,142
31,Aureliano Buendia,132
102,Jose Arcadio,131
163,Rebeca,114
107,Jose Arcadio Buendia,98
131,Melquiades,75
6,Amaranta Ursula,71
110,Jose Arcadio Segundo,69


In [22]:
entities_df[entities_df.label == 'LOC'].groupby('text').size().reset_index(name='total').sort_values('total', ascending=False).head()

Unnamed: 0,text,total
20,earth,21
4,Caribbean,6
1,Arcadio,4
3,Aurelianos,4
7,Europe,3


In [23]:
# GPE stands for Geopolitical entity
entities_df[entities_df.label == 'GPE'].groupby('text').size().reset_index(name='total').sort_values('total', ascending=False).head()

Unnamed: 0,text,total
3,Amaranta,176
53,Macondo,128
12,Aureliano,44
93,Santa Sofia de la Piedad,38
23,Buendia,29


### Entity refenrece in the text

In [24]:
[e['entity'].start for i, e in entities_df[entities_df.text == 'Melquiades'][:10].iterrows()]

[159, 260, 667, 774, 1706, 1788, 2291, 2323, 2399, 2482]

In [25]:
span = entities_df[entities_df.text == 'Melquiades'].iloc[0].entity
token = span[0]

In [26]:
span.start, token.i

(159, 159)

In [27]:
book_doc[span.start], book_doc[159], book_doc[260], token

(Melquiades, Melquiades, Melquiades, Melquiades)

### POS tags in the text

In [28]:
pos_tags = [
    'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 
    'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SPACE', 'SYM', 'VERB', 'X'
]

In [29]:
i = 0
for token in book_doc:
    if token.pos_ == 'PRON':
        print (token)
        i += 1
    if i == 10:
        break

he
him
them
it
they
they
who
himself
what
he


## Noun-chunks

In [30]:
i = 0
for nc in book_doc.noun_chunks:
    print (nc, list(nc.subtree))
    i += 1
    if i == 10:
        break

he [he]
the firing squad [the, firing, squad]
Colonel Aureliano Buendfa [Colonel, Aureliano, Buendfa]
distant afternoon [distant, afternoon, when, his, father, took, him, to, discover, ice]
his father [his, father]
him [him]
ice [ice]
that time [that, time]
Macondo [Macondo]
a village [a, village, of, 
, twenty, adobe, houses, ,, built, on, the, bank, of, a, river, of, clear, water, that, ran, along, a, bed, of, polished, 
, stones, ,, which, were, white, and, enormous, ,, like, prehistoric, eggs]


# Matching misclassified entities

In [31]:
import unidecode

def process_text(text):
    text = text.strip()
    text = text.lower()
    text = unidecode.unidecode(text)
    return text

In [32]:
entities_df.shape

(6464, 4)

In [33]:
entities_df[entities_df.label == 'PERSON'].shape

(2452, 4)

In [34]:
entities_df['text_clean'] = entities_df.text.apply(lambda t: process_text(t))

In [35]:
entities_df[entities_df.text_clean == 'jose arcadio buendia'].label.unique()

array(['PERSON', 'FAC', 'ORG'], dtype=object)

In [36]:
entities_df[entities_df.text_clean == 'ursula'].label.unique()

array(['PERSON', 'NORP', 'PRODUCT', 'LANGUAGE', 'GPE', 'ORG'],
      dtype=object)

- There are 6464 entities, 2452 `PERSON` entities
- Some ocurrences of these characters were incorrectly classified (they should always be labeled as `PERSON`)
- To prevent this situation, we'll create an heuristic to fix theses probleas based on other ocurrences of these entities.

In [37]:
text_entities = entities_df.groupby(['text_clean', 'label']).size().reset_index(name='total').sort_values('total', ascending=False)

In [38]:
text_entities.shape

(1508, 3)

In [39]:
text_entities.head()

Unnamed: 0,text_clean,label,total
1463,ursula,PERSON,314
346,aureliano,PERSON,244
566,first,ORDINAL,198
263,amaranta,GPE,176
918,one,CARDINAL,175


In [40]:
text_entities[text_entities.text_clean == 'ursula'] 

Unnamed: 0,text_clean,label,total
1463,ursula,PERSON,314
1461,ursula,NORP,21
1459,ursula,GPE,8
1464,ursula,PRODUCT,3
1462,ursula,ORG,1
1460,ursula,LANGUAGE,1


In [41]:
text_entities[text_entities.text_clean == 'amaranta'] 

Unnamed: 0,text_clean,label,total
263,amaranta,GPE,176
265,amaranta,PERSON,44
264,amaranta,NORP,7


In [42]:
text_entities[text_entities.text_clean == 'jose arcadio buendia'] 

Unnamed: 0,text_clean,label,total
705,jose arcadio buendia,PERSON,98
703,jose arcadio buendia,FAC,12
704,jose arcadio buendia,ORG,4


In [43]:
max_ids = text_entities.groupby('text_clean').idxmax()

In [44]:
entities_true_label = text_entities.loc[max_ids.total] 

In [45]:
entities_true_label.shape

(1387, 3)

In [46]:
entities_true_label[entities_true_label.text_clean == 'jose arcadio buendia'] 

Unnamed: 0,text_clean,label,total
705,jose arcadio buendia,PERSON,98


In [47]:
entities_true_label[entities_true_label.text_clean == 'ursula'] 

Unnamed: 0,text_clean,label,total
1463,ursula,PERSON,314


In [48]:
entities_true_label[entities_true_label.text_clean == 'amaranta'] 

Unnamed: 0,text_clean,label,total
263,amaranta,GPE,176


In [49]:
entities_true_label.sort_values('total', ascending=False).head(10)

Unnamed: 0,text_clean,label,total
1463,ursula,PERSON,314
346,aureliano,PERSON,244
566,first,ORDINAL,198
263,amaranta,GPE,176
918,one,CARDINAL,175
362,aureliano segundo,PERSON,142
699,jose arcadio,PERSON,136
355,aureliano buendia,PERSON,132
763,macondo,GPE,128
547,fernanda,ORG,116


- There are 1508 unique entities considering its `text` and `label`
- Fixing entitities that have more than one label for the same text by assuming the label with highest occurance is the correct one, we get a total of 1387 unique entities
- We can see that is fixes the cases for the characters `Ursula` and `jose arcadio buendia`, but at the same time it classifies the character `Amaranta` as a `Geopolitical Entity (GPE)` and `Fernanda` as an `Organization (ORG)`