In [4]:
import warnings
warnings.filterwarnings('ignore')

# NER with SpaCy - Basic entity extraction, Finding relations and some data analysis

## NER With SpaCy

In this notebook I will use spaCy's named entity recognition (NER) algorithm to find relations between different entities in the Brown corpus.

### Basic entity extraction

The Brown corpus is a well-known corpus of English developed at Brown University, containing text from many different sources. We will use entity extraction on a subset of the Brown corpus covering a few categories.

We can use spaCy to find entities in a basic sentence as follows:

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
sample_sentence = "The White House is located in Washington D.C."
sample_doc = nlp(sample_sentence)
print([(ent.text, ent.label_) for ent in sample_doc.ents])

[('The White House', 'ORG'), ('Washington', 'GPE'), ('D.C.', 'GPE')]


To see what an entity label means:

In [6]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

And to display the entities in a document using displaCy:

In [7]:
from spacy import displacy
displacy.render(sample_doc, style='ent', jupyter = True)

### Using the Brown corpus
Now let's load sentences from the Brown corpus for a few categories:

In [8]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
sentences = brown.sents(categories = ['news', 'editorial', 'reviews'])

[nltk_data] Downloading package brown to /home/jrock/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Using displaCy to display the entities in the first three sentences of this corpus. What are some entities that are tagged, and what do their entity labels means?

In [9]:
for i in range(3):
    displacy.render(nlp(' '.join(sentences[i])), style='ent', jupyter = True)

Friday and September-October were tagged as DATE:

In [10]:
spacy.explain("DATE")

'Absolute or relative dates or periods'

Durwood Pye and Ivan Allen Jr. were tagged as PERSON:

In [11]:
spacy.explain("PERSON")

'People, including fictional'

### Some data analysis

Let's look what are the five most common people mentioned in the corpus for these categories?  

In [13]:
import tqdm
from collections import Counter
persons_list = [word.text for i in tqdm.tqdm(range(0, len(sentences)), position=0) 
 for word in nlp(' '.join(sentences[i])).ents if word.label_ == 'PERSON']
print('The five most common people in the corpus are:')
print(Counter(persons_list).most_common(5))

100%|██████████| 9371/9371 [00:34<00:00, 271.72it/s]

The five most common people in the corpus are:
[('Kennedy', 114), ('Khrushchev', 51), ('Maris', 36), ('Mantle', 28), ('Eisenhower', 26)]





What are the five most common buildings?

In [15]:
buildings_list = [word.text for i in tqdm.tqdm(range(0, len(sentences)), position=0) 
 for word in nlp(' '.join(sentences[i])).ents if word.label_ == 'FAC']
print('The five most common buildings in the corpus are:')
print(Counter(buildings_list).most_common(5))

100%|██████████| 9371/9371 [00:30<00:00, 308.90it/s]

The five most common buildings in the corpus are:
[('Broadway', 15), ('the White House', 7), ('Pennsylvania Avenue', 4), ('Notre Dame', 4), ('Dreadnought', 4)]





## Finding relations

Now we will look at pairs of entities in sentences in the corpus and try to identify relations between them.

Let's say I would like to know where organizations are located.  

Let's try to find all occurences of organization-location where the organization (ORG) comes before the location (GPE) in the sentence, with no other entity in between, and the word "in" appears somewhere between them.   

I will put this in a Pandas Dataframe with three columns: ORG (organization name), GPE (location name), and context (words in between the organization and location).   
  
I used entity.start and entity.end to get the starting and ending indices for an entity in the sentence.

In [16]:
word_tokens = [[(word.text, word.start, word.end, word.label_) 
             for word in nlp(' '.join(sentences[i])).ents] 
for i in tqdm.tqdm(range(0, len(sentences)), position=0)]

100%|██████████| 9371/9371 [00:31<00:00, 300.45it/s]


In [19]:
from itertools import tee
import pandas as pd

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

In [22]:
df_relations = pd.DataFrame(columns=['ORG', 'GPE', 'context'])
for i, token in enumerate(word_tokens):
    for v, w in pairwise(token):
        if v[3] == 'ORG' and w[3] == 'GPE':
            for word_index in range(v[2], w[1]-2):
                if sentences[i][word_index] == 'in':
                    df_relations = df_relations.append({'ORG': v[0], 'GPE': w[0], 'context': ' '.join(sentences[i][v[2]:w[1]-2])}
                                       , ignore_index=True)
                    break

print(f"There are {df_relations.shape[0]} entities that where the organization (ORG) comes before the location (GPE) in the sentence, with no other entity in between, and the word in appears somewhere between them.")           

There are 17 entities that where the organization (ORG) comes before the location (GPE) in the sentence, with no other entity in between, and the word in appears somewhere between them.


In [13]:
df_relations.head()

Unnamed: 0,ORG,GPE,context
0,the State Welfare Department,Fulton County,has seen fit to distribute these funds through...
1,NATO,Angola,has been set up so that in the future such topics
2,State Department,Laos,"officials explain , now is mainly interested i..."
3,Wellsley College,Columbus,", in the National A.A.U."
4,"Paree , Genevieve",Empire,opening in


**How much does this data tell us about what organizations are located where? In what cases can we be more or less certain?**

This data don't really tell us what organizations are located where. When there is a small amount of words between the organization name and the location, there is a better chance that the word 'in' will indeed indicate a relation of the organization and the location, but as the distance in words between the organization name and the location is getting longer, the context will usually change and the word 'in' won't be enough to decide that there is a relation between the organization and the location.

#### Another example  

We can try to find a relation between a person and a country using 'from' as a context word:

In [23]:
df_relations = pd.DataFrame(columns=['PERSON', 'GPE', 'context'])
for i, token in enumerate(word_tokens):
    for v, w in pairwise(token):
        if v[3] == 'PERSON' and w[3] == 'GPE':
            for word_index in range(v[2], w[1]-2):
                if sentences[i][word_index] == 'from':
                    df_relations = df_relations.append({'PERSON': v[0], 'GPE': w[0], 'context': ' '.join(sentences[i][v[2]:w[1]-2])}
                                       , ignore_index=True)
                    break

print(f"There are {df_relations.shape[0]} entities that have the word -from- between a person and a country.")           

There are 3 entities that have the word -from- between a person and a country.


In [24]:
df_relations.head()

Unnamed: 0,PERSON,GPE,context
0,Berry,San Antonio,", an ex-gambler from"
1,Al Fike,Colorado,", an ex-schoolteacher from"
2,Ella Fitzgerald,Australia,from
