This tutorial is based on work developed by Elizabeth Cary with Pacific Northwest National Lab.
POC: Elizabeth Cary, elizabeth.cary@pnnl.gov

# Applying NER and Coreference Resolution with spaCy and AllenNLP.


## Load spaCy model
[en_core_web_sm](https://spacy.io/models/en#en_core_web_sm) is typically considered spaCy's default English model and comes pre-loaded with a number of components: tok2vec, tagger, parser, senter, ner, attribute_ruler, and lemmatizer. For this demo, we'll be focusing on the NER component, though you can check out the linked documentation for more information on this model and its offerings.

> Note: Take a look at the information included in the model documentation. What should we keep in mind when using this model? In particular, what type of training data was used to train these components? How will this affect how we use this model?

In [1]:
# Import packages
import spacy
nlp = spacy.load('en_core_web_sm')
import pandas as pd
import json
import os

In [2]:
#Pick a dataset and bring it into memory
dataset = 3
data_file = '../Data/Dataset_'+str(dataset)+'/Documents/Documents_Dataset_'+str(dataset)+'.json'
df = pd.read_json(data_file, orient='records')
df.head()

Unnamed: 0,id,date,title,contents
0,disappearance1,Jan 2014 Meeting announcement,,Athena Speaks <br><br>TO MEET the WHOLE FOR T...
1,disappearance2,Jan 2014,Centrum Sentinel,Centrum Sentinel <br><br> VOICES - a blog app...
2,disappearance3,Oct 1995,MAGNIFICENT OPENING GASTECH-KRONOS,The General Post MAGNIFICENT OPENING GASTECH-...
3,disappearance4,Jan 2014,GASTech Employees Kidnapped in Kronos,International Times GASTech Employees Kidnappe...
4,disappearance5,Jan 2014,Homeland Illumination,Homeland Illumination VOICES - a blog about wh...


## Named Entity Recognition
Now that we have our data and spaCy model loaded, let's explore the model in a little more detail.

A list of class definitions somewhere to better understand what we're being shown:

In [3]:
for label in nlp.get_pipe('ner').labels:
    print(label, '|', spacy.explain(label))

CARDINAL | Numerals that do not fall under another type
DATE | Absolute or relative dates or periods
EVENT | Named hurricanes, battles, wars, sports events, etc.
FAC | Buildings, airports, highways, bridges, etc.
GPE | Countries, cities, states
LANGUAGE | Any named language
LAW | Named documents made into laws.
LOC | Non-GPE locations, mountain ranges, bodies of water
MONEY | Monetary values, including unit
NORP | Nationalities or religious or political groups
ORDINAL | "first", "second", etc.
ORG | Companies, agencies, institutions, etc.
PERCENT | Percentage, including "%"
PERSON | People, including fictional
PRODUCT | Objects, vehicles, foods, etc. (not services)
QUANTITY | Measurements, as of weight or distance
TIME | Times smaller than a day
WORK_OF_ART | Titles of books, songs, etc.


Let's test how this works on the first docuemnt

In [5]:
doc = nlp(df['contents'][0])
for ent in doc.ents:
    print(ent, ent.text, ent.label_)

Athena Speaks Athena Speaks ORG
Tomorrow Tomorrow DATE
2014/01/19 2014/01/19 CARDINAL
Kronos Kronos PERSON
tomorrow tomorrow DATE
8 AM 8 AM TIME
Sten St George Sten St George PERSON
GAStech GAStech ORG
Kronos Kronos ORG
Abila Abila GPE
Haneson Ngohebo Haneson Ngohebo PERSON
Blog Blog PERSON
St. George St. George GPE
GAStech GAStech ORG
Kronos Kronos ORG
IPO IPO ORG
GAStech GAStech ORG
Kronos Kronos ORG


In [6]:
spacy.displacy.render(doc, style='ent')

In [7]:
options={'ents' : ['PERSON','GPE']}
spacy.displacy.render(doc, style='ent', options=options)

## Getting entities for all the documents

In [34]:
# Loop through each document to create a list of each Person or GPE in each document as an array
documentIDs = []
documentGeos = []
documentPeople = []

for i in range(len(df['contents'])):
    doc = nlp(df['contents'][i])
    documentIDs.append(df['id'][i])
    documentGeos.append(list({str(word) for word in doc.ents if word.label_=='GPE'}))
    documentPeople.append(list({str(word) for word in doc.ents if word.label_=='PERSON'}))

In [35]:
# Make output JSON objects
output = []
for i in range(len(documentIDs)):
    tempObj = {}
    tempObj["id"] = documentIDs[i]
    tempObj["Geos"] = documentGeos[i]
    tempObj["People"] = documentPeople[i]
    output.append(tempObj)
    
# outJSON = {}
# outJSON['Documents'] = output
outJSON=output.copy()

In [36]:
# Save to file
filename = '../Data/Dataset_'+str(dataset)+'/Documents/Entities_Dataset_'+str(dataset)+'.json'

def write_json_data_to_file(file_path, data):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, 'w') as file:
            d = json.dumps(data, ensure_ascii=False)
            file.write(d)
    file.close()
    print("file written to ",file_path)
    
write_json_data_to_file(filename,outJSON)