# Named entity recognition in spaCy
Kate Riesbeck  
19 May 2020  
  
   
This notebook reviews named entity recognition (NER) in spaCy with:
* a pretrained spaCy model
* spaCy lookup
* a custom model

## Setup

pip install requirements.txt

In [1]:
import spacy

## Default

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
# default pipeline
nlp.pipe_names

['tagger', 'parser', 'ner']

In [6]:
text = """George Washington (February 22, 1732[b] – December 14, 1799) was an American political leader, military general, statesman, and founding father who served as the first president of the United States from 1789 to 1797. Previously, he led Patriot forces to victory in the nation's War for Independence. He presided at the Constitutional Convention of 1787, which established the U.S. Constitution and a federal government. Washington has been called the "Father of His Country" for his manifold leadership in the formative days of the new nation.
"""

In [7]:
doc = nlp(text)

In [10]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

George Washington 0 17 PERSON
February 22 19 30 DATE
December 14, 1799 42 59 DATE
American 68 76 NORP
first 162 167 ORDINAL
the United States 181 198 GPE
1789 to 1797 204 216 DATE
Patriot 237 244 PERSON
War for Independence 279 299 EVENT
the Constitutional Convention 316 345 LAW
1787 349 353 DATE
the U.S. Constitution 373 394 LAW
Washington 421 431 GPE
the "Father of His Country" 448 475 LAW
the formative days 507 525 DATE


In [None]:
# displacy



## Add entities with spaCy lookup

pip install spacy-lookup




spacy-lookup matches on token text (not a statistical prediction)

can be used alone or added to a pipeline with an existing model 

https://github.com/mpuig/spacy-lookup

In [38]:
from spacy_lookup import Entity

In [4]:
# current pipeline
nlp.pipe_names

['tagger', 'parser', 'ner']

In [40]:
presidents = ["Donald Trump" , "Barack Obama" , "George W. Bush" , "Bill Clinton" , "George H.W. Bush" , "Ronald Reagan" , "Jimmy Carter" , "Gerald Ford" , "Richard Nixon" , "Lyndon B. Johnson" , "John F. Kennedy" , "Dwight D. Eisenhower", "Harry S. Truman" , "Franklin D. Roosevelt" , "Herbert Hoover" , "Calvin Coolidge" , "Warren G. Harding" , "Woodrow Wilson" , "Howard Taft" , "Theodore Roosevelt" , "William McKinley" , "Grover Cleveland" , "Benjamin Harrison" , "Grover Cleveland" , "Chester A. Arthur" , "James Garfield" , "Rutherford B. Hayes" , "Ulysses S. Grant" , "Andrew Johnson" , "Abraham Lincoln" , "James Buchanan" , "Franklin Pierce" , "Millard Fillmore", "Zachary Taylor" , "James K. Polk" , "John Tyler" , "William Henry Harrison" , "Martin Van Buren" , "Andrew Jackson" , "John Quincy Adams" , "James Monroe" , "James Madison" , "Thomas Jefferson" , "John Adams" , "George Washington"]

In [42]:
# create a new "entity" pipeline component

# new labels can be added with via a list, dictionary, or file

new_entities = Entity(keywords_list=presidents, label='PRES')

In [43]:
# Add new entity component before the existing 'ner' pipeline
nlp.add_pipe(new_entities, before='ner', name='presidents')

In [5]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [45]:
doc = nlp(u"When George H.W. Bush was elected.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

George H.W. Bush 5 21 PRES


In [46]:
# limitation -- only finds exact matches

doc = nlp(u"When George Bush was elected.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

George Bush 5 16 PERSON


## Train a custom model

In [None]:
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

In [None]:
# https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy

# spaCy’s models are statistical and every “decision” they make whether a word is a named entity is a prediction. 
# This prediction is based on the examples the model has seen during training.

# The model is then shown the unlabelled text and will make a prediction. 
# Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. 
# The greater the difference, the more significant the gradient and the updates to our model.

# When training a model, we don’t just want it to memorise our examples — 
# we want it to come up with theory that can be generalised across other examples. 
# After all, we don’t just want the model to learn that this one instance of “Amazon” right here is a company — 
# we want it to learn that “Amazon”, in contexts like this, is most likely a company. 
# In order to tune the accuracy, we process our training examples in batches, 
# and experiment with minibatch sizes and dropout rates.

# Of course, it’s not enough to only show a model a single example once. 
# Especially if you only have few examples, you’ll want to train for a number of iterations. 
# At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations 
# based on the order of examples.

# Another technique to improve the learning results is to set a dropout rate, 
# a rate at which to randomly “drop” individual features and representations. 
# This makes it harder for the model to memorise the training data. 
# For example, a 0.25dropout means that each feature or internal representation has a 1/4 likelihood of being dropped. 
# We train the model for 10 epochs and keep the dropout rate as 0.2.

In [None]:
Results and Evaluation of the model :
The model is tested on 20 resumes and the predicted summarized resumes are stored as separate .txt files for each resume.

For each resume on which the model is tested, we calculate the accuracy score, precision, recall and f-score for each entity that the model recognizes. The values of these metrics for each entity are summed up and averaged to generate an overall score to evaluate the model on the test data consisting of 20 resumes. The entity wise evaluation results can be observed below . It is observed that the results obtained have been predicted with a commendable accuracy.



## Prodigy

In [None]:
# replace requirements.txt