# spaCy NER Intro

In [385]:
import spacy

nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


## spaCy Tokenizer

In [386]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


## spaCy NER

In [387]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Let's see how the pre-trained model does on pulling entities fromt his recent article about the Giants Leonard Williams.

In [388]:
nyg_article = '''New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.

Williams had three sacks, five additional pressures, three tackles for loss and six total stops against Dallas, along with a pass defensed in the Giants’ 23-19 victory.

Williams also won the Defensive Player of the Week honor after a 2.5-sack game against the Seattle Seahawks.

The six-year veteran finished with 11.5 sacks, the first double-digit sack season of his career. Williams’ total is the most by a Giant since Jason Pierre-Paul had 14.5 in 2014.

The Giants finished the season with 40 sacks, their highest total since they had 47 in 2014.

I also like the Dallas Stars.

Williams is heading to free agency after playing the 2020 season on the franchise tag. Best guess is he will be looking for a contract that will put him in the top 10 among defensive linemen, which means an average annual value of at least $17.5 million.'''

In [389]:
nyg_tokens = nlp(nyg_article)

for ent in nyg_tokens.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

New York Giants 0 15 ORG
Leonard Williams 34 50 PERSON
NFL 83 86 ORG
NFC 109 112 ORG
the Week 133 141 DATE
Sunday 166 172 DATE
Dallas 185 191 GPE
Cowboys 192 199 ORG
Williams 202 210 PERSON
three 215 220 CARDINAL
five 228 232 CARDINAL
three 255 260 CARDINAL
six 282 285 CARDINAL
Dallas 306 312 GPE
Giants 348 354 ORG
23 356 358 CARDINAL
Williams 372 380 PERSON
2.5-sack 437 445 QUANTITY
the Seattle Seahawks 459 479 ORG
six-year 486 494 DATE
11.5 517 521 CARDINAL
first 533 538 ORDINAL
Williams 579 587 PERSON
Giant 612 617 ORG
Jason Pierre-Paul 624 641 PERSON
14.5 646 650 CARDINAL
2014 654 658 DATE
Giants 665 671 ORG
season 685 691 DATE
40 697 699 CARDINAL
47 742 744 CARDINAL
2014 748 752 DATE
Dallas 771 777 GPE
Williams 786 794 PERSON
the 2020 season 835 850 DATE
10 950 952 CARDINAL
annual 1001 1007 DATE
at least $17.5 million 1017 1039 MONEY


The pre-trained model does fairly well, identifying several key entities: Leonard Williams, New York Giants, Cowboys.
    
It did however, inconsistently label team names: New York Giants and Cowboys without the city being included.
    
   Let's see if we can update to the model to overcome this issue. 

## Creating Training Data

In order to update our model, we will need labeled examples of the pattern we want the model to identify.

### Training Data Format

spaCy expects training data to be in the form below.

In [415]:
train_data = [
("New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.", 
 {"entities": 
  [(185,199, "ORG")]}),
("The Dallas Cowboys did not make the playoffs.", 
 {"entities": 
  [(4,18, "ORG")]}),("When will the Dallas Cowboys learn.", 
 {"entities": 
  [(14,28, "ORG")]}),
("I am betting on the Cowboys this weekend",
 {"entities": 
  [(20,27,"ORG")]}),
   ("Williams is heading to free agency after playing the 2020 season on the franchise tag.",{
        "entities":[]})
]

Note: We added an entry where williams was referenced by his last name only.  When we ran this small number of examples through only updating ORG, it started overfitting to Williams as a ORG.  In reallity, this is just too few examples to update the model.

I labeled this data by hand, but in some situations we may be able to create labeled data using a regular expression.  Let's give it a try.

Let's start with out sentences in a list.

In [416]:
unlabeled_text = [x[0] for x in train_data]
unlabeled_text

['New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
 'The Dallas Cowboys did not make the playoffs.',
 'When will the Dallas Cowboys learn.',
 'I am betting on the Cowboys this weekend',
 'Williams is heading to free agency after playing the 2020 season on the franchise tag.']

### Toy Problem

In [417]:
# we changed the first sentence to be dallas cowboys; 
#     this ensures we find an entity if it occurs twice
sentences = ["The Dallas Cowboys defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys."]
ent = "Dallas Cowboys|Cowboys"
label = "Org"

matches = [(sentence, 
            {"entities":
                       [(match.start(), match.end(), label) 
                        for match in re.finditer(ent, sentence)]}) 
           for sentence in sentences]
matches

[('The Dallas Cowboys defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(4, 18, 'Org'), (188, 202, 'Org')]})]

### Create a Function to Label Examples

Create a function that finds that start and stop position of a regular expression to create data labels.

In [418]:
def label_sentences(sentences, ent, label):
        
    labeled_sentences = [(sentence, 
                          {"entities":
                           [(match.start(), match.end(), label) 
                            for match in re.finditer(ent, sentence)]}) 
                         for sentence in sentences]
    return labeled_sentences
        
    

label_sentences(sentences, ent, label)


[('The Dallas Cowboys defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(4, 18, 'Org'), (188, 202, 'Org')]})]

### Test Label Function

In [419]:
test_labeled_data = label_sentences(unlabeled_text, ent, "ORG")
test_labeled_data

[('New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(185, 199, 'ORG')]}),
 ('The Dallas Cowboys did not make the playoffs.',
  {'entities': [(4, 18, 'ORG')]}),
 ('When will the Dallas Cowboys learn.', {'entities': [(14, 28, 'ORG')]}),
 ('I am betting on the Cowboys this weekend', {'entities': [(20, 27, 'ORG')]}),
 ('Williams is heading to free agency after playing the 2020 season on the franchise tag.',
  {'entities': []})]

In [420]:
test_labeled_data == train_data

True

In [421]:
doc_test = nlp(sentences[0])

current_labels = [(ent.start_char, ent.end_char, ent.label_) for ent in doc_test.ents if not ent.text in ["Dallas", "Cowboys"]]
    
current_labels

[(37, 53, 'PERSON'),
 (86, 89, 'ORG'),
 (112, 115, 'ORG'),
 (136, 144, 'DATE'),
 (169, 175, 'DATE'),
 (184, 202, 'FAC')]

In [422]:
final_labels = []

for sentence in unlabeled_text:
    doc_test = nlp(sentence)

    current_labels = [(ent.start_char, ent.end_char, ent.label_) for ent in doc_test.ents if not ent.text in ["Dallas", "Cowboys"]]
    
    result = (sentence, 
            {"entities": current_labels})
    
    final_labels.append(result)

In [423]:
final_labels

[('New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(0, 15, 'ORG'),
    (34, 50, 'PERSON'),
    (83, 86, 'ORG'),
    (109, 112, 'ORG'),
    (133, 141, 'DATE'),
    (166, 172, 'DATE'),
    (181, 199, 'FAC')]}),
 ('The Dallas Cowboys did not make the playoffs.', {'entities': []}),
 ('When will the Dallas Cowboys learn.', {'entities': []}),
 ('I am betting on the Cowboys this weekend',
  {'entities': [(28, 40, 'DATE')]}),
 ('Williams is heading to free agency after playing the 2020 season on the franchise tag.',
  {'entities': [(0, 8, 'PERSON'), (49, 64, 'DATE')]})]

In [424]:
[final_labels[i][1]['entities'].extend(train_data[i][1]['entities']) for i in range(len(final_labels))]

final_labels

[('New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(0, 15, 'ORG'),
    (34, 50, 'PERSON'),
    (83, 86, 'ORG'),
    (109, 112, 'ORG'),
    (133, 141, 'DATE'),
    (166, 172, 'DATE'),
    (181, 199, 'FAC'),
    (185, 199, 'ORG')]}),
 ('The Dallas Cowboys did not make the playoffs.',
  {'entities': [(4, 18, 'ORG')]}),
 ('When will the Dallas Cowboys learn.', {'entities': [(14, 28, 'ORG')]}),
 ('I am betting on the Cowboys this weekend',
  {'entities': [(28, 40, 'DATE'), (20, 27, 'ORG')]}),
 ('Williams is heading to free agency after playing the 2020 season on the franchise tag.',
  {'entities': [(0, 8, 'PERSON'), (49, 64, 'DATE')]})]

In [425]:
final_labels[0][1]['entities'].pop(6)
final_labels

[('New York Giants defensive lineman Leonard Williams finished the best season of his NFL career by being named NFC Defensive Player of the Week for his dominant effort Sunday against the Dallas Cowboys.',
  {'entities': [(0, 15, 'ORG'),
    (34, 50, 'PERSON'),
    (83, 86, 'ORG'),
    (109, 112, 'ORG'),
    (133, 141, 'DATE'),
    (166, 172, 'DATE'),
    (185, 199, 'ORG')]}),
 ('The Dallas Cowboys did not make the playoffs.',
  {'entities': [(4, 18, 'ORG')]}),
 ('When will the Dallas Cowboys learn.', {'entities': [(14, 28, 'ORG')]}),
 ('I am betting on the Cowboys this weekend',
  {'entities': [(28, 40, 'DATE'), (20, 27, 'ORG')]}),
 ('Williams is heading to free agency after playing the 2020 season on the franchise tag.',
  {'entities': [(0, 8, 'PERSON'), (49, 64, 'DATE')]})]

In [426]:
train_data = final_labels

## Update NER Model

In [427]:
nlp_update = spacy.load("en_core_web_md")
ner = nlp_update.get_pipe('ner')

In [428]:
optimizer = nlp_update.entity.create_optimizer()

In [429]:
other_pipes = [pipe for pipe in nlp_update.pipe_names if pipe != 'ner']

In [430]:
import random
from spacy.util import minibatch, compounding
from pathlib import Path

n_iter = 30

with nlp_update.disable_pipes(*other_pipes):  # only train NER
    for itn in range(n_iter):
        random.shuffle(train_data)
        losses = {}
        batches = minibatch(train_data, 
                            size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch) 
            # Updating the weights
            nlp_update.update(texts, annotations, sgd=optimizer, 
                       drop=0.35, losses=losses)
            print('Losses', losses)
           

Losses {'ner': 19.93783422850538}
Losses {'ner': 21.999592515322462}
Losses {'ner': 15.287014350295067}
Losses {'ner': 18.386359574193975}
Losses {'ner': 8.176636432763189}
Losses {'ner': 12.995391554579891}
Losses {'ner': 9.997786432504654}
Losses {'ner': 12.718017380244474}
Losses {'ner': 7.522971128491918}
Losses {'ner': 10.323486998941867}
Losses {'ner': 4.7784134699904826}
Losses {'ner': 7.748930032512159}
Losses {'ner': 5.092487674672157}
Losses {'ner': 5.257104549390811}
Losses {'ner': 4.365809489041567}
Losses {'ner': 6.4774227751826015}
Losses {'ner': 2.637667201110162}
Losses {'ner': 2.9098194269142406}
Losses {'ner': 1.058930950875947}
Losses {'ner': 1.0595719018936134}
Losses {'ner': 2.8254172340232344}
Losses {'ner': 2.932186745800185}
Losses {'ner': 1.3096429991181822}
Losses {'ner': 1.5036012075539027}
Losses {'ner': 0.0046084951327429735}
Losses {'ner': 0.1314652077349314}
Losses {'ner': 0.10067412133139442}
Losses {'ner': 0.8919648306674282}
Losses {'ner': 0.3539400098

In [431]:
test_doc = nlp_update("The Dallas Cowboys are the worst team in Dallas.  The other Dallas team is the Dallas Stars.")
print("Entities", [(ent.text, ent.label_) for ent in test_doc.ents])

Entities [('Dallas Cowboys', 'ORG'), ('Dallas', 'GPE'), ('Dallas', 'GPE'), ('Dallas Stars', 'ORG')]


In [432]:
import os
os.getcwd()

'C:\\Users\\nickr\\Documents\\Projects\\NLP\\spacy'

In [433]:
import os 

# Directory 
directory = "content\\team_model"
  
# Parent Directory path 
parent_dir = os.getcwd()
parent_dir  
# Path 
output_dir = os.path.join(parent_dir, directory) 
  
# Create the directory 
if not os.path.isdir(output_dir): 
    os.makedirs(output_dir) 

### Save the Model

In [434]:
# Save the  model to directory
# output_dir = './content/team_model'
nlp_update.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to C:\Users\nickr\Documents\Projects\NLP\spacy\content\team_model


### Load and Test New Model

In [435]:
# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated(nyg_article)


Loading from C:\Users\nickr\Documents\Projects\NLP\spacy\content\team_model


In [436]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

New York Giants 0 15 ORG
Leonard Williams 34 50 PERSON
NFL 83 86 ORG
NFC 109 112 ORG
the Week 133 141 DATE
Sunday 166 172 DATE
Dallas Cowboys 185 199 ORG
Williams 202 210 PERSON
three 215 220 CARDINAL
five 228 232 CARDINAL
three 255 260 CARDINAL
six 282 285 CARDINAL
Dallas 306 312 GPE
Giants 348 354 ORG
23 356 358 CARDINAL
Williams 372 380 PERSON
the Week 414 422 DATE
Seattle Seahawks 463 479 ORG
six-year 486 494 DATE
11.5 517 521 CARDINAL
first 533 538 ORDINAL
Williams 579 587 PERSON
Giant 612 617 ORG
Jason Pierre-Paul 624 641 PERSON
14.5 646 650 CARDINAL
2014 654 658 DATE
Giants 665 671 ORG
40 697 699 CARDINAL
47 742 744 CARDINAL
2014 748 752 DATE
Dallas Stars 771 783 ORG
Williams 786 794 PERSON
the 2020 season 835 850 DATE
10 950 952 CARDINAL
annual 1001 1007 DATE
at least $17.5 million 1017 1039 MONEY
