# NER With Traditional spaCy and spaCy Transformers

## Traditional spaCy

In [46]:
import spacy
from spacy import displacy

# Load English tokenizer, tagger, parser and NER
nlp_sm = spacy.load("en_core_web_lg")

In [47]:
def print_entities(pipeline, text):
    
    # Create a document 
    document = pipeline(text)
    
    # Entity text & label extraction
    for entity in document.ents:
        print(entity.text + '->', entity.label_)
        
    # Show entities in pretty manner
    #displacy.render(document, jupyter=True, style='ent')

In [66]:
short_text = """
(CNN) - Amy Schneider,  an engineering manager from Oakland, California, became the first woman  and the fourth person on "Jeopardy!" to earn more than $1 million in  winnings on Friday's episode
"""


long_text = """
Good news for consumers, undoubtedly, and good news also for investors. Apple’s recent results, covering the three months to December 31 2016, saw the company’s chief financial officer Luca Maestri announce: ‘We returned nearly $15 billion to investors through share re-purchases and dividends during the quarter.’ The quarterly dividend itself was 57 cents a share, identical to the dividend for the previous three quarters and up on the 52 cents paid for each of the four quarters before that.

Business is brisk at Apple. On January 31, Tim Cook, Apple’s chief executive, said of the last three months of 2016: ‘We’re thrilled to report that our holiday quarter results generated Apple’s highest quarterly revenue ever, and broke multiple records along the way. We sold more iPhones than ever before and set all-time revenue records for iPhone, Services, Mac and Apple Watch.’
"""

In [56]:
#long_text

In [49]:
#document_1 = nlp_sm(short_text)
#print_entities(document_1)
print_entities(nlp_sm, short_text)

CNN-> ORG
Amy Schneider-> PERSON
Oakland-> GPE
California-> GPE
first-> ORDINAL
fourth-> ORDINAL
Jeopardy-> WORK_OF_ART
more than $1 million-> MONEY
Friday-> DATE


In [65]:
#document_2 = nlp_sm(long_text)
#print_entities(document_2)
print_entities(nlp_sm, long_text)

Apple-> ORG
the three months to December 31 2016-> DATE
Luca Maestri-> PERSON
nearly $15 billion-> MONEY
the quarter-> DATE
quarterly-> DATE
57 cents-> MONEY
the previous three quarters-> DATE
52 cents-> MONEY
the four quarters-> DATE
Apple-> ORG
January 31-> DATE
Tim Cook-> PERSON
Apple-> ORG
the last three months of 2016-> DATE
our holiday quarter-> DATE
Apple-> ORG
quarterly-> DATE
iPhones-> PRODUCT
Apple Watch-> ORG


We can modify our print_entities function by adding the displacy function in order to have a nice visualization of all the entities

In [51]:
def visualize_entities(model, text):
    
    # Create a document 
    document = model(text)
        
    # Show entities in pretty manner
    displacy.render(document, jupyter=True, style='ent')
    #from spacy import displacy
    #displacy.render(doc, style='dep', jupyter=True, options={'distance': 130})

In [52]:
# On the short text
visualize_entities(nlp_sm, short_text)

In [62]:
# On the long text
visualize_entities(nlp_sm, long_text)

## spaCy Transformers - roBERTa

In [40]:
#!python3 -m spacy download en_core_web_trf

# Load the spacy transformer (roberta-base) model
roberta_nlp = spacy.load("en_core_web_trf")

In [41]:
# Entities on short text
print_entities(roberta_nlp, short_text)

CNN-> ORG
Amy Schneider-> PERSON
Oakland-> GPE
California-> GPE
first-> ORDINAL
fourth-> ORDINAL
Jeopardy-> WORK_OF_ART
more than $1 million-> MONEY
Friday-> DATE




In [64]:
# Entities on long text
print_entities(roberta_nlp, long_text)

Apple-> ORG
the three months to December 31 2016-> DATE
Luca Maestri-> PERSON
nearly $15 billion-> MONEY
the quarter-> DATE
quarterly-> DATE
57 cents-> MONEY
the previous three quarters-> DATE
the 52 cents-> MONEY
the four quarters-> DATE
Apple-> ORG
January 31-> DATE
Tim Cook-> PERSON
Apple-> ORG
the last three months of 2016-> DATE
holiday quarter-> DATE
Apple-> ORG
quarterly-> DATE
iPhones-> PRODUCT
iPhone-> PRODUCT
Apple-> ORG


In [43]:
# On the short text
visualize_entities(roberta_nlp, short_text)

So far, so good, nothing changes from the result of the traditional spacy on the short text. 

In [63]:
# On the short text
visualize_entities(roberta_nlp, long_text)