# Parts of Speech & Named Entity Recognition

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# 4.1.0 - Parts of Speech tagging

**What is it?** the tokenisation process can itself be difficult or can pose problems in many languages. That means overall we can consider that while it is possible to solve some problems starting only from the raw characters it is generally 'better' to use linguistic knowledge to add useful meaning and information.

**Why?** well, most words are rare and it's common for words that look completely different to mean almost the same thing. The same words rearranged in a different order can mean something wildly different. There is a lot of nuance to the way words are used and the order they are used in. This is why it's important to look at the `parts of speech` that a word appears in rather than the word or `token` itself. 

This process is exactly what Spacy was designed to do. You submit the raw text and a doc object that is annotated is what is returned. In the proces Spacy gives two tiers of annotations. We get coarse POS tags such as `nouns`, `verb` & `adjective` but also fine-grained tags such as `plural noun`, `past tense verb`, `superlative adjective`.  


In [1]:
import spacy 

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
def doc_table(doc):
    # create the header 
    print(f"{'Token':{12}} {'POS_ID':>7} {'POS':{10}} {'TAG':{10}} {'Explanation':{25}}")
    print("-" * 100)

    # create the table for the document
    for token in doc:
        print(f"{token.text:{12}} {token.pos:>7} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{45}}")

In [4]:
def word_table(token):
    # create the header 
    print(f"{'Token':{12}} {'POS_ID':{7}} {'POS':{10}} {'TAG':{10}} {'Explanation':{25}}")
    print("-" * 100)

    print(f"{token.text:{12}} {token.pos:>7} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{45}}")

In [5]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

In [6]:
doc_table(doc)

Token         POS_ID POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
The               90 DET        DT         determiner                                   
quick             84 ADJ        JJ         adjective                                    
brown             84 ADJ        JJ         adjective                                    
fox               96 PROPN      NNP        noun, proper singular                        
jumped           100 VERB       VBD        verb, past tense                             
over              85 ADP        IN         conjunction, subordinating or preposition    
the               90 DET        DT         determiner                                   
lazy              84 ADJ        JJ         adjective                                    
dog               92 NOUN       NN         noun, singular or mass                       
's                94 PART    

#### Working with tense in Spacy

In [7]:
# highlight a present tense example that is interpreted as 
# a past tense example by the lib.

doc = nlp(u"I read books on NLP.")

In [8]:
word = doc[1]

In [9]:
# show the analysis 
word_table(word)

Token        POS_ID  POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read             100 VERB       VBD        verb, past tense                             


In [10]:
# print the table to show that 
doc_table(doc)

Token         POS_ID POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
I                 95 PRON       PRP        pronoun, personal                            
read             100 VERB       VBD        verb, past tense                             
books             92 NOUN       NNS        noun, plural                                 
on                85 ADP        IN         conjunction, subordinating or preposition    
NLP               96 PROPN      NNP        noun, proper singular                        
.                 97 PUNCT      .          punctuation mark, sentence closer            


In [11]:
# create a past tense example 
doc = nlp(u"I read a book on NLP")

In [12]:
word = doc[1]
word_table(word)

Token        POS_ID  POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read             100 VERB       VBD        verb, past tense                             


In [13]:
doc = nlp(u"read books to be well read")
doc_table(doc)

Token         POS_ID POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read             100 VERB       VB         verb, base form                              
books             92 NOUN       NNS        noun, plural                                 
to                94 PART       TO         infinitival "to"                             
be                87 AUX        VB         verb, base form                              
well              86 ADV        RB         adverb                                       
read             100 VERB       VBN        verb, past participle                        


We can see in analysis that the tokens have an identifier and hash, Spacy optimises the positioning of frequency of hashes for maximizing performance internally. The user need not concern themselves with that hashing order. Instead you can see the POS identifier for each of the 50k + language references.

In [14]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

In [15]:
pos_count = doc.count_by(spacy.attrs.POS)
pos_count

{90: 2, 84: 3, 96: 1, 100: 1, 85: 1, 92: 2, 94: 1, 97: 1}

In [16]:
for k,v in sorted(pos_count.items()):
    print(f"{k:>3} {doc.vocab[k].text:{8}} {v}")

 84 ADJ      3
 85 ADP      1
 90 DET      2
 92 NOUN     2
 94 PART     1
 96 PROPN    1
 97 PUNCT    1
100 VERB     1


# 4.2.0 - Visualizing Parts of Speech

Quick review of displacy

In [17]:
from spacy import displacy

In [18]:
doc = nlp(u"The quick brown fox jumped over the lazy dog")

In [19]:
displacy.render(doc, style='dep', jupyter=True)

In [20]:
options = {'distance': 70, 'compact': 'True', 'color': 'yellow', 'bg': '#888', 'font': 'Times'}

In [21]:
displacy.render(doc, style='dep', jupyter=True, options=options)

# 4.3.0 - Named Entity Recognition

NER seeks to locate and classify named entity mentions in unstructured text into predefined categories such as person names, orgainisations, locations, medical codes, time expressions quantities, monetary values, percentages, etc.

In [22]:
def show_ents(doc):
    if doc.ents:
        # print a table header
        print(f"{'Entity':{20}}{'Label':{12}}{'Explantion':{50}} ")
        print("-" * 100)
        
        for ent in doc.ents:
            print(f"{ent.text:<20}{ent.label_:<12}{str(spacy.explain(ent.label_)):50} ")
    else:
        print("No entities identified.")

In [23]:
# random negative sample. 
doc = nlp(u"Hi how are you?")
show_ents(doc)

No entities identified.


In [24]:
# entity present example 
doc = nlp(u"The market opened in New York city with Apple at an all time high of $1124 per share.")
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
New York            GPE         Countries, cities, states                          
Apple               ORG         Companies, agencies, institutions, etc.            
1124                MONEY       Monetary values, including unit                    


In [25]:
doc_table(doc)

Token         POS_ID POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
The               90 DET        DT         determiner                                   
market            92 NOUN       NN         noun, singular or mass                       
opened           100 VERB       VBD        verb, past tense                             
in                85 ADP        IN         conjunction, subordinating or preposition    
New               96 PROPN      NNP        noun, proper singular                        
York              96 PROPN      NNP        noun, proper singular                        
city              92 NOUN       NN         noun, singular or mass                       
with              85 ADP        IN         conjunction, subordinating or preposition    
Apple             96 PROPN      NNP        noun, proper singular                        
at                85 ADP     

In [26]:
doc = nlp(u"I would like to go to London next June to visit the Tower of London")
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
London              GPE         Countries, cities, states                          
next June           DATE        Absolute or relative dates or periods              
the Tower of London FAC         Buildings, airports, highways, bridges, etc.       


In [27]:
doc = nlp(u"Is 500 dollars the same thing as $500 as by next Christmas I should have saved that to buy a new Microsoft laptop")
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
500 dollars         MONEY       Monetary values, including unit                    
500                 MONEY       Monetary values, including unit                    
next Christmas      DATE        Absolute or relative dates or periods              
Microsoft           ORG         Companies, agencies, institutions, etc.            


In [28]:
doc = nlp(u"Tesla to build Glasgow factory creating 460 jobs and costing 290 million")
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
Tesla               ORDINAL     "first", "second", etc.                            
Glasgow             GPE         Countries, cities, states                          
460                 CARDINAL    Numerals that do not fall under another type       
290 million         CARDINAL    Numerals that do not fall under another type       


#### Defining custom data sets - Adding Entities to a Span

In [29]:
from spacy.tokens import Span

In [36]:
# create a hash for the bundled capability
ORG = doc.vocab.strings[u"ORG"]
ORG

383

In [38]:
# doc = document, 0 startpos, 1 exlusive endpos, label is the label
new_entity = Span(doc,0,1, label=ORG)
new_entity

Tesla

In [39]:
# add new entity
doc.ents = list(doc.ents) + [new_entity]
doc.ents

ValueError: [E103] Trying to set conflicting doc.ents: '(0, 1, 'ORDINAL')' and '(0, 1, 'ORG')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

In [33]:
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
Tesla               ORDINAL     "first", "second", etc.                            
Glasgow             GPE         Countries, cities, states                          
460                 CARDINAL    Numerals that do not fall under another type       
290 million         CARDINAL    Numerals that do not fall under another type       


#### Adding entities to ALL Spans

Above we added a single term as our own NER, but it's possible we may need to add several terms. For example, custom set definition is a very real possibility when working with an entity that requires items, product sets to be recognised with NER.

In [40]:
doc = nlp(u"Our company created a brand new vacuum cleaner." u"This new vacuum-cleaner is the best in show.")

In [41]:
show_ents(doc)

No entities identified.


In [42]:
from spacy.matcher import PhraseMatcher

In [43]:
matcher = PhraseMatcher(nlp.vocab)

In [44]:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']

In [45]:
# use list comp to gen a list of desired phrases
phrase_patterns = [nlp(text) for text in phrase_list]

In [46]:
# create the new matcher
matcher.add('newproduct', None, *phrase_patterns)

In [47]:
found_matches = matcher(doc)

In [48]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [49]:
from spacy.tokens import Span

In [50]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [51]:
new_ents = [Span(doc, match[1], match[2], label=PROD) for match in found_matches]

In [52]:
doc.ents = list(doc.ents) + new_ents

In [53]:
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
vacuum cleaner      PRODUCT     Objects, vehicles, foods, etc. (not services)      
vacuum-cleaner      PRODUCT     Objects, vehicles, foods, etc. (not services)      


#### Counting match types in a document or string

Let's suppose you want to know how many times a specific entity type is mentioned in a doc.

In [54]:
doc = nlp(u"Originally I paid £3000 for the car but now it is only worth £1800")

In [57]:
len([ent for ent in doc.ents if ent.label_ == "MONEY"])

2

In [74]:
doc = nlp(u"of course Engineers want to work at Google, Apple, Netflix, Facebook and Amazon because of salary and reputation within FAANG")

len([ent for ent in doc.ents if ent.label_ == "ORG"])

6

In [75]:
doc.ents

(Google, Apple, Netflix, Facebook, Amazon, FAANG)

In [76]:
show_ents(doc)

Entity              Label       Explantion                                         
----------------------------------------------------------------------------------------------------
Google              ORG         Companies, agencies, institutions, etc.            
Apple               ORG         Companies, agencies, institutions, etc.            
Netflix             ORG         Companies, agencies, institutions, etc.            
Facebook            ORG         Companies, agencies, institutions, etc.            
Amazon              ORG         Companies, agencies, institutions, etc.            
FAANG               ORG         Companies, agencies, institutions, etc.            


 # 4.4.0 - Visualizing NER

In [90]:
doc = nlp(u"Over the last business quarter, Apple sold over 300,000 units of the Macbook Air." u"By contrast, Google have sold around 50 thousand units of the Pixel Slate 2020")

In [91]:
displacy.render(doc, style='ent', jupyter=True)

In [92]:
 for sent in doc.sents:
        displacy.render(nlp(sent.text), style='ent', jupyter=True)

# 4.5.0 - Sentence Segmentation

In the earlier introductory examples with Spacy we seen that docs were broken into sentences. We can also set custom segmentation rules to break up documents into sentences based on our own rules. 

In [93]:
doc = nlp(u"This is the first sentence we see. This is a follow up sentence. This one could feasibly be described as a third, but not a Richard the Third")

In [94]:
for sent in doc.sents:
    print(sent)

This is the first sentence we see.
This is a follow up sentence.
This one could feasibly be described as a third, but not a Richard the Third


Note that the `doc.sents` is actually implemented as a generator and therefore cannot be indexed.

In [95]:
list(doc.sents)

[This is the first sentence we see.,
 This is a follow up sentence.,
 This one could feasibly be described as a third, but not a Richard the Third]

**Note** Using the list method you can list the sentences however they will still be Spacy objects

In [101]:
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." - Peter Drucker')

In [102]:
doc.text

'"Management is doing the right things; leadership is doing the right things." - Peter Drucker'

In [103]:
for sent in doc.sents:
    print(sent)
    print('\n')

"Management is doing the right things; leadership is doing the right things."


- Peter Drucker




In [120]:
# adding a new rule to the pipeline by adding a segmentation rule
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ";":
            doc[token.i+1].is_sent_start = True
    return doc

In [127]:
if not nlp.pipe(set_custom_boundaries):
    nlp.add_pipe(set_custom_boundaries, before='parser')

In [128]:
nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [129]:
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." - Peter Drucker')

In [130]:
for sent in doc.sents:
    print(sent)

"Management is doing the right things;
leadership is doing the right things."
- Peter Drucker


In [105]:
# change the segmentation rules

In [131]:
nlp = spacy.load('en_core_web_sm')

In [136]:
# take a complex poetic end of sentence which is just as possible to define by
# linebreak than by periods themselves. 

mystring = u"This is a first line. And this one will follow.\n\nDon't stall on the third, the fourth's an Apollo"
print(mystring)

This is a first line. And this one will follow.

Don't stall on the third, the fourth's an Apollo


In [137]:
doc = nlp(mystring)

for sent in doc.sents:
    print(sent)

This is a first line.
And this one will follow.


Don't stall on the third, the fourth's an Apollo


In [138]:
from spacy.pipeline import SentenceSegmenter

In [141]:
# create a custom segmenter

def split_on_newline(doc):
    start = 0;
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'):
            seen_newline = True

    yield doc[start:]

In [142]:
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newline)

In [143]:
nlp.add_pipe(sbd)

In [149]:
mystring = u"This is a first\nline. And this one is two.\n\nDon't stall on the\nthird, the fourth's meant for you"
print(mystring)

This is a first
line. And this one is two.

Don't stall on the
third, the fourth's meant for you


In [150]:
doc = nlp(mystring)

In [151]:
for sent in doc.sents:
    print(sent)

This is a first

line. And this one is two.


Don't stall on the

third, the fourth's meant for you
