# Parts of Speech & Named Entity Recognition

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# 4.1.0 - Parts of Speech tagging

**What is it?** the tokenisation process can itself be difficult or can pose problems in many languages. That means overall we can consider that while it is possible to solve some problems starting only from the raw characters it is generally 'better' to use linguistic knowledge to add useful meaning and information.

**Why?** well, most words are rare and it's common for words that look completely different to mean almost the same thing. The same words rearranged in a different order can mean something wildly different. There is a lot of nuance to the way words are used and the order they are used in. This is why it's important to look at the `parts of speech` that a word appears in rather than the word or `token` itself. 

This process is exactly what Spacy was designed to do. You submit the raw text and a doc object that is annotated is what is returned. In the proces Spacy gives two tiers of annotations. We get coarse POS tags such as `nouns`, `verb` & `adjective` but also fine-grained tags such as `plural noun`, `past tense verb`, `superlative adjective`.  


In [1]:
import spacy 

In [2]:
nlp = spacy.load('en_core_web_sm')

In [37]:
def doc_table(doc):
    # create the header 
    print(f"{'Token':{12}} {'POS_ID':>7} {'POS':{10}} {'TAG':{10}} {'Explanation':{25}}")
    print("-" * 100)

    # create the table for the document
    for token in doc:
        print(f"{token.text:{12}} {token.pos:>7} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{45}}")

In [38]:
def word_table(token):
    # create the header 
    print(f"{'Token':{12}} {'POS_ID':{7}} {'POS':{10}} {'TAG':{10}} {'Explanation':{25}}")
    print("-" * 100)

    print(f"{token.text:{12}} {token.pos:>7} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{45}}")

In [39]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

In [40]:
doc_table(doc)

Token         POS_ID POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
The               90 DET        DT         determiner                                   
quick             84 ADJ        JJ         adjective                                    
brown             84 ADJ        JJ         adjective                                    
fox               96 PROPN      NNP        noun, proper singular                        
jumped           100 VERB       VBD        verb, past tense                             
over              85 ADP        IN         conjunction, subordinating or preposition    
the               90 DET        DT         determiner                                   
lazy              84 ADJ        JJ         adjective                                    
dog               92 NOUN       NN         noun, singular or mass                       
's                94 PART    

#### Working with tense in Spacy

In [15]:
# highlight a present tense example that is interpreted as 
# a past tense example by the lib.

doc = nlp(u"I read books on NLP.")

In [16]:
word = doc[1]

In [17]:
# show the analysis 
word_table(word)

Token        POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read         VERB       VBD        verb, past tense                             


In [18]:
# print the table to show that 
doc_table(doc)

Token        POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
I            PRON       PRP        pronoun, personal                            
read         VERB       VBD        verb, past tense                             
books        NOUN       NNS        noun, plural                                 
on           ADP        IN         conjunction, subordinating or preposition    
NLP          PROPN      NNP        noun, proper singular                        
.            PUNCT      .          punctuation mark, sentence closer            


In [19]:
# create a past tense example 
doc = nlp(u"I read a book on NLP")

In [20]:
word = doc[1]
word_table(word)

Token        POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read         VERB       VBD        verb, past tense                             


In [21]:
doc = nlp(u"read books to be well read")
doc_table(doc)

Token        POS        TAG        Explanation              
----------------------------------------------------------------------------------------------------
read         VERB       VB         verb, base form                              
books        NOUN       NNS        noun, plural                                 
to           PART       TO         infinitival "to"                             
be           AUX        VB         verb, base form                              
well         ADV        RB         adverb                                       
read         VERB       VBN        verb, past participle                        


We can see in analysis that the tokens have an identifier and hash, Spacy optimises the positioning of frequency of hashes for maximizing performance internally. The user need not concern themselves with that hashing order. Instead you can see the POS identifier for each of the 50k + language references.

In [41]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

In [42]:
pos_count = doc.count_by(spacy.attrs.POS)
pos_count

{90: 2, 84: 3, 96: 1, 100: 1, 85: 1, 92: 2, 94: 1, 97: 1}

In [43]:
for k,v in sorted(pos_count.items()):
    print(f"{k:>3} {doc.vocab[k].text:{8}} {v}")

 84 ADJ      3
 85 ADP      1
 90 DET      2
 92 NOUN     2
 94 PART     1
 96 PROPN    1
 97 PUNCT    1
100 VERB     1


In [None]:
# 4.2.0