# NLP - Practical case 1

Using the NLP chain which we used in the [last notebook](01-first-steps-nltk.ipynb), we have to do:

- A NLP chain with the following elements:

    `Word tokenization -> Lemmatization -> Syntactic analysis`


- A NLP chain with the following elements:

    `Word tokenization -> Lemmatization -> POS tagger -> Syntactic analysis`
    

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### Part 1: `Word tokenization -> Lemmatization -> Syntactic analysis`

In [2]:
# Import all packages which we need 
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

In [3]:
# 1. Create a input text for our NLP chain
text = "I didn't notice my animals were uglier than yours! I'm sorry..."
print ("1. Text:",text)

1. Text: I didn't notice my animals were uglier than yours! I'm sorry...


In [4]:
# 2. Tokenization: Tokenize the text, i.e: split the text on tokens
tokens = nltk.word_tokenize(text)
print ("2. Tokens:",tokens)

2. Tokens: ['I', 'did', "n't", 'notice', 'my', 'animals', 'were', 'uglier', 'than', 'yours', '!', 'I', "'m", 'sorry', '...']


In [5]:
# 3. Morphology/Lexical analysis: set a morphology tag for each token
tagged = nltk.pos_tag(tokens)
print ("3. Morphology analysis:",tagged)

3. Morphology analysis: [('I', 'PRP'), ('did', 'VBD'), ("n't", 'RB'), ('notice', 'VB'), ('my', 'PRP$'), ('animals', 'NNS'), ('were', 'VBD'), ('uglier', 'JJR'), ('than', 'IN'), ('yours', 'JJR'), ('!', '.'), ('I', 'PRP'), ("'m", 'VBP'), ('sorry', 'JJ'), ('...', ':')]


In [6]:
# Create the lemmatizer
lemmatizer = WordNetLemmatizer()

# WordNetLemmatizer only knows 4 POS tags: a (adjetive), r (adverb), n (noun) and v (verb)
# For that, we should convert Penn Tree Bank format to WordNet format 
# (e.g: N->n, J->a, R->r, V->V, ...)

wnTags = {'N':wordnet.NOUN, 'J':wordnet.ADJ, 'V':wordnet.VERB, 'R':wordnet.ADV} 

# Create a lemmas array for storage the lemmas:
lemmas = []

print ("4. Lemmas: ")
# For each token and its tag:
for (tok,tag) in tagged:
    # WordNet has not the short forms: 'm, n't, so we should introduce them for the good lemmatization
    if tok=='\'m':
        tok = 'am'
    if tok=='\'s':
        tok = 'is'
    if tok=='n\'t':
        tok = 'not'
        
    # We only get the first char of the tag because we use it to convert it to WordNet format
    tag = tag[:1]
    
    # Lemmatize the tokens
    lemma = lemmatizer.lemmatize(tok.lower(), wnTags.get(tag, wordnet.NOUN))

    # Other alternative for get the lemma can be use the wordnet.morphy() function 
    #lemma = wordnet.morphy(tok.lower(), wnTags.get(tag, wordnet.NOUN))
    
    
    if lemma is None: # If WordNet has not the word, we assign its token to the lemma
       lemma = tok.lower() 
    lemmas.append(lemma)

print(lemmas)

4. Lemmas: 
['i', 'do', 'not', 'notice', 'my', 'animal', 'be', 'ugly', 'than', 'yours', '!', 'i', 'be', 'sorry', '...']


In [7]:
# Create our CFG using the lemmas got in the last step
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> Noun | Noun Verb Adj Punt 
VP -> Verb Adv PP | Verb Adj Conj Det Punt
PP -> Noun Det Noun VP
Punt -> '!' NP | '...'
Adv -> 'not'
Adj -> 'ugly' | 'sorry'
Det -> 'yours' | 'my'
Noun -> 'i' | 'animal' | 'notice'
Verb -> 'do' | 'be'
Conj -> 'than'
""")

# Generate a syntactic parser be able to recognize the grammar
parser = nltk.ChartParser(grammar, trace=1)
print ('5. Syntactic analysis:\n')
for tree in parser.parse(lemmas):
    print(tree,'\n')
    tree.draw()

5. Syntactic analysis:

|.i .do.no.no.my.an.be.ug.th.yo.! .i .be.so....|
|[--]  .  .  .  .  .  .  .  .  .  .  .  .  .  .| [0:1] 'i'
|.  [--]  .  .  .  .  .  .  .  .  .  .  .  .  .| [1:2] 'do'
|.  .  [--]  .  .  .  .  .  .  .  .  .  .  .  .| [2:3] 'not'
|.  .  .  [--]  .  .  .  .  .  .  .  .  .  .  .| [3:4] 'notice'
|.  .  .  .  [--]  .  .  .  .  .  .  .  .  .  .| [4:5] 'my'
|.  .  .  .  .  [--]  .  .  .  .  .  .  .  .  .| [5:6] 'animal'
|.  .  .  .  .  .  [--]  .  .  .  .  .  .  .  .| [6:7] 'be'
|.  .  .  .  .  .  .  [--]  .  .  .  .  .  .  .| [7:8] 'ugly'
|.  .  .  .  .  .  .  .  [--]  .  .  .  .  .  .| [8:9] 'than'
|.  .  .  .  .  .  .  .  .  [--]  .  .  .  .  .| [9:10] 'yours'
|.  .  .  .  .  .  .  .  .  .  [--]  .  .  .  .| [10:11] '!'
|.  .  .  .  .  .  .  .  .  .  .  [--]  .  .  .| [11:12] 'i'
|.  .  .  .  .  .  .  .  .  .  .  .  [--]  .  .| [12:13] 'be'
|.  .  .  .  .  .  .  .  .  .  .  .  .  [--]  .| [13:14] 'sorry'
|.  .  .  .  .  .  .  .  .  .  .  .  .  .  [--]| [14:15] '...'

### Part 2: Word tokenization -> Lemmatization -> POS tagger -> Syntactic analysis

In [8]:
# Let's get the POS tags of each lemma:
lemmas_tagged = nltk.pos_tag(lemmas)
print ("Lemmas tagged:", lemmas_tagged)

Lemmas tagged: [('i', 'NNS'), ('do', 'VBP'), ('not', 'RB'), ('notice', 'VB'), ('my', 'PRP$'), ('animal', 'JJ'), ('be', 'VB'), ('ugly', 'RB'), ('than', 'IN'), ('yours', 'UH'), ('!', '.'), ('i', 'NN'), ('be', 'VB'), ('sorry', 'JJ'), ('...', ':')]


In [9]:
# Create our CFG using the lemmas got in the last step
grammar_POS = nltk.CFG.fromstring("""
S -> NP VP
NP -> 'NNS' 'VBP' 'RB' | 'RB' 'IN' 'UH' Punt | 'NN' VP
VP -> 'VB' 'PRP$' 'JJ' 'VB' NP |  'VB' 'JJ' Punt
Punt -> '.' NP | ':'
""")

# Generate a syntactic parser be able to recognize the grammar
parser = nltk.ChartParser(grammar, trace=1)
print ('Syntactic analysis:\n')
for tree in parser.parse(lemmas):
    print(tree,'\n')
    tree.draw()

Syntactic analysis:

|.i .do.no.no.my.an.be.ug.th.yo.! .i .be.so....|
|[--]  .  .  .  .  .  .  .  .  .  .  .  .  .  .| [0:1] 'i'
|.  [--]  .  .  .  .  .  .  .  .  .  .  .  .  .| [1:2] 'do'
|.  .  [--]  .  .  .  .  .  .  .  .  .  .  .  .| [2:3] 'not'
|.  .  .  [--]  .  .  .  .  .  .  .  .  .  .  .| [3:4] 'notice'
|.  .  .  .  [--]  .  .  .  .  .  .  .  .  .  .| [4:5] 'my'
|.  .  .  .  .  [--]  .  .  .  .  .  .  .  .  .| [5:6] 'animal'
|.  .  .  .  .  .  [--]  .  .  .  .  .  .  .  .| [6:7] 'be'
|.  .  .  .  .  .  .  [--]  .  .  .  .  .  .  .| [7:8] 'ugly'
|.  .  .  .  .  .  .  .  [--]  .  .  .  .  .  .| [8:9] 'than'
|.  .  .  .  .  .  .  .  .  [--]  .  .  .  .  .| [9:10] 'yours'
|.  .  .  .  .  .  .  .  .  .  [--]  .  .  .  .| [10:11] '!'
|.  .  .  .  .  .  .  .  .  .  .  [--]  .  .  .| [11:12] 'i'
|.  .  .  .  .  .  .  .  .  .  .  .  [--]  .  .| [12:13] 'be'
|.  .  .  .  .  .  .  .  .  .  .  .  .  [--]  .| [13:14] 'sorry'
|.  .  .  .  .  .  .  .  .  .  .  .  .  .  [--]| [14:15] '...'
|[