# NLP - Practical case 2

Using the Taggers which we used in the [last notebook](03-machine-learning-in-taggers.ipynb), we have to train a Catalan trigram tagger. 

To train the tagger, if it cannot tag a word it will use a bigram tagger, and if this last tagger neither, then it will use a unigram tagger. If the last tagger neither, then it will use a default tagger.

Once the tagger is trained, we have to test it with the sentence `'My dog is a very cute chuchete, but the word chuchete is not recognized.'` in catalan: 
       
    'El meu gos és un chuchete molt bonic, però la paraula chuchete no es reconeix.'

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### Import NLTK and Catalan CESS corpus

In [2]:
import nltk
from nltk.corpus import cess_cat

# Load all tagged sentences of CESS corpus
sents = cess_cat.tagged_sents()

### Make a training and test datasets 

In [3]:
# 90% -> training
# 10% -> test
training = []
test = []
for i in range(len(sents)) :
    if i % 10 :
        training.append(sents[i])
    else :
        test.append(sents[i])

### Import the four types of morphology analyzer (taggers) 

In [4]:
# DefaultTagger -> it tag all words with the same tag (it is indicated in the constructor)
# UnigramTagger -> it learn of the each word's statistics on CESS corpus
# BigramTagger -> it learn of the each word's statistics and its previous word
# TrigramTagger -> it learn of the each word's statistics and its two previous words
from nltk import DefaultTagger, UnigramTagger, BigramTagger, TrigramTagger

### Training the taggers 

In [5]:
default_tagger = DefaultTagger('Practical-Case2')
unigram_tagger = UnigramTagger(training, backoff = default_tagger)
bigram_tagger = BigramTagger(training, backoff=unigram_tagger) 

trigram_tagger = TrigramTagger(training, backoff=bigram_tagger)

### Evaluate the tagger

In [6]:
print ('Success:',trigram_tagger.evaluate(test)*100)

Success: 91.83444970437448


### Test the tagger with the sentence

In [7]:
# Chuchetes is not on corpus
sentence = "El meu gos és un chuchete molt bonic, però la paraula chuchete no es reconeix."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Tag the tokens
tagged = trigram_tagger.tag(tokens)

print (tagged)

[('El', 'da0ms0'), ('meu', 'dp1mss'), ('gos', 'ncms000'), ('és', 'vsip3s0'), ('un', 'di0ms0'), ('chuchete', 'Practical-Case2'), ('molt', 'rg'), ('bonic', 'aq0ms0'), (',', 'Fc'), ('però', 'cc'), ('la', 'da0fs0'), ('paraula', 'ncfs000'), ('chuchete', 'Practical-Case2'), ('no', 'rn'), ('es', 'p0000000'), ('reconeix', 'vmip3s0'), ('.', 'Fp')]


As we can see, the token `chuchete` is not on corpus, and the tagger has tagged that word with `Practical-Case2` tag, which is the default tag, like we wanted.