# Practical case - Food Orders - Details

This notebook contains all details and explications of the practical case analysis. [01-food-orders](./01-food-orders.ipynb) notebook only contains the function required by that statement with all models.

### Statement (written in spanish) 

Para este ejercicio se va a imaginar que se trabaja para una empresa de envíos de comida, presente
en todo el territorio nacional, con miles de pedidos cada día. Dicha empresa tiene un fichero histórico
con todas las peticiones de comida que los clientes han realizado mediante el chat de su web en los
últimos meses. Necesitan analizar en tiempo real qué comidas están pidiendo los usuarios y qué
ingredientes tenían, ya que en la cadena de stock de alimentos es necesario realizar una previsión para
no quedarse sin platos cocinados. 

Se ha calculado que el impacto en las ventas cada vez que uno de
los platos deja de estar disponible es del 7% de pérdidas en esa semana, debido al abandono de la web
de pedidos por parte del cliente. Por tanto, es de vital importancia poder realizar automáticamente
estimaciones al respecto.

El objetivo es programar una función que reciba como input un texto de usuario y devuelva los
fragmentos de texto (chunks) que hagan referencia a las comidas y cantidades que ha solicitado. No es
necesario, ni es el objetivo de este ejercicio, construir un clasificador de intención previo a esta
función, sino simplemente una función que presuponemos recibe una frase con la intención
`Pedir_comida`. Tampoco es objetivo normalizar la salida (por ej.: no es necesario convertir 'tres' a '3'
ni 'pizzas' a 'pizza'). Es, por tanto, un ejercicio de mínimos.

    Por ejemplo: “quiero 3 bocadillos de anchoas y 2 pizzas” →
    [
        {comida:'bocadillo', ingrediente:'anchoas', cantidad:3},
        {comida:'pizza', ingrediente:'null', cantidad:2}
    ]
    
Por tanto, la salida de la función será un array con diccionarios de 2 elementos (`comida` y `cantidad`).
Cuando una cantidad no sea detectada, se pondrá su valor a '1' como valor por defecto.

Se deberá comenzar la práctica por el nivel más básico de dificultad (`RegexParser`) y, en caso de
conseguirlo, añadir los siguientes niveles de forma sucesiva. De esta forma, el entregable contendrá
todas y cada una de las tres formas de solucionar el problema. No basta, por tanto, con incluir, por
ejemplo, únicamente un `NaiveBayesClassifier`, hay que incluir también las otras dos formas si se
quiere obtener la máxima puntuación. Se trata simplemente de una práctica y, por tanto, no se espera
como resultado un sistema de alta precisión listo para usar en producción, sino simplemente una
aproximación básica que permita ejecutar las tres formas de resolver el problema.

Este ejercicio hay que hacerlo con textos de entrenamiento en español, pero teniendo en cuenta que
la precisión de los POS taggers en castellano de NLTK es muy mala. Por tanto, el alumno no debe
frustrarse por no obtener buenos resultados, como hemos dicho anteriormente se trata simplemente de
un ejercicio teórico y podemos suponer que, con un mejor analizador, podríamos obtener mejores
resultados.

Para llevar a cabo la práctica, deberá construirse una cadena NLP con NLTK, con los siguientes
elementos:
    - segmentación de frases,
    - tokenización,
    - POS tagger (analizador mofológico para el español).

A continuación, los POS tags obtenidos serán usados por el `RegexParser`, el `UnigramParser`, el
`BigramParser` y el `NaiveBayesClassifier`.

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### Import NLTK and Spanish CESS Corpus 

In [2]:
import nltk
from nltk.corpus import cess_esp

# Load all tagged sentences of Spanish CESS corpus
sents = cess_esp.tagged_sents()

## POS tagger training 

Before making the several versions required by that practical case, we have to train a POS tagger for Spanish CESS corpus. For that, we are going to use the `HiddenMarkovModelTagger` tagger, because that tagger is more complete than `UnigramTagger`, `BigramTagger` and `TrigramTagger` taggers.

At first, we will make the train (90%) and test (10%) datasets and evaluate if the trained model is not overfitting:

In [3]:
training = []
testing = []

for i in range(len(sents)) :
    if i % 10 :
        training.append(sents[i])
    else :
        testing.append(sents[i])

In [4]:
print('len(sents) =', len(sents))
print('len(training) =', len(training))
print('len(testing) =', len(testing))

len(sents) = 6030
len(training) = 5427
len(testing) = 603


In [5]:
# Import HiddenMarkovModelTagger
from nltk.tag.hmm import HiddenMarkovModelTagger

# Create the Spanish POS tagger using HMM Tagger
spanish_pos_tagger = HiddenMarkovModelTagger.train(training)

# Evaluate that tagger
print('Evaluation:', spanish_pos_tagger.evaluate(testing)*100)

Evaluation: 89.88905831011094


As we can see, that model is not overfitting and its evaluation is good. So, we will use that POS tagger for our tagging.

## Get Part-of-Speech Tagset of Spanish CESS Corpus

Before using `RegexParser`, we have to know how are the Spanish tags provided by our tagger. 

In [6]:
# Create a set which will contain all tags
tagset = set()

for sent in sents :
    tagset.update([ tag for (word, tag) in sent ])

In [7]:
# Print sorted tagset
print(sorted(tagset))

['Faa', 'Fat', 'Fc', 'Fd', 'Fe', 'Fg', 'Fh', 'Fia', 'Fit', 'Fp', 'Fpa', 'Fpt', 'Fs', 'Fx', 'Fz', 'I', 'W', 'X', 'Y', 'Z', 'Zm', 'Zp', 'ao0fp0', 'ao0fs0', 'ao0mp0', 'ao0ms0', 'aq00000', 'aq0cn0', 'aq0cp0', 'aq0cs0', 'aq0fp0', 'aq0fpp', 'aq0fs0', 'aq0fsp', 'aq0mp0', 'aq0mpp', 'aq0ms0', 'aq0msp', 'cc', 'cs', 'da0fp0', 'da0fs0', 'da0mp0', 'da0ms0', 'da0ns0', 'dd0cp0', 'dd0cs0', 'dd0fp0', 'dd0fs0', 'dd0mp0', 'dd0ms0', 'de0cn0', 'di0cp0', 'di0cs0', 'di0fp0', 'di0fs0', 'di0mp0', 'di0ms0', 'dn0cp0', 'dn0cs0', 'dn0fp0', 'dn0fs0', 'dn0mp0', 'dn0ms0', 'dp1cps', 'dp1css', 'dp1fpp', 'dp1fsp', 'dp1mpp', 'dp1msp', 'dp1mss', 'dp2cps', 'dp2css', 'dp3cp0', 'dp3cs0', 'dp3fs0', 'dp3mp0', 'dp3ms0', 'dt0cn0', 'dt0fs0', 'dt0ms0', 'i', 'nc00000', 'nccn000', 'nccp000', 'nccs000', 'ncfn000', 'ncfp000', 'ncfs000', 'ncmn000', 'ncmp000', 'ncms000', 'np00000', 'np0000a', 'np0000l', 'np0000o', 'np0000p', 'p0000000', 'p010p000', 'p010s000', 'p020s000', 'p0300000', 'pd0cp000', 'pd0cs000', 'pd0fp000', 'pd0fs000', 'pd0m

With that, we cannot know what is each tag. However, we can see that some tags are similar. For that, we will show some examples for each tag begining:

In [8]:
[ (word, tag) for (word, tag) in sents[100] if tag.startswith('a') ]

[('dedicada', 'aq0fsp'),
 ('democrático', 'aq0ms0'),
 ('político', 'aq0ms0'),
 ('fundamentales', 'aq0cp0'),
 ('eficiente', 'aq0cs0'),
 ('participativa', 'aq0fs0')]

In [9]:
[ (word, tag) for (word, tag) in sents[100] if tag.startswith('d') ]

[('la', 'da0fs0'),
 ('la', 'da0fs0'),
 ('el', 'da0ms0'),
 ('el', 'da0ms0'),
 ('el', 'da0ms0'),
 ('el', 'da0ms0'),
 ('los', 'da0mp0'),
 ('las', 'da0fp0'),
 ('la', 'da0fs0'),
 ('una', 'di0fs0')]

In [10]:
[ (word, tag) for (word, tag) in sents[100] if tag.startswith('n') ]

[('La_Declaración_de_Viña_del_Mar', 'np0000a'),
 ('Gobernabilidad', 'ncfs000'),
 ('democracia', 'ncfs000'),
 ('compromiso', 'ncms000'),
 ('sistema', 'ncms000'),
 ('Estado_de_Derecho', 'np0000a'),
 ('pluralismo', 'ncms000'),
 ('derechos_humanos', 'ncmp000'),
 ('libertades', 'ncfp000'),
 ('marco', 'ncms000'),
 ('gobernabilidad', 'ncfs000'),
 ('democracia', 'ncfs000')]

In [11]:
[ (word, tag) for (word, tag) in sents[300] if tag.startswith('p') ]

[('que', 'pr0cn000'), ('se', 'p0300000'), ('que', 'pr0cn000')]

Now, we are going to use our POS tagger with a example for seeing how it tags that example:

In [12]:
sentence = 'me pones una pizza de cuatro quesos?, un buen bocata con pepinillos y chorizo'

tokens = nltk.word_tokenize(sentence)

tagged = spanish_pos_tagger.tag(tokens)
tagged

[('me', 'pp1cs000'),
 ('pones', 'vmip3s0'),
 ('una', 'di0fs0'),
 ('pizza', 'ncfs000'),
 ('de', 'sps00'),
 ('cuatro', 'dn0cp0'),
 ('quesos', 'ncmp000'),
 ('?', 'Fit'),
 (',', 'Fc'),
 ('un', 'di0ms0'),
 ('buen', 'aq0ms0'),
 ('bocata', 'ncms000'),
 ('con', 'sps00'),
 ('pepinillos', 'np0000l'),
 ('y', 'cc'),
 ('chorizo', 'sn.e-SUJ')]

With that, we can know what is the meaning of some tags:
    
    
| Tag Begining | Meaning |
|--------------|---------|
|       a      | adjetive|
| d | determiner |
| n | noun |

That is useful for create our `RegexParser`:

## Version 1: `RegexParser`

At first, we are going to define the grammar for `RegexpParser` and with that, we can create the `RegexpParser`.

In [13]:
# Define the grammar
grammar = r"""
    Comida: {<s?n[cp\.].*><s.*>?<s?[na][cp\.].*>*}
            {<s?n[cp\.].*>}
    Cantidad: {<d[in].*>}
              {<Z.*>}  
"""


# Create the RegexParser
regex_parser = nltk.RegexpParser(grammar)

Then, let's use our parser with an example:

In [14]:
# Create the example sentence
sentence = 'yo quiero 3 bocadillos con pimiento, 3 pizzas y ensalada'

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Tag the sentence
tagged = spanish_pos_tagger.tag(tokens)

# Chunk the tagged sentence
chunked = regex_parser.parse(tagged)
print(chunked)
nltk.chunk.tree2conlltags(chunked)

(S
  yo/pp1csn00
  quiero/vmip1s0
  (Cantidad 3/di0fp0)
  (Comida bocadillos/ncfp000 con/sps00 pimiento/np0000l)
  ,/Fc
  (Cantidad 3/Z)
  (Comida pizzas/ncmp000)
  y/cc
  (Comida ensalada/sn.e-SUJ))


[('yo', 'pp1csn00', 'O'),
 ('quiero', 'vmip1s0', 'O'),
 ('3', 'di0fp0', 'B-Cantidad'),
 ('bocadillos', 'ncfp000', 'B-Comida'),
 ('con', 'sps00', 'I-Comida'),
 ('pimiento', 'np0000l', 'I-Comida'),
 (',', 'Fc', 'O'),
 ('3', 'Z', 'B-Cantidad'),
 ('pizzas', 'ncmp000', 'B-Comida'),
 ('y', 'cc', 'O'),
 ('ensalada', 'sn.e-SUJ', 'B-Comida')]

However, the statement of that practical case ask us what food is deliveried and its amount. So, let's make a function which is able to give it from a chunked sentence. It is important to know that with that function, we assume the following structures in a sentence:

1. [...] Cantidad Comida [...]
2. [...] Comida [...]

In 2 case, we assume that `Cantidad = 1`. 

In [15]:
# However, the statement of that practical case ask us what food is deliveried and its amount. 
# That function is able to give it from a chunked sentence. 

# It is important to know that with that function, we assume the following structures in a sentence:

    # 1. [...] Cantidad Comida [...]
    # 2. [...] Comida [...]

# In 2 case, we assume that `Cantidad = 1`. 
def getFoodOrders(chunked_sentence) :
    delivery = []
    dic = {}
    default_cantidad = 1

    chunks = [ chunk for chunk in nltk.chunk.tree2conlltags(chunked_sentence) 
                  if 'Cantidad' in chunk[2] or 'Comida' in chunk[2] ]
    
    i = 0
    while (i < len(chunks)):
        chunk = chunks[i]
        w, t, c = chunk
        
        if c == 'B-Cantidad' :
            dic['cantidad'] = w

        if c == 'B-Comida' :
            if 'cantidad' not in dic :
                dic['cantidad'] = default_cantidad

            dic['comida'] = w
            
            j = i+1
                  
            condition = (j < len(chunks))
            while (condition) :
                w, t, c = chunks[j]
                
                if c == 'I-Comida' :
                    if 'ingredientes' not in dic :
                        dic['ingredientes'] = w
                    else :
                        dic['ingredientes'] = dic['ingredientes'] + " " + w
        
                j += 1
                condition = (j < len(chunks) and c == 'I-Comida')
   
            delivery.append(dic)
            dic = {}

        i += 1
        
    return delivery

In [16]:
# Let's get the food orders of our example
print(sentence)
getFoodOrders(chunked)

yo quiero 3 bocadillos con pimiento, 3 pizzas y ensalada


[{'cantidad': '3', 'comida': 'bocadillos', 'ingredientes': 'con pimiento'},
 {'cantidad': '3', 'comida': 'pizzas'},
 {'cantidad': 1, 'comida': 'ensalada'}]

## Version 2: `UnigramChunker` and `BigramChunker`

### Create a Corpus

For make that version, we have to consider that for using the `UnigramChunker`, `BigramChunker` and `NaiveBayesClassifier`, we need a corpus which will be used by training them. That corpus has to be in IOB format, and we are going to use our `RegexpParser` with a sentences corpus for making our training corpus.

So, let's create our sentences corpus:

In [17]:
# Create a corpus with sentences
corpus = [
    "Me gustaría comer una tortilla de patatas",
    "¿Nos pones 3 tocinos, 4 pechugas?",
    "5 repollos, 12 sardinas y 8 jalapeños",
    "Estamos indecisos, pero puedes ponernos mientras tres raciones de patatas fritas",
    "¿Sería tan amable de servirnos catorce tostadas?",
    "Queremos 2 pollos con caracoles, una ternera, tres patos y 2 pimientos con arroz",
    "Nuestro pedido es: 10 hamburguesas con queso, 20 pizzas y 3 patatas",
    "Yo quiero 9 yogures, 12 ensaladas, un pimiento, 4 chorizos, 12 empanadillas de atún, 8 crespillos y 3 alubias",
    "Me gustaría comer: kiwi con ensalada",
    "Él quiere pedir 3 tostadas, 2 tomates y 6 uvas",
    "Ellos han pedido 2 higos, 1 salmón con caracoles y arroz",
    "Tú has pedido 4 fajitas de pollo, dos emperadores con patatas y tres rúculas",
    "2 macarrones, 3 quesos, una guindilla, cuatro pulpos y quince verduras",
    "Mi abuela quiere emperador con ensalada",
    "Mi tía quiere comer marisco y carne",
    "2 cebollas, una calabaza y ocho tomates",
    "tres pollos, dos corderos, cuatro cerdos y tres macarrones",
    "Pepinillos, calabaza, navajas, sepia y gulas",
    "Me gustaría comer bonito a la plancha con patatas asadas",
    "Pídete cinco pomelos, tres calabacines, 8 mangos, seis melocotones"
    "Langosta, langostinos y almejas"
]

Let's use our `RegexpParser` together with our Spanish POS tagger for create a IOB corpus:

In [18]:
def createIOBCorpus(corpus, pos_tagger, regex_parser) :

    iob_corpus = []

    # For each sentence in corpus
    for sent in corpus :
        # Tokenize the sentence
        tokens = nltk.word_tokenize(sent)

        # Tag the sentence
        tagged_sent = pos_tagger.tag(tokens)

        # Parse tagged_sent
        chunked_sent = regex_parser.parse(tagged_sent)

        iob_corpus.append(chunked_sent)
    
    return iob_corpus

In [19]:
iob_corpus = createIOBCorpus(corpus, spanish_pos_tagger, regex_parser)

### Classes's definition 

Let's define `UnigramChunker` and `BigramChunker` classes

In [20]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data) 

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [21]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data) 

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

Now, we are going to split the `iob_corpus` between training and testing sets:

In [22]:
training = iob_corpus[0:14]
testing = iob_corpus[14:]

In [23]:
print('len(iob_corpus) =', len(iob_corpus))
print('len(training) =', len(training))
print('len(testing) =', len(testing))

len(iob_corpus) = 20
len(training) = 14
len(testing) = 6


Let's evaluate the `UnigramChunker` and test its parse with a sentence:

In [24]:
unigram_chunker = UnigramChunker(training)
print(unigram_chunker.evaluate(testing))

tokens = nltk.word_tokenize(corpus[4])
tagged = spanish_pos_tagger.tag(tokens)

print()
print(corpus[4])
print(getFoodOrders(unigram_chunker.parse(tagged)))

ChunkParse score:
    IOB Accuracy:  96.7%%
    Precision:     90.6%%
    Recall:        96.7%%
    F-Measure:     93.5%%

¿Sería tan amable de servirnos catorce tostadas?
[{'cantidad': 1, 'comida': '¿Sería'}, {'cantidad': 1, 'comida': 'de'}, {'cantidad': 'catorce', 'comida': 'tostadas'}]


Now, let's try `BigramChunker`:

In [25]:
bigram_chunker = BigramChunker(training)
print(bigram_chunker.evaluate(testing))

tokens = nltk.word_tokenize(corpus[5])
tagged = spanish_pos_tagger.tag(tokens)

print()
print(corpus[5])
print(getFoodOrders(bigram_chunker.parse(tagged)))

ChunkParse score:
    IOB Accuracy:  82.0%%
    Precision:    100.0%%
    Recall:        70.0%%
    F-Measure:     82.4%%

Queremos 2 pollos con caracoles, una ternera, tres patos y 2 pimientos con arroz
[{'cantidad': '2', 'comida': 'pollos', 'ingredientes': 'con caracoles'}, {'cantidad': 'una', 'comida': 'ternera'}, {'cantidad': 'tres', 'comida': 'patos'}, {'cantidad': '2', 'comida': 'pimientos', 'ingredientes': 'con arroz'}]


## Version 3: `NaiveBayesChunker` 

Let's define the `NaiveBayesChunker` class using the `ConsecutivePosTagger` class for tagging the sentences using `NaiveBayesClassifier` classifier. 

Also, we define the features extractor fuction: `pos_features`. Now, this function only returns the part-of-speech tag of the current token.

In [26]:
class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history) 
                train_set.append( (featureset, tag) )
                history.append(tag)
                
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

    
    
class NaiveBayesChunker(nltk.ChunkParserI): 
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutivePosTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

In [27]:
def pos_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}

Let's evaluate the `NaiveBayesChunker` and test its parse with a sentence:

In [28]:
naive_chunker = NaiveBayesChunker(training)
print(naive_chunker.evaluate(testing))

tokens = nltk.word_tokenize(corpus[7])
tagged = spanish_pos_tagger.tag(tokens)

parsed = naive_chunker.parse(tagged)

print()
print(corpus[7])
print()
print(getFoodOrders(parsed))

ChunkParse score:
    IOB Accuracy:  96.7%%
    Precision:     90.6%%
    Recall:        96.7%%
    F-Measure:     93.5%%

Yo quiero 9 yogures, 12 ensaladas, un pimiento, 4 chorizos, 12 empanadillas de atún, 8 crespillos y 3 alubias

[{'cantidad': 1, 'comida': 'yogures'}, {'cantidad': '12', 'comida': 'ensaladas'}, {'cantidad': 'un', 'comida': 'pimiento'}, {'cantidad': '12', 'comida': 'empanadillas', 'ingredientes': 'de atún'}, {'cantidad': '8', 'comida': 'crespillos'}, {'cantidad': '3', 'comida': 'alubias'}]


Now, we are going to update the `pos_features` function to using the previous POS tag in the sentence:

In [29]:
 def pos_features(sentence, i, history):
    word, pos = sentence[i]
    
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    
    return {"pos": pos, "prevpos": prevpos}

In [30]:
naive_chunker = NaiveBayesChunker(training)
print(naive_chunker.evaluate(testing))

tokens = nltk.word_tokenize(corpus[5])
tagged = spanish_pos_tagger.tag(tokens)

parsed = naive_chunker.parse(tagged)

print()
print(corpus[5])
print()
print(getFoodOrders(parsed))

ChunkParse score:
    IOB Accuracy:  90.2%%
    Precision:     81.2%%
    Recall:        86.7%%
    F-Measure:     83.9%%

Queremos 2 pollos con caracoles, una ternera, tres patos y 2 pimientos con arroz

[{'cantidad': '2', 'comida': 'pollos', 'ingredientes': 'con caracoles'}, {'cantidad': 'una', 'comida': 'ternera'}, {'cantidad': 'tres', 'comida': 'patos'}, {'cantidad': '2', 'comida': 'pimientos', 'ingredientes': 'con arroz'}]


## Models evaluation

We just have the models which the statemente requires, then we are going to make an evaluation of the models to know what is better in this practical case. This evaluation will be made using our `iob_corpus` with 20 sentences, 14 of them used by training and the other 6 used by testing.

In [31]:
# Create a corpus with sentences
corpus = [
    "Me gustaría comer una tortilla de patatas",
    "¿Nos pones 3 tocinos, 4 pechugas?",
    "5 repollos, 12 sardinas y 8 jalapeños",
    "Estamos indecisos, pero puedes ponernos mientras tres raciones de patatas fritas",
    "¿Sería tan amable de servirnos catorce tostadas?",
    "Queremos 2 pollos con caracoles, una ternera, tres patos y 2 pimientos con arroz",
    "Nuestro pedido es: 10 hamburguesas con queso, 20 pizzas y 3 patatas",
    "Yo quiero 9 yogures, 12 ensaladas, un pimiento, 4 chorizos, 12 empanadillas de atún, 8 crespillos y 3 alubias",
    "Me gustaría comer: kiwi con ensalada",
    "Él quiere pedir 3 tostadas, 2 tomates y 6 uvas",
    "Ellos han pedido 2 higos, 1 salmón con caracoles y arroz",
    "Tú has pedido 4 fajitas de pollo, dos emperadores con patatas y tres rúculas",
    "2 macarrones, 3 quesos, una guindilla, cuatro pulpos y quince verduras",
    "Mi abuela quiere emperador con ensalada",
    "Mi tía quiere comer marisco y carne",
    "2 cebollas, una calabaza y ocho tomates",
    "tres pollos, dos corderos, cuatro cerdos y tres macarrones",
    "Pepinillos, calabaza, navajas, sepia y gulas",
    "Me gustaría comer bonito a la plancha con patatas asadas",
    "Pídete cinco pomelos, tres calabacines, 8 mangos, seis melocotones"
    "Langosta, langostinos y almejas"
]

# Function which create IOB corpus using a pos_tagger and regex_parser params
def createIOBCorpus(corpus, pos_tagger, regex_parser) :

    iob_corpus = []

    # For each sentence in corpus
    for sent in corpus :
        # Tokenize the sentence
        tokens = nltk.word_tokenize(sent)

        # Tag the sentence
        tagged_sent = pos_tagger.tag(tokens)

        # Parse tagged_sent
        chunked_sent = regex_parser.parse(tagged_sent)

        iob_corpus.append(chunked_sent)
    
    return iob_corpus

# Create the IOB corpus using our corpus
iob_corpus = createIOBCorpus(corpus, spanish_pos_tagger, regex_parser)

# Split the IOB corpus between training and testing sets
training = iob_corpus[0:14]
testing = iob_corpus[14:]

print('len(iob_corpus) =', len(iob_corpus))
print('len(training) =', len(training))
print('len(testing) =', len(testing))

len(iob_corpus) = 20
len(training) = 14
len(testing) = 6


In [32]:
print("UnigramChunker evaluation")
print(unigram_chunker.evaluate(testing))

UnigramChunker evaluation
ChunkParse score:
    IOB Accuracy:  96.7%%
    Precision:     90.6%%
    Recall:        96.7%%
    F-Measure:     93.5%%


In [33]:
print("BigramChunker evaluation")
print(bigram_chunker.evaluate(testing))

BigramChunker evaluation
ChunkParse score:
    IOB Accuracy:  82.0%%
    Precision:    100.0%%
    Recall:        70.0%%
    F-Measure:     82.4%%


In [34]:
def pos_features(sentence, i, history):
    word, pos = sentence[i]
        
    return {"pos": pos}


naive_chunker = NaiveBayesChunker(training)
print("NaiveBayesChunker evaluation - Only current POS tag")
print(naive_chunker.evaluate(testing))

NaiveBayesChunker evaluation - Only current POS tag
ChunkParse score:
    IOB Accuracy:  96.7%%
    Precision:     90.6%%
    Recall:        96.7%%
    F-Measure:     93.5%%


In [35]:
def pos_features(sentence, i, history):
    word, pos = sentence[i]
    
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
        
    return {"pos": pos, "prevpos": prevpos}


naive_chunker = NaiveBayesChunker(training)
print("NaiveBayesChunker evaluation - Current POS tag and previous POS tag")
print(naive_chunker.evaluate(testing))

NaiveBayesChunker evaluation - Current POS tag and previous POS tag
ChunkParse score:
    IOB Accuracy:  90.2%%
    Precision:     81.2%%
    Recall:        86.7%%
    F-Measure:     83.9%%


As we can observe, all results are high due to the corpus is small and with a small variety. That causes the models have a little overfitting. Also, we can observe that the best models are `UnigramChunker` and `NaiveBayesChunker` (with current POS tag only). This is because in that practical case, the word context is not essential to tag it. For that, `BigramChunker` and `NaiveBayesChunker` (with previous POS tag) have a smaller evaluation, these models using the word context to tag it.

## Conclusions

The conclusions of that practical case are:

- A training corpus which is big and has variety is very important, because it is used by the models to training and if it is small and has a little variety the models will be overfitted and its training is not good.
- Theoretically, using `NaiveBayesClassifier` the results should be better, because with it, we can train the model with more sentence data, like it can be the word context. However, in that practical case, the word context is not essential to word's tagging.
- In summary, for that practical case, the best model is `UnigramChunker`, because it has simplier than `NaiveBayesChunker` and it has a the same results.

With that conclusions, in [01-food-orders](./01-food-orders) notebook provides the function required by the statement with all models.