# Example of different information needs
For the subset of recipes used in this example, see the [Recipe1M+](http://pic2recipe.csail.mit.edu/) dataset.
> Marin, J., Biswas, A., Ofli, F., Hynes, N., Salvador, A., Aytar, Y., ... & Torralba, A. (2019). Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE transactions on pattern analysis and machine intelligence, 43(1), 187-203.

See the extracted dataset sample [here](https://unimi2013.sharepoint.com/:u:/s/InformationRetrieval/EaL7kid2qzdCmAA8RO-m5iQBsvCl5cuNIdn0rsJN1FUhSg?e=fdXkkB)

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [2]:
import os
import nltk

In [3]:
folder = "../text-sample/"
files = [f for f in os.listdir(folder) if f.endswith('.txt')]
recipes = []
for file in files:
    with open(os.path.join(folder, file), 'r') as data:
        recipes.append(data.read())

## Tokenizers

### NLTK tokenizer
[nltk.tokenize package](https://www.nltk.org/api/nltk.tokenize.html)

In [4]:
nltk_tokenize = lambda text: [x.lower() for x in nltk.word_tokenize(text)]

### SpaCy tokenizer and parser
See an overview of [SpaCy Linguistic Feature](https://spacy.io/usage/linguistic-features)

**Importante fare prima di usare spacy: python3 -m spacy download en_core_web_lg**

In [5]:
import spacy
nlp = spacy.load('en_core_web_lg')

#### Example
Recipes do not have a real sentence structure. Thus, we use a special tokenizer for sentences based on newlines. The last chunk is typically the set of instructions.

In [6]:
spacy_sentences = [sentence for sentence in nlp(recipes[0]).sents]
newline_sentences = [r.strip('\n') for r in recipes[0].split('\n\n')]

In [7]:
instructions_text = newline_sentences[-1]
print(instructions_text)
instructions_sentences = list(nlp(instructions_text).sents)

Bring 1 3/4 cups water to a boil in a medium saucepan; gradually stir in couscous. Remove from heat; cover and let stand 5 minutes. Fluff with a fork. While couscous stands, steam broccoli florets, covered, for 3 minutes or until tender. Combine couscous, broccoli, onion, and next 10 ingredients (onion through chickpeas), tossing gently. Sprinkle with cheese.


In [8]:
sentence = instructions_sentences[0]
print(type(sentence))
print(sentence)

<class 'spacy.tokens.span.Span'>
Bring 1 3/4 cups water to a boil in a medium saucepan; gradually stir in couscous.


In [9]:
tokens = []
for token in sentence:
    tokens.append({
        'position': token.idx, 'text': token.text, 'pos': token.pos_, 'tag':token.tag_, 'lemma': token.lemma_,
        'alpha': token.is_alpha, 'stop': token.is_stop, 'dep': token.dep_, 'morph': token.morph
    })
S = pd.DataFrame(tokens)

In [10]:
S

Unnamed: 0,position,text,pos,tag,lemma,alpha,stop,dep,morph
0,0,Bring,VERB,VB,bring,True,False,advcl,(VerbForm=Inf)
1,6,1,NUM,CD,1,False,False,compound,(NumType=Card)
2,8,3/4,NUM,CD,3/4,False,False,nummod,(NumType=Card)
3,12,cups,NOUN,NNS,cup,True,False,compound,(Number=Plur)
4,17,water,NOUN,NN,water,True,False,dobj,(Number=Sing)
5,23,to,ADP,IN,to,True,True,prep,()
6,26,a,DET,DT,a,True,True,det,"(Definite=Ind, PronType=Art)"
7,28,boil,NOUN,NN,boil,True,False,pobj,(Number=Sing)
8,33,in,ADP,IN,in,True,True,prep,()
9,36,a,DET,DT,a,True,True,det,"(Definite=Ind, PronType=Art)"


**Text**: The original word text.  
**Lemma**: The base form of the word.  
**POS**: The simple UPOS part-of-speech tag. => Verbo, aggettivo,... 
**Tag**: The detailed part-of-speech tag.  
**Dep**: Syntactic dependency, i.e. the relation between tokens.  
**Shape**: The word shape – capitalization, punctuation, digits.  
**is alpha**: Is the token an alpha character?  
**is stop**: Is the token part of a stop list, i.e. the most common words of the language?

Using spaCy’s built-in displaCy visualizer, **here’s what our example sentence and its dependencies look like:**

In [11]:
from spacy.displacy import render

In [12]:
render(sentence)

**TIP: UNDERSTANDING TAGS AND LABELS** 
Most of the tags and labels look pretty abstract, and they vary between languages. spacy.explain will show you a short description – for example, spacy.explain("VBZ") returns “verb, 3rd person singular present”.

In [13]:
spacy.explain("det")

'determiner'