# Example of different information needs
For the subset of recipes used in this example, see the [Recipe1M+](http://pic2recipe.csail.mit.edu/) dataset.
> Marin, J., Biswas, A., Ofli, F., Hynes, N., Salvador, A., Aytar, Y., ... & Torralba, A. (2019). Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE transactions on pattern analysis and machine intelligence, 43(1), 187-203.

See the extracted dataset sample [here](https://unimi2013.sharepoint.com/:u:/s/InformationRetrieval/EaL7kid2qzdCmAA8RO-m5iQBsvCl5cuNIdn0rsJN1FUhSg?e=fdXkkB)

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [2]:
import os
import nltk

In [3]:
folder = "../text-sample/"
files = [f for f in os.listdir(folder) if f.endswith('.txt')]
recipes = []
for file in files:
    with open(os.path.join(folder, file), 'r') as data:
        recipes.append(data.read())

## Tokenizers

### NLTK tokenizer
[nltk.tokenize package](https://www.nltk.org/api/nltk.tokenize.html)

In [None]:
nltk_tokenize = lambda text: [x.lower() for x in nltk.word_tokenize(text)]

### SpaCy tokenizer and parser
See an overview of [SpaCy Linguistic Feature](https://spacy.io/usage/linguistic-features)

In [None]:
import spacy
nlp = spacy.load('en_core_web_lg')

#### Example
Recipes do not have a real sentence structure. Thus, we use a special tokenizer for sentences based on newlines. The last chunk is typically the set of instructions.

In [None]:
spacy_sentences = [sentence for sentence in nlp(recipes[0]).sents]
newline_sentences = [r.strip('\n') for r in recipes[0].split('\n\n')]

In [None]:
instructions_text = newline_sentences[-1]
print(instructions_text)
instructions_sentences = list(nlp(instructions_text).sents)

In [None]:
sentence = instructions_sentences[0]
type(sentence)

In [None]:
tokens = []
for token in sentence:
    tokens.append({
        'position': token.idx, 'text': token.text, 'pos': token.pos_, 'lemma': token.lemma_,
        'alpha': token.is_alpha, 'stop': token.is_stop, 'dep': token.dep_, 'morph': token.morph
    })
S = pd.DataFrame(tokens)

In [None]:
S

In [None]:
from spacy.displacy import render

In [None]:
render(sentence)