This is a notebook showing a modification of the original [NYT Ingredient Phrase tagger](https://github.com/NYTimes/ingredient-phrase-tagger). [Here](http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/) is the article where they talk about it.

That github repository contains New York Time's tool for performing Named Entity Recognition via Conditional Random Fields on food recipes to extract the ingredients used on those recipes as well as the quantities.

On their implementation they use a [CRF++](https://taku910.github.io/crfpp/) as the extractor.

Here I will use pycrfsuite instead of CRF++, the main reasons being:

* by using a full python solution (even though pycrfsuite is just a wrapper around [crfsuite](http://www.chokkan.org/software/crfsuite/)) we can deploy the model more easily, and 

* installing CRF++ proved to be a challenge in Ubuntu 14.04

You can install pycrfsuite by doing:

`pip install python-crfsuite`

We load the train_file with features produced by calling *(as it appears on the README)*:

```
bin/generate_data --data-path=input.csv --count=180000 --offset=0 > tmp/train_file
```

In [79]:
import re
import json

from itertools import chain
import nltk
import pycrfsuite

from lib.training import utils

In [80]:
with open('tmp/train_file') as fname:
    lines = fname.readlines()
    items = [line.strip('\n').split('\t') for line in lines]
    items = [item for item in items if len(item)==6]

In [81]:
items[:10]

[['1$1/4', 'I1', 'L20', 'NoCAP', 'NoPAREN', 'B-QTY'],
 ['cups', 'I2', 'L20', 'NoCAP', 'NoPAREN', 'B-UNIT'],
 ['cooked', 'I3', 'L20', 'NoCAP', 'NoPAREN', 'B-COMMENT'],
 ['and', 'I4', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['pureed', 'I5', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['fresh', 'I6', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],
 ['butternut', 'I7', 'L20', 'NoCAP', 'NoPAREN', 'B-NAME'],
 ['squash', 'I8', 'L20', 'NoCAP', 'NoPAREN', 'I-NAME'],
 [',', 'I9', 'L20', 'NoCAP', 'NoPAREN', 'OTHER'],
 ['or', 'I10', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT']]

As we can see, each line of the train_file follows the format:

- token
- position on the phrase. (I1 would be first word, I2 the second, and so on)
- LX , being the length group of the token (defined by [LengthGroup](https://github.com/NYTimes/ingredient-phrase-tagger/blob/master/lib/training/utils.py#L140))
- NoCAP or YesCAP, whether the token is capitalized or not
- YesParen or NoParen, whether the token is inside parenthesis or not

PyCRFSuite expects the input to be a list of the structured items and their respective tags. So we process the items from the train file and bucket them into sentences

In [82]:
sentences = []

sent = [items[0]]
for item in items[1:]:
    if 'I1' in item:
        sentences.append(sent)
        sent = [item]
    else:
        sent.append(item)
len(sentences)

19948

In [83]:
import random
random.shuffle(sentences)
test_size = 0.1
data_size = len(sentences)

test_data = sentences[:int(test_size*data_size)]
train_data = sentences[int(test_size*data_size):]

In [84]:
def sent2labels(sent):
    return [word[-1] for word in sent]

def sent2features(sent):
    return [word[:-1] for word in sent]

def sent2tokens(sent):
    return [word[0] for word in sent]   

y_train = [sent2labels(s) for s in train_data]
X_train = [sent2features(s) for s in train_data]
X_train[1]

[['Freshly', 'I1', 'L8', 'YesCAP', 'NoPAREN'],
 ['ground', 'I2', 'L8', 'NoCAP', 'NoPAREN'],
 ['pepper', 'I3', 'L8', 'NoCAP', 'NoPAREN'],
 ['to', 'I4', 'L8', 'NoCAP', 'NoPAREN'],
 ['taste', 'I5', 'L8', 'NoCAP', 'NoPAREN']]

We set up the CRF trainer. We will use the default values and include all the possible joint features

In [85]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

I obtained the following hyperparameters by performing a GridSearchCV with the scikit learn implementation of pycrfsuite.

In [86]:
trainer.set_params(
{
        'c1': 0.43,
        'c2': 0.012,
        'max_iterations': 100,
        'feature.possible_transitions': True,
        'feature.possible_states': True,
        'linesearch': 'StrongBacktracking'
    }
)

We train the model (this might take a while)

In [87]:
trainer.train('tmp/trained_pycrfsuite')

Now we have a pretrained model that we can just deploy

In [88]:
tagger = pycrfsuite.Tagger()
tagger.open('tmp/trained_pycrfsuite')

<contextlib.closing at 0x115b07e50>

Now we just add a wrapper function for the script found in **lib/testing/convert_to_json.py** and create a convient way to parse an ingredient sentence

In [89]:
import re
import json
from lib.training import utils
from string import punctuation
import spacy.en

nlp_engine = spacy.en.English()

#from nltk.tokenize import PunktSentenceTokenizer
#tokenizer = PunktSentenceTokenizer()

def get_sentence_features(sent):
    """Gets  the features of the sentence"""
    sent_tokens = utils.tokenize(utils.cleanUnicodeFractions(sent))

    sent_features = []
    for i, token in enumerate(sent_tokens):
        token_features = [token]
        token_features.extend(utils.getFeatures(token, i+1, sent_tokens))
        sent_features.append(token_features)
    return sent_features

def format_ingredient_output(tagger_output, display=False):
    """Formats the tagger output into a more convenient dictionary"""
    data = [{}]
    display = [[]]
    prevTag = None


    for token, tag in tagger_output:
    # turn B-NAME/123 back into "name"
        tag = re.sub(r'^[BI]\-', "", tag).lower()

        # ---- DISPLAY ----
        # build a structure which groups each token by its tag, so we can
        # rebuild the original display name later.

        if prevTag != tag:
            display[-1].append((tag, [token]))
            prevTag = tag
        else:
            display[-1][-1][1].append(token)
            #               ^- token
            #            ^---- tag
            #        ^-------- ingredient

            # ---- DATA ----
            # build a dict grouping tokens by their tag

            # initialize this attribute if this is the first token of its kind
        if tag not in data[-1]:
            data[-1][tag] = []

        # HACK: If this token is a unit, singularize it so Scoop accepts it.
        if tag == "unit":
            token = utils.singularize(token)

        data[-1][tag].append(token)

    # reassemble the output into a list of dicts.
    output = [
        dict([(k, utils.smartJoin(tokens)) for k, tokens in ingredient.iteritems()])
        for ingredient in data
        if len(ingredient)
    ]

    # Add the raw ingredient phrase
    for i, v in enumerate(output):
        output[i]["input"] = utils.smartJoin(
            [" ".join(tokens) for k, tokens in display[i]])

    return output

def parse_ingredient(sent):
    """ingredient parsing logic"""
    sentence_features = get_sentence_features(sent)
    tags = tagger.tag(sentence_features)
    tagger_output = zip(sent2tokens(sentence_features), tags)
    parsed_ingredient =  format_ingredient_output(tagger_output)
    if parsed_ingredient:
        parsed_ingredient[0]['name'] = parsed_ingredient[0].get('name','').strip('.')
    return parsed_ingredient

In [90]:

with open('../chocolate_chip_cookie/ButterySugarCookies.json') as data_file: 
    # keep abs path for now
    # must go thru all files in the directory ***
    data = json.load(data_file)

ingre_text = data["ingredients"]
steps_text = data["instructions"]


In [91]:
def parse_recipe_ingredients(ingredient_list):
    """Wrapper around parse_ingredient so we can call it on an ingredient list"""
    #sentences = tokenizer.tokenize(q)
    sentences = tokenizer.tokenize(ingredient_list)
    sentences = [sent.strip('\n') for sent in sentences]
    ingredients = []
    for sent in sentences:
        ingredients.extend(parse_ingredient(sent))
    return ingredients

In [92]:
ingred_dict=[]
for ingr in ingre_text:
    #ingr = ingre_text[2]
    #print ingr
    ingr_parse = parse_recipe_ingredients(ingr)
    #print ingr_parse
    for element in ingr_parse:
        ingred_dict.append(ingr_parse[0]['name'])
    
    


In [93]:
ingred_dict

[u'unsalted butter',
 u'sugar',
 u'salt',
 u'egg',
 u'vanilla extract',
 u'all-purpose flour',
 u'sugar',
 u'**Special equipment:** Wax paper; 2']

In [97]:
for i in range(len(steps_text)):
    sent_tokens = nlp_engine(steps_text[i])
    for sent in sent_tokens.sents:
        for token in sent:
            print token, token.pos, token.pos_
        print "===================="

Beat 97 VERB
together 84 ADV
butter 89 NOUN
, 94 PUNCT
sugar 89 NOUN
, 94 PUNCT
and 86 CONJ
salt 89 NOUN
in 83 ADP
a 87 DET
large 82 ADJ
bowl 89 NOUN
with 83 ADP
an 87 DET
electric 82 ADJ
mixer 89 NOUN
at 83 ADP
medium 82 ADJ
- 94 PUNCT
high 82 ADJ
speed 89 NOUN
until 83 ADP
pale 82 ADJ
and 86 CONJ
fluffy 82 ADJ
, 94 PUNCT
about 84 ADV
3 90 NUM
minutes 89 NOUN
in 83 ADP
a 87 DET
stand 89 NOUN
mixer 89 NOUN
( 94 PUNCT
preferably 84 ADV
fitted 97 VERB
with 83 ADP
paddle 89 NOUN
attachment 89 NOUN
) 94 PUNCT
or 86 CONJ
6 90 NUM
with 83 ADP
a 87 DET
handheld 89 NOUN
. 94 PUNCT
Beat 89 NOUN
in 83 ADP
egg 89 NOUN
and 86 CONJ
vanilla 89 NOUN
. 94 PUNCT
Reduce 82 ADJ
speed 89 NOUN
to 91 PART
low 84 ADV
, 94 PUNCT
then 84 ADV
mix 97 VERB
in 83 ADP
flour 89 NOUN
. 94 PUNCT
Halve 97 VERB
dough 89 NOUN
and 86 CONJ
form 97 VERB
each 87 DET
half 89 NOUN
into 83 ADP
a 87 DET
disk 89 NOUN
, 94 PUNCT
then 84 ADV
wrap 97 VERB
in 83 ADP
wax 89 NOUN
paper 89 NOUN
. 94 PUNCT
Put 97 VERB
each 87 DET
disk 89