In [1]:
import argparse
import nltk
import pandas as pd
import pycrfsuite
import numpy as np
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer

nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Read in Simply Recipe and take a look at what we have

In [2]:
recipe_sr = pd.read_pickle('../../data/02_intermediate/recipes_sr_final.pickle')

In [4]:
recipe_sr.head()

Unnamed: 0,title,prep_time,cook_time,recipe_yield,tags,ingredients,entire_card,recipe_links
0,Grilled Cheese BLT,10 minutes,10 minutes,4 sandwiches,"'Dinner', 'Lunch', 'Sandwich', 'Favorite Summe...","[8 slices sourdough bread, 4 tablespoon unsalt...","['\n\n ', '\n ...",https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes",Serves 6 to 8,"'Dinner', 'Sandwich', 'Budget', 'Comfort Food'...","[1 large onion, chopped, 6 garlic cloves, peel...","['\n\n ', '\n ...",https://www.simplyrecipes.com/recipes/pulled_p...
2,How to Make Bacon in the Oven,5 minutes,20 minutes,12 strips,"'Tips', 'Breakfast and Brunch', 'Baking', 'How...","[12 strips bacon, 1/2 teaspoon ground black pe...","['\n\n ', '\n ...",https://www.simplyrecipes.com/recipes/how_to_m...
3,Sausage Stuffed Zucchini,15 minutes,1 hour,Serves 4,"'Dinner', 'Favorite Summer', 'Make-ahead', 'It...","[2 tablespoons extra virgin olive oil, 1/2 pou...","['\n\n ', '\n ...",https://www.simplyrecipes.com/recipes/italian_...
4,The Best Dry Rub for Ribs,5 minutes,,,"'Favorite Fall', 'Favorite Summer', 'Game Day'...",[3/4 cup packed dark brown sugar (or 1/2 cup i...,"['\n\n ', '\n ...",https://www.simplyrecipes.com/recipes/the_best...


Although our recipes are clean, we are not able to search through this list in any meaningful way in order to figure out what ingredients are in each recipe. We need to figure out how to turn each unstructured ingredient (i.e., '8 slices sourdough bread' to something more structured like quantity: 8, unit: slices, food: sourdough bread). For this task we are going to need natural language processing. 

## Tagging Ingredients through a CRF Model 

Through a great deal of research I have found that this issue is often solved through building and deploying a CRF model. The New York Times has some wonderful resources on how to deploy this model, and using parts of their model could be really helpful in figuring out how to tag my data. 

In [5]:
ingredients_lists = list(recipe_sr.ingredients)

Let's tokenize each ingredient and make sure that the punctuation is removed

In [6]:
# let's tokenize all the words and get rid of punctuation
tokenizer = RegexpTokenizer(r'(\d\/\d |\w+)')
token_sr = []
for recipe in ingredients_lists:
    sub_list = []
    for ingredient in recipe: 
        sub_list.append(tokenizer.tokenize(ingredient))
    token_sr.append(sub_list)

### Feature Creation 

Now that we have tokenized our ingredients we need to put this in a form that our CRF model can handle (this means grouping recipe ingredients into a list and grouping each ingredient sentence into it's own list with tuples). 

In [7]:
crf_data  = []
index = 0
for recipe in token_sr:
    sub_list = []
    for ingredient in recipe:
        pos = nltk.pos_tag(ingredient)
        sub_list.append((pos))
    crf_data.append(sub_list)
    index = index + 1

In [8]:
# Source: NYT Github Page 
def singularize(word):
    """
    A poor replacement for the pattern.en singularize function, but ok for now.
    """

    units = {
        "cups": u"cup",
        "tablespoons": u"tablespoon",
        "teaspoons": u"teaspoon",
        "pounds": u"pound",
        "ounces": u"ounce",
        "cloves": u"clove",
        "sprigs": u"sprig",
        "pinches": u"pinch",
        "bunches": u"bunch",
        "slices": u"slice",
        "grams": u"gram",
        "heads": u"head",
        "quarts": u"quart",
        "stalks": u"stalk",
        "pints": u"pint",
        "pieces": u"piece",
        "sticks": u"stick",
        "dashes": u"dash",
        "fillets": u"fillet",
        "cans": u"can",
        "ears": u"ear",
        "packages": u"package",
        "strips": u"strip",
        "bulbs": u"bulb",
        "bottles": u"bottle"
    }

    if word in units.keys():
        return units[word]
    else:
        return word

In [9]:
def word2features(doc, i):
    word = singularize(doc[i][0])
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'postag=' + postag,
    ]
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')
    # Features for words that are not
    # at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

In [10]:
# A function for extracting features in documents
def extract_features(doc):
    return [word2features(doc, i) for i in range(len(doc))]

# A function fo generating the list of labels for each document
def get_labels(doc):
    return [label for (token, postag, label) in doc]

In [11]:
crf_data_final = []
for recipe in crf_data:
    X = [extract_features(doc) for doc in recipe]
    crf_data_final.append(X)

Import our tagger

In [12]:
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_ing_final.model')

<contextlib.closing at 0x1a21ca9748>

Let's tag the simply recipes dataset

In [13]:
sr_labels = []
for recipe in crf_data_final:
    y_pred = [tagger.tag(xseq) for xseq in recipe]
    sr_labels.append(y_pred)

In [64]:
recipe_titles = list(recipe_sr.title)

In [66]:
len(recipe_titles)

1763

In [67]:
len(token_sr)

1763

In [68]:
len(sr_labels)

1763

### Let's match up the tokens with their tags

One idea is to create each recipe as a nested dictionary

In [111]:
def ingredient_tagger(ingredient_sentence, ingredient_label_sentence):
    qty = []
    unit = []
    name = []
    comment = []

    for word, label in zip(ingredient_sentence, ingredient_label_sentence):
        if label == 'qty':
            qty.append(word)
        if label == 'unit':
            unit.append(word)
        if label == 'name':
            name.append(word)
        if label == 'comment':
            comment.append(word)
    return {'qty': " ".join(qty), 'unit': " ".join(unit), 'name': " ".join(name), 'comment': " ".join(comment)}

In [123]:
def recipe_tagger(single_recipe, matching_recipe_labels):
    ret = []
    for ingredient, ingredient_label in zip(single_recipe, matching_recipe_labels):
        ret.append(ingredient_tagger(ingredient, ingredient_label))
    return ret

In [148]:
def token_labels_to_dict(tokens, labels, recipe_titles):
    final_dict = {}
    for recipe, label, title in zip(token_sr, sr_labels, recipe_titles):
        ing = recipe_tagger(recipe, label)
        final_dict[str(title).lower()] = ing
    return final_dict

In [147]:
final_dict['grab-and-go oatmeal chia cups']

[{'qty': '2',
  'unit': 'tablespoons',
  'name': 'old fashioned rolled oats',
  'comment': 'gluten free if needed'},
 {'qty': '1', 'unit': 'tablespoon', 'name': 'chia seeds', 'comment': ''},
 {'qty': '1/3 ',
  'unit': 'cup',
  'name': 'milk',
  'comment': 'any kind including non dairy'},
 {'qty': '1/3 ',
  'unit': 'cup',
  'name': 'plain or vanilla yogurt',
  'comment': 'any kind including non dairy'},
 {'qty': '1',
  'unit': 'teaspoon',
  'name': 'honey',
  'comment': 'maple syrup or sweetener of your choice to taste optional'},
 {'qty': '1/4 ',
  'unit': 'cup',
  'name': 'blueberries',
  'comment': 'sliced strawberries raspberries or other chopped fruit of your choice'},
 {'qty': '1',
  'unit': 'tablespoon',
  'name': 'raisins',
  'comment': 'or other dried fruit'},
 {'qty': '1',
  'unit': 'tablespoons',
  'name': 'chopped or sliced almonds cashews walnuts',
  'comment': 'or other nut of your choice'}]

Can create a metric of user to purchased item like in movie recommendation system.
implicit feedback recommendation system. (don't have stars or anything. All we have is the fact that they bought this thing onece). We are saying that they bought it so I have some confidence that they like it. 

als recommendations with the carts 

The sticking point is how we get around the recommendation part? don't have actual ratings for things in a cart, all we have is bought or not bought (there is no scale). There could be a million different reasons why you buy something that you don't like. There are algorithms for this (extension of the ALS models. Could start with implicit feedback recommendations using ALS). Look for package that is doing this very quickly /. this one building a model to make predictions. 

Other approaches: don't use the supervised side of things. we just find similarities between carts. Now you can find maybe the top three carts that are most similair to your "new cart". We are assuming that these people have similair tastes to you. And then all of these things might be a good suggestion to me as well. this one we are using huristics tyo try and figure out what is similair. Maybe could even ignore the carts if I go this route. 

**content based recommendations**
**implicit feedback recommendations using ALS**

1. the user says they l ike tomatoes, don't like eggplant, like this. 
1. take that 
---
actually matching things that a person likes to a recipe will be incredibly challenging in and of itself. Deciding on the rules of doing the matching will take time to think about and code it. Example, lety's say I like eggplant and garlic. There are so many recipies with eggplant and garlic.