# Using Machine Learning to Find Meals that Conform to Your Taste 

## Introduction 

According to a recent [survey](https://www.iol.co.za/lifestyle/food-drink/many-cookbooks-yet-same-old-supper-1824825), the average Briton has a rotation of nine different recipes. Part of this lack of diversity is likely the fear of preparing a meal that doesn't conform to your tastes and being forced to throw it out. 

### Project Objective

For this project, we will create a recipe recommendation engine based on your tastes and customize your weekly menu within seconds. 

Our aim is to diversify your menu by helping you discover new meals that incorporate flavors that you already love, using items you already buy. 

### Datasets

1. **Instacart Market Basket Analysis Datasets**
    - order
    - ailes: 132 unique store ailes
    - departments: 24 unique departments
    - products: 49.7k unique products. 
    
    
2. **Mariano's Grocery Prices**: 

3. **Simply Recipes recipe database**

4. **Model Training Datasets**: 
    - nyt-ingredients-snapshot-2015
    - marianos_product_train_complete
    - instacart_product_train

### Summery of Results

blah blah blah blah 

## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [1]:
%load_ext autoreload
%autoreload 2

import argparse
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import pickle
import pycrfsuite
import random
import re
import sys
import warnings
nltk.download('averaged_perceptron_tagger')
warnings.filterwarnings('ignore')

from selenium import webdriver
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d00_utils import utils
from d01_data import clean_data
from d01_data.web_scraping import sr_scraping, marianos_insta_scraping
from d02_features.feature_creation import nyt_ingredients_crf_feature_creation
from d02_features.feature_creation import instacart_prod_crf_feature_creation
from d03_models.crf_model_recipes import crf_model_recipe_tagger
from d03_models.crf_model_baskets import crf_basket_feature_creation, crf_basket_dataset_creation
from d03_models.app_functions import *

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Recipe Web Scraping - simplyrecipes.com

After conducting research on a number of sites, I chose simplyrecipes.com to scrape for a number of reasons. 

* The website contained diverse recipes that could appeal to a number of different pallets
* Each recipe came pre-tagged with meal and dietary preferences

If you would like to scrape the website yourself please run ```sr_scraping()``` in a cell within this notebook. The full script takes around 1.5 hours to run and all files are saved to the ```data/01_raw/simply_recipes``` folder. I'm reading in the scraped and concatenated dataset for convenience. 

In [41]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/simply_recipes/simply_recipes*')
recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)

In [42]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

At the moment we have 1752 different recipe entries. From a quick glance, some of the things marked as recipes are actually how-to guides. These will need to come out because they don't provide ingredients to base our model off of. 

In [43]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


### Grocery Price Web Scraping - Chicago's Marianos

If you would like to scrape the website yourself please run ```marianos_insta_scraping()``` in a cell within this notebook. The full script takes around 5 hours to run and all files are saved to the ```data/01_raw folder``` as ```prod_aile_*```. Once you run the script selenium will open and you'll need to sign into instacart wit your username and password. After 80 seconds the script will began scraping the site on it's own. 

Let's go ahead and read in the concatenated files from marianos. 

In [44]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [45]:
grocery_prices_orig.head(2)

Unnamed: 0,product,unit_price,item_size,prod_aile
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"


In [46]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [47]:
grocery_prices_inter.head(2)

Unnamed: 0,product,main_price,prod_aile,price_per_lb,measure_words_main_price,item_weight_count_vol,date_collected,store,location
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615


## Exploratory Data Analysis

### Instacart Market Basket Analysis Datasets

Let's read in our datasets and merge them so that we can see what each user purchased. 

In [48]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

In [49]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, departments, order, 
                                                                 order_products__prior, products)
instacart_baskets.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id,aisle,department
0,2,33120,1,1,202279,prior,3,5,9,8.0,Organic Egg Whites,86,16,eggs,dairy eggs
1,2,28985,2,1,202279,prior,3,5,9,8.0,Michigan Organic Kale,83,4,fresh vegetables,produce
2,2,9327,3,0,202279,prior,3,5,9,8.0,Garlic Powder,104,13,spices seasonings,pantry
3,2,45918,4,1,202279,prior,3,5,9,8.0,Coconut Butter,19,13,oils vinegars,pantry
4,2,30035,5,0,202279,prior,3,5,9,8.0,Natural Sweetener,17,13,baking ingredients,pantry


In [50]:
instacart_baskets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  int64
product_id                int64
add_to_cart_order         int64
reordered                 int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
dtypes: float64(1), int64(10), object(4)
memory usage: 3.9+ GB


The highest number of baskets by customer is 99 (this means 99 different orders for each of the 5 customers below). 

In [51]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

Unnamed: 0_level_0,order_id
user_id,Unnamed: 1_level_1
152340,99
185641,99
185524,99
81678,99
70922,99


### Simply Recipes' Recipe Dataset

In [52]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


In [53]:
recipes_sr_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1752 entries, 0 to 1751
Data columns (total 8 columns):
title           1738 non-null object
prep_time       1583 non-null object
cook_time       1408 non-null object
tags            1752 non-null object
ingredients     1752 non-null object
recipe_yield    1242 non-null object
byline          1752 non-null object
link_food       1752 non-null object
dtypes: object(8)
memory usage: 109.6+ KB


In [54]:
print('Number of unique recipes: ', len(recipes_sr_inter))

Number of unique recipes:  1752


## Natural Langauge Processing 

### Simply Recipes

In order to sort recipes by ingredient (and ultimately price recipes) we will need to first mark each recipe by their list of ingredients. In most recipes ingredients are listed within full sentences (i.e., a recipe will call for "8 slices sourdough bread" and not just "sourdough bread". Because of this difficulty, we will need to build a model that can identify the quantity (8), unit (slices), item (sourdough bread), and any additional comments.

After conducting research on a number of natural language processing models, I have settled on constructing a conditional random field model (crf model). These models are deployed for pattern recognition and are used in situations that call for structured predictions. Because recipe ingredients often have the same elements, this will be a good match for our purposes. 

#### Step 1: Construct Training Data

In an article from the New York Times titled "[Extracting Structured Data From Recipes Using Conditional Random Fields](https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/)", the author details the process they used to classify their recipies for their recipe website. The Times generously shared their code and training dataset on their github page, so I had access to both a training set and was able to see a clear example of how the code works. 

#### Step 2: Feature Creation 

In [55]:
nyt_ing = pd.read_csv('../../data/01_raw/nyt-ingredients-snapshot-2015.csv')
nyt_ing.drop(columns=['index'], inplace=True)
print('Number of Handlabeled Ingredients: ', len(nyt_ing))
nyt_ing.head()

Number of Handlabeled Ingredients:  179207


Unnamed: 0,input,name,qty,range_end,unit,comment
0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,"medium-size, peeled and chopped"
3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,chopped coarse
4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,


The NYT training dataset contains over 170K handlabled ingredients. Let's put the dataset in the correct format for feeding into the model. 

In [56]:
nyt_ing.fillna("missing", inplace=True)

In [57]:
X, y = nyt_ingredients_crf_feature_creation(nyt_ing)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Model Building & Testing

In [59]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_nyt_initial_model.model')
# let's read back in our model 
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_nyt_initial_model.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 40539
Seconds required: 0.971

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 727832.498245
Feature norm: 1.000000
Error norm: 293479.302848
Active features: 39832
Line search trials: 1
Line search step: 0.000003
Seconds required for this iteration: 0.365

***** Iteration #2 *****
Loss: 247662.046524
Feature norm: 4.931286
Error norm: 153833.049896
Active features: 38518
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.195

***** Iteration #3 *****
Loss: 227418.143738
Feature norm: 5.429338
Error norm: 185553.296887
Active features: 39392
Line search trials: 2
Line search step: 0.500000
Seconds requir

***** Iteration #41 *****
Loss: 28336.856307
Feature norm: 74.472494
Error norm: 4248.638433
Active features: 25794
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.188

***** Iteration #42 *****
Loss: 27642.056735
Feature norm: 79.398770
Error norm: 4648.555584
Active features: 25580
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.191

***** Iteration #43 *****
Loss: 27092.892619
Feature norm: 83.319787
Error norm: 4120.842306
Active features: 25586
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.194

***** Iteration #44 *****
Loss: 26588.099542
Feature norm: 87.382243
Error norm: 3745.735763
Active features: 25084
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.192

***** Iteration #45 *****
Loss: 26088.717715
Feature norm: 91.630090
Error norm: 3175.451798
Active features: 25135
Line search trials: 1
Line search step: 1.000000

***** Iteration #82 *****
Loss: 21479.445174
Feature norm: 160.598182
Error norm: 820.421917
Active features: 20614
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.189

***** Iteration #83 *****
Loss: 21448.223548
Feature norm: 160.935211
Error norm: 793.100814
Active features: 20527
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.191

***** Iteration #84 *****
Loss: 21415.023653
Feature norm: 161.519153
Error norm: 2095.963609
Active features: 20398
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.193

***** Iteration #85 *****
Loss: 21392.062806
Feature norm: 161.829850
Error norm: 1061.178165
Active features: 20426
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.196

***** Iteration #86 *****
Loss: 21372.339479
Feature norm: 162.084178
Error norm: 913.213500
Active features: 20370
Line search trials: 1
Line search step: 1.0000

***** Iteration #121 *****
Loss: 20919.933971
Feature norm: 171.273264
Error norm: 1172.835401
Active features: 19688
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.194

***** Iteration #122 *****
Loss: 20915.105958
Feature norm: 171.437508
Error norm: 1607.951828
Active features: 19696
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.196

***** Iteration #123 *****
Loss: 20908.478599
Feature norm: 171.564332
Error norm: 984.979370
Active features: 19702
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.195

***** Iteration #124 *****
Loss: 20903.723885
Feature norm: 171.716390
Error norm: 1480.201306
Active features: 19698
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.192

***** Iteration #125 *****
Loss: 20897.608486
Feature norm: 171.833430
Error norm: 972.739716
Active features: 19694
Line search trials: 1
Line search step: 

***** Iteration #162 *****
Loss: 20728.153672
Feature norm: 175.911019
Error norm: 1511.653457
Active features: 19476
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.187

***** Iteration #163 *****
Loss: 20722.175304
Feature norm: 175.992005
Error norm: 712.950486
Active features: 19491
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.191

***** Iteration #164 *****
Loss: 20719.043624
Feature norm: 176.075610
Error norm: 1213.349691
Active features: 19479
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.195

***** Iteration #165 *****
Loss: 20713.889038
Feature norm: 176.161298
Error norm: 932.549457
Active features: 19486
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.192

***** Iteration #166 *****
Loss: 20713.032999
Feature norm: 176.258410
Error norm: 1740.834822
Active features: 19457
Line search trials: 1
Line search step: 

<contextlib.closing at 0x16153ca10>

Let's test our model. 

In [60]:
# Kernal keeps dying when i try and tag things with tagger
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [61]:
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97     19746
           1       1.00      1.00      1.00     35762
           2       1.00      1.00      1.00     35842
           3       1.00      1.00      1.00     24612

   micro avg       0.99      1.00      0.99    115962
   macro avg       0.99      0.99      0.99    115962
weighted avg       0.99      1.00      0.99    115962
 samples avg       0.99      1.00      0.99    115962



Our model is almost 100% accurate. That's good enough. Let's train all of our data on the model and move onto using it for our ingredients.

In [62]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/0/crf_ing_final.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 44319
Seconds required: 1.128

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 910350.519284
Feature norm: 1.000000
Error norm: 367077.603587
Active features: 43524
Line search trials: 1
Line search step: 0.000002
Seconds required for this iteration: 0.474

***** Iteration #2 *****
Loss: 309213.712708
Feature norm: 4.927604
Error norm: 190852.363998
Active features: 42108
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.244

***** Iteration #3 *****
Loss: 284765.219526
Feature norm: 5.423585
Error norm: 233315.887163
Active features: 43052
Line search trials: 2
Line search step: 0.500000
Seconds requir

***** Iteration #44 *****
Loss: 33696.866447
Feature norm: 86.568089
Error norm: 5311.219949
Active features: 27563
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.237

***** Iteration #45 *****
Loss: 33022.421846
Feature norm: 91.902957
Error norm: 4807.225503
Active features: 27499
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.239

***** Iteration #46 *****
Loss: 32397.253091
Feature norm: 97.852255
Error norm: 5302.330110
Active features: 27256
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.236

***** Iteration #47 *****
Loss: 31924.279201
Feature norm: 102.910935
Error norm: 5344.729790
Active features: 27358
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.252

***** Iteration #48 *****
Loss: 31474.246729
Feature norm: 107.892744
Error norm: 5176.793471
Active features: 27270
Line search trials: 1
Line search step: 1.0000

***** Iteration #85 *****
Loss: 26934.363982
Feature norm: 170.720777
Error norm: 1898.610946
Active features: 22632
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.242

***** Iteration #86 *****
Loss: 26905.022073
Feature norm: 170.997434
Error norm: 1998.979407
Active features: 22561
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.242

***** Iteration #87 *****
Loss: 26880.511630
Feature norm: 171.297767
Error norm: 1999.488149
Active features: 22525
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.243

***** Iteration #88 *****
Loss: 26854.980477
Feature norm: 171.546808
Error norm: 2319.251118
Active features: 22471
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.242

***** Iteration #89 *****
Loss: 26833.547593
Feature norm: 171.818551
Error norm: 1810.955058
Active features: 22463
Line search trials: 1
Line search step: 1.0

***** Iteration #124 *****
Loss: 26345.891858
Feature norm: 180.894527
Error norm: 2064.612072
Active features: 21801
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.245

***** Iteration #125 *****
Loss: 26338.292892
Feature norm: 181.078550
Error norm: 1485.477988
Active features: 21782
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.241

***** Iteration #126 *****
Loss: 26333.600140
Feature norm: 181.266021
Error norm: 2491.642370
Active features: 21747
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.235

***** Iteration #127 *****
Loss: 26323.217669
Feature norm: 181.451755
Error norm: 1216.229937
Active features: 21753
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.240

***** Iteration #128 *****
Loss: 26317.668619
Feature norm: 181.619784
Error norm: 1997.383272
Active features: 21745
Line search trials: 1
Line search step

***** Iteration #164 *****
Loss: 26108.542358
Feature norm: 186.510247
Error norm: 1907.014617
Active features: 21701
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.236

***** Iteration #165 *****
Loss: 26100.502355
Feature norm: 186.604953
Error norm: 1038.841155
Active features: 21713
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.241

***** Iteration #166 *****
Loss: 26098.642334
Feature norm: 186.704643
Error norm: 1806.807095
Active features: 21705
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.236

***** Iteration #167 *****
Loss: 26091.887207
Feature norm: 186.796958
Error norm: 1130.279893
Active features: 21706
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.242

***** Iteration #168 *****
Loss: 26089.621419
Feature norm: 186.900487
Error norm: 1669.063942
Active features: 21715
Line search trials: 1
Line search step

#### Let's use our model to tag our simply recipies dataset 

In [63]:
recipe_ing_dict, recipe_links_dict, recipe_tags_dict = crf_model_recipe_tagger(recipes_sr_inter)

In [64]:
recipe_ing_dict

{'grilled cheese blt': [{'qty': '8',
   'unit': '',
   'name': 'slices sourdough bread',
   'comment': ''},
  {'qty': '4',
   'unit': 'tablespoon',
   'name': 'unsalted butter',
   'comment': 'at room temperature'},
  {'qty': '8',
   'unit': 'ounces',
   'name': '2 cups shredded cheddar cheese',
   'comment': ''},
  {'qty': '2',
   'unit': '',
   'name': 'slicing tomatoes',
   'comment': 'such as beefsteak Brandywine or Cherokee purple sliced 1 4 inch thick'},
  {'qty': '8', 'unit': 'to 12', 'name': 'slices', 'comment': ''},
  {'qty': 'cooked', 'unit': '', 'name': 'bacon', 'comment': ''},
  {'qty': '12',
   'unit': 'leaves',
   'name': 'butterhead or other crispy lettuce',
   'comment': ''},
  {'qty': 'Kosher',
   'unit': '',
   'name': 'salt and black pepper',
   'comment': ''}],
 'pulled pork sandwich': [{'qty': '1',
   'unit': '',
   'name': 'large onion',
   'comment': 'chopped'},
  {'qty': '6', 'unit': '', 'name': 'garlic cloves', 'comment': 'peeled'},
  {'qty': '1',
   'unit': ''

We now have a dictionary of recipes with ingredients correctly labeled, a dictionary of links and a dictionary of tags (breakfast, lunch, dinner, vegetarian, etc.). Let's now build and use a model for our instacart products. 

### Instacart Products

#### Step 1: Construct Training Data

I hand-labeled over 1000 rows of the combined instacart dataset in order to make a training dataset. This dataset will be available in the sample data folder. 

In [26]:
instacart_prod_train = pd.read_csv('../../data/01_raw/instacart_product_train.csv')

In [27]:
instacart_prod_train.head()

Unnamed: 0,products,pre_description,food,post_description
0,Organic Egg Whites,Organic,Egg Whites,
1,Michigan Organic Kale,Michigan Organic,Kale,
2,Garlic Powder,,Garlic Powder,
3,Coconut Butter,,Coconut Butter,
4,Natural Sweetener,,natural sweetener,


In [28]:
X, y = instacart_prod_crf_feature_creation(instacart_prod_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [29]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_ingredients_initial.model')
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_ingredients_initial.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 4135
Seconds required: 0.014

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 2180.158655
Feature norm: 1.000000
Error norm: 871.559661
Active features: 3964
Line search trials: 1
Line search step: 0.000524
Seconds required for this iteration: 0.002

***** Iteration #2 *****
Loss: 1866.827822
Feature norm: 1.389581
Error norm: 493.596214
Active features: 3861
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 1622.939764
Feature norm: 2.065423
Error norm: 383.175015
Active features: 3952
Line search trials: 1
Line search step: 1.000000
Seconds required for this iterati

<contextlib.closing at 0x12b4c02d0>

In [31]:
labels = [tagger.tag(xseq) for xseq in X_test]

In [43]:
mlb = MultiLabelBinarizer()

print(classification_report(y_pred=mlb.fit_transform(labels), y_true=mlb.fit_transform(y_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       194
           1       0.78      0.78      0.78        41
           2       0.97      0.98      0.98       167

   micro avg       0.97      0.97      0.97       402
   macro avg       0.92      0.92      0.92       402
weighted avg       0.97      0.97      0.97       402
 samples avg       0.97      0.97      0.96       402



Our model is almost 100% accurate. That's good enough. Let's train all of our data on the model and move onto using it for our ingredients. 

In [44]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_instacart_products_final.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 4719
Seconds required: 0.016

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 2746.490798
Feature norm: 1.000000
Error norm: 1104.380669
Active features: 4506
Line search trials: 1
Line search step: 0.000414
Seconds required for this iteration: 0.002

***** Iteration #2 *****
Loss: 2369.340357
Feature norm: 1.378173
Error norm: 646.726482
Active features: 4439
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 2054.669438
Feature norm: 2.107407
Error norm: 504.188817
Active features: 4489
Line search trials: 1
Line search step: 1.000000
Seconds required for this iterat

***** Iteration #196 *****
Loss: 231.452865
Feature norm: 51.406628
Error norm: 2.346117
Active features: 1326
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.007

***** Iteration #197 *****
Loss: 231.445259
Feature norm: 51.409060
Error norm: 1.848426
Active features: 1326
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.002

***** Iteration #198 *****
Loss: 231.442324
Feature norm: 51.406364
Error norm: 2.271858
Active features: 1324
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.002

***** Iteration #199 *****
Loss: 231.435884
Feature norm: 51.410878
Error norm: 1.951832
Active features: 1322
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.002

***** Iteration #200 *****
Loss: 231.432412
Feature norm: 51.407457
Error norm: 2.257016
Active features: 1318
Line search trials: 1
Line search step: 1.000000
Seconds required for thi

#### Let's use our new model to tag our Instacart Bakset Datset

```python
# if you would like to run this on your own then add this to a cell. Otherwise you should read in the final file from 
# the thing provided
X, token_sr, products_list = crf_basket_feature_creation(instacart_baskets)

tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_instacart_products_final.model')

labels = [tagger.tag(xseq) for xseq in X]

instacart_baskets_update = crf_basket_dataset_creation(token_sr, labels, products_list, instacart_baskets)
```

In [2]:
instacart_baskets_update = pd.read_csv('../../data/05_model_output/baskets_newprodlist_2.csv')

In [3]:
filename = '../../data/05_model_output/baskets_newprodlist_2'
outfile = open(filename,'wb')

In [4]:
pickle.dump(instacart_baskets_update, outfile)
outfile.close()

In [24]:
instacart_baskets_update

Unnamed: 0,order_id,product_name,user_id,all_ones,new_prod_list
0,2,Organic Egg Whites,202279,1,egg whites
1,2,Michigan Organic Kale,202279,1,kale
2,2,Garlic Powder,202279,1,garlic powder
3,2,Coconut Butter,202279,1,coconut butter
4,2,Natural Sweetener,202279,1,natural sweetener
...,...,...,...,...,...
24890358,3421082,Raspberries,175185,1,raspberries
24890359,3421082,Original Whipped Cream,175185,1,whipped cream
24890360,3421082,Special K Red Berries Cereal,175185,1,cereal
24890361,3421083,All Natural French Toast Sticks,25247,1,french toast


## Recommendation System Construction

In [25]:
# from a cursory look at the dataset I can tell that there are a number of things marked as food that are not. let's 
# get rid of these so that they don't mess up our results. 
mask = ((instacart_baskets_update['new_prod_list']!='1')&(instacart_baskets_update['new_prod_list']!='100')&\
        (instacart_baskets_update['new_prod_list']!='11')&(instacart_baskets_update['new_prod_list']!='118')&\
        (instacart_baskets_update['new_prod_list']!='2')&(instacart_baskets_update['new_prod_list']!='24')&\
        (instacart_baskets_update['new_prod_list']!='3')&(instacart_baskets_update['new_prod_list']!='3 cheese')&\
        (instacart_baskets_update['new_prod_list']!='30')&(instacart_baskets_update['new_prod_list']!='328')&\
        (instacart_baskets_update['new_prod_list']!='4')&(instacart_baskets_update['new_prod_list']!='5')&\
        (instacart_baskets_update['new_prod_list']!='50')&(instacart_baskets_update['new_prod_list']!='6')&\
        (instacart_baskets_update['new_prod_list']!='6 cheese')&(instacart_baskets_update['new_prod_list']!='60')&\
        (instacart_baskets_update['new_prod_list']!='7')&(instacart_baskets_update['new_prod_list']!='70')&\
        (instacart_baskets_update['new_prod_list']!='8')&(instacart_baskets_update['new_prod_list']!='85')&\
        (instacart_baskets_update['new_prod_list']!='9')&(instacart_baskets_update['new_prod_list']!='95')&\
        (instacart_baskets_update['new_prod_list']!='97')&(instacart_baskets_update['new_prod_list']!='98')&\
        (instacart_baskets_update['new_prod_list']!='a')&(instacart_baskets_update['new_prod_list']!='a garlic butter sauce')&\
        (instacart_baskets_update['new_prod_list']!=np.nan)&(instacart_baskets_update['new_prod_list']!='nan'))

instacart_baskets_filtered = instacart_baskets_update[mask]

the new product list made from my crf model reduced the number of products from 24K to just over 4k.

In [26]:
print('Number of Products After Running Names through CRF Mode: ', instacart_baskets_filtered.new_prod_list.nunique())
print('Number of products in the original list: ', instacart_baskets_filtered.product_name.nunique())
print('Number of unique users: ', instacart_baskets_filtered.user_id.nunique())

Number of Products After Running Names through CRF Mode:  4060
Number of products in the original list:  24434
Number of unique users:  204454


We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 100k users and go from there. 

In [27]:
instacart_users_lst = list(instacart_baskets_filtered.user_id.unique())
len(instacart_users_lst)

204454

In [29]:
random_usrids_100k = random.sample(instacart_users_lst, 100000)
mask = instacart_baskets_filtered['user_id'].isin(random_usrids_100k)
baskets_100k = instacart_baskets_filtered.loc[mask]
print('Number of User IDs: ', baskets_100k.user_id.nunique())

Number of User IDs:  100000


In [30]:
baskets_100k

Unnamed: 0,order_id,product_name,user_id,all_ones,new_prod_list
0,2,Organic Egg Whites,202279,1,egg whites
1,2,Michigan Organic Kale,202279,1,kale
2,2,Garlic Powder,202279,1,garlic powder
3,2,Coconut Butter,202279,1,coconut butter
4,2,Natural Sweetener,202279,1,natural sweetener
...,...,...,...,...,...
24890356,3421082,Original Spray,175185,1,spray
24890357,3421082,Strawberries,175185,1,strawberries
24890358,3421082,Raspberries,175185,1,raspberries
24890359,3421082,Original Whipped Cream,175185,1,whipped cream


We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [32]:
baskets_complete = baskets_100k.drop(columns=['product_name', 'user_id'])
baskets_complete.head()

This is how to get the dataframe into matrix format
```python
basket_matrix_usr = baskets_complete.groupby(['order_id', 'new_prod_list'])['all_ones']\
                    .sum().unstack().reset_index().fillna(0)\
                    .set_index('order_id')
```

Run ```similarities_model.py``` (located the ```src/d02_features``` folder) from the command line in order to get the final similarity matrix. 

In [35]:
data_matrix = pd.read_csv('../../data/05_model_output/data_matrix_sim.csv')
data_matrix.set_index('Unnamed: 0', inplace=True)

Now let's try out our new model. When we put in potato the top 10 things associated with potato come out. the list is below. 

In [39]:
print(data_matrix.loc['potato'].nlargest(11))

potato     1.000000
onion      0.120991
milk       0.107052
tomato     0.106734
carrots    0.101938
cheese     0.098204
garlic     0.097276
avocado    0.096195
eggs       0.092582
apple      0.092447
butter     0.092278
Name: potato, dtype: float64


## Application

### Basic Application 

In [85]:
print('Choose your meal by inputing either 1, 2 or 3')
# print('\n')
meal_input = input("Breakfast: Input 1 || || Lunch: Input 2 || Dinner: Input 3: ")
print('\n')
print('Choose your dietary preferences by inputing either 1 or 2: ')
# print('\n')
dietary_preference_input = input("Vegetarian: Input 1 || Omnivore: Input 2: ")
# print('\n')
print('Type in 3 foods you already like')
item1 = input("Item 1: ")
item2 = input("Item 2: ")
item3 = input("Item 3: ")
print('\n')
print('Searching for five recipe recommendations based both on your inputs and similair foods.')

if meal_input == "1":
    meal = 'Breakfast'
else: 
    meal = 'Dinner'
    
if dietary_preference_input == "1":
    dietary_preference = 'Vegetarian'
else:
    dietary_preference = None
shopping_basket = [item1, item2, item3]
recipe_recommendations_app(shopping_basket, recipe_ing_dict, recipe_tags_dict, meal, dietary_preference, recipe_links_dict)


Choose your meal by inputing either 1, 2 or 3
Breakfast: Input 1 || || Lunch: Input 2 || Dinner: Input 3: 3


Choose your dietary preferences by inputing either 1 or 2: 
Vegetarian: Input 1 || Omnivore: Input 2: 2
Type in 3 foods you already like
Item 1: onion
Item 2: pepper
Item 3: potato


Searching for five recipe recommendations based both on your inputs and similair foods.

Recipe 1:  persian pomegranate chicken (fesenjan) https://www.simplyrecipes.com/recipes/fesenjan_persian_chicken_stew_with_walnut_and_pomegranate_sauce/

Recipe 2:  albondigas soup (mexican meatball soup) https://www.simplyrecipes.com/recipes/albondigas_soup/

Recipe 3:  teriyaki chicken lettuce wraps https://www.simplyrecipes.com/recipes/teriyaki_chicken_lettuce_wraps/

Recipe 4:  shaved asparagus and potato pizza https://www.simplyrecipes.com/recipes/shaved_asparagus_and_potato_pizza/

Recipe 5:  how to grill the best burgers https://www.simplyrecipes.com/recipes/how_to_grill_the_best_burgers/


## Conclusion, Limitations & Next Steps
### Limitations
* Milk, eggs and bread appear in the majority of item similarity searches (these items are purchased often and so are lumped in together). 
* item-to-item collaborative filtering recommends items that are similar to an input item (i.e., input basil, and output mint, thyme, fennel) instead of items that go with the item (i.e., pasta, cheese, bacon). 
### Next Steps
* Build user-user based recommendation system in order to get more ingredients that go together
* Weight common ingredients lower (i.e., milk, eggs and bread)
* Deploy Application via Heroku
* Expand database of recipes and allowing more dietary inputs. 