# Finding Meals that Conform to Your Taste 

## Introduction 

According to a recent [survey](https://www.iol.co.za/lifestyle/food-drink/many-cookbooks-yet-same-old-supper-1824825), the average Briton has a rotation of nine different recipes. Part of this lack of diversity is likely the fear of preparing a meal that doesn't conform to your tastes and being forced to throw it out. 

### Project Objective

For this project, we will create a recipe recommendation engine based on your tastes and customize your weekly menu within seconds. 

Our aim is to diversify your menu by helping you discover new meals that incorporate flavors that you already love, using items you already buy. 

### Datasets

1. **Instacart Market Basket Analysis Datasets**
    - order
    - ailes: 132 unique store ailes
    - departments: 24 unique departments
    - products: 49.7k unique products. 
    
    
2. **Mariano's Grocery Prices**: 

3. **Simply Recipes recipe database**

4. **Model Training Datasets**: 
    - nyt-ingredients-snapshot-2015
    - marianos_product_train_complete
    - instacart_product_train

### Summery of Results

blah blah blah blah 

## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [60]:
%load_ext autoreload
%autoreload 2

import argparse
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import pycrfsuite
import re
import sys
import warnings
nltk.download('averaged_perceptron_tagger')
warnings.filterwarnings('ignore')

from selenium import webdriver
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d00_utils import utils
from d01_data import clean_data
from d01_data.web_scraping import sr_scraping, marianos_insta_scraping
from d02_features.feature_creation import nyt_ingredients_crf_feature_creation

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/markishab/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Recipe Web Scraping - simplyrecipes.com

After conducting research on a number of sites, I chose simplyrecipes.com to scrape for a number of reasons. 

* The website contained diverse recipes that could appeal to a number of different pallets
* Each recipe came pre-tagged with meal and dietary preferences

If you would like to scrape the website yourself please run ```sr_scraping()``` in a cell within this notebook. The full script takes around 1.5 hours to run and all files are saved to the ```data/01_raw/simply_recipes``` folder. I'm reading in the scraped and concatenated dataset for convenience. 

In [10]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/simply_recipes/simply_recipes*')
recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)

In [18]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

At the moment we have 1752 different recipe entries. From a quick glance, some of the things marked as recipes are actually how-to guides. These will need to come out because they don't provide ingredients to base our model off of. 

In [44]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


### Grocery Price Web Scraping - Chicago's Marianos

If you would like to scrape the website yourself please run ```marianos_insta_scraping()``` in a cell within this notebook. The full script takes around 5 hours to run and all files are saved to the ```data/01_raw folder``` as ```prod_aile_*```. Once you run the script selenium will open and you'll need to sign into instacart wit your username and password. After 80 seconds the script will began scraping the site on it's own. 

Let's go ahead and read in the concatenated files from marianos. 

In [38]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [45]:
grocery_prices_orig.head(2)

Unnamed: 0,product,main_price,prod_aile,price_per_lb,measure_words_main_price,item_weight_count_vol,date_collected,store,location
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615


In [40]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [46]:
grocery_prices_inter.head(2)

Unnamed: 0,product,main_price,prod_aile,price_per_lb,measure_words_main_price,item_weight_count_vol,date_collected,store,location
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615


## Exploratory Data Analysis

### Instacart Market Basket Analysis Datasets

Let's read in our datasets and merge them so that we can see what each user purchased. 

In [2]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

In [6]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, departments, order, 
                                                                 order_products__prior, products)
instacart_baskets.head()

In [8]:
instacart_baskets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  int64
product_id                int64
add_to_cart_order         int64
reordered                 int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
dtypes: float64(1), int64(10), object(4)
memory usage: 3.9+ GB


The highest number of baskets by customer is 99 (this means 99 different orders for each of the 5 customers below). 

In [9]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

Unnamed: 0_level_0,order_id
user_id,Unnamed: 1_level_1
152340,99
185641,99
185524,99
81678,99
70922,99


### Simply Recipes' Recipe Dataset

In [54]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


In [48]:
recipes_sr_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1752 entries, 0 to 1751
Data columns (total 8 columns):
title           1738 non-null object
prep_time       1583 non-null object
cook_time       1408 non-null object
tags            1752 non-null object
ingredients     1752 non-null object
recipe_yield    1242 non-null object
byline          1752 non-null object
link_food       1752 non-null object
dtypes: object(8)
memory usage: 109.6+ KB


In [49]:
print('Number of unique recipes: ', len(recipes_sr_inter))

Number of unique recipes:  1752


### Mariano's Grocery Prices Dataset

## Model Exploration & Feature Construction

### Simply Recipes

In order to sort recipes by ingredient (and ultimately price recipes) we will need to first mark each recipe by their list of ingredients. In most recipes ingredients are listed within full sentences (i.e., a recipe will call for "8 slices sourdough bread" and not just "sourdough bread". Because of this difficulty, we will need to build a model that can identify the quantity (8), unit (slices), item (sourdough bread), and any additional comments.

After conducting research on a number of natural language processing models, I have settled on constructing a conditional random field model (crf model). These models are deployed for pattern recognition and are used in situations that call for structured predictions. Because recipe ingredients often have the same elements, this will be a good match for our purposes. 

#### Step 1: Construct Training Data

In an article from the New York Times titled "[Extracting Structured Data From Recipes Using Conditional Random Fields](https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/)", the author details the process they used to classify their recipies for their recipe website. The Times generously shared their code and training dataset on their github page, so I had access to both a training set and was able to see a clear example of how the code works. 

#### Step 2: Feature Creation 

In [52]:
nyt_ing = pd.read_csv('../../data/01_raw/nyt-ingredients-snapshot-2015.csv')
nyt_ing.drop(columns=['index'], inplace=True)
print('Number of Handlabeled Ingredients: ', len(nyt_ing))
nyt_ing.head()

Number of Handlabeled Ingredients:  179207


Unnamed: 0,input,name,qty,range_end,unit,comment
0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,"medium-size, peeled and chopped"
3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,chopped coarse
4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,


The NYT training dataset contains over 170K handlabled ingredients. Let's put the dataset in the correct format for feeding into the model. 

In [62]:
nyt_ing.fillna("missing", inplace=True)

In [63]:
X, y = nyt_ingredients_crf_feature_creation(nyt_ing)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Model Building & Testing

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_nyt_initial_model.model')
# let's read back in our model 
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_nyt_initial_model.model')

Let's test our model. 

In [None]:
tagger = pycrfsuite.Tagger()
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [None]:
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

Our model is almost 100% accurate. That's good enough. Let's train all of our data on the model and move onto using it for our ingredients.

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_ing_final.model')

### Let's use our model to tag our simply recipies dataset 

### Instacart Products

## Application Testing 

## Conclusion & Next Steps