# Finding Meals that Conform to Your Taste 

## Introduction 

According to a recent [survey](https://www.iol.co.za/lifestyle/food-drink/many-cookbooks-yet-same-old-supper-1824825), the average Briton has a rotation of nine different recipes. Part of this lack of diversity is likely the fear of preparing a meal that doesn't conform to your tastes and being forced to throw it out. 

### Project Objective

For this project, we will create a recipe recommendation engine based on your tastes and customize your weekly menu within seconds. 

Our aim is to diversify your menu by helping you discover new meals that incorporate flavors that you already love, using items you already buy. 

### Datasets

1. **Instacart Market Basket Analysis Datasets**
    - order
    - ailes: 132 unique store ailes
    - departments: 24 unique departments
    - products: 49.7k unique products. 
    
    
2. **Mariano's Grocery Prices**: 

3. **Simply Recipes recipe database**

4. **Model Training Datasets**: 
    - nyt-ingredients-snapshot-2015
    - marianos_product_train_complete
    - instacart_product_train

### Summery of Results

blah blah blah blah 

## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [None]:
%load_ext autoreload
%autoreload 2

import argparse
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import pycrfsuite
import re
import sys
import warnings
nltk.download('averaged_perceptron_tagger')
warnings.filterwarnings('ignore')

from selenium import webdriver
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d00_utils import utils
from d01_data import clean_data
from d01_data.web_scraping import sr_scraping, marianos_insta_scraping
from d02_features.feature_creation import nyt_ingredients_crf_feature_creation
from d02_features.feature_creation import instacart_prod_crf_feature_creation
from d03_models.crf_model_recipes import crf_model_recipe_tagger
from d03_models.crf_model_baskets import crf_basket_feature_creation, crf_basket_dataset_creation

### Recipe Web Scraping - simplyrecipes.com

After conducting research on a number of sites, I chose simplyrecipes.com to scrape for a number of reasons. 

* The website contained diverse recipes that could appeal to a number of different pallets
* Each recipe came pre-tagged with meal and dietary preferences

If you would like to scrape the website yourself please run ```sr_scraping()``` in a cell within this notebook. The full script takes around 1.5 hours to run and all files are saved to the ```data/01_raw/simply_recipes``` folder. I'm reading in the scraped and concatenated dataset for convenience. 

In [None]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/simply_recipes/simply_recipes*')
recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)

In [None]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

At the moment we have 1752 different recipe entries. From a quick glance, some of the things marked as recipes are actually how-to guides. These will need to come out because they don't provide ingredients to base our model off of. 

In [None]:
recipes_sr_inter.head(2)

### Grocery Price Web Scraping - Chicago's Marianos

If you would like to scrape the website yourself please run ```marianos_insta_scraping()``` in a cell within this notebook. The full script takes around 5 hours to run and all files are saved to the ```data/01_raw folder``` as ```prod_aile_*```. Once you run the script selenium will open and you'll need to sign into instacart wit your username and password. After 80 seconds the script will began scraping the site on it's own. 

Let's go ahead and read in the concatenated files from marianos. 

In [None]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [None]:
grocery_prices_orig.head(2)

In [None]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [None]:
grocery_prices_inter.head(2)

## Exploratory Data Analysis

### Instacart Market Basket Analysis Datasets

Let's read in our datasets and merge them so that we can see what each user purchased. 

In [None]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

In [None]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, departments, order, 
                                                                 order_products__prior, products)
instacart_baskets.head()

In [None]:
instacart_baskets.info()

The highest number of baskets by customer is 99 (this means 99 different orders for each of the 5 customers below). 

In [None]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

### Simply Recipes' Recipe Dataset

In [None]:
recipes_sr_inter.head(2)

In [None]:
recipes_sr_inter.info()

In [None]:
print('Number of unique recipes: ', len(recipes_sr_inter))

### Mariano's Grocery Prices Dataset

## Model Exploration & Feature Construction

### Simply Recipes

In order to sort recipes by ingredient (and ultimately price recipes) we will need to first mark each recipe by their list of ingredients. In most recipes ingredients are listed within full sentences (i.e., a recipe will call for "8 slices sourdough bread" and not just "sourdough bread". Because of this difficulty, we will need to build a model that can identify the quantity (8), unit (slices), item (sourdough bread), and any additional comments.

After conducting research on a number of natural language processing models, I have settled on constructing a conditional random field model (crf model). These models are deployed for pattern recognition and are used in situations that call for structured predictions. Because recipe ingredients often have the same elements, this will be a good match for our purposes. 

#### Step 1: Construct Training Data

In an article from the New York Times titled "[Extracting Structured Data From Recipes Using Conditional Random Fields](https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/)", the author details the process they used to classify their recipies for their recipe website. The Times generously shared their code and training dataset on their github page, so I had access to both a training set and was able to see a clear example of how the code works. 

#### Step 2: Feature Creation 

In [None]:
nyt_ing = pd.read_csv('../../data/01_raw/nyt-ingredients-snapshot-2015.csv')
nyt_ing.drop(columns=['index'], inplace=True)
print('Number of Handlabeled Ingredients: ', len(nyt_ing))
nyt_ing.head()

The NYT training dataset contains over 170K handlabled ingredients. Let's put the dataset in the correct format for feeding into the model. 

In [None]:
nyt_ing.fillna("missing", inplace=True)

In [None]:
X, y = nyt_ingredients_crf_feature_creation(nyt_ing)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Model Building & Testing

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_nyt_initial_model.model')
# let's read back in our model 
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_nyt_initial_model.model')

Let's test our model. 

In [None]:
# Kernal keeps dying when i try and tag things with tagger
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [None]:
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

Our model is almost 100% accurate. That's good enough. Let's train all of our data on the model and move onto using it for our ingredients.

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/0/crf_ing_final.model')

#### Let's use our model to tag our simply recipies dataset 

In [None]:
recipe_ing_dict, recipe_links_dict, recipe_tags_dict = crf_model_recipe_tagger(recipes_sr_inter)

In [None]:
recipe_ing_dict

We now have a dictionary of recipes with ingredients correctly labeled, a dictionary of links and a dictionary of tags (breakfast, lunch, dinner, vegetarian, etc.). Let's now build and use a model for our instacart products. 

### Instacart Products

#### Step 1: Construct Training Data

I hand-labeled over 1000 rows of the combined instacart dataset in order to make a training dataset. This dataset will be available in the sample data folder. 

In [None]:
instacart_prod_train = pd.read_csv('../../data/01_raw/instacart_product_train.csv')

In [None]:
instacart_prod_train.head()

In [None]:
X, y = instacart_prod_crf_feature_creation(instacart_prod_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_ingredients_initial.model')
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_ingredients_initial.model')

In [None]:
labels = [tagger.tag(xseq) for xseq in X_test]
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

Our model is almost 100% accurate. That's good enough. Let's train all of our data on the model and move onto using it for our ingredients. 

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_instacart_products_final.model')

#### Let's use our new model to tag our Instacart Bakset Datset

In [None]:
X, token_sr, products_list = crf_basket_feature_creation(instacart_baskets)

In [None]:
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_instacart_products_final.model')

In [None]:
labels = [tagger.tag(xseq) for xseq in X]

In [None]:
instacart_baskets_update = crf_basket_dataset_creation(token_sr, labels, products_list, instacart_baskets)

## Application Testing 

## Conclusion & Next Steps