# Model and Evaluation for SmartRecipes (a recipe recommender)

## Table of Contents
- [Model and Evaluation for the SmartRecipes Recipe Recommender](#model-and-evaluation-for-the-smartrecipes-recipe-recommender)
  - [Introduction](#introduction)
  - [Goal](#goal)
  - [Process](#process)
    - [Load the cleaned dataset](#load-the-cleaned-dataset)
    - [Feature Engineering `featuredCol` = `categories` + `ingredients`](#feature-engineering-featuredcol-=-categories-+-ingredients)
    - [Vectorize `ingredientsStr` to get basic vocabulary](#vectorize-ingredientsstr-to-get-basic-vocabulary)
    - [Create custom vocabulary](#create-custom-vocabulary)
    - [Get the transformed sparse matrix by vectorizing `featuredCol`](#get-the-transformed-sparse-matrix-by-vectorizing-featuredcol)
    - [Create the `NearestNeighbors` model](#create-the-nearestneighbors-model)
    - [Test and Evaluate the model](#test-and-evaluate-the-model)
    - [Save for Streamlit](#save-for-streamlit)
  - [Next Steps](#next-steps)


## Introduction

1. So far we have cleaned the dataset
2. Done some experimentation and iteration and figured out a custom tokenizer for or CountVectorizer.
3. Figured out how to match for recipes based on a list of ingredients that the user has entered.
4. Figured out that we will use a custom vocabulary to include 2 word ingredients, since using bigrams in the vectorizer's hyperparameters was not giving us satisfactory results.
4. The results are returned in order of number of ingredients, which matches our reducing food waste goal.
5. Figured out that we get better results if we include the categories string in the vectorization too.
6. Shortlisted some categories that are of interest to us - []"Drink", "Dairy Free", "Gluten free", "Dessert", "Vegetarian"]. 

## Goal

In this notebook our goal is to combine these various findings into a sequential series of steps:

1. Feature engineering - Combine the `categoriesStr` and `ingredientsStr` column. Name it `featuredCol`
2. Vectorize the `ingredientsStr` column using the custom tokenizer. We want most of the words from this column. This step will give us the vocabulary.
3. Create the custom vocabulary `custom_vocab` by adding the two word ingredients like "peanut butter" and any category words of interest like "vegetarian".
4. Get the ingredients + relevant categories sparse matrix by calling transform() on the `featuredCol` using a 2nd CountVectorizer that has the `custom_vocab` and custom tokenizer. Note: We will not call fit() on this 2nd vectorizer because we are passing a custom vocabulary. We will directly call transform().
5. Create a model using NearestNeighbors and fit it using the sparse matrix got in the step above.
6. Use the 2nd vectorizer to transform any input before giving it to the model.
7. Call model.kneighbors() to get the results.
8. For Streamlit usage.
    1. Save the `custom_vocab`
    2. Save the `custom tokenizer`
    3. Save the 2nd vectorizer.
    4. Save the NearestNeighbors model.
    5. Save the dataset with the `featuredCol` 

## Process

Let's start by loading the cleaned dataset.

### Load the cleaned dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
# import statsmodels.api as sm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors

#(NOTE: joblib had to be installed the first time I used it)
import joblib
# custom tokenizer (moved to a separate module due to Streamlit requirements)
import cust_tokenizer


In [2]:
df = pd.read_csv('../data/interim/full_recipes_cleaned_2.csv')
df.shape

(14526, 7)

In [3]:
df.set_index('recipeId')

Unnamed: 0_level_0,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
recipeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,426.0,2.500,"Lentil, Apple, and Turkey Wrap","['1. Place the stock, lentils, celery, carrot,...","['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turke...",['4 cups low-sodium vegetable or chicken stock...
1,403.0,4.375,Boudin Blanc Terrine with Red Onion Confit,['Combine first 9 ingredients in heavy medium ...,"['Food Processor', 'Onion', 'Pork', 'Bake', 'B...","['1 1/2 cups whipping cream', '2 medium onions..."
2,165.0,3.750,Potato and Fennel Soup Hodge,['In a large heavy saucepan cook diced fennel ...,"['Soup/Stew', 'Dairy', 'Potato', 'Vegetable', ...","['1 fennel bulb (sometimes called anise), stal..."
4,547.0,3.125,Spinach Noodle Casserole,['Preheat oven to 350°F. Lightly grease 8x8x2-...,"['Cheese', 'Dairy', 'Pasta', 'Vegetable', 'Sid...","['1 12-ounce package frozen spinach soufflé, t..."
5,948.0,4.375,The Best Blts,"['Mix basil, mayonnaise and butter in processo...","['Sandwich', 'Food Processor', 'Tomato', 'Kid-...",['2 1/2 cups (lightly packed) fresh basil leav...
...,...,...,...,...,...,...
20125,28.0,3.125,Parmesan Puffs,['Beat whites in a bowl with an electric mixer...,"['Mixer', 'Cheese', 'Egg', 'Fry', 'Cocktail Pa...","['2 large egg whites', '3 oz Parmigiano-Reggia..."
20126,671.0,4.375,Artichoke and Parmesan Risotto,['Bring broth to simmer in saucepan.Remove fro...,"['Side', 'Kid-Friendly', 'High Fiber', 'Dinner...",['5 1/2 cups (or more) low-salt chicken broth'...
20127,563.0,4.375,Turkey Cream Puff Pie,"['Using a sharp knife, cut a shallow X in bott...","['Onion', 'Poultry', 'turkey', 'Vegetable', 'B...","['1 small tomato', '1 small onion, finely chop..."
20128,631.0,4.375,Snapper on Angel Hair with Citrus Cream,['Heat 2 tablespoons oil in heavy medium skill...,"['Milk/Cream', 'Citrus', 'Dairy', 'Fish', 'Gar...","['4 tablespoons olive oil', '4 shallots, thinl..."


In [4]:
# confirm that there are no null values or duplicated values
print(f"Null values: {df.isna().sum().sum()}")
print(f"Duplicated rows: {df.duplicated().sum()}")

Null values: 0
Duplicated rows: 0


### Feature Engineering `featuredCol` = `categories` + `ingredients`

In [5]:
df["featuredCol"] = df['categoriesStr'] + df['ingredientsStr']

In [6]:
type(df.loc[0,['featuredCol']].values[0])

str

### Vectorize `ingredientsStr` to get basic vocabulary

Vectorize the ingredient list to get the vocab

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer=(cust_tokenizer.my_tokenizer),
                       min_df=5)
ingredients_matrix = vectorizer.fit_transform(df['ingredientsStr'])




### Create custom vocabulary

Add 2 word ingredients and the categories you want to the custom vocab

In [8]:
new_vocab_list = list(vectorizer.get_feature_names_out())
new_vocab_list.append("Peanut butter")
new_vocab_list.extend(["Drink", "Dairy Free", "Gluten free", "Dessert", "Vegetarian"])
len(new_vocab_list)

2166

### Get the transformed sparse matrix by vectorizing `featuredCol`

Now transform the featuredCol

In [9]:
# Let's set this new vocab as a hyperparameter of the vectorizer
vectorizer_mod = CountVectorizer(tokenizer=(cust_tokenizer.my_tokenizer),
                       min_df=5, vocabulary=new_vocab_list)
# not calling fit because sending custom vocab..
ingredients_matrix_mod = vectorizer_mod.transform(df['featuredCol'])


In [10]:
# Adding a function to print the list of ingredients given a recipe title
def printIngredients(recipeName):
    sel = df['title'] == recipeName
    print(recipeName)
    print(df.loc[sel, ['featuredCol']].values)


### Create the `NearestNeighbors` model

In [11]:
model = NearestNeighbors(n_neighbors=11, metric='cosine')
model.fit(ingredients_matrix_mod)

### Test and Evaluate the model

In [12]:
# Using various ingredient lists to test the results

ingInputList = [
    # "Chicken, Parmesan, Breadcrumbs",  # something familiar
    # "Artichoke Pesto",
    # "Chicken thighs, potatoes",  # compare results of potatoes vs potato
    # "Chicken thighs, potato",
    # "Okra",  # single ingredient
    # "Bhindi",  # unknown ingredient - does not exist in the vocabulary
    "Peanut butter"  # This is a 2 word ingredient
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = vectorizer_mod.transform(ingInputSeries)

    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")



 Input ingredients: Peanut butter

 Result
Peanut Butter and Banana Sandwiches  :  0.333  :  3.75
Peanut Punch  :  0.36  :  0.0
Peanut Butter, Banana, and Jelly "Ice Cream"  :  0.368  :  0.0
Frozen Peanut Butter Pie with Candied Bacon  :  0.452  :  0.0
Peanut Butter and Jelly Bars  :  0.457  :  4.375
Milk Chocolate Peanut Butter Sauce  :  0.475  :  2.5
Peanut Butter and Jelly Layered Sandwiches  :  0.486  :  4.375
Peanut Butter Chocolate Ripple Ice Cream  :  0.486  :  4.375
Peanut Butter Chocolate Chip Breads  :  0.492  :  3.75
Giant Chocolate Candy Bar With Peanuts and Nougat  :  0.5  :  5.0
Indian Clarified Butter  :  0.5  :  5.0


In [13]:
ingInputList = [
    # "Chicken, Parmesan, Breadcrumbs",  # something familiar
    # "Artichoke Pesto",
    "Chicken thighs, potatoes",  # compare results of potatoes vs potato
    "Chicken thighs, potatoes, gluten free",  # compare results of potatoes vs potato
    # "Chicken thighs, potato",
    "Okra",  # single ingredient
    "Okra, vegetarian",
    # "Bhindi",  # unknown ingredient - does not exist in the vocabulary
    # "Peanut butter"  # This is a 2 word ingredient
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = vectorizer_mod.transform(ingInputSeries)

    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")



 Input ingredients: Chicken thighs, potatoes

 Result
Chicken Vegetable Soup with Lime and Cilantro  :  0.512  :  3.75
Chicken with Baby Onions  :  0.538  :  3.75
Roasted Chicken, Ramps, and Potatoes  :  0.547  :  4.375
Greek Chicken and Potatoes  :  0.564  :  4.375
Maple-Barbecued Chicken  :  0.564  :  4.375
Cane Vinegar Chicken with Pearl Onions, Orange & Spinach  :  0.574  :  4.375
Chicken Vindaloo  :  0.588  :  3.125
Old-Fashioned Chicken and Corn Stew  :  0.588  :  4.375
Chilled Corn Soup with Herbed Chicken  :  0.592  :  1.875
Chicken Soup  :  0.594  :  3.75
Tamales de Mole Poblano  :  0.604  :  0.0

 Input ingredients: Chicken thighs, potatoes, gluten free

 Result
Cane Vinegar Chicken with Pearl Onions, Orange & Spinach  :  0.446  :  4.375
Grilled Chicken Moroccan Style  :  0.462  :  4.375
Southern-Style Fried Chicken  :  0.484  :  0.0
Roast Chicken With Harissa And Schmaltz  :  0.489  :  5.0
Chili and Honey Chicken Legs  :  0.489  :  3.75
Chicken and Artichoke Fricassée with 

Results look satisfactory - Comparison of various runs is in https://docs.google.com/spreadsheets/d/1hkekdJCZBJqC5hkHKQFFM8sBONUf3CIsv9ekPxqa1AQ/edit?gid=0#gid=0

In [None]:
printIngredients("Chicken Vegetable Soup with Lime and Cilantro")

In [None]:
printIngredients("Cane Vinegar Chicken with Pearl Onions, Orange & Spinach")

We can see that specifying "Gluten-free" changes the results.

In [16]:
yes_ing_series = pd.Series("Broccoli")
no_ing_series = pd.Series("Asparagus")
# Results look a little strange for this combo.

yes_ing_tx = vectorizer_mod.transform(yes_ing_series)
no_ing_tx = (vectorizer_mod.transform(no_ing_series)) * -1  # try increasing to -10 and see what happens

updated_ing_tx = yes_ing_tx + no_ing_tx

distOfRes, indicesOfRes = model.kneighbors(updated_ing_tx)

# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    distance = (distOfRes[0][i]).round(3)
    rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

    # print(f"{name}  :  {distance}")
    print(f"{name}  :  {distance}  :  {rating}")


 Result
Broccoli and Broccoli Rabe with Roasted Red Peppers  :  0.667  :  3.125
Creamy Broccoli and Carrot Slaw  :  0.705  :  3.75
Broccoli Gratin with Mustard-Cheese Streusel  :  0.721  :  3.75
Cream of Broccoli Soup with Wild Mushrooms  :  0.723  :  2.5
Broccoli Rabe with Garlic and Pecorino Romano Cheese  :  0.731  :  4.375
Broccoli Cheddar Cornbread  :  0.742  :  4.375
Beef and Broccoli Stir Fry  :  0.742  :  3.75
Linguine Primavera  :  0.743  :  3.75
Italian Sausages with Broccoli Rabe and Polenta  :  0.746  :  3.75
Spaghetti with Broccoli Rabe and Garlic  :  0.746  :  4.375
Steamed Broccoli with Olive Oil and Parmesan  :  0.746  :  4.375


### Save for Streamlit

1. Save the `custom_vocab`
2. Save the `custom tokenizer` (done already as a module cust_tokenizer)
3. Save the 2nd vectorizer.
4. Save the NearestNeighbors model.
5. Save the dataset with the `featuredCol`

In [17]:
# Save the custom vocab
joblib.dump(new_vocab_list, '../model/custom_vocab.pkl')

# Save the 2nd vectorizer.
joblib.dump(vectorizer_mod, "../model/vect_mod.pkl")

# Save the NearestNeighbors model.
joblib.dump(model, "../model/model_final.pkl")

# Save the dataset with the featuredCol
df.to_csv("../data/final/full_recipes.csv",index=True,index_label='recipeId')

## Next Steps

- This was an unsupervised learning problem, so the results had to be evaluated manually.

- Recipes with the least relevant ingredients show up in the top 5 results ~ 80% of the time.

- Build a Streamlit app for the recipe recommender.