# Pre-processing for the dataset from Epicurious.com for SmartRecipes (a recipe recommender)

# Table of Contents
- [Pre-processing for the dataset from Epicurious.com for the Recipe Recommendation System](#pre-processing-for-the-dataset-from-epicurious.com-for-the-recipe-recommendation-system)
  - [Introduction](#introduction)
  - [Load the cleaned data](#load-the-cleaned-data)
  - [Vectorize](#vectorize)
    - [Define a custom tokenizer](#define-a-custom-tokenizer)
    - [Vectorize ingredientsStr using the custom tokenizer](#vectorize-ingredientsstr-using-the-custom-tokenizer)
  - [Just for fun: Run Logistic Regression to see if we can predict high ratings based on ingredients.](#just-for-fun:-run-logistic-regression-to-see-if-we-can-predict-high-ratings-based-on-ingredients.)
  - [Initial Modeling](#initial-modeling)
    - [Method 1 : Similarity matrix](#method-1-:-similarity-matrix)
    - [Method 2 : Using NearestNeighbors](#method-2-:-using-nearestneighbors)
      - [Matching on "Exact Recipe Title" as Input](#matching-on-"exact-recipe-title"-as-input)
      - [Matching on a "list of ingredients" as Input](#matching-on-a-"list-of-ingredients"-as-input)
      - [Matching on "exclude list of ingredients" as Input](#matching-on-"exclude-list-of-ingredients"-as-input)
      - [Matching on 'include ingredients' + 'exclude ingredients' as input](#matching-on-'include-ingredients'-+-'exclude-ingredients'-as-input)
      - [Trying a vectorizer with a stemmer](#trying-a-vectorizer-with-a-stemmer)
    - [Trying NearestNeighbors on categoriesStr](#trying-nearestneighbors-on-categoriesstr)
      - [Vectorize categoriesStr using the default tokenizer](#vectorize-categoriesstr-using-the-default-tokenizer)
      - [Run NearestNeigbor](#run-nearestneigbor)
  - [For Streamlit](#for-streamlit)
  - [Summary](#summary)
  - [Next Steps and Questions](#next-steps-and-questions)


## Introduction

We have chosen the dataset at https://www.kaggle.com/datasets/hugodarwood/epirecipes?select=full_format_recipes.json for our project.  

We have cleaned this dataset and saved it as `full_recipes_cleaned_2.csv`.  

In this notebook, we will do preprocessing on the data to get it ready for modeling. Since we are dealing with text data, we will use CountVectorizer to preprocess.

We will also do some preliminary modeling.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.feature_extraction.text import CountVectorizer

# custom tokenizer (moved to a separate module due to Streamlit requirements)
import cust_tokenizer


In [2]:
f"pandas version: {pd.__version__}"

'pandas version: 2.1.4'

## Load the cleaned data

In [3]:
df = pd.read_csv('../data/interim/full_recipes_cleaned_2.csv')
df.shape

(14526, 7)

In [4]:
df.head()

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
0,0,426.0,2.5,"Lentil, Apple, and Turkey Wrap","['1. Place the stock, lentils, celery, carrot,...","['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turke...",['4 cups low-sodium vegetable or chicken stock...
1,1,403.0,4.375,Boudin Blanc Terrine with Red Onion Confit,['Combine first 9 ingredients in heavy medium ...,"['Food Processor', 'Onion', 'Pork', 'Bake', 'B...","['1 1/2 cups whipping cream', '2 medium onions..."
2,2,165.0,3.75,Potato and Fennel Soup Hodge,['In a large heavy saucepan cook diced fennel ...,"['Soup/Stew', 'Dairy', 'Potato', 'Vegetable', ...","['1 fennel bulb (sometimes called anise), stal..."
3,4,547.0,3.125,Spinach Noodle Casserole,['Preheat oven to 350°F. Lightly grease 8x8x2-...,"['Cheese', 'Dairy', 'Pasta', 'Vegetable', 'Sid...","['1 12-ounce package frozen spinach soufflé, t..."
4,5,948.0,4.375,The Best Blts,"['Mix basil, mayonnaise and butter in processo...","['Sandwich', 'Food Processor', 'Tomato', 'Kid-...",['2 1/2 cups (lightly packed) fresh basil leav...


In [5]:
df.set_index('recipeId')

Unnamed: 0_level_0,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
recipeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,426.0,2.500,"Lentil, Apple, and Turkey Wrap","['1. Place the stock, lentils, celery, carrot,...","['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turke...",['4 cups low-sodium vegetable or chicken stock...
1,403.0,4.375,Boudin Blanc Terrine with Red Onion Confit,['Combine first 9 ingredients in heavy medium ...,"['Food Processor', 'Onion', 'Pork', 'Bake', 'B...","['1 1/2 cups whipping cream', '2 medium onions..."
2,165.0,3.750,Potato and Fennel Soup Hodge,['In a large heavy saucepan cook diced fennel ...,"['Soup/Stew', 'Dairy', 'Potato', 'Vegetable', ...","['1 fennel bulb (sometimes called anise), stal..."
4,547.0,3.125,Spinach Noodle Casserole,['Preheat oven to 350°F. Lightly grease 8x8x2-...,"['Cheese', 'Dairy', 'Pasta', 'Vegetable', 'Sid...","['1 12-ounce package frozen spinach soufflé, t..."
5,948.0,4.375,The Best Blts,"['Mix basil, mayonnaise and butter in processo...","['Sandwich', 'Food Processor', 'Tomato', 'Kid-...",['2 1/2 cups (lightly packed) fresh basil leav...
...,...,...,...,...,...,...
20125,28.0,3.125,Parmesan Puffs,['Beat whites in a bowl with an electric mixer...,"['Mixer', 'Cheese', 'Egg', 'Fry', 'Cocktail Pa...","['2 large egg whites', '3 oz Parmigiano-Reggia..."
20126,671.0,4.375,Artichoke and Parmesan Risotto,['Bring broth to simmer in saucepan.Remove fro...,"['Side', 'Kid-Friendly', 'High Fiber', 'Dinner...",['5 1/2 cups (or more) low-salt chicken broth'...
20127,563.0,4.375,Turkey Cream Puff Pie,"['Using a sharp knife, cut a shallow X in bott...","['Onion', 'Poultry', 'turkey', 'Vegetable', 'B...","['1 small tomato', '1 small onion, finely chop..."
20128,631.0,4.375,Snapper on Angel Hair with Citrus Cream,['Heat 2 tablespoons oil in heavy medium skill...,"['Milk/Cream', 'Citrus', 'Dairy', 'Fish', 'Gar...","['4 tablespoons olive oil', '4 shallots, thinl..."


We have 14526 rows and 6 columns

In [6]:
# confirm that there are no null values or duplicated values
print(f"Null values: {df.isna().sum().sum()}")
print(f"Duplicated rows: {df.duplicated().sum()}")

Null values: 0
Duplicated rows: 0


Verified that there are no null or duplicate values

## Vectorize

### Define a custom tokenizer

<font color='yellow'>NOTE - MOVED my_tokenizer to a module called cust_tokenizer, because of Streamlit</font>

### Vectorize ingredientsStr using the custom tokenizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer=(cust_tokenizer.my_tokenizer),
                       min_df=5)
ingredients_matrix = vectorizer.fit_transform(df['ingredientsStr'])



In [8]:
# IGNORE COMMENTED CODE BELOW

#bigram_series = pd.Series(["chicken breast", "boneless chicken", "chicken stock",
#                           "egg yolks", "chicken thighs", "cocoa powder",
#                           "cilantro leaves", "boneless pork", "sundried tomatoes"])

# join this with the ingredientsStr series
# custom_vocab = pd.concat([df['ingredientsStr'],bigram_series], ignore_index=True)

# specifying bigrams made matters worse, so I will add only the most sensible bigrams manually
# I inspected the results using the code snippet labeled - checking word / bigram frequency

# ingredients_matrix = vectorizer.fit_transform(custom_vocab)
# vectorizer.fit(df['ingredientsStr'])
# ingredients_matrix = vectorizer.transform(df['ingredientsStr'])

In [9]:
# checking word / bigram frequency - used when ngrams was specified in the CountVectorizer
# sum_words = ingredients_matrix.sum(axis = 0)
# words_freq = [(word, sum_words[0, i]) for word, i in vectorizer.vocabulary_.items()]
# words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
# words_freq = sorted(words_freq, key = lambda x: x[1])
# words_freq[1000:1500]

In [10]:
ingredients_matrix.shape

(14526, 2160)

In [11]:
ingredients_matrix

<14526x2160 sparse matrix of type '<class 'numpy.int64'>'
	with 313400 stored elements in Compressed Sparse Row format>

In [12]:
# 'Artichoke and Parmesan Risotto' and 'Chicken Parmesan' are two recipes we will use to test for now
df[df['title'].str.contains('Chicken Parmesan', na=False)]

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
4918,6368,1842.0,5.0,Chicken Parmesan,['Place breadcrumbs and flour in 2 separate sh...,"['Chicken', 'Tomato', 'Broil', 'Kid-Friendly',...","['2 cups fine dry breadcrumbs', '1 cup all-pur..."
7482,9873,3392.0,3.75,Chicken Parmesan Heros,['Heat olive oil in a 4- to 5-quart heavy sauc...,"['Sandwich', 'Cheese', 'Chicken', 'Poultry', '...","['3 tablespoons olive oil', '1 small onion, fi..."
9982,13372,610.0,4.375,New Chicken Parmesan,['Preheat oven to 500° F. Whisk first 3 ingred...,"['Chicken', 'Tomato', 'Roast', 'Kid-Friendly',...","['1/3 cup extra-virgin olive oil', '2 large ga..."
12825,17557,917.0,5.0,Quick Baked Chicken Parmesan,['Arrange racks in top and bottom of oven and ...,"['22-Minute Meals', 'Chicken', 'Parmesan', 'To...","['2 large eggs', '1 1/2 cups breadcrumbs or pa..."


In [13]:
df[df['title'].str.contains('Chicken', na=False)]

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
25,35,625.0,3.750,Aztec Chicken,['Melt 2 tablespoons butter with vegetable oil...,"['Chicken', 'Olive', 'Onion', 'Sauté', 'Dinner...","['6 tablespoons (3/4 stick) chilled butter', '..."
37,53,1203.0,5.000,Pancetta Roast Chicken with Walnut Stuffing,['Preheat oven to 400°F. Melt 1/4 cup butter i...,"['Chicken', 'Roast', 'High Fiber', 'Dinner', '...","['8 tablespoons (1 stick) butter, divided', 'C..."
61,80,1172.0,4.375,"Braised Chicken and Rice with Orange, Saffron,...",['Rinse the rice in a sieve under cold running...,"['Chicken', 'Citrus', 'Fruit', 'Nut', 'Poultry...","['1 1/2 cups brown basmati rice', '1/4 cup oli..."
63,82,682.0,4.375,Chicken in Green Pumpkin-Seed Sauce,['Bring all ingredients to boil in large pot. ...,"['Chicken', 'Low/No Sugar', 'Cinco de Mayo', '...","['5 cups water', '6 chicken thighs with skin a..."
67,87,1143.0,0.000,Roast Chicken With Sorghum and Squash,['Bring 5 cups water to a boil in a medium pot...,"['Bon Appétit', 'Dinner', 'Chicken', 'Grains',...","['Kosher salt', '1 cup sorghum', '1/2 large bu..."
...,...,...,...,...,...,...,...
14445,20007,936.0,4.375,"Braised Chicken with Smoked Ham, Chestnuts, an...","['Bring water to a simmer in a small saucepan,...","['Chicken', 'Ginger', 'Braise', 'Marinate', 'D...","['2 3/4 cups water', '12 dried Chinese black m..."
14473,20050,356.0,3.750,Hot Chicken Salad,['1. Preheat the oven to 375°F. Spray a 13-by-...,"['Cheese', 'Chicken', 'Nut', 'Poultry', 'Bake'...","['2 cups cooked chicken breast meat, cubed (Yo..."
14475,20052,878.0,4.375,Chicken Tetrazzini,"['Bring chicken bones, broth, carrot, onion, c...","['Chicken', 'Mushroom', 'Pasta', 'Bake', 'Supe...",['1 to 1 1/2 pound chicken bones (from 2 cooke...
14497,20091,1096.0,3.750,Chicken with Raisins and Lemon,['Arrange chicken in single layer in large Dut...,"['Chicken', 'Potato', 'Poultry', 'Lemon', 'Rai...","['1 3 1/2-pound chicken, cut into 8 pieces', '..."


## Just for fun: Run Logistic Regression to see if we can predict high ratings based on ingredients.

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Data setup
X = df["ingredientsStr"]
y = df["rating"] >= 4 # binarize our target variable
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,
                                                 stratify=y,random_state=42)

# Vectorization
vect_log = CountVectorizer(tokenizer=cust_tokenizer.my_tokenizer,
                       min_df=5)
X_train_vect = vect_log.fit_transform(X_train)
X_test_vect = vect_log.transform(X_test)





In [15]:
# Classifier Training with proper cross validation
best_model = None
best_score = -np.inf
for cval in np.logspace(-3,3,10):
    lr = LogisticRegression(C=cval,max_iter=1000)
    lr.fit(X_train_vect,y_train)
    score = lr.score(X_test_vect,y_test)
    print("Accuracy = {:.4f} for C={:.4f}".format(score,cval))
    if score > best_score:
        best_score = score
        best_model = lr

# finally, evaluate on the test data after fitting with the best hyperparams
best_model.fit(X_train_vect,y_train)
test_acc = best_model.score(X_test_vect,y_test)
print("Our classifier test accuracy is {:.4f}".format(test_acc))

Accuracy = 0.5957 for C=0.0010
Accuracy = 0.6026 for C=0.0046
Accuracy = 0.5980 for C=0.0215
Accuracy = 0.5879 for C=0.1000
Accuracy = 0.5792 for C=0.4642
Accuracy = 0.5709 for C=2.1544
Accuracy = 0.5652 for C=10.0000
Accuracy = 0.5608 for C=46.4159


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy = 0.5622 for C=215.4435
Accuracy = 0.5633 for C=1000.0000
Our classifier test accuracy is 0.6026


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Based on the ingredients only, we can predict a high rating > 4.0 with only 59.66% accuracy. This is slightly better than the baseline, which was 54%. (Earlier we saw that 54% of the dataset has ratings > 4.0.)

## Initial Modeling

### Method 1 : Similarity matrix



<font color="yellow">NOTE: We will use NearestNeighbors, this section of the notebook can be skipped.</font>

Got these steps from the Notebook for the upcoming Recommendation Systems class.

In [16]:
ingredients_matrix[(df['title'] == 'Chicken Parmesan').values].todense().squeeze()

matrix([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

recipe_1 = ingredients_matrix[(df['title'] == 'Chicken Parmesan').values,]
recipe_2 = ingredients_matrix[(df['title'] == 'Chicken Parmesan Heros').values,]

print("Similarity:", cosine_similarity(recipe_1, recipe_2)) # Notice the result is a 2D 1X1 array, so to grab
                                                          # the number we will need to index

Similarity: [[0.31980107]]


Using cosine_similarity, 'Chicken Parmesan' and 'Chicken Parmesan Heros' recipes are 34% similar.

In [18]:
recipe_3 = ingredients_matrix[(df['title'] == 'Artichoke and Parmesan Risotto').values,]
print("Similarity:", cosine_similarity(recipe_1, recipe_3))

Similarity: [[0.18844459]]


Using cosine_similarity, 'Chicken Parmesan' and 'Artichoke and Parmesan Risotto' recipes are 20% similar.

Looking at the actual recipes in the dataset this output makes sense.

Let's create a similarity matrix by doing cosine_similarity on the entire ingredients sparse matrix.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(ingredients_matrix, dense_output=False)

In [20]:
# Check the shape
# rows and columns should be equal, and the number of movies we started with (rows)
similarities.shape

(14526, 14526)

In [21]:
# Test with a sample recipe
df[df['title'] == 'Chicken Parmesan']

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
4918,6368,1842.0,5.0,Chicken Parmesan,['Place breadcrumbs and flour in 2 separate sh...,"['Chicken', 'Tomato', 'Broil', 'Kid-Friendly',...","['2 cups fine dry breadcrumbs', '1 cup all-pur..."


In [22]:
# Get the column based upon the index
recipe_index = df[df['title'] == 'Chicken Parmesan'].index

# Create a dataframe with the movie titles
sim_df = pd.DataFrame({'recipe': df['title'],
                       'similarity': np.array(similarities[recipe_index, :].todense()).squeeze()})

In [23]:
# Return the top 10 most similar recipes
sim_df.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,recipe,similarity
4918,Chicken Parmesan,1.0
12825,Quick Baked Chicken Parmesan,0.586939
11718,Fried Chicken Biscuits,0.565752
7471,Fried Chicken Thighs with Cheesy Grits,0.540295
7475,3-Ingredient Shakshuka,0.533002
5405,Spanish-Style Fried Chicken With Grilled Avocado,0.528525
6860,Chicken and Dumplings with Mushrooms,0.523785
13533,Fusilli with Shrimp and Paneed Chicken,0.522233
7749,Lamb and Eggplant Casserole (Moussaka),0.50884
4639,Spiced Labneh,0.506024


With TFIDF Vectorizer:

| recipe |                                       similarity |          |
|-------:|-------------------------------------------------:|----------|
|   4918 |                                 Chicken Parmesan | 1.000000 |
|  11718 |                           Fried Chicken Biscuits | 0.424981 |
|  12825 |                     Quick Baked Chicken Parmesan | 0.424046 |
|   1165 |                          Mozzarella Pesto Spread | 0.412968 |
|   9877 |         Rigatoni with Cheese and Italian Sausage | 0.360720 |
|  14180 |                      BA's Best Eggplant Parmesan | 0.356189 |
|   5786 |                         Horseradish-Yogurt Sauce | 0.351933 |
|   3927 | Chunky Two-Cheese Potatoes with Garlic and Pesto | 0.333565 |
|  13533 |           Fusilli with Shrimp and Paneed Chicken | 0.331478 |
|   3123 |    Spicy Lamb Pizza With Parsley–Red Onion Salad | 0.328433 |

Results with CountVectorizer seem better:

|       |                                            recipe | similarity |
|------:|--------------------------------------------------:|------------|
|  4918 |                                  Chicken Parmesan |   1.000000 |
| 13533 |            Fusilli with Shrimp and Paneed Chicken |   0.587137 |
| 12480 |                          Spicy Oven-Fried Chicken |   0.574641 |
| 11718 |                            Fried Chicken Biscuits |   0.559149 |
| 14180 |                       BA's Best Eggplant Parmesan |   0.557207 |
|  7749 |            Lamb and Eggplant Casserole (Moussaka) |   0.556499 |
| 11245 |                                             Pinon |   0.548443 |
|  9785 | Breaded Skinless Fish Fillets with Red Pepper ... |   0.540222 |
|  6860 |              Chicken and Dumplings with Mushrooms |   0.539425 |
| 11336 | Crispy Chicken Sandwich with Buttermilk Slaw a... |   0.539164 |

TODO: Ana check these results - something does not make sense, should I try with CountVectorizer? Still getting some weird results like Pinon.

Tried with custom tokenizer that removed most measurements and adjectives - TODO see if there is a library for this e.g. https://stackoverflow.com/questions/33587667/extracting-all-nouns-from-a-text-file-using-nltk

The results are better now:
|   |       |                                 recipe | similarity |
|---|------:|---------------------------------------:|------------|
|   |  4918 |                       Chicken Parmesan |   1.000000 |
|   | 12825 |           Quick Baked Chicken Parmesan |   0.553191 |
|   |  6860 |   Chicken and Dumplings with Mushrooms |   0.522931 |
|   | 11718 |                 Fried Chicken Biscuits |   0.520939 |
|   | 13533 | Fusilli with Shrimp and Paneed Chicken |   0.517769 |
|   |  1937 |            East-West Barbecued Chicken |   0.506168 |
|   |  8877 |     Pepper, Rosemary, and Cheese Bread |   0.505992 |
|   |  7471 | Fried Chicken Thighs with Cheesy Grits |   0.503038 |
|   |  4490 |                       Parmesan Muffins |   0.501956 |
|   | 14180 |            BA's Best Eggplant Parmesan |   0.500428 |

In [24]:
selector = (df['title'] == 'Chicken Parmesan') | (df['title'] == 'Fried Chicken Biscuits') | (df['title'] == 'Pinon')
df.loc[selector,['title', 'ingredientsStr']].values

array([['Chicken Parmesan',
        '[\'2 cups fine dry breadcrumbs\', \'1 cup all-purpose flour\', \'4 large eggs\', \'1 cup whole milk\', \'8 small skinless, boneless chicken thighs, pounded to 1/2" thickness\', \'Kosher salt, freshly ground pepper\', \'N/A freshly ground pepper\', \'8 tablespoons olive oil\', \'8 tablespoons prepared sun-dried tomato pesto\', \'1 pound fresh mozzarella, cut into 8 slices\', \'1/2 teaspoon crushed red pepper flakes\', \'4 cups prepared marinara sauce, warmed\', \'Finely grated Parmesan (for serving)\']'],
       ['Pinon',
        "['1 medium onion', '1/2 small green bell pepper', '1/2 small red bell pepper', 'a 14- to 16-ounce can whole tomatoes', '1/3 cup drained pimiento-stuffed green olives', '1 pound ground beef chuck', '1/4 teaspoon salt', '1/4 teaspoon freshly ground black pepper', '1/2 cup tomato sauce', '2 tablespoons raisins', '1 tablespoon cider vinegar', '2 bay leaves', '1/4 teaspoon ground achiote (optional)', '6 semi-ripe (yellow with so

In [25]:
# Check for Aritchoke
# Get the column based upon the index
recipe_index_2 = df[df['title'] == 'Artichoke and Parmesan Risotto'].index

# Create a dataframe with the movie titles
sim_df_2 = pd.DataFrame({'recipe': df['title'],
                       'similarity': np.array(similarities[recipe_index_2, :].todense()).squeeze()})

# Return the top 10 most similar recipes
sim_df_2.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,recipe,similarity
14522,Artichoke and Parmesan Risotto,1.0
6459,Asparagus Risotto,0.746203
10936,Risotto with Squash and Pancetta,0.677672
5060,Shrimp Risotto with Baby Spinach and Basil,0.67352
12142,Mock Risotto,0.649519
8366,Fontina Risotto Cakes with Fresh Chives,0.640712
11060,Risotto Primavera,0.628619
4857,Radicchio Risotto,0.625543
10112,Porcini Mushroom Risotto,0.625463
8493,"Butternut Squash, Rosemary, and Blue Cheese Ri...",0.625463


Results look satisfactory.

### Method 2 : Using NearestNeighbors

In [26]:
# Adding a function to print the list of ingredients given a recipe title
def printIngredients(recipeName):
    sel = df['title'] == recipeName
    print(recipeName)
    print(df.loc[sel, ['ingredientsStr']].values)


In [62]:
def printRecipeSteps(recipeName):
    sel = df['title'] == recipeName
    print(recipeName)
    print(df.loc[sel, ['directionsStr']].values)

In [70]:
def printCategories(recipeName):
    sel = df['title'] == recipeName
    print(recipeName)
    print(df.loc[sel, ['categoriesStr']].values)

Our first goal is to find the most similar recipes to a given recipe name that already exists the dataset. For this we will use the NearestNeighbors unsupervised learner.

We will set the following hyperparameters:
- n_neighbors = 11, to find the 10 most similar recipes.
- metric = cosine, we want to find the cosine distance between the recipes.

In [27]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(n_neighbors=11, metric='cosine')
model.fit(ingredients_matrix)


#### Matching on "Exact Recipe Title" as Input

In [28]:
recipe_index = df[df['title'] == 'Chicken Parmesan'].index

# Run the text input of ingredients through the tokenizer and pass the params into the function below
distances, indices = model.kneighbors(ingredients_matrix[recipe_index])

In [29]:
# recipe_title = df.loc[4918, ['title']]
recipe_titles = []
for id in indices[0]:
    recipe_titles.append(df.loc[id, ['title']])
print(recipe_titles)

[title    Chicken Parmesan
Name: 4918, dtype: object, title    Quick Baked Chicken Parmesan
Name: 12825, dtype: object, title    Fried Chicken Biscuits
Name: 11718, dtype: object, title    Fried Chicken Thighs with Cheesy Grits
Name: 7471, dtype: object, title    3-Ingredient Shakshuka
Name: 7475, dtype: object, title    Spanish-Style Fried Chicken With Grilled Avocado
Name: 5405, dtype: object, title    Chicken and Dumplings with Mushrooms
Name: 6860, dtype: object, title    Fusilli with Shrimp and Paneed Chicken
Name: 13533, dtype: object, title    Lamb and Eggplant Casserole (Moussaka)
Name: 7749, dtype: object, title    Spiced Labneh
Name: 4639, dtype: object, title    BA's Best Eggplant Parmesan
Name: 14180, dtype: object]


In [30]:
recipe_index_2 = df[df['title'] == 'Artichoke and Parmesan Risotto'].index

distances2, indices2 = model.kneighbors(ingredients_matrix[recipe_index_2])
recipe_titles2 = []
for id2 in indices2[0]:
    recipe_titles2.append(df.loc[id2, ['title']])
print(recipe_titles2)

[title    Artichoke and Parmesan Risotto
Name: 14522, dtype: object, title    Asparagus Risotto
Name: 6459, dtype: object, title    Risotto with Squash and Pancetta
Name: 10936, dtype: object, title    Shrimp Risotto with Baby Spinach and Basil
Name: 5060, dtype: object, title    Mock Risotto
Name: 12142, dtype: object, title    Fontina Risotto Cakes with Fresh Chives
Name: 8366, dtype: object, title    Risotto Primavera
Name: 11060, dtype: object, title    Radicchio Risotto
Name: 4857, dtype: object, title    Porcini Mushroom Risotto
Name: 10112, dtype: object, title    Butternut Squash, Rosemary, and Blue Cheese Ri...
Name: 8493, dtype: object, title    Crispy Garlic Risotto Cakes
Name: 2138, dtype: object]


Results seem to be going in the right direction.

#### Matching on a "list of ingredients" as Input

In [73]:
# Using various ingredient lists to test the results

ingInputList = [
    "Chicken, Parmesan, Breadcrumbs",  # something familiar
    # "Artichoke Pesto",
    # "Chicken thighs, potatoes",  # compare results of potatoes vs potato
    # "Chicken thighs, potato",
    "Okra",  # single ingredient
    # "Bhindi",  # unknown ingredient - does not exist in the vocabulary
    # "Peanut butter",  # This is a 2 word ingredient
    # "Eggs, Egg, Eggs, Egg"
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = vectorizer.transform(ingInputSeries)

    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")
        printCategories(name)



 Input ingredients: Chicken, Parmesan, Breadcrumbs

 Result
Parmesan Chicken with Mixed Baby Greens  :  0.496  :  4.375
Parmesan Chicken with Mixed Baby Greens
[["['Cheese', 'Chicken', 'Leafy Green', 'Poultry', 'Sauté', 'Quick & Easy', 'Parmesan', 'Fall', 'Lettuce', 'Bon Appétit']"]]
Parmesan Polenta  :  0.52  :  4.375
Parmesan Polenta
[["['Side', 'Quick & Easy', 'Parmesan', 'Cornmeal', 'Bon Appétit']"]]
Chicken Soup  :  0.592  :  3.75
Chicken Soup
[["['Soup/Stew', 'Chicken', 'Poultry', 'Vegetable', 'Passover', 'Low/No Sugar', 'Carrot', 'Parsnip', 'Spring', 'Kosher', 'Dill', 'Simmer']"]]
Fettucine with Chicken and Bell Pepper Cream Sauce  :  0.611  :  4.375
Fettucine with Chicken and Bell Pepper Cream Sauce
[["['Milk/Cream', 'Chicken', 'Pasta', 'Sauté', 'Basil', 'Bell Pepper', 'Summer', 'Bon Appétit']"]]
Quick Baked Chicken Parmesan  :  0.625  :  5.0
Quick Baked Chicken Parmesan
[["['22-Minute Meals', 'Chicken', 'Parmesan', 'Tomato', 'Mozzarella', 'Quick & Easy']"]]
Parmesan-Crusted L

In [32]:
printIngredients("Onion Oil")
printIngredients("Red-Chile Oil")
printIngredients("Beef Pot Stickers")

Onion Oil
[["['1 3/4 cups peanut oil', '1 pound onions, very thinly sliced (4 cups)']"]]
Red-Chile Oil
[["['1/4 cup peanut oil', '1 tablespoon dried hot red-pepper flakes']"]]
Beef Pot Stickers
[["['1 3/4 to 2 cups all-purpose flour', '3/4 cup boiling-hot water', '1/4 lb ground beef chuck (1/2 cup)', '1 1/2 tablespoons soy sauce', '1 tablespoon Asian sesame oil', '1 tablespoon peanut oil', '1 tablespoon minced peeled fresh ginger', '1 teaspoon Chinese sweet bean paste', '2 cups finely chopped yellow or green garlic chives (6 oz)', '1 tablespoon peanut oil', '1/3 cup warm water', 'Special equipment: a 6-inch (3/4-inch-diameter) rolling pin or dowel']"]]


In [33]:
printIngredients("Red-Chile Oil")

Red-Chile Oil
[["['1/4 cup peanut oil', '1 tablespoon dried hot red-pepper flakes']"]]


Stored the results in https://docs.google.com/spreadsheets/d/1hkekdJCZBJqC5hkHKQFFM8sBONUf3CIsv9ekPxqa1AQ/edit?gid=0#gid=0 for easier readability

Results are in the right direction, but:  
- we need to do more tokenization or handle plurals in some way.
- maybe add phrases like 'chicken thighs' 'egg yolks' to tokenizer in some way..

#### Matching on "exclude list of ingredients" as Input

In [34]:
ingInputList = [
    "Chicken, Parmesan, Breadcrumbs",  # something familiar
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = (vectorizer.transform(ingInputSeries)) * -1


    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")

    # print ingredient lists to verify
    # for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    #     name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    #     printIngredients(name)


 Input ingredients: Chicken, Parmesan, Breadcrumbs

 Result
Double-Chocolate Almond Brownies  :  1.0  :  3.75
Sauteed Radishes and Sugar Snap Peas with Dill  :  1.0  :  3.75
Mom's Noodle Kugel  :  1.0  :  3.125
Spicy Tomato Chutney  :  1.0  :  4.375
Pomegranate Molasses-Glazed Carrots  :  1.0  :  4.375
Coconut Pineapple Cake  :  1.0  :  3.75
Dried Apple and Cheddar Strudel  :  1.0  :  3.75
Sweet-Potato Coconut Purée  :  1.0  :  4.375
Spiced Brown Butter Linzer Cookies  :  1.0  :  0.0
Fruit Salad with Ginger Syrup  :  1.0  :  4.375
Skillet Polenta with Tomatoes and Gorgonzola  :  1.0  :  4.375


This worked but one recipe had bread crumbs instead of breadcrumbs and that made it through in the output.

#### Matching on 'include ingredients' + 'exclude ingredients' as input

In [67]:
yes_ing_series = pd.Series("Eggs Egg")
no_ing_series = pd.Series("Spinach")
# NOTE RESULTS FOR THIS COMBO ARE VERY WEIRD

yes_ing_tx = vectorizer.transform(yes_ing_series)
no_ing_tx = (vectorizer.transform(no_ing_series)) * -1  # try increasing to -10 and see what happens

updated_ing_tx = yes_ing_tx + no_ing_tx

distOfRes, indicesOfRes = model.kneighbors(updated_ing_tx)

# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    distance = (distOfRes[0][i]).round(3)
    rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

    # print(f"{name}  :  {distance}")
    print(f"{name}  :  {distance}  :  {rating}")


 Result
Scrambled Eggs  :  0.529  :  3.75
Lemon Crème Caramel  :  0.592  :  3.75
Sherry and Egg  :  0.592  :  1.25
Orange Crème Caramels  :  0.615  :  3.75
Lemon Sabayon with Grapefruit  :  0.635  :  5.0
Chocolate and Vanilla Parfaits with Cherry Sauce  :  0.635  :  5.0
Pasta Dough for Agnolotti  :  0.635  :  4.375
Ginger Custard  :  0.635  :  3.75
Cheese Ravioli with Fresh Tomato Sauce  :  0.639  :  3.75
Maple Pumpkin Pie  :  0.639  :  4.375
Lemon Zabaglione  :  0.652  :  5.0


Seems to be working, although I am concerned about the distance values (it keeps getting larger as I exclude ingredients.)  
Results stored in https://docs.google.com/spreadsheets/d/1hkekdJCZBJqC5hkHKQFFM8sBONUf3CIsv9ekPxqa1AQ/edit?gid=0#gid=0

In [63]:
printIngredients("Chicken Soup")
printRecipeSteps("Chicken Soup")

Chicken Soup
[['[\'1 pound chicken parts\', \'2 stalks celery, including leafy tops, cut into 3-inch pieces\', \'1 whole chicken, thoroughly rinsed\', \'Salt to rub inside chicken\', \'1 large whole onion, unpeeled (find one with a firm, golden-brown peel)\', \'1 large whole carrot, peeled\', \'1 medium whole parsnip, peeled\', \'2 teaspoons salt\', \'1/4 teaspoon pepper\', \'1 bunch of dill, cleaned and tied with a string\', "Note: The Deli\'s recipe calls for both a whole chicken plus 1 pound of chicken parts. You can, however, use just 1 large chicken and cut off both wings, the neck, and a leg to use as parts."]']]
Chicken Soup
[['[\'1. Pour 12 cups of cold water into a large stockpot, and throw in the chicken parts and celery. Bring to a boil. While water is heating, rub the inside of the whole chicken with salt.\', "2. Add the chicken to the pot, cover, reduce heat, and simmer for 30 minutes. Test chicken with a fork to see if it\'s tender and fully cooked; then remove it from th

#### Trying a vectorizer with a stemmer

In [36]:
import nltk
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS

# Remove units of measurements such as teaspoons, cups, ounces etc. Full list at https://en.wikibooks.org/wiki/Cookbook:Units_of_measurement
measurements = set(line.strip() for line in open('../data/interim/measurement_list.txt'))

# Remove extra adjectives like 'baked', 'thawed', 'cleaned' etc.
extra_adjectives = set(line.strip() for line in open('../data/interim/extra_adjectives_list.txt'))

# Remove some extra words like 'assorted', 'approximately' etc. QUESTION: Is there a smart way to remove the top 100 such words?
extra_words = set(line.strip() for line in open('../data/interim/extra_words_list.txt'))

my_stops = set(ENGLISH_STOP_WORDS) | measurements | extra_adjectives | extra_words
stemmer = nltk.stem.PorterStemmer()

# from 0531_Text_Data
def my_tokenizer_with_stemmer(text):

    # convert to lowercase
    text = text.lower()
    # break into characters and weed out punctuation etc.  (include space!)
    chars = list(char for char in text if char in "abcdefghijklmnopqrstuvwxyz ")
    # make back into a single string
    text = "".join(chars)

    # split sentence into words
    listofwords = text.split(' ')
    listofstemmed_words = []

    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

In [37]:
vect_stemmed = CountVectorizer(tokenizer=my_tokenizer_with_stemmer,
                       min_df=5)
ingredients_matrix_stemmed = vect_stemmed.fit_transform(df['ingredientsStr'])

# THIS IS 12 seconds slower than the non-stemmed tokenizer :-(



In [38]:
from sklearn.neighbors import NearestNeighbors

model_stemmed = NearestNeighbors(n_neighbors=11, metric='cosine')
model_stemmed.fit(ingredients_matrix_stemmed)


In [39]:
yes_ing_series = pd.Series("Okra")
no_ing_series = pd.Series("Tomato Cilantro")

yes_ing_tx = vect_stemmed.transform(yes_ing_series)
no_ing_tx = (vect_stemmed.transform(no_ing_series)) * -1

updated_ing_tx = yes_ing_tx + no_ing_tx

distOfRes, indicesOfRes = model_stemmed.kneighbors(updated_ing_tx)

# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    distance = (distOfRes[0][i]).round(3)
    rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

    # print(f"{name}  :  {distance}")
    print(f"{name}  :  {distance}  :  {rating}")

# THE RESULTS ARE TOTALLY MESSY ! NO NEED TO STEM THIS WAY


 Result
Chicken, Sausage, and Okra Gumbo  :  0.868  :  2.5
Okra with Scallion, Lime, and Ginger  :  0.88  :  3.75
Crisp Okra in Yogurt Sauce  :  0.938  :  0.0
Chive Shortcakes with Smoky Corn and Okra Stew  :  0.941  :  3.125
Vegetable Rundown  :  0.958  :  3.125
Okra Beignets with Cilantro Sour Cream Sauce  :  0.959  :  3.75
Gallette of Sweet Potato-Crusted Tobago Crab Cake  :  0.97  :  3.75
Frozen Mango Daiquiri  :  1.0  :  4.375
Cauliflower Maque Choux  :  1.0  :  3.125
Brined and Barbecued Turkey  :  1.0  :  4.375
Plum and Red-Wine Sorbet  :  1.0  :  3.75


The results with the stemmer are very messy. Not good.

### Trying NearestNeighbors on categoriesStr

This section is a work in progress to see whether we can use the categoriesStr for anything. I have not found anything that works properly so far.

#### Vectorize categoriesStr using the default tokenizer

In [40]:
# NOTE Using default tokenizer here
vect2 = CountVectorizer(stop_words="english")
vect2.fit(df['categoriesStr'])
vocab_categories = vect2.get_feature_names_out()
categories_matrix = vect2.transform(df['categoriesStr'])

In [41]:
len(vocab_categories)

701

In [None]:
vocab_categories

In [43]:
type(vect2.vocabulary_)

dict

In [44]:
categories_matrix.shape

(14526, 701)

In [45]:
# Most common categories

sum_cat_words = categories_matrix.sum(axis = 0)
words_freq_cat = [(word, sum_cat_words[0, i]) for word, i in vect2.vocabulary_.items()]
words_freq_cat = sorted(words_freq_cat, key = lambda x: x[1], reverse = True)
words_freq_cat[0:100]

[('free', 23228),
 ('bon', 6823),
 ('appétit', 6816),
 ('peanut', 6289),
 ('soy', 6237),
 ('nut', 6109),
 ('tree', 5180),
 ('gourmet', 5143),
 ('vegetarian', 4971),
 ('kosher', 4552),
 ('pescatarian', 4430),
 ('sugar', 4164),
 ('quick', 3823),
 ('easy', 3782),
 ('wheat', 3545),
 ('gluten', 3521),
 ('bake', 3489),
 ('dairy', 3379),
 ('summer', 3034),
 ('friendly', 2947),
 ('dessert', 2939),
 ('winter', 2365),
 ('cream', 2332),
 ('fall', 2294),
 ('added', 2186),
 ('fruit', 2088),
 ('cheese', 1834),
 ('dinner', 1798),
 ('onion', 1741),
 ('low', 1731),
 ('conscious', 1714),
 ('kidney', 1677),
 ('vegetable', 1665),
 ('sauté', 1586),
 ('party', 1584),
 ('milk', 1583),
 ('tomato', 1581),
 ('pepper', 1495),
 ('egg', 1342),
 ('herb', 1271),
 ('kid', 1270),
 ('vegan', 1241),
 ('spring', 1229),
 ('salad', 1221),
 ('garlic', 1213),
 ('healthy', 1209),
 ('chill', 1162),
 ('cocktail', 1122),
 ('grill', 1115),
 ('thanksgiving', 1082),
 ('potato', 1055),
 ('stew', 1013),
 ('appetizer', 1009),
 ('chick

#### Run NearestNeigbor

In [46]:
model_cat = NearestNeighbors(n_neighbors=11, metric='cosine')
model_cat.fit(categories_matrix)
distances_cat1, indices_cat1 = model_cat.kneighbors(categories_matrix[recipe_index])
recipe_cat_titles1 = []
for idc1 in indices_cat1[0]:
    recipe_cat_titles1.append(df.loc[idc1, ['title']])
print(recipe_cat_titles1)


[title    Chicken Parmesan
Name: 4918, dtype: object, title    New Chicken Parmesan
Name: 9982, dtype: object, title    Chicken Schnitzel with Chile Cherry Tomatoes a...
Name: 7219, dtype: object, title    Chorizo Bolognese with Buffalo Mozzarella
Name: 6822, dtype: object, title    Broiled Chicken, Romaine, and Tomato Bruschetta
Name: 4994, dtype: object, title    Green Mountain Maple Barbecued Chicken
Name: 292, dtype: object, title    Baked Beans with Slab Bacon and Breadcrumbs
Name: 139, dtype: object, title    Turkey Burritos with Salsa and Cilantro
Name: 2568, dtype: object, title    Eggplant Parmesan With Fresh Mozzarella
Name: 10016, dtype: object, title    Lemon Chicken Cutlets
Name: 4441, dtype: object, title    Fried-Egg Caesar with Sun-Dried Tomatoes and P...
Name: 9787, dtype: object]


In [47]:
# THIS IS NOT WORKING PROPERLY
catInput = "vegeterian"
catInputSeries = pd.Series(catInput)
catInputTransformed = vect2.transform(catInputSeries)
dcat, icat = model_cat.kneighbors(catInputTransformed)
# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    namecat = df.loc[icat[0][i], ['title']].values[0]
    distancecat = (dcat[0][i]).round(3)
    ratingcat = df.loc[icat[0][i], ['categoriesStr']].values[0]

    print(f"{namecat}  :  {distancecat}")
    # print(f"{namecat}  :  {distancecat}  :  {ratingcat}")


 Result
Sour Cream Chocolate Cake  :  1.0
Zucchini-Blossom Quesadillas  :  1.0
Summer Seafood Stew  :  1.0
Baked Sea Bass with Walnut-Breadcrumb Crust and Lemon-Dill Sauce  :  1.0
Maple-Walnut Espresso Torte  :  1.0
Turkey Cheddar Sandwiches with Honey Mustard  :  1.0
Poached Eggs in Pipérade  :  1.0
Farfalle with Tomatoes and Feta Cheese  :  1.0
Potato, Cucumber and Dill Salad  :  1.0
Chocolate and Mixed Nut Tart in Cookie Crust  :  1.0
Scallop Potatoes with Gouda and Fennel  :  1.0


In [78]:
distances_cat2, indices_cat2 = model_cat.kneighbors(categories_matrix[recipe_index_2])
recipe_cat_titles2 = []
for idc2 in indices_cat2[0]:
    recipe_cat_titles2.append(df.loc[idc2, ['title']])
print(recipe_cat_titles2)
for recipe in recipe_cat_titles2:
    printCategories(recipe[0])

[title    Artichoke and Parmesan Risotto
Name: 14522, dtype: object, title    Poached Salmon with Artichoke Confit
Name: 1848, dtype: object, title    Roast Chicken with Rosemary, Lemon, and Honey
Name: 11438, dtype: object, title    Potato Salad with 7-Minute Eggs and Mustard Vi...
Name: 11357, dtype: object, title    Roast Chicken With Harissa And Schmaltz
Name: 12453, dtype: object, title    Slow-Roasted Char with Fennel Salad
Name: 1021, dtype: object, title    Beans with Kale and Portuguese Sausage
Name: 13014, dtype: object, title    Pot-Roasted Artichokes With White Wine and Capers
Name: 7465, dtype: object, title    Roasted Asparagus and Baby Artichokes with Lem...
Name: 6549, dtype: object, title    Dai Due's Master Brined Chicken
Name: 5189, dtype: object, title    Milk Pudding with Rose Water Caramel and Figs
Name: 2105, dtype: object]
Artichoke and Parmesan Risotto
[["['Side', 'Kid-Friendly', 'High Fiber', 'Dinner', 'Parmesan', 'Artichoke', 'Spring', 'Summer', 'Simmer', 'Bo

  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])
  printCategories(recipe[0])


## Summary

- Just to investigate whether there is a relation between ingredient lists and rating, we ran a Logistic Regression. We found that we can predict a high rating > 4.0 with only 59.66% accuracy. This is slightly better than the baseline, which was 54%. 

- We tried the TFIDF and CountVectorizer with default tokenizer and realized that CountVectorizer suits our purpose better. TFIDF gives higher weights to words that occur less, which is not what we want.

- As a first step to our recommendation system, we wanted to figure out "Give a recipe name, can we find the most similar recipes from the dataset?" We tried 2 ways to do this:
    - Similarity matrix
        - We followed the steps shown in the `Recommendation Systems` notebook. During this process, we found that a custom tokenizer gives us better results, especially when we remove the measurement values (cups, teaspoons etc.) from the vocabulary.
        - We built a custom tokenizer by iteratively comparing results of the similarity matrix.
    - NearestNeighbors
        - We figured out how to use this and saw that the results were the same as the Similarity matrix method but with much less headaches while coding. We will use this method going forward.

- We were able to find similar recipes satisfactorily. A few issues that need further investigation / fixing:
    - handling plurals - potato / potatoes
    - handling 2 word ingredients - "peanut butter"

- Next we updated the code to use a list of ingredients instead of a recipe name. And also included an option to exclude ingredients. More investigation needs to be done on the quality of results. This is a work in progress.

## Next Steps and Questions

In [52]:
vocab = vectorizer.get_feature_names_out()

In [53]:
vocab

array(['achiote', 'acorn', 'adobo', ..., 'zinfandel', 'ziti', 'zucchini'],
      dtype=object)

In [54]:
custom_list = list(vocab) + ["peanut butter"]
custom_list

['achiote',
 'acorn',
 'adobo',
 'african',
 'agave',
 'aged',
 'ahi',
 'aji',
 'ale',
 'aleppo',
 'allbutter',
 'allpurpose',
 'allspice',
 'almond',
 'almonds',
 'alsatian',
 'amaranth',
 'amaretti',
 'amaretto',
 'amarillo',
 'amber',
 'american',
 'anaheim',
 'ancho',
 'anchovies',
 'anchovy',
 'andor',
 'andouille',
 'angel',
 'anglaise',
 'angostura',
 'anise',
 'aniseed',
 'aniseflavored',
 'anjou',
 'annatto',
 'aperol',
 'apple',
 'applejack',
 'apples',
 'applesauce',
 'applewoodsmoked',
 'apricot',
 'apricots',
 'aquavit',
 'arbol',
 'arborio',
 'arctic',
 'area',
 'armagnac',
 'arrowroot',
 'artichoke',
 'artichokes',
 'arugula',
 'asiago',
 'asian',
 'asparagus',
 'avocado',
 'avocados',
 'backbone',
 'backbones',
 'backs',
 'bacon',
 'baguette',
 'baguettes',
 'bakers',
 'balm',
 'balsamic',
 'bamboo',
 'banana',
 'bananas',
 'bar',
 'barbecue',
 'barley',
 'bars',
 'bartlett',
 'base',
 'basic',
 'basil',
 'basmati',
 'bass',
 'bay',
 'bbq',
 'bean',
 'beans',
 'beanthre

Tokenization questions  
- Is there a way to include 2 word ingredients in the vocabulary manually? Example "Peanut butter" This is vastly different from just Peanuts. 
- How to handle plurals? Stemming is slow and didn't really help.
- Interesting article about tokenization - https://www.kaggle.com/code/shivanirana63/beginner-s-guide-to-word-tokenization
- Can I use categories to pick vegetarian recipes too? From categories, how to tokenize ONLY specific ones like 'vegetarian', 'pescatarian', 'wheat', 'gluten' ?

Ranking  
- When distance is the same how to rank the recipe?

General  
- Should I create a column with number of ingredients? May not be as straigforward.
- Do fuzzy matching or spell check on the input where the vocabulary will be from the vectorizer.