TODO: Add Title
TODO: Add TOC
TODO: Add Goals

In this notebook, we will load the cleaned dataset created in the first notebook (full_format_recipes_cleaned.csv) and do preprocessing on the data to get it ready for modeling.

In [81]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.feature_extraction.text import CountVectorizer


In [82]:
f"pandas version: {pd.__version__}"

'pandas version: 2.1.4'

## Load the cleaned data

In [83]:
df = pd.read_csv('../data/interim/full_recipes_cleaned_2.csv')
df.shape

(14526, 7)

In [84]:
df.head()

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
0,0,426.0,2.5,"Lentil, Apple, and Turkey Wrap","['1. Place the stock, lentils, celery, carrot,...","['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turke...",['4 cups low-sodium vegetable or chicken stock...
1,1,403.0,4.375,Boudin Blanc Terrine with Red Onion Confit,['Combine first 9 ingredients in heavy medium ...,"['Food Processor', 'Onion', 'Pork', 'Bake', 'B...","['1 1/2 cups whipping cream', '2 medium onions..."
2,2,165.0,3.75,Potato and Fennel Soup Hodge,['In a large heavy saucepan cook diced fennel ...,"['Soup/Stew', 'Dairy', 'Potato', 'Vegetable', ...","['1 fennel bulb (sometimes called anise), stal..."
3,4,547.0,3.125,Spinach Noodle Casserole,['Preheat oven to 350°F. Lightly grease 8x8x2-...,"['Cheese', 'Dairy', 'Pasta', 'Vegetable', 'Sid...","['1 12-ounce package frozen spinach soufflé, t..."
4,5,948.0,4.375,The Best Blts,"['Mix basil, mayonnaise and butter in processo...","['Sandwich', 'Food Processor', 'Tomato', 'Kid-...",['2 1/2 cups (lightly packed) fresh basil leav...


In [85]:
df.set_index('recipeId')

Unnamed: 0_level_0,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
recipeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,426.0,2.500,"Lentil, Apple, and Turkey Wrap","['1. Place the stock, lentils, celery, carrot,...","['Sandwich', 'Bean', 'Fruit', 'Tomato', 'turke...",['4 cups low-sodium vegetable or chicken stock...
1,403.0,4.375,Boudin Blanc Terrine with Red Onion Confit,['Combine first 9 ingredients in heavy medium ...,"['Food Processor', 'Onion', 'Pork', 'Bake', 'B...","['1 1/2 cups whipping cream', '2 medium onions..."
2,165.0,3.750,Potato and Fennel Soup Hodge,['In a large heavy saucepan cook diced fennel ...,"['Soup/Stew', 'Dairy', 'Potato', 'Vegetable', ...","['1 fennel bulb (sometimes called anise), stal..."
4,547.0,3.125,Spinach Noodle Casserole,['Preheat oven to 350°F. Lightly grease 8x8x2-...,"['Cheese', 'Dairy', 'Pasta', 'Vegetable', 'Sid...","['1 12-ounce package frozen spinach soufflé, t..."
5,948.0,4.375,The Best Blts,"['Mix basil, mayonnaise and butter in processo...","['Sandwich', 'Food Processor', 'Tomato', 'Kid-...",['2 1/2 cups (lightly packed) fresh basil leav...
...,...,...,...,...,...,...
20125,28.0,3.125,Parmesan Puffs,['Beat whites in a bowl with an electric mixer...,"['Mixer', 'Cheese', 'Egg', 'Fry', 'Cocktail Pa...","['2 large egg whites', '3 oz Parmigiano-Reggia..."
20126,671.0,4.375,Artichoke and Parmesan Risotto,['Bring broth to simmer in saucepan.Remove fro...,"['Side', 'Kid-Friendly', 'High Fiber', 'Dinner...",['5 1/2 cups (or more) low-salt chicken broth'...
20127,563.0,4.375,Turkey Cream Puff Pie,"['Using a sharp knife, cut a shallow X in bott...","['Onion', 'Poultry', 'turkey', 'Vegetable', 'B...","['1 small tomato', '1 small onion, finely chop..."
20128,631.0,4.375,Snapper on Angel Hair with Citrus Cream,['Heat 2 tablespoons oil in heavy medium skill...,"['Milk/Cream', 'Citrus', 'Dairy', 'Fish', 'Gar...","['4 tablespoons olive oil', '4 shallots, thinl..."


In [86]:
# confirm that there are no null values or duplicated values
print(f"Null values: {df.isna().sum().sum()}")
print(f"Duplicated rows: {df.duplicated().sum()}")

Null values: 0
Duplicated rows: 0


## Vectorize

### Define a custom tokenizer

In [87]:
# Remove units of measurements such as teaspoons, cups, ounces etc. Full list at https://en.wikibooks.org/wiki/Cookbook:Units_of_measurement
measurements = set(line.strip() for line in open('../data/interim/measurement_list.txt'))

# Remove extra adjectives like 'baked', 'thawed', 'cleaned' etc.
extra_adjectives = set(line.strip() for line in open('../data/interim/extra_adjectives_list.txt'))

# Remove some extra words like 'assorted', 'approximately' etc. QUESTION: Is there a smart way to remove the top 100 such words?
extra_words = set(line.strip() for line in open('../data/interim/extra_words_list.txt'))

In [88]:
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS

"""
custom tokenizer examples are in 0604_nlp_part2a_beta and 0531_text_vect_redux and 0531_Text_Data
"""
my_stops = set(ENGLISH_STOP_WORDS) | measurements | extra_adjectives | extra_words

def my_tokenizer(text):
    # convert to lowercase
    text = text.lower()
    # break into characters and weed out punctuation etc.  (include space!)
    chars = list(char for char in text if char in "abcdefghijklmnopqrstuvwxyz ")
    # make back into a single string
    text = "".join(chars)
    # break into words and weed out stop words and short words < 3 characters
    text = list(word for word in text.split() if word not in my_stops and len(word) >=3)
    return text

my_tokenizer(df['ingredientsStr'][1])

['whipping',
 'cream',
 'onions',
 'salt',
 'bay',
 'leaves',
 'cloves',
 'garlic',
 'clove',
 'pepper',
 'nutmeg',
 'thyme',
 'shallots',
 'boneless',
 'center',
 'pork',
 'loin',
 'sinew',
 'chunks',
 'chilled',
 'eggs',
 'purpose',
 'flour',
 'tawny',
 'port',
 'currants',
 'lettuce',
 'leaves',
 'peppercorns',
 'parsley',
 'bay',
 'leaves',
 'french',
 'bread',
 'baguette',
 'slices',
 'olive',
 'red',
 'onions',
 'halved',
 'currants',
 'red',
 'wine',
 'vinegar',
 'canned',
 'chicken',
 'broth',
 'thyme',
 'sugar']

### Vectoize ingredientsStr using the custom tokenizer

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer=my_tokenizer,
                       min_df=5)
ingredients_matrix = vectorizer.fit_transform(df['ingredientsStr'])

#bigram_series = pd.Series(["chicken breast", "boneless chicken", "chicken stock",
#                           "egg yolks", "chicken thighs", "cocoa powder",
#                           "cilantro leaves", "boneless pork", "sundried tomatoes"])

# join this with the ingredientsStr series
# custom_vocab = pd.concat([df['ingredientsStr'],bigram_series], ignore_index=True)
# specifying bigrams made matters worse, so I will add only the most sensible bigrams manually
# I inspected the results using the code snippet labeled - checking word / bigram frequency

# ingredients_matrix = vectorizer.fit_transform(custom_vocab)
# vectorizer.fit(df['ingredientsStr'])
# ingredients_matrix = vectorizer.transform(df['ingredientsStr'])



In [90]:
vocab = vectorizer.get_feature_names_out()
len(vocab)

2158

In [91]:
# checking word / bigram frequency - used when ngrams was specified in the CountVectorizer
# sum_words = ingredients_matrix.sum(axis = 0)
# words_freq = [(word, sum_words[0, i]) for word, i in vectorizer.vocabulary_.items()]
# words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
# words_freq = sorted(words_freq, key = lambda x: x[1])
# words_freq[1000:1500]

In [92]:
ingredients_matrix.shape

(14526, 2158)

In [93]:
ingredients_matrix

<14526x2158 sparse matrix of type '<class 'numpy.int64'>'
	with 301451 stored elements in Compressed Sparse Row format>

In [94]:
# 'Artichoke and Parmesan Risotto' and 'Chicken Parmesan' are two recipes we will use to test for now
df[df['title'].str.contains('Chicken Parmesan', na=False)]

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
4918,6368,1842.0,5.0,Chicken Parmesan,['Place breadcrumbs and flour in 2 separate sh...,"['Chicken', 'Tomato', 'Broil', 'Kid-Friendly',...","['2 cups fine dry breadcrumbs', '1 cup all-pur..."
7482,9873,3392.0,3.75,Chicken Parmesan Heros,['Heat olive oil in a 4- to 5-quart heavy sauc...,"['Sandwich', 'Cheese', 'Chicken', 'Poultry', '...","['3 tablespoons olive oil', '1 small onion, fi..."
9982,13372,610.0,4.375,New Chicken Parmesan,['Preheat oven to 500° F. Whisk first 3 ingred...,"['Chicken', 'Tomato', 'Roast', 'Kid-Friendly',...","['1/3 cup extra-virgin olive oil', '2 large ga..."
12825,17557,917.0,5.0,Quick Baked Chicken Parmesan,['Arrange racks in top and bottom of oven and ...,"['22-Minute Meals', 'Chicken', 'Parmesan', 'To...","['2 large eggs', '1 1/2 cups breadcrumbs or pa..."


In [95]:
df[df['title'].str.contains('Chicken', na=False)]

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
25,35,625.0,3.750,Aztec Chicken,['Melt 2 tablespoons butter with vegetable oil...,"['Chicken', 'Olive', 'Onion', 'Sauté', 'Dinner...","['6 tablespoons (3/4 stick) chilled butter', '..."
37,53,1203.0,5.000,Pancetta Roast Chicken with Walnut Stuffing,['Preheat oven to 400°F. Melt 1/4 cup butter i...,"['Chicken', 'Roast', 'High Fiber', 'Dinner', '...","['8 tablespoons (1 stick) butter, divided', 'C..."
61,80,1172.0,4.375,"Braised Chicken and Rice with Orange, Saffron,...",['Rinse the rice in a sieve under cold running...,"['Chicken', 'Citrus', 'Fruit', 'Nut', 'Poultry...","['1 1/2 cups brown basmati rice', '1/4 cup oli..."
63,82,682.0,4.375,Chicken in Green Pumpkin-Seed Sauce,['Bring all ingredients to boil in large pot. ...,"['Chicken', 'Low/No Sugar', 'Cinco de Mayo', '...","['5 cups water', '6 chicken thighs with skin a..."
67,87,1143.0,0.000,Roast Chicken With Sorghum and Squash,['Bring 5 cups water to a boil in a medium pot...,"['Bon Appétit', 'Dinner', 'Chicken', 'Grains',...","['Kosher salt', '1 cup sorghum', '1/2 large bu..."
...,...,...,...,...,...,...,...
14445,20007,936.0,4.375,"Braised Chicken with Smoked Ham, Chestnuts, an...","['Bring water to a simmer in a small saucepan,...","['Chicken', 'Ginger', 'Braise', 'Marinate', 'D...","['2 3/4 cups water', '12 dried Chinese black m..."
14473,20050,356.0,3.750,Hot Chicken Salad,['1. Preheat the oven to 375°F. Spray a 13-by-...,"['Cheese', 'Chicken', 'Nut', 'Poultry', 'Bake'...","['2 cups cooked chicken breast meat, cubed (Yo..."
14475,20052,878.0,4.375,Chicken Tetrazzini,"['Bring chicken bones, broth, carrot, onion, c...","['Chicken', 'Mushroom', 'Pasta', 'Bake', 'Supe...",['1 to 1 1/2 pound chicken bones (from 2 cooke...
14497,20091,1096.0,3.750,Chicken with Raisins and Lemon,['Arrange chicken in single layer in large Dut...,"['Chicken', 'Potato', 'Poultry', 'Lemon', 'Rai...","['1 3 1/2-pound chicken, cut into 8 pieces', '..."


## Method 1 : Similarity matrix

Got these steps from the Notebook for the upcoming Recommendation Systems class.

In [96]:
ingredients_matrix[(df['title'] == 'Chicken Parmesan').values].todense().squeeze()

matrix([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [97]:
from sklearn.metrics.pairwise import cosine_similarity

recipe_1 = ingredients_matrix[(df['title'] == 'Chicken Parmesan').values,]
recipe_2 = ingredients_matrix[(df['title'] == 'Chicken Parmesan Heros').values,]

print("Similarity:", cosine_similarity(recipe_1, recipe_2)) # Notice the result is a 2D 1X1 array, so to grab
                                                          # the number we will need to index

Similarity: [[0.29230094]]


Using cosine_similarity, 'Chicken Parmesan' and 'Chicken Parmesan Heros' recipes are 34% similar.

In [98]:
recipe_3 = ingredients_matrix[(df['title'] == 'Artichoke and Parmesan Risotto').values,]
print("Similarity:", cosine_similarity(recipe_1, recipe_3))

Similarity: [[0.16302783]]


Using cosine_similarity, 'Chicken Parmesan' and 'Artichoke and Parmesan Risotto' recipes are 20% similar.

Looking at the actual recipes in the dataset this output makes sense.

Let's create a similarity matrix by doing cosine_similarity on the entire ingredients sparse matrix.

In [99]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(ingredients_matrix, dense_output=False)

In [100]:
# Check the shape
# rows and columns should be equal, and the number of movies we started with (rows)
similarities.shape

(14526, 14526)

In [101]:
# Test with a sample recipe
df[df['title'] == 'Chicken Parmesan']

Unnamed: 0,recipeId,calories,rating,title,directionsStr,categoriesStr,ingredientsStr
4918,6368,1842.0,5.0,Chicken Parmesan,['Place breadcrumbs and flour in 2 separate sh...,"['Chicken', 'Tomato', 'Broil', 'Kid-Friendly',...","['2 cups fine dry breadcrumbs', '1 cup all-pur..."


In [102]:
# Get the column based upon the index
recipe_index = df[df['title'] == 'Chicken Parmesan'].index

# Create a dataframe with the movie titles
sim_df = pd.DataFrame({'recipe': df['title'],
                       'similarity': np.array(similarities[recipe_index, :].todense()).squeeze()})

In [103]:
# Return the top 10 most similar recipes
sim_df.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,recipe,similarity
4918,Chicken Parmesan,1.0
12825,Quick Baked Chicken Parmesan,0.576624
11718,Fried Chicken Biscuits,0.562117
7471,Fried Chicken Thighs with Cheesy Grits,0.536461
6860,Chicken and Dumplings with Mushrooms,0.52984
5405,Spanish-Style Fried Chicken With Grilled Avocado,0.518811
7475,3-Ingredient Shakshuka,0.518751
13533,Fusilli with Shrimp and Paneed Chicken,0.508426
5786,Horseradish-Yogurt Sauce,0.50322
7749,Lamb and Eggplant Casserole (Moussaka),0.496373


With TFIDF Vectorizer:

| recipe |                                       similarity |          |
|-------:|-------------------------------------------------:|----------|
|   4918 |                                 Chicken Parmesan | 1.000000 |
|  11718 |                           Fried Chicken Biscuits | 0.424981 |
|  12825 |                     Quick Baked Chicken Parmesan | 0.424046 |
|   1165 |                          Mozzarella Pesto Spread | 0.412968 |
|   9877 |         Rigatoni with Cheese and Italian Sausage | 0.360720 |
|  14180 |                      BA's Best Eggplant Parmesan | 0.356189 |
|   5786 |                         Horseradish-Yogurt Sauce | 0.351933 |
|   3927 | Chunky Two-Cheese Potatoes with Garlic and Pesto | 0.333565 |
|  13533 |           Fusilli with Shrimp and Paneed Chicken | 0.331478 |
|   3123 |    Spicy Lamb Pizza With Parsley–Red Onion Salad | 0.328433 |

Results with CountVectorizer seem better:

|       |                                            recipe | similarity |
|------:|--------------------------------------------------:|------------|
|  4918 |                                  Chicken Parmesan |   1.000000 |
| 13533 |            Fusilli with Shrimp and Paneed Chicken |   0.587137 |
| 12480 |                          Spicy Oven-Fried Chicken |   0.574641 |
| 11718 |                            Fried Chicken Biscuits |   0.559149 |
| 14180 |                       BA's Best Eggplant Parmesan |   0.557207 |
|  7749 |            Lamb and Eggplant Casserole (Moussaka) |   0.556499 |
| 11245 |                                             Pinon |   0.548443 |
|  9785 | Breaded Skinless Fish Fillets with Red Pepper ... |   0.540222 |
|  6860 |              Chicken and Dumplings with Mushrooms |   0.539425 |
| 11336 | Crispy Chicken Sandwich with Buttermilk Slaw a... |   0.539164 |

TODO: Ana check these results - something does not make sense, should I try with CountVectorizer? Still getting some weird results like Pinon.

Tried with custom tokenizer that removed most measurements and adjectives - TODO see if there is a library for this e.g. https://stackoverflow.com/questions/33587667/extracting-all-nouns-from-a-text-file-using-nltk

The results are better now:
|   |       |                                 recipe | similarity |
|---|------:|---------------------------------------:|------------|
|   |  4918 |                       Chicken Parmesan |   1.000000 |
|   | 12825 |           Quick Baked Chicken Parmesan |   0.553191 |
|   |  6860 |   Chicken and Dumplings with Mushrooms |   0.522931 |
|   | 11718 |                 Fried Chicken Biscuits |   0.520939 |
|   | 13533 | Fusilli with Shrimp and Paneed Chicken |   0.517769 |
|   |  1937 |            East-West Barbecued Chicken |   0.506168 |
|   |  8877 |     Pepper, Rosemary, and Cheese Bread |   0.505992 |
|   |  7471 | Fried Chicken Thighs with Cheesy Grits |   0.503038 |
|   |  4490 |                       Parmesan Muffins |   0.501956 |
|   | 14180 |            BA's Best Eggplant Parmesan |   0.500428 |

In [104]:
selector = (df['title'] == 'Chicken Parmesan') | (df['title'] == 'Fried Chicken Biscuits') | (df['title'] == 'Pinon')
df.loc[selector,['title', 'ingredientsStr']].values

array([['Chicken Parmesan',
        '[\'2 cups fine dry breadcrumbs\', \'1 cup all-purpose flour\', \'4 large eggs\', \'1 cup whole milk\', \'8 small skinless, boneless chicken thighs, pounded to 1/2" thickness\', \'Kosher salt, freshly ground pepper\', \'N/A freshly ground pepper\', \'8 tablespoons olive oil\', \'8 tablespoons prepared sun-dried tomato pesto\', \'1 pound fresh mozzarella, cut into 8 slices\', \'1/2 teaspoon crushed red pepper flakes\', \'4 cups prepared marinara sauce, warmed\', \'Finely grated Parmesan (for serving)\']'],
       ['Pinon',
        "['1 medium onion', '1/2 small green bell pepper', '1/2 small red bell pepper', 'a 14- to 16-ounce can whole tomatoes', '1/3 cup drained pimiento-stuffed green olives', '1 pound ground beef chuck', '1/4 teaspoon salt', '1/4 teaspoon freshly ground black pepper', '1/2 cup tomato sauce', '2 tablespoons raisins', '1 tablespoon cider vinegar', '2 bay leaves', '1/4 teaspoon ground achiote (optional)', '6 semi-ripe (yellow with so

In [105]:
# Check for Aritchoke
# Get the column based upon the index
recipe_index_2 = df[df['title'] == 'Artichoke and Parmesan Risotto'].index

# Create a dataframe with the movie titles
sim_df_2 = pd.DataFrame({'recipe': df['title'],
                       'similarity': np.array(similarities[recipe_index_2, :].todense()).squeeze()})

# Return the top 10 most similar recipes
sim_df_2.sort_values(by='similarity', ascending=False).head(10)

Unnamed: 0,recipe,similarity
14522,Artichoke and Parmesan Risotto,1.0
6459,Asparagus Risotto,0.717137
5060,Shrimp Risotto with Baby Spinach and Basil,0.68313
12142,Mock Risotto,0.644658
10936,Risotto with Squash and Pancetta,0.641533
11060,Risotto Primavera,0.634335
8493,"Butternut Squash, Rosemary, and Blue Cheese Ri...",0.628971
8366,Fontina Risotto Cakes with Fresh Chives,0.599145
11013,Spinach Risotto,0.597614
800,Asparagus Risotto,0.587975


Results look satisfactory.

## Method 2 : Using NearestNeighbors

In [None]:
# Adding a function to print the list of ingredients given a recipe title
def printIngredients(recipeName):
    sel = df['title'] == recipeName
    print(recipeName)
    print(df.loc[sel, ['ingredientsStr']].values)


In [106]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(n_neighbors=11, metric='cosine')
model.fit(ingredients_matrix)


### Matching on "Exact Recipe Title" as Input

In [107]:
recipe_index = df[df['title'] == 'Chicken Parmesan'].index

# Run the text input of ingredients through the tokenizer and pass the params into the function below
distances, indices = model.kneighbors(ingredients_matrix[recipe_index])

In [108]:
# recipe_title = df.loc[4918, ['title']]
recipe_titles = []
for id in indices[0]:
    recipe_titles.append(df.loc[id, ['title']])
print(recipe_titles)

[title    Chicken Parmesan
Name: 4918, dtype: object, title    Quick Baked Chicken Parmesan
Name: 12825, dtype: object, title    Fried Chicken Biscuits
Name: 11718, dtype: object, title    Fried Chicken Thighs with Cheesy Grits
Name: 7471, dtype: object, title    Chicken and Dumplings with Mushrooms
Name: 6860, dtype: object, title    Spanish-Style Fried Chicken With Grilled Avocado
Name: 5405, dtype: object, title    3-Ingredient Shakshuka
Name: 7475, dtype: object, title    Fusilli with Shrimp and Paneed Chicken
Name: 13533, dtype: object, title    Horseradish-Yogurt Sauce
Name: 5786, dtype: object, title    Lamb and Eggplant Casserole (Moussaka)
Name: 7749, dtype: object, title    Spicy Oven-Fried Chicken
Name: 12480, dtype: object]


In [109]:
recipe_index_2 = df[df['title'] == 'Artichoke and Parmesan Risotto'].index

distances2, indices2 = model.kneighbors(ingredients_matrix[recipe_index_2])
recipe_titles2 = []
for id2 in indices2[0]:
    recipe_titles2.append(df.loc[id2, ['title']])
print(recipe_titles2)

[title    Artichoke and Parmesan Risotto
Name: 14522, dtype: object, title    Asparagus Risotto
Name: 6459, dtype: object, title    Shrimp Risotto with Baby Spinach and Basil
Name: 5060, dtype: object, title    Mock Risotto
Name: 12142, dtype: object, title    Risotto with Squash and Pancetta
Name: 10936, dtype: object, title    Risotto Primavera
Name: 11060, dtype: object, title    Butternut Squash, Rosemary, and Blue Cheese Ri...
Name: 8493, dtype: object, title    Fontina Risotto Cakes with Fresh Chives
Name: 8366, dtype: object, title    Spinach Risotto
Name: 11013, dtype: object, title    Porcini Mushroom Risotto
Name: 10112, dtype: object, title    Asparagus Risotto
Name: 800, dtype: object]


Results seem to be going in the right direction.

### Matching on a "list of ingredients" as Input

In [119]:
# Using various ingredient lists to test the results

ingInputList = [
    "Chicken, Parmesan, Breadcrumbs",  # something familiar
    "Artichoke Pesto",
    "Chicken thighs, potatoes",  # compare results of potatoes vs potato
    "Chicken thighs, potato",
    "Okra",  # single ingredient
    "Bhindi",  # unknown ingredient - does not exist in the vocabulary
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = vectorizer.transform(ingInputSeries)

    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")



 Input ingredients: Chicken, Parmesan, Breadcrumbs

 Result
Parmesan Chicken with Mixed Baby Greens  :  0.484  :  4.375
Parmesan Polenta  :  0.5  :  4.375
Chicken Soup  :  0.592  :  3.75
Fettucine with Chicken and Bell Pepper Cream Sauce  :  0.607  :  4.375
Quick Baked Chicken Parmesan  :  0.62  :  5.0
Parmesan-Crusted Lemon Chicken  :  0.622  :  3.75
Risotto with Squash and Pancetta  :  0.622  :  4.375
Breaded Chicken Cutlets with Chunky Vegetable Sauce  :  0.625  :  3.75
Fontina Risotto Cakes with Fresh Chives  :  0.63  :  4.375
Easy Arancini  :  0.63  :  4.375
Chicken Schnitzel with Anchovy-Chive Butter Sauce  :  0.631  :  3.75

 Input ingredients: Artichoke Pesto

 Result
Rack of Lamb with Pesto Crumbs  :  0.75  :  4.375
Mozzarella Pesto Spread  :  0.75  :  3.75
Pesto Pizza with Crabmeat and Artichoke Hearts  :  0.757  :  4.375
Burgers with Artichokes, Gorgonzola, and Tomatoes  :  0.761  :  3.75
Artichoke Hearts with Garlic, Olive Oil and Parsley  :  0.764  :  3.125
Roast Turkey w

Stored the results in https://docs.google.com/spreadsheets/d/1hkekdJCZBJqC5hkHKQFFM8sBONUf3CIsv9ekPxqa1AQ/edit?gid=0#gid=0 for easier readability

Results are in the right direction, but:  
- we need to do more tokenization or handle plurals in some way.
- maybe add phrases like 'chicken thighs' 'egg yolks' to tokenizer in some way..

### Matching on "exclude list of ingredients" as Input

In [133]:
ingInputList = [
    "Chicken, Parmesan, Breadcrumbs",  # something familiar
]

for ingInput in ingInputList:
    print(f"\n Input ingredients: {ingInput}")
    # Convert the string to a series
    ingInputSeries = pd.Series(ingInput)

    # Let's try to use the vectorizer on this
    ingTransformed = (vectorizer.transform(ingInputSeries)) * -1


    # pass this to NearestNeighbors trained model
    distOfRes, indicesOfRes = model.kneighbors(ingTransformed)

    # print the output
    print("\n Result")

    for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
        name = df.loc[indicesOfRes[0][i], ['title']].values[0]
        distance = (distOfRes[0][i]).round(3)
        rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

        # print(f"{name}  :  {distance}")
        print(f"{name}  :  {distance}  :  {rating}")

    # print ingredient lists to verify
    # for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    #     name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    #     printIngredients(name)


 Input ingredients: Chicken, Parmesan, Breadcrumbs

 Result
Double-Chocolate Almond Brownies  :  1.0  :  3.75
Sauteed Radishes and Sugar Snap Peas with Dill  :  1.0  :  3.75
Mom's Noodle Kugel  :  1.0  :  3.125
Spicy Tomato Chutney  :  1.0  :  4.375
Pomegranate Molasses-Glazed Carrots  :  1.0  :  4.375
Coconut Pineapple Cake  :  1.0  :  3.75
Dried Apple and Cheddar Strudel  :  1.0  :  3.75
Sweet-Potato Coconut Purée  :  1.0  :  4.375
Spiced Brown Butter Linzer Cookies  :  1.0  :  0.0
Fruit Salad with Ginger Syrup  :  1.0  :  4.375
Skillet Polenta with Tomatoes and Gorgonzola  :  1.0  :  4.375
Double-Chocolate Almond Brownies
[["['1/3 cup sliced almonds (about 1 ounce)', '1/4 cup all-purpose flour', '1/2 teaspoon baking powder', '2 ounces unsweetened chocolate', '1/2 stick (1/4 cup) unsalted butter', '1/2 cup sugar', '1/2 teaspoon vanilla', '1 large egg', '1/3 cup semisweet chocolate chips']"]]
Sauteed Radishes and Sugar Snap Peas with Dill
[["['1 tablespoon butter', '1 tablespoon oliv

This worked but one recipe had bread crumbs instead of breadcrumbs and that made it through in the output.

### Matching on 'include ingredients' + 'exclude ingredients' as input

In [134]:
yes_ing_series = pd.Series("Okra")
no_ing_series = pd.Series("Tomato")

yes_ing_tx = vectorizer.transform(yes_ing_series)
no_ing_tx = (vectorizer.transform(no_ing_series)) * -1

distOfRes, indicesOfRes = model.kneighbors(yes_ing_tx)

# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    distance = (distOfRes[0][i]).round(3)
    rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

    # print(f"{name}  :  {distance}")
    print(f"{name}  :  {distance}  :  {rating}")




 Result
Sauteed Okra with Tomato and Corn  :  0.646  :  4.375
Okra with Scallion, Lime, and Ginger  :  0.684  :  3.75
Chicken, Sausage, and Okra Gumbo  :  0.702  :  2.5
Okra Beignets with Cilantro Sour Cream Sauce  :  0.72  :  3.75
Broiled Tomato, Corn, and Okra  :  0.723  :  4.375
Stewed Corn and Tomatoes with Okra  :  0.757  :  4.375
Creole Chicken and Okra Gumbo  :  0.787  :  3.75
Succotash  :  0.8  :  4.375
Catfish and Okra with Pecan Butter Sauce  :  0.817  :  4.375
Crisp Okra in Yogurt Sauce  :  0.82  :  0.0
Chive Shortcakes with Smoky Corn and Okra Stew  :  0.82  :  3.125


In [137]:
print(yes_ing_tx.shape)
print(no_ing_tx.shape)
print(type(yes_ing_tx))

updated_ing_tx = yes_ing_tx + no_ing_tx

print(updated_ing_tx.shape)

(1, 2158)
(1, 2158)
<class 'scipy.sparse._csr.csr_matrix'>
(1, 2158)


In [140]:
yes_ing_series = pd.Series("Okra")
no_ing_series = pd.Series("Tomato Cilantro")

yes_ing_tx = vectorizer.transform(yes_ing_series)
no_ing_tx = (vectorizer.transform(no_ing_series)) * -1

updated_ing_tx = yes_ing_tx + no_ing_tx

distOfRes, indicesOfRes = model.kneighbors(updated_ing_tx)

# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    name = df.loc[indicesOfRes[0][i], ['title']].values[0]
    distance = (distOfRes[0][i]).round(3)
    rating = df.loc[indicesOfRes[0][i], ['rating']].values[0]

    # print(f"{name}  :  {distance}")
    print(f"{name}  :  {distance}  :  {rating}")


 Result
Okra with Scallion, Lime, and Ginger  :  0.817  :  3.75
Chicken, Sausage, and Okra Gumbo  :  0.828  :  2.5
Broiled Tomato, Corn, and Okra  :  0.84  :  4.375
Stewed Corn and Tomatoes with Okra  :  0.86  :  4.375
Creole Chicken and Okra Gumbo  :  0.877  :  3.75
Succotash  :  0.885  :  4.375
Catfish and Okra with Pecan Butter Sauce  :  0.895  :  4.375
Chive Shortcakes with Smoky Corn and Okra Stew  :  0.896  :  3.125
Crisp Okra in Yogurt Sauce  :  0.896  :  0.0
Corn and Okra Stew  :  0.904  :  3.75
Spicy Gumbo-Laya  :  0.908  :  4.375


Seems to be working, although I am concerned about the similarity values (it keeps getting larger as I exclude ingredients.)  
Results stored in https://docs.google.com/spreadsheets/d/1hkekdJCZBJqC5hkHKQFFM8sBONUf3CIsv9ekPxqa1AQ/edit?gid=0#gid=0

## Trying NearestNeighbors on categoriesStr

### Vectorize categoriesStr using the default tokenizer

In [121]:
# NOTE Using default tokenizer here
vect2 = CountVectorizer(stop_words="english")
vect2.fit(df['categoriesStr'])
vocab_categories = vect2.get_feature_names_out()
categories_matrix = vect2.transform(df['categoriesStr'])

In [122]:
len(vocab_categories)

701

In [None]:
vocab_categories

In [124]:
type(vect2.vocabulary_)

dict

In [None]:
categories_matrix.shape

(14526, 701)

In [125]:
# Most common categories

sum_cat_words = categories_matrix.sum(axis = 0)
words_freq_cat = [(word, sum_cat_words[0, i]) for word, i in vect2.vocabulary_.items()]
words_freq_cat = sorted(words_freq_cat, key = lambda x: x[1], reverse = True)
words_freq_cat[0:100]

[('free', 23228),
 ('bon', 6823),
 ('appétit', 6816),
 ('peanut', 6289),
 ('soy', 6237),
 ('nut', 6109),
 ('tree', 5180),
 ('gourmet', 5143),
 ('vegetarian', 4971),
 ('kosher', 4552),
 ('pescatarian', 4430),
 ('sugar', 4164),
 ('quick', 3823),
 ('easy', 3782),
 ('wheat', 3545),
 ('gluten', 3521),
 ('bake', 3489),
 ('dairy', 3379),
 ('summer', 3034),
 ('friendly', 2947),
 ('dessert', 2939),
 ('winter', 2365),
 ('cream', 2332),
 ('fall', 2294),
 ('added', 2186),
 ('fruit', 2088),
 ('cheese', 1834),
 ('dinner', 1798),
 ('onion', 1741),
 ('low', 1731),
 ('conscious', 1714),
 ('kidney', 1677),
 ('vegetable', 1665),
 ('sauté', 1586),
 ('party', 1584),
 ('milk', 1583),
 ('tomato', 1581),
 ('pepper', 1495),
 ('egg', 1342),
 ('herb', 1271),
 ('kid', 1270),
 ('vegan', 1241),
 ('spring', 1229),
 ('salad', 1221),
 ('garlic', 1213),
 ('healthy', 1209),
 ('chill', 1162),
 ('cocktail', 1122),
 ('grill', 1115),
 ('thanksgiving', 1082),
 ('potato', 1055),
 ('stew', 1013),
 ('appetizer', 1009),
 ('chick

In [127]:
model_cat = NearestNeighbors(n_neighbors=11, metric='cosine')
model_cat.fit(categories_matrix)
distances_cat1, indices_cat1 = model_cat.kneighbors(categories_matrix[recipe_index])
recipe_cat_titles1 = []
for idc1 in indices_cat1[0]:
    recipe_cat_titles1.append(df.loc[idc1, ['title']])
print(recipe_cat_titles1)


[title    Chicken Parmesan
Name: 4918, dtype: object, title    New Chicken Parmesan
Name: 9982, dtype: object, title    Chicken Schnitzel with Chile Cherry Tomatoes a...
Name: 7219, dtype: object, title    Chorizo Bolognese with Buffalo Mozzarella
Name: 6822, dtype: object, title    Broiled Chicken, Romaine, and Tomato Bruschetta
Name: 4994, dtype: object, title    Green Mountain Maple Barbecued Chicken
Name: 292, dtype: object, title    Baked Beans with Slab Bacon and Breadcrumbs
Name: 139, dtype: object, title    Turkey Burritos with Salsa and Cilantro
Name: 2568, dtype: object, title    Eggplant Parmesan With Fresh Mozzarella
Name: 10016, dtype: object, title    Lemon Chicken Cutlets
Name: 4441, dtype: object, title    Fried-Egg Caesar with Sun-Dried Tomatoes and P...
Name: 9787, dtype: object]


In [131]:
# THIS IS NOT WORKING PROPERLY
catInput = "vegeterian"
catInputSeries = pd.Series(catInput)
catInputTransformed = vect2.transform(catInputSeries)
dcat, icat = model_cat.kneighbors(catInputTransformed)
# print the output
print("\n Result")

for i in range(0, 11):  # TODO: 11 should be made configurable and match the n-neighbors number
    namecat = df.loc[icat[0][i], ['title']].values[0]
    distancecat = (dcat[0][i]).round(3)
    ratingcat = df.loc[icat[0][i], ['categoriesStr']].values[0]

    print(f"{namecat}  :  {distancecat}")
    # print(f"{namecat}  :  {distancecat}  :  {ratingcat}")


 Result
Sour Cream Chocolate Cake  :  1.0
Zucchini-Blossom Quesadillas  :  1.0
Summer Seafood Stew  :  1.0
Baked Sea Bass with Walnut-Breadcrumb Crust and Lemon-Dill Sauce  :  1.0
Maple-Walnut Espresso Torte  :  1.0
Turkey Cheddar Sandwiches with Honey Mustard  :  1.0
Poached Eggs in Pipérade  :  1.0
Farfalle with Tomatoes and Feta Cheese  :  1.0
Potato, Cucumber and Dill Salad  :  1.0
Chocolate and Mixed Nut Tart in Cookie Crust  :  1.0
Scallop Potatoes with Gouda and Fennel  :  1.0


In [37]:
distances_cat2, indices_cat2 = model_cat.kneighbors(categories_matrix[recipe_index_2])
recipe_cat_titles2 = []
for idc2 in indices_cat2[0]:
    recipe_cat_titles2.append(df.loc[idc2, ['title']])
print(recipe_cat_titles2)

[title    Artichoke and Parmesan Risotto
Name: 14522, dtype: object, title    Poached Salmon with Artichoke Confit
Name: 1848, dtype: object, title    Roast Chicken with Rosemary, Lemon, and Honey
Name: 11438, dtype: object, title    Potato Salad with 7-Minute Eggs and Mustard Vi...
Name: 11357, dtype: object, title    Roast Chicken With Harissa And Schmaltz
Name: 12453, dtype: object, title    Slow-Roasted Char with Fennel Salad
Name: 1021, dtype: object, title    Beans with Kale and Portuguese Sausage
Name: 13014, dtype: object, title    Pot-Roasted Artichokes With White Wine and Capers
Name: 7465, dtype: object, title    Roasted Asparagus and Baby Artichokes with Lem...
Name: 6549, dtype: object, title    Dai Due's Master Brined Chicken
Name: 5189, dtype: object, title    Milk Pudding with Rose Water Caramel and Figs
Name: 2105, dtype: object]


## TODO List

In [38]:
import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

print(nouns)

['lines', 'string', 'words']


## TODO LIST

- Fuzzy Matching and StreamLit tutorial
- My input can be a list of ingredients, or a full recipe name, test fuzzy matcher. How to make this work for an ingredient list instead of recipe name
- How to use categories for pulling out vegeterian recipes - try a query to see how many recipes have this category in them
- Also ask about tokenization
- QUESTION - Can I use categories to pick vegeterian recipes too? From categories, how to tokenize ONLY specific ones like 'vegeterian', 'pescatarian', 'wheat', 'gluten' ?
- Should I create a column with number of ingredients?
- Interesting article about tokenization - https://www.kaggle.com/code/shivanirana63/beginner-s-guide-to-word-tokenization
- When similarity is the same how to rank the recipe? 


In [40]:
# fuzzy matching, let's put recipe names in a list
recipeNames = list(df['title'])

In [41]:
# %pip install joblib

## For StreamLit

In [42]:
# https://stackoverflow.com/questions/10592605/save-classifier-to-disk-in-scikit-learn

"""
import pickle
# now you can save it to a file
with open('model1.pkl', 'wb') as f:
    pickle.dump(model, f)
"""
import joblib
# now you can save it to a file
joblib.dump(model, 'model_joblib.pkl')

['model_joblib.pkl']

In [43]:
joblib.dump(ingredients_matrix, 'ing_mat.pkl')

['ing_mat.pkl']