# Recipe Recommender System

In this notebook, I expand on my previous findings in "Recipe_Analysis.ipynb" and begin the process of building a recommender system for Asian-inspired recipes from thewoksoflife.com. In this specific notebook, I will continue on basic exploratory data analysis and utilize machine learning & NLP for recipe recommendation.. I plan to use this data and build a full-stack web application to deploy on github allowing people to, based on specific parameters like their available time, calorie preferences, and of course choice of protein, receive a list of the top 5 recipes based on their criteria.

## Data

This CSV file contains 1450 recipes with 9 features, listed as follows:

- Average Rating    **(float64)**
- Calories          **(object)**
- Category          **(object)**
- Cook Time         **(object)**
- Ingredients       **(object)**
- Name              **(object)**
- Prep Time         **(object)**
- Review Count      **(float64)**
- Total Time        **(object)**
- URL               **(object)**

In [1]:
import pandas as pd

df = pd.read_csv("cleaned_asian_recipes.csv")
display(df)

Unnamed: 0.1,Unnamed: 0,Name,Category,Prep Time,Cook Time,Total Time,Ingredients,Calories,Average Rating,Review Count,URL
0,0,Slow Roasted Tomato Pasta,Main Course,15.0,195.0,210.0,"tomatoes , extra virgin olive oil, salt, ...",576.0,5.00,2.0,https://thewoksoflife.com/roasted-tomato-pasta/
1,1,Yaki Udon,Noodles,15.0,25.0,40.0,"frozen udon noodles , butter , clove garl...",312.0,5.00,5.0,https://thewoksoflife.com/yaki-udon/
2,2,Drunken Noodles (Pad Kee Mao),Noodles,20.0,10.0,30.0,"water, sliced chicken thighs or chicken br...",444.0,5.00,26.0,https://thewoksoflife.com/drunken-noodles-pad-...
3,3,Yunnan Rice Noodle Soup (云南小锅米线),Soup,60.0,,75.0,"g ground pork , shaoxing wine , dark soy ...",565.0,5.00,6.0,https://thewoksoflife.com/yunnan-rice-noodle-s...
4,4,Poor Man’s Thai Noodles,Noodles,10.0,10.0,20.0,"fresh wide rice noodles , brown sugar, h...",201.0,5.00,2.0,https://thewoksoflife.com/poor-mans-thai-noodles/
...,...,...,...,...,...,...,...,...,...,...,...
801,1221,"Carrot Pea Soup with Pancetta, Basil & Mint",Soup,20.0,40.0,60.0,"butter, olive oil, onions , large carrot...",268.0,,,https://thewoksoflife.com/springtime-carrot-pe...
802,1252,Chinese Tomato Egg Stir-fry,Main Course,5.0,5.0,10.0,"small to medium tomatoes , scallion, egg...",333.0,4.87,15.0,https://thewoksoflife.com/stir-fried-tomato-an...
803,1360,Ratatouille Grilled Cheese,Bread,20.0,50.0,70.0,"each of diced zucchini, eggplant, and onion,...",,5.00,2.0,https://thewoksoflife.com/ratatouille-grilled-...
804,1384,Chinese Chive Frittata with Tomatoes,Vegetarian,5.0,10.0,15.0,"eggs, water, chinese chives , small t...",129.0,5.00,1.0,https://thewoksoflife.com/chinese-chive-frittata/


# Step 1: Criteria for Recommendation

Unfortunately, there are no user IDs, so it is not possible to use a collaborative filtering based model to recommend recipes based on prior users rating on different recipes. Instead, I will be utilizing the metrics "Average Rating" and "Review Count". For purposes of testing, I will be selecting the top 5 recipes based on 1) average rating, and if tied, than 2) review count.

In [2]:
def get_top_recipes(time, calories, category):
    """
    Gets top 5 recipes based on rating and review count from data that matches specified parameters.
    
    :param time: Maximum time allotted for recipe (both preparation and cook time)
    :param calories: Maximum calories desired for recipe
    :param category: Type of cuisine
    :returns: Dataframe containing all recipes that match the specified parameters
    
    """
    
    return df[(df.Category == category) & (df.Calories <= calories) & (df["Total Time"] <= time)] \
           .nlargest(5, columns = ['Average Rating', 'Review Count'])


In [3]:
# Example usage of get_top_recipes function
time = int(input("Enter max cook time: "))
calories = int(input("Enter max calories: "))
category = input("What type of dish? Choose from Soup, Poultry, Noodles, Appetizers, Vegetarian, Rice, Bread, Beef, Seafood, Pork, Dessert: ")

get_top_recipes(time,calories,category)

Enter max cook time: 500
Enter max calories: 500
What type of dish? Choose from Soup, Poultry, Noodles, Appetizers, Vegetarian, Rice, Bread, Beef, Seafood, Pork, Dessert: Beef


Unnamed: 0.1,Unnamed: 0,Name,Category,Prep Time,Cook Time,Total Time,Ingredients,Calories,Average Rating,Review Count,URL
165,172,Classic Beef Fried Rice,Beef,20.0,15.0,35.0,"flank steak , salt, water, baking soda...",360.0,5.0,18.0,https://thewoksoflife.com/classic-beef-fried-r...
326,370,Instant Pot Braised Curry Beef,Beef,15.0,60.0,60.0,"beef outside flankrough flank , vegetable...",440.0,5.0,18.0,https://thewoksoflife.com/braised-curry-beef-i...
349,407,Beef Tomato Stir-fry,Beef,70.0,5.0,75.0,"flank steak , cornstarch, oil, salt, ...",329.0,5.0,13.0,https://thewoksoflife.com/beef-tomato-stir-fry/
336,386,Bison Chili,Beef,15.0,150.0,165.0,"ground bison , onions, garlic, green b...",284.0,5.0,8.0,https://thewoksoflife.com/bison-chili/
342,394,Sichuan Boiled Beef (Shuizhu Niurou),Beef,70.0,20.0,90.0,"flank steak , baking soda, water, corn...",428.0,5.0,8.0,https://thewoksoflife.com/sichuan-boiled-beef/


# Step 2: Using Cosine Similarity & NLP to develop recommendations based on similar ingredients

Comparing recipes by similar ingredients is a task that requires the highest intersection of ingredients between any two given recipes. At first, I considered using TF-IDF for this, but since it places weights on each word depending on frequency, it seems antithetical to the task. Instead, I am opting for cosine similiarity.

Cosine similarity is a method often used in text analysis and measures the similarity between two vectors in a given space. It functions by determining whether the cosine of two plotted vectors are "pointing" in the same direction. These vectors, also known as term-frequency vectors, store the occurrence of each word in a 2D array. For example, given the following two sentences:

**I like cats, cats like me.**

**Cats are really fun!**

An example term frequency vector might look like:
<pre>
Sentence | I  | like | cats | me | are | really | fun |

S1         1     2      2     1     0      0       0

S2         0     0      1     0     1      1       1
</pre>

These vectors are often extremely long and many columns with 0's are common. 
 
Cosine similiarity is denoted by:

**sim(x,y) = x * y / ||x|| ||y||**

where sim(x,y) measures the similarity between two vectors, x and y independent vectors, and ||x|| and ||y|| the euclidian norm of x and y (can also be described as the length).

The closer the cosine similarity between x and y, the more the value approaches 1. If the cosine similarity approaches 0, that means the two vectors are orthogonal to each other (90 degrees apart) and have little to no similiarity.

In [6]:
from nltk.tokenize import word_tokenize 

def cosine_sim(sentence1, sentence2):
    """
    Returns cosine similarity between two sentences.
    
    :param sentence1: the first sentence to be compared
    :param sentence2: the second sentence to be compared
    :returns: a float describing the numerical correlation between the given inputs
    
    """
    # Creates individual term-frequency vector for each sentence
    TFVector1 = []
    TFVector2 = [] 

    s1_token = set(word_tokenize(sentence1))
    s2_token = set(word_tokenize(sentence2))

    combined_vect = s1_token.union(s2_token)
    for word in combined_vect:
        TFVector1.append(1) if word in s1_token else TFVector1.append(0)
        TFVector2.append(1) if word in s2_token else TFVector2.append(0)

    c = 0
    for index in range(len(combined_vect)): 
            c += TFVector1[index] * TFVector2[index] 
            
    cosine = c / float((sum(TFVector1)*sum(TFVector2))**0.5) 

    return cosine


In [5]:
# Example usage of cosine_sim() function

X = df.iloc[0].Ingredients
Y = df.iloc[1].Ingredients

print("similiarity: " + str(cosine_sim(X,Y)))

print(X, '\n')
print(Y)


similiarity: 0.1336306209562122
     tomatoes , extra virgin olive oil, salt,   red chilies ,  head garlic ,   anchovies,   tomato paste,    parsley,   angel hair pasta  

  frozen udon noodles ,   butter ,  clove garlic ,   dashi powder,   oil,   pork shoulder ,   oyster or shiitake mushrooms ,   mirin,   cabbage ,   medium carrot ,   black pepper,   low sodium soy sauce,   water,   scallions 


# Step 3: Recommendations based on similar recipes

The next step is to, given a recipe that the user may like, is to search for recipes that closely match what the user may want. For example, lets say a user was initially recommended "Crab Fried Rice", and wants to find recipes that are similar in taste profile. Since we've already built a comparator, we can use it to index into our dataframe and look for the recipe that is closest in similarity to what the user prefers.

In [7]:
def find_closest_recipe(df, index):
    """
    Finds the recipe that is most similar to the given recipe using cosine similarity
    
    :param df: the dataframe to gather data from
    :param index: the index of the recipe that the user wants to find the most similar recipe to
    :prints: the name and the similarity rating of the most similar recipe
    
    """
    
    category = df.iloc[index].Category
    ingredients = df.iloc[index].Ingredients

    sim = 0
    highest = df[df.Category == category].iloc[0]
    for i in range(1, len(df[df.Category == category])):
        Y = df[df.Category == category].iloc[i].Ingredients
        if cosine_sim(ingredients,Y) > sim and ingredients != Y:
            sim = cosine_sim(ingredients,Y)
            highest = df[df.Category == category].iloc[i]
    
    print("Closest recipe: " + str(highest.Name) + "\n Similarity Rating: ", str(sim))


In [8]:
# Example usage of find_closest_recipe()
# Let's say we want to find the most similar recipe to our "instant pot braised beef" that 
# came up first for our top beef recipes

# Recall that it's given index was 154.

find_closest_recipe(df, 326)

Closest recipe: Beef Curry, A Hong Kong Style Recipe
 Similarity Rating:  0.6366550033321674


# Step 4: Combining it all together

Let's take a look at an end to end usage of this recommender system might look like. Of course, it's far from perfect, and I'll delve into how I might improve it in the near future and make it more scalable.

In [9]:
# I'm feeling seafood tonight... but I also want to stay low on calories, I have a lot of time, so that isn't an issue.

time = int(input("Enter max cook time: "))
calories = int(input("Enter max calories: "))
category = input("What type of dish? Choose from Soup, Poultry, Noodles, Appetizers, Vegetarian, Rice, Bread, Beef, Seafood, Pork, Dessert: ")

get_top_recipes(time,calories,category)

Enter max cook time: 9999
Enter max calories: 350
What type of dish? Choose from Soup, Poultry, Noodles, Appetizers, Vegetarian, Rice, Bread, Beef, Seafood, Pork, Dessert: Seafood


Unnamed: 0.1,Unnamed: 0,Name,Category,Prep Time,Cook Time,Total Time,Ingredients,Calories,Average Rating,Review Count,URL
370,434,Shrimp and Broccoli,Seafood,20.0,15.0,35.0,"sized to count shrimp , broccoli florets...",206.0,5.0,12.0,https://thewoksoflife.com/shrimp-and-broccoli/
403,482,Pan Fried Fish: Chinese Whole Fish Recipe,Seafood,30.0,15.0,45.0,"porgies , salt, vegetable oil, ginger ...",300.0,5.0,11.0,https://thewoksoflife.com/pan-fried-fish/
416,501,Cantonese-Style Ginger Scallion Lobster,Seafood,35.0,15.0,50.0,"live lobsters , all purpose flour, corns...",260.0,5.0,10.0,https://thewoksoflife.com/ginger-scallion-lobs...
406,487,Salt and Pepper Shrimp,Seafood,30.0,15.0,45.0,"parts whole peppercorns, part sea salt, ...",312.0,5.0,9.0,https://thewoksoflife.com/salt-and-pepper-shrimp/
408,489,Chinese Stuffed Peppers,Seafood,20.0,10.0,30.0,"shrimp , scallions, salt, vegetable oi...",277.0,5.0,7.0,https://thewoksoflife.com/chinese-stuffed-pepp...


In [11]:
# I made the ginger-scallion lobster, and I really liked it! Are there any recipes similar to it?
find_closest_recipe(df, 416)

Closest recipe: Scallion Ginger Cantonese Crab
 Similarity Rating:  0.8346223261119858


# Future Action Items

There are a number of things I'd like to work on in the near future.

First, certain ingredients should be given higher priority than others. For example, the choice of protein is most important. In certain categories, like "Poultry", which may involve things like duck or chicken, the "chicken" ingredient should have assigned a larger weight in order to account to its importance in the dish.

Second, the recommender should also take into account user preferences, like the constraints of cook time and calories. 

Third, the user should be able to specify any ingredients that they don't have and/or food allergies for a more robust system that actually records what the user will be able to create.