<p align="center">
  <h1 align="center">Recipe recommendation system</h1>
  <h4 align="center">
    <strong>Jelle Huibregtse</strong> and <strong>Aron Hemmes</strong>
  </h4>
</p>

## The Assignment
The goal is to ultimately create a personalized recipe recommendation system that learns from the choices of its users.

### Loading in some libraries

In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.rcParams['figure.figsize'] = 10, 5



## Data Cleaning and Aggregation (DCA)
First, we need to load the datasets from `.csv` to a `DataFrame`, we will name them accordingly. Then show the first 5 rows and some general information per `DataFrame`.


### The ingredient dataset
The first dataset is the ingredient dataset:

In [2]:
df_ingredient = pd.read_csv('dataset/ingredient.csv')
# Set the index to ingredient id.
df_ingredient = df_ingredient.set_index('ingredient_id')
df_ingredient.head()

Unnamed: 0_level_0,category,name,plural
ingredient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,dairy,1% lowfat cottage cheese,
6,dairy,1% lowfat milk,
10,Mexican products,10-inch flour tortilla,s
11,cereals,100% bran cereal,
12,dairy,2% lowfat milk,


In [3]:
df_ingredient.dtypes

category    object
name        object
plural      object
dtype: object

Let's get all the rows where `plural` is missing:

In [4]:
df_ingredient[df_ingredient['plural'].isnull()]

Unnamed: 0_level_0,category,name,plural
ingredient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,dairy,1% lowfat cottage cheese,
6,dairy,1% lowfat milk,
11,cereals,100% bran cereal,
12,dairy,2% lowfat milk,
20,vinegars,aceto balsamico vinegar,
...,...,...,...
4641,cereals,cooked oatmeal,
4642,hot beverages,instant coffee granules,
4643,grains,long grain enriched white rice,
4644,frozen fruit,frozen grapefruit juice concentrate,


Let's get all the unique category's this is usefull when you want to filter ingredients.

In [5]:
unique = df_ingredient.category.unique()
pd.DataFrame(unique, columns=["category"]).head()

Unnamed: 0,category
0,dairy
1,Mexican products
2,cereals
3,breads
4,fresh vegetables


Since, we are builing a recipe recommendation system it is keythat we have the ingredients of the recipes and categories. The latter we can use for the actual recommendation, for example maybe someone doesn't like diary, we can exclude recipes based on ingredients that fall into the category dairy. You make a case for removing the `plural` column, however we'd argue that it might come in handy when we want to visualize a recipe.

### The nutrition dataset
Next, is the nutrition dataset:

In [6]:
df_nutrition = pd.read_csv('dataset/nutrition.csv')
# Set the index to ingredient id.
df_nutrition = df_nutrition.set_index('recipe_id')
df_nutrition.head()

Unnamed: 0_level_0,protein,carbo,alcohol,total_fat,sat_fat,cholestrl,sodium,iron,vitamin_c,vitamin_a,fiber,pcnt_cal_carb,pcnt_cal_fat,pcnt_cal_prot,calories
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
214,5.47,41.29,0.0,11.53,2.21,1.39,260.78,0.81,8.89,586.2,0.87,56.8,35.68,7.53,290.79
215,5.7,23.75,1.93,1.08,0.58,3.48,46.17,0.57,13.02,2738.24,0.62,67.38,6.89,16.17,141.01
216,4.9,26.88,0.0,1.1,0.58,3.46,41.79,0.37,6.13,1521.1,0.34,78.45,7.24,14.3,137.06
217,1.77,18.17,0.0,0.21,0.06,0.0,14.01,0.19,8.79,478.09,0.69,88.98,2.35,8.67,81.7
218,1.38,36.63,0.0,5.47,3.46,10.36,50.22,0.66,0.16,229.16,1.05,72.81,24.46,2.73,201.23


In [7]:
df_nutrition.dtypes

protein          float64
carbo            float64
alcohol          float64
total_fat        float64
sat_fat          float64
cholestrl        float64
sodium           float64
iron             float64
vitamin_c        float64
vitamin_a        float64
fiber            float64
pcnt_cal_carb    float64
pcnt_cal_fat     float64
pcnt_cal_prot    float64
calories         float64
dtype: object

The data types seem to be correct. Furthermore, for someone concerned with health, we can use each of these datapoints to get a better recommendation. Say someone does not want alcohol in their recipe, we can filter recipes without alcohol:

In [8]:
df_nutrition_without_alcohol = df_nutrition[df_nutrition['alcohol'] == 0.00]
df_nutrition_without_alcohol['alcohol'].to_frame().head()

Unnamed: 0_level_0,alcohol
recipe_id,Unnamed: 1_level_1
214,0.0
216,0.0
217,0.0
218,0.0
220,0.0


Now, we have all recipes without alcohol!

### The quantity dataset
Next, is the quantity dataset:

In [9]:
df_quantity = pd.read_csv('dataset/quantity.csv')
# Set the index to ingredient id.
df_quantity = df_quantity.set_index('quantity_id')
df_quantity.head()

Unnamed: 0_level_0,recipe_id,ingredient_id,max_qty,min_qty,unit,preparation,optional
quantity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,214,1613,2.0,2.0,cup(s),,False
2,214,3334,0.25,0.25,cup(s),,False
3,214,2222,0.5,0.5,cup(s),melted,False
4,214,2797,0.25,0.25,cup(s),or water,False
5,214,3567,3.0,3.0,teaspoon(s),,False


In [10]:
df_quantity.dtypes

recipe_id          int64
ingredient_id      int64
max_qty          float64
min_qty          float64
unit              object
preparation       object
optional            bool
dtype: object

Again, it seems that all columns are relevant, since in the end we want to be able to show the recipe to the user.

### The recipe dataset
Finally, we have the recipe dataset:

In [11]:
df_recipe = pd.read_csv('dataset/recipe.csv')
# Set the index to ingredient id.
df_recipe = df_recipe.set_index('recipe_id')
df_recipe.head()

Unnamed: 0_level_0,title,subtitle,servings,yield_unit,prep_min,cook_min,stnd_min,source,intro,directions
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
214,Raspberry Chiffon Pie,,10,1 pie,20,8,305,The California Tree Fruit Agreement,,"For crust, preheat oven to 375 degrees F.\nIn..."
215,Apricot Yogurt Parfaits,,4,,5,2,65,Produce for Better Health Foundation and 5 a Day,,"Drain canned apricots, pour 1/4 cup of the ju..."
216,Fresh Apricot Bavarian,,8,,5,13,0,The California Apricot Advisory Board,Serve in stemmed glasses and top with sliced a...,Drop apricots into boiling water to cover. R...
217,Fresh Peaches,with Banana Cream Whip,4,,10,0,0,Produce for Better Health Foundation and 5 a Day,"For a quick, low-cal dessert, serve this on o...","In a small bowl, beat egg white until foamy. ..."
218,Canned Cherry Crisp,,6,,10,5,0,The Cherry Marketing Institute,Your microwave turns a can of cherry pie filli...,"Pour cherry pie filling into an 8-inch, round..."


In [12]:
df_recipe.dtypes

title         object
subtitle      object
servings       int64
yield_unit    object
prep_min       int64
cook_min       int64
stnd_min       int64
source        object
intro         object
directions    object
dtype: object

Just like the nutrients, maybe the person for whom we are recommending the recipes, wants a short cook or prep time. We can for example filter all recipes that take less than 30 minutes in total to create.

In [13]:
df_recipe_less_than_30_minutes = df_recipe[df_recipe['prep_min'] + df_recipe['cook_min'] < 30]
df_recipe_less_than_30_minutes.head()

Unnamed: 0_level_0,title,subtitle,servings,yield_unit,prep_min,cook_min,stnd_min,source,intro,directions
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
214,Raspberry Chiffon Pie,,10,1 pie,20,8,305,The California Tree Fruit Agreement,,"For crust, preheat oven to 375 degrees F.\nIn..."
215,Apricot Yogurt Parfaits,,4,,5,2,65,Produce for Better Health Foundation and 5 a Day,,"Drain canned apricots, pour 1/4 cup of the ju..."
216,Fresh Apricot Bavarian,,8,,5,13,0,The California Apricot Advisory Board,Serve in stemmed glasses and top with sliced a...,Drop apricots into boiling water to cover. R...
217,Fresh Peaches,with Banana Cream Whip,4,,10,0,0,Produce for Better Health Foundation and 5 a Day,"For a quick, low-cal dessert, serve this on o...","In a small bowl, beat egg white until foamy. ..."
218,Canned Cherry Crisp,,6,,10,5,0,The Cherry Marketing Institute,Your microwave turns a can of cherry pie filli...,"Pour cherry pie filling into an 8-inch, round..."


Now, we only have recipes that take less than 30 minutes in total to create. Why not add total recipe time as a column, and remove `stnd_min`, since I have no idea the column means. Furthermore, the source does not seem relevant for our recommendation system.

In [14]:
#df_recipe.drop('stnd_min', inplace=True, axis=1)
df_recipe.drop('source', inplace=True, axis=1)
df_recipe.insert(loc=6, column='total_min', value=(df_recipe['prep_min'] + df_recipe['cook_min']))
df_recipe.head()

Unnamed: 0_level_0,title,subtitle,servings,yield_unit,prep_min,cook_min,total_min,stnd_min,intro,directions
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
214,Raspberry Chiffon Pie,,10,1 pie,20,8,28,305,,"For crust, preheat oven to 375 degrees F.\nIn..."
215,Apricot Yogurt Parfaits,,4,,5,2,7,65,,"Drain canned apricots, pour 1/4 cup of the ju..."
216,Fresh Apricot Bavarian,,8,,5,13,18,0,Serve in stemmed glasses and top with sliced a...,Drop apricots into boiling water to cover. R...
217,Fresh Peaches,with Banana Cream Whip,4,,10,0,10,0,"For a quick, low-cal dessert, serve this on o...","In a small bowl, beat egg white until foamy. ..."
218,Canned Cherry Crisp,,6,,10,5,15,0,Your microwave turns a can of cherry pie filli...,"Pour cherry pie filling into an 8-inch, round..."


Let's try joining the recipe and nutrition data sets!

In [15]:
df_recipe_nutrition = pd.merge(df_recipe, df_nutrition, on='recipe_id', how='inner')
df_recipe_nutrition.head()

Unnamed: 0_level_0,title,subtitle,servings,yield_unit,prep_min,cook_min,total_min,stnd_min,intro,directions,...,cholestrl,sodium,iron,vitamin_c,vitamin_a,fiber,pcnt_cal_carb,pcnt_cal_fat,pcnt_cal_prot,calories
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
214,Raspberry Chiffon Pie,,10,1 pie,20,8,28,305,,"For crust, preheat oven to 375 degrees F.\nIn...",...,1.39,260.78,0.81,8.89,586.2,0.87,56.8,35.68,7.53,290.79
215,Apricot Yogurt Parfaits,,4,,5,2,7,65,,"Drain canned apricots, pour 1/4 cup of the ju...",...,3.48,46.17,0.57,13.02,2738.24,0.62,67.38,6.89,16.17,141.01
216,Fresh Apricot Bavarian,,8,,5,13,18,0,Serve in stemmed glasses and top with sliced a...,Drop apricots into boiling water to cover. R...,...,3.46,41.79,0.37,6.13,1521.1,0.34,78.45,7.24,14.3,137.06
217,Fresh Peaches,with Banana Cream Whip,4,,10,0,10,0,"For a quick, low-cal dessert, serve this on o...","In a small bowl, beat egg white until foamy. ...",...,0.0,14.01,0.19,8.79,478.09,0.69,88.98,2.35,8.67,81.7
218,Canned Cherry Crisp,,6,,10,5,15,0,Your microwave turns a can of cherry pie filli...,"Pour cherry pie filling into an 8-inch, round...",...,10.36,50.22,0.66,0.16,229.16,1.05,72.81,24.46,2.73,201.23


### Export the dataframes

In [16]:
df_recipe_nutrition.to_csv('export/recipe.csv', encoding='utf-8')
df_quantity.to_csv('export/recipe.csv', encoding='utf-8')
df_ingredient.to_csv('export/recipe.csv', encoding='utf-8')