# Recipe Recommendation System: Data Preparation

By: Kelly Li


## Table of contents:
* [1 Introduction](#one)
* [2 Dataset](#two)
    * [2.1 Data Sources](#twoone)
    * [2.2 Data Loading](#twotwo)
* [3 Data Cleaning](#three)
    * [3.1 Data Types](#threeone) 
    * [3.2 Missing Data](#threetwo)
    * [3.3 Duplicate Data](#threethree)
* [4 Findings Summary](#four)
* [5 Conclusion](#five)

-------------------------------------------------------------------------------------------------------------------------------

## 1 Introduction <a class="anchor" id="one"></a>

Cooking enthusiasts often face challenges in finding personalized and diverse recipes that align with their tastes, dietar restrictions, and ingredient preferences. The existing search process is time-consuming, lacks inspiration, and fails to cater to specific dietary needs. To address these challenges, we have developed a recipe recommendation system that aids in recipe discovery, focusing on providing inspiration and catering to specific dietary needs. By leveraging advanced algorithms, it curates personalized recipe suggestions that ignite culinary creativity while considering unique tasts and dietary requirements. Wecome a world a culinary inspiration with

## 2 Data Preparation <a class="anchor" id="two"></a>

### 2.1 Data Sources <a class="anchor" id="twoone"></a>

Here's a data dictionary for the columns in the raw recipe dataset:

| Column Name   | Description                                               |
| ------------- | --------------------------------------------------------- |
| name          | The name of the recipe.                                   |
| id            | The unique identifier of the recipe.                       |
| minutes       | The total cooking and preparation time in minutes.         |
| contributor_id| The unique identifier of the user who submitted the recipe|
| submitted     | The date when the recipe was submitted.                    |
| tags          | Tags or labels associated with the recipe (e.g., vegetarian, vegan, dessert). |
| nutrition     | Nutritional information (i.e. calories (#), total fat (PDV), sugar (PDV) , sodium (PDV) , protein (PDV) , saturated fat). |
| n_steps       | The total number of steps or instructions in the recipe.   |
| steps         | The step-by-step instructions for preparing the recipe.    |
| description   | A brief description or summary of the recipe.              |
| ingredients   | The list of ingredients required for the recipe.           |
| n_ingredients | The total number of ingredients used in the recipe.        |

Here's a data dictionary for the columns in the raw user interactions dataset:

| Column Name | Description                                          |
| ----------- | ---------------------------------------------------- |
| user_id     | The unique identifier of the user.                    |
| recipe_id   | The unique identifier of the recipe associated with the user's interaction. |
| date        | The date when the user's interaction took place.      |
| rating      | The rating given by the user for the recipe.          |
| review      | The review or feedback provided by the user for the recipe. |

### 2.2 Data Loading <a class="anchor" id="twotwo"></a>

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read the raw datasets
raw_recipes_df = pd.read_csv('RAW_recipes.csv')
raw_users_df = pd.read_csv('RAW_interactions.csv')

In [3]:
# Shape of the data
print('The shape of the recipe dataset is:', raw_recipes_df.shape)
print('The shape of the user interactions dataset is:', raw_users_df.shape)

The shape of the recipe dataset is: (231637, 12)
The shape of the user interactions dataset is: (1132367, 5)


In [4]:
# View recipes dataset
raw_recipes_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


In [5]:
# View user interactions dataset
raw_users_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


## 3 Data Cleaining <a class="anchor" id="three"></a>

## 3.1 Data Types <a class="anchor" id="threeone"></a>

To gain a comprehensive understanding of the data types and identify any non-null values within the dataset, we will employ the `.info()` method.

In [6]:
# Info about the recipe dataset
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            231636 non-null  object
 1   id              231637 non-null  int64 
 2   minutes         231637 non-null  int64 
 3   contributor_id  231637 non-null  int64 
 4   submitted       231637 non-null  object
 5   tags            231637 non-null  object
 6   nutrition       231637 non-null  object
 7   n_steps         231637 non-null  int64 
 8   steps           231637 non-null  object
 9   description     226658 non-null  object
 10  ingredients     231637 non-null  object
 11  n_ingredients   231637 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 21.2+ MB


The 'id' and 'contributor_id' columns are currently of `integer` data type, while the 'submitted' column is of `object` data type. To ensure consistency and facilitate data manipulation, we will convert the 'id' and 'contributor_id' columns to `object` data type. Additionally, we will convert the 'submitted' column to `datetime` data type for more convenient date-based operations.

In [7]:
# Convert 'id' and 'contributor_id' columns to object data type
raw_recipes_df['id'] = raw_recipes_df['id'].astype(str)
raw_recipes_df['contributor_id'] = raw_recipes_df['contributor_id'].astype(str)

# Convert 'submitted' column to date data type
raw_recipes_df['submitted'] = pd.to_datetime(raw_recipes_df['submitted'], format='%Y-%m-%d')

# Print updated data types of the columns
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   name            231636 non-null  object        
 1   id              231637 non-null  object        
 2   minutes         231637 non-null  int64         
 3   contributor_id  231637 non-null  object        
 4   submitted       231637 non-null  datetime64[ns]
 5   tags            231637 non-null  object        
 6   nutrition       231637 non-null  object        
 7   n_steps         231637 non-null  int64         
 8   steps           231637 non-null  object        
 9   description     226658 non-null  object        
 10  ingredients     231637 non-null  object        
 11  n_ingredients   231637 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 21.2+ MB


Let us now examine the data types within the user interactions dataset.

In [25]:
# Info about the recipe dataset
raw_users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   user_id    1132367 non-null  int64 
 1   recipe_id  1132367 non-null  int64 
 2   date       1132367 non-null  object
 3   rating     1132367 non-null  int64 
 4   review     1132198 non-null  object
dtypes: int64(3), object(2)
memory usage: 43.2+ MB


Similar to the recipes dataset, the 'user_id' and 'recipe_id' columns are currently of `integer` data type, while the 'date' column is of `object` data type. To ensure consistency and facilitate data manipulation, we will convert the 'user_id' and 'recipe_id' columns to `object` data type. Additionally, we will convert the 'date' column to `datetime` data type for more convenient date-based operations.

In [26]:
# Convert 'user_id' and 'recipe_id' columns to object data type
raw_users_df['user_id'] = raw_users_df['user_id'].astype(str)
raw_users_df['recipe_id'] = raw_users_df['recipe_id'].astype(str)

# Convert 'submitted' column to date data type
raw_users_df['date'] = pd.to_datetime(raw_users_df['date'], format='%Y-%m-%d')

# Print updated data types of the columns
raw_users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   user_id    1132367 non-null  object        
 1   recipe_id  1132367 non-null  object        
 2   date       1132367 non-null  datetime64[ns]
 3   rating     1132367 non-null  int64         
 4   review     1132198 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 43.2+ MB


## 3.2 Missing Data <a class="anchor" id="threetwo"></a>

Now that we have gained a deeper understanding of the dataset, let us proceed to explore any potential missing values within it.

In [9]:
# Checking for missing values
raw_recipes_df.isna().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

There is 1 missing value in the 'name' column and 4,979 missing values in the 'description' column. values represent. Let us examine the specific rows that contain missing values to identify any discernible patterns and explore potential strategies for data imputation.

In [10]:
# Row with missing name
raw_recipes_df[raw_recipes_df['name'].isna()]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
721,,368257,10,779451,2009-04-27,"['15-minutes-or-less', 'time-to-make', 'course...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"['in a bowl , combine ingredients except for o...",-------------,"['lemon', 'honey', 'horseradish mustard', 'gar...",10


It appears that the row with the missing 'name' value also lacks a corresponding description. Consquently, we are unable to infer a name from the description itself, as there is not textual information available to guide the imputation process. Considering that there is only one instance with a missing 'name' value, we can safely drop this row from the dataset.

In [None]:
# Drop row with missing name
raw_recipes_df.dropna(subset=['name'], inplace=True)

In [14]:
# Sanity check
raw_recipes_df.isna().sum()

name                 0
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

Let us take a look at the rows with a missing description.

In [12]:
# Row with missing description
raw_recipes_df[raw_recipes_df['description'].isna()].sample(10)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
129194,mammies oatmeal cookies,17283,30,17803,2002-01-17,"['30-minutes-or-less', 'time-to-make', 'course...","[151.1, 9.0, 55.0, 5.0, 4.0, 17.0, 7.0]",8,"['cream butter and sugars add eggs', 'sift flo...",,"['butter', 'sugar', 'brown sugar', 'eggs', 'fl...",12
231534,zucchini garlic pasta,49456,40,37779,2002-12-22,"['60-minutes-or-less', 'time-to-make', 'course...","[519.4, 31.0, 16.0, 19.0, 38.0, 36.0, 21.0]",11,['prepare pasta according to package direction...,,"['wagon wheel macaroni', 'bacon', 'onion', 'ga...",8
34991,cantaloupe cobbler,49224,70,58886,2002-12-19,"['weeknight', 'time-to-make', 'course', 'main-...","[3261.4, 308.0, 976.0, 132.0, 66.0, 620.0, 116.0]",4,"['place melon in a casserole dish', 'combine a...",,"['cantaloupe', 'sugar', 'milk', 'self rising f...",6
34581,canadian cream,78376,5,106624,2003-12-09,"['15-minutes-or-less', 'time-to-make', 'course...","[124.7, 5.0, 49.0, 2.0, 6.0, 10.0, 4.0]",4,"['combine all well', 'bottle', 'refrigerate', ...",,"['sweetened condensed milk', 'carnation evapor...",6
116002,just oatmeal cookies,43929,20,27443,2002-10-22,"['30-minutes-or-less', 'time-to-make', 'course...","[122.1, 9.0, 35.0, 3.0, 3.0, 10.0, 5.0]",7,"['cream shortening , butter and sugars', 'add ...",,"['shortening', 'butter', 'brown sugar', 'white...",12
136715,minute rice s spanish rice with beef,69958,30,93698,2003-08-28,"['30-minutes-or-less', 'time-to-make', 'course...","[410.4, 21.0, 15.0, 24.0, 43.0, 25.0, 17.0]",6,"['brown meat , breaking the pieces and stirrin...",,"['ground beef', 'frozen corn', 'water', 'stewe...",10
17816,barbecued lobster tails,31309,22,23302,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[800.7, 89.0, 1.0, 20.0, 117.0, 40.0, 2.0]",8,['split the tails lengthwise with a large knif...,,"['salt substitute', 'paprika', 'white pepper',...",7
118352,kiwi commotion smoothie,16869,5,8728,2002-01-06,"['15-minutes-or-less', 'time-to-make', 'course...","[278.1, 5.0, 196.0, 2.0, 7.0, 9.0, 20.0]",2,"['blend kiwis , banana , and frozen yogurt in ...",,"['kiwi fruits', 'banana', 'vanilla frozen yogu...",4
31907,butter sticks,57670,45,52476,2003-04-02,"['60-minutes-or-less', 'time-to-make', 'course...","[130.8, 12.0, 9.0, 12.0, 3.0, 20.0, 4.0]",10,"['heat oven to 425', 'place butter into 13x9 b...",,"['butter', 'bisquick baking mix', 'water']",3
41170,cherry almond chews,46753,32,26399,2002-11-18,"['60-minutes-or-less', 'time-to-make', 'course...","[82.7, 6.0, 32.0, 2.0, 1.0, 6.0, 3.0]",9,['in a mixing bowl cream shortening and sugar'...,,"['shortening', 'sugar', 'brown sugar', 'eggs',...",11


Since the descriptions are written by the recipe contributors themselves and not available in the source, it is not feasible to impute missing values for the 'description' column. Let us look at the extent to which these missing values constitute the overall dataset.

In [19]:
print('Missing values make up', round(raw_recipes_df.isna().sum().sum()/raw_recipes_df.shape[0]*100, 2), "%", "of the overall data.")

Missing values make up 2.15 % of the overall data.


Given that the missing values in the 'description' column constitute a small subset of the overall dataset, we can safely drop these rows.

In [22]:
# Drop rows with missing description
raw_recipes_df.dropna(subset=['description'], inplace=True)

In [24]:
# Sanity check
raw_recipes_df.isna().sum()

name              0
id                0
minutes           0
contributor_id    0
submitted         0
tags              0
nutrition         0
n_steps           0
steps             0
description       0
ingredients       0
n_ingredients     0
dtype: int64

We will now take a look at the user interactions dataset to identify any missing values.

In [13]:
raw_users_df.isna().sum()

user_id        0
recipe_id      0
date           0
rating         0
review       169
dtype: int64

There are 169 missing values in the 'review' column. values represent. Similar to the 'description' column in the recipes dataset, as the reviews are written by users themselves and are not available in the source, it is not feasible to impute missing values for the 'review' column. Let us now assess the proportion of these missing values in relation to the overall dataset.

In [27]:
print('Missing values make up', round(raw_users_df.isna().sum().sum()/raw_users_df.shape[0]*100, 2), "%", "of the overall data.")

Missing values make up 0.01 % of the overall data.


Given that the missing values in the 'review' column constitute a very small subset of the overall dataset, we can safely drop these rows.

In [28]:
# Drop rows with missing description
raw_users_df.dropna(subset=['review'], inplace=True)

In [29]:
# Sanity check
raw_users_df.isna().sum()

user_id      0
recipe_id    0
date         0
rating       0
review       0
dtype: int64

## 3.3 Duplicate Data <a class="anchor" id="threethree"></a>

Now that we have addressed the missing values, let us proceed to identify any potential duplicates within the datasets.

In [30]:
# Duplicated rows
print("duplicated rows in recipes dataset:", raw_recipes_df.duplicated().sum())
print("duplicated rows in user interactions dataset:", raw_users_df.duplicated().sum())

duplicated rows in recipes dataset: 0
duplicated rows in user interactions dataset: 0


Great! It appears that there are no duplicated rows within the datasets.

#### Saving the data

In [None]:
# Save clean datasets
raw_recipes_df.to_csv("clean_recipes.csv", index=False)
raw_users_df.to_csv("clean_interactions.csv", index=False)