# Recipe Recommendation System: Data Preprocessing

By: Kelly Li


## Table of contents:
* [1 Introduction](#one)
* [2 Data Preparation](#two)
    * [2.1 Data Sources](#twoone)
    * [2.2 Initial Data Exploration](#twotwo)
* [3 Data Cleaning](#three)
    * [3.1 Datatypes](#threeone) 
    * [3.2 Duplicate Data](#threetwo)
    * [3.3 Missing Data](#threethree)
    * [3.4 Data Aggregation](#threefour)
* [4 Findings Summary](#four)
* [5 Conclusion](#five)

-------------------------------------------------------------------------------------------------------------------------------

## 1 Introduction <a class="anchor" id="one"></a>

Cooking enthusiasts often face challenges in finding personalized and diverse recipes that align with their tastes, dietar restrictions, and ingredient preferences. The existing search process is time-consuming, lacks inspiration, and fails to cater to specific dietary needs. To address these challenges, we have developed a recipe recommendation system that aids in recipe discovery, focusing on providing inspiration and catering to specific dietary needs. By leveraging advanced algorithms, it curates personalized recipe suggestions that ignite culinary creativity while considering unique tasts and dietary requirements. Wecome a world a culinary inspiration with

## 2 Data Preparation <a class="anchor" id="two"></a>

### 2.1 Data Sources <a class="anchor" id="twoone"></a>

Here's a data dictionary for the columns in the raw recipe dataset:

| Column Name   | Description                                               |
| ------------- | --------------------------------------------------------- |
| name          | The name of the recipe.                                   |
| id            | The unique identifier of the recipe.                       |
| minutes       | The total cooking and preparation time in minutes.         |
| contributor_id| The unique identifier of the user who submitted the recipe|
| submitted     | The date when the recipe was submitted.                    |
| tags          | Tags or labels associated with the recipe (e.g., vegetarian, vegan, dessert). |
| nutrition     | Nutritional information (i.e. calories (#), total fat (PDV), sugar (PDV) , sodium (PDV) , protein (PDV) , saturated fat). |
| n_steps       | The total number of steps or instructions in the recipe.   |
| steps         | The step-by-step instructions for preparing the recipe.    |
| description   | A brief description or summary of the recipe.              |
| ingredients   | The list of ingredients required for the recipe.           |
| n_ingredients | The total number of ingredients used in the recipe.        |

Here's a data dictionary for the columns in the raw user interactions dataset:

| Column Name | Description                                          |
| ----------- | ---------------------------------------------------- |
| user_id     | The unique identifier of the user.                    |
| recipe_id   | The unique identifier of the recipe associated with the user's interaction. |
| date        | The date when the user's interaction took place.      |
| rating      | The rating given by the user for the recipe.          |
| review      | The review or feedback provided by the user for the recipe. |

## Data Loading

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read the raw datasets
raw_recipes_df = pd.read_csv('RAW_recipes.csv')
raw_users_df = pd.read_csv('RAW_interactions.csv')

In [4]:
# Shape of the data
print('The shape of the recipe dataset is:', raw_recipes_df.shape)
print('The shape of the user interactions dataset is:', raw_users_df.shape)

The shape of the recipe dataset is: (231637, 12)
The shape of the user interactions dataset is: (1132367, 5)


In [3]:
# View recipes dataset
raw_recipes_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


In [5]:
# View user interactions dataset
raw_users_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


To gain a comprehensive understanding of the data types and identify any non-null values within the dataset, we will employ the `.info()` method.

In [6]:
# Info about the recipe dataset
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            231636 non-null  object
 1   id              231637 non-null  int64 
 2   minutes         231637 non-null  int64 
 3   contributor_id  231637 non-null  int64 
 4   submitted       231637 non-null  object
 5   tags            231637 non-null  object
 6   nutrition       231637 non-null  object
 7   n_steps         231637 non-null  int64 
 8   steps           231637 non-null  object
 9   description     226658 non-null  object
 10  ingredients     231637 non-null  object
 11  n_ingredients   231637 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 21.2+ MB


We can see that the 'id' and 'contributor_id' columns are currently of integer data type, while the 'submitted' column is of object data type. To ensure consistency and facilitate data manipulation, we will convert the 'id' and 'contributor_id' columns to object data type. Additionally, we will convert the 'submitted' column to date data type for more convenient date-based operations.

In [9]:
# Convert 'id' and 'contributor_id' columns to object data type
raw_recipes_df['id'] = raw_recipes_df['id'].astype(str)
raw_recipes_df['contributor_id'] = raw_recipes_df['contributor_id'].astype(str)

# Convert 'submitted' column to date data type
raw_recipes_df['submitted'] = pd.to_datetime(raw_recipes_df['submitted'], format='%Y-%m-%d')

# Print updated data types of the columns
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   name            231636 non-null  object        
 1   id              231637 non-null  object        
 2   minutes         231637 non-null  int64         
 3   contributor_id  231637 non-null  object        
 4   submitted       231637 non-null  datetime64[ns]
 5   tags            231637 non-null  object        
 6   nutrition       231637 non-null  object        
 7   n_steps         231637 non-null  int64         
 8   steps           231637 non-null  object        
 9   description     226658 non-null  object        
 10  ingredients     231637 non-null  object        
 11  n_ingredients   231637 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 21.2+ MB


In [7]:
# Info about the dataset
raw_users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   user_id    1132367 non-null  int64 
 1   recipe_id  1132367 non-null  int64 
 2   date       1132367 non-null  object
 3   rating     1132367 non-null  int64 
 4   review     1132198 non-null  object
dtypes: int64(3), object(2)
memory usage: 43.2+ MB


In [10]:
raw_recipes_df.isna().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

In [11]:
raw_recipes_df[raw_recipes_df['name'].isna()]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
721,,368257,10,779451,2009-04-27,"['15-minutes-or-less', 'time-to-make', 'course...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"['in a bowl , combine ingredients except for o...",-------------,"['lemon', 'honey', 'horseradish mustard', 'gar...",10


In [51]:
print(raw_recipes_df[raw_recipes_df['name'].isna()]['ingredients'])

721    ['lemon', 'honey', 'horseradish mustard', 'garlic clove', 'dried parsley', 'dried basil', 'dried thyme', 'garlic salt', 'black pepper', 'olive oil']
Name: ingredients, dtype: object


In [43]:
raw_recipes_df[raw_recipes_df['ingredients'].str.contains('horseradish') & raw_recipes_df['ingredients'].str.contains('lemon') & raw_recipes_df['ingredients'].str.contains('honey') & raw_recipes_df['ingredients'].str.contains('garlic')]


Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
721,,368257,10,779451,2009-04-27,"['15-minutes-or-less', 'time-to-make', 'course...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"['in a bowl , combine ingredients except for o...",-------------,"['lemon', 'honey', 'horseradish mustard', 'gar...",10
20557,beef ribs a beautiful barbecue,64711,90,74558,2003-06-17,"['weeknight', 'time-to-make', 'course', 'main-...","[187.6, 3.0, 113.0, 33.0, 4.0, 1.0, 11.0]",20,['barbecue sauce / marinade: skin tomato and c...,the aroma wafting through your neighborhood of...,"['beef ribs', 'tomatoes', 'vidalia onion', 'ga...",19
84821,flavored butters 8 variations,451664,5,57042,2011-03-26,"['weeknight', '15-minutes-or-less', 'time-to-m...","[432.6, 71.0, 17.0, 20.0, 2.0, 146.0, 1.0]",12,"['lemon basil: combine melted butter , lemon j...",flavored butters for corn on the cob. this cam...,"['butter', 'lemon juice', 'dried basil', 'prep...",15
160368,pineapple salsa mahi,298073,20,453604,2008-04-11,"['30-minutes-or-less', 'time-to-make', 'course...","[210.8, 2.0, 46.0, 23.0, 65.0, 1.0, 5.0]",10,"['chop pineapple into small pieces', 'combine ...",got this recipe off a recipe card from publix....,"['mahi mahi fillets', 'pineapple chunk', 'sals...",8
208527,tangy sweet and sour pork shoulder steak bake,151215,120,89831,2006-01-11,"['time-to-make', 'course', 'main-ingredient', ...","[686.9, 34.0, 206.0, 49.0, 123.0, 37.0, 19.0]",15,"['set oven to 350 degrees', 'grease a 13 x 9-i...",you can also make this using country-style por...,"['pork shoulder steaks', 'seasoning salt', 'fr...",15


In [4]:
pd.set_option('display.max_colwidth', None)
print(raw_users_df.loc[raw_users_df['recipe_id'] == 368257]['review'])

369401    This was great! Thanx. It was the only one without vinegar which\r\nI can't take.  And the quantity was a small batch to fine tune.
Name: review, dtype: object


In [6]:
raw_recipes_df['nutrition'][0]

'[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]'

In [12]:
raw_recipes_df[raw_recipes_df['description'].isna()]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
5,apple a day milk shake,5289,0,1533,1999-12-06,"['15-minutes-or-less', 'time-to-make', 'course...","[160.2, 10.0, 55.0, 3.0, 9.0, 20.0, 7.0]",4,"['combine ingredients in blender', 'cover and ...",,"['milk', 'vanilla ice cream', 'frozen apple ju...",4
8,bananas 4 ice cream pie,70971,180,102353,2003-09-10,"['weeknight', 'time-to-make', 'course', 'main-...","[4270.8, 254.0, 1306.0, 111.0, 127.0, 431.0, 2...",8,"['crumble cookies into a 9-inch pie plate , or...",,"['chocolate sandwich style cookies', 'chocolat...",6
74,philly waldorf salad,5060,60,1534,1999-12-01,"['60-minutes-or-less', 'time-to-make', 'course...","[180.7, 22.0, 29.0, 3.0, 6.0, 33.0, 3.0]",4,"['combine softened cream cheese , orange juice...",,"['philadelphia cream cheese', 'orange juice', ...",7
76,pizza stuffed potato,52443,25,1533,2003-01-28,"['30-minutes-or-less', 'time-to-make', 'course...","[183.3, 12.0, 9.0, 15.0, 21.0, 23.0, 6.0]",8,"['preheat oven to 450 degrees', 'cut potato in...",,"['baking potato', 'mozzarella cheese', 'tomato...",7
99,the woiks dilly burgers,34930,32,23302,2002-07-24,"['bacon', '60-minutes-or-less', 'time-to-make'...","[449.2, 50.0, 16.0, 19.0, 58.0, 67.0, 2.0]",8,"['in a bowl , mix together the ground beef , m...",,"[""mccormick's montreal brand steak seasoning"",...",9
...,...,...,...,...,...,...,...,...,...,...,...,...
231449,zucchini with onions and tomatoes,33602,50,23302,2002-07-08,"['60-minutes-or-less', 'time-to-make', 'course...","[84.4, 5.0, 28.0, 1.0, 6.0, 9.0, 4.0]",5,['melt the butter in a large saucepan and cook...,,"['unsalted butter', 'onion', 'garlic', 'no-sal...",6
231492,zucchini potato and parmesan soup,73668,45,29300,2003-10-20,"['60-minutes-or-less', 'time-to-make', 'course...","[97.8, 4.0, 24.0, 15.0, 12.0, 3.0, 3.0]",14,['heat the oil in a large pot over medium heat...,,"['extra virgin olive oil', 'onion', 'celery ri...",11
231493,zucchini red pepper leek frittata,41429,65,1533,2002-09-30,"['weeknight', 'time-to-make', 'course', 'main-...","[141.7, 6.0, 17.0, 9.0, 33.0, 4.0, 2.0]",11,['steam or microwave the vegetables together u...,,"['zucchini', 'red pepper', 'leek', 'vegetable ...",7
231534,zucchini garlic pasta,49456,40,37779,2002-12-22,"['60-minutes-or-less', 'time-to-make', 'course...","[519.4, 31.0, 16.0, 19.0, 38.0, 36.0, 21.0]",11,['prepare pasta according to package direction...,,"['wagon wheel macaroni', 'bacon', 'onion', 'ga...",8


In [13]:
raw_users_df.isna().sum()

user_id        0
recipe_id      0
date           0
rating         0
review       169
dtype: int64

In [14]:
raw_recipes_df.duplicated().sum()

0

In [17]:
raw_users_df.duplicated().sum()

0

In [18]:
raw_recipes_df.isna().sum().sum()/raw_recipes_df.shape[0]*100

2.149915600702824

2.15% of total data

In [19]:
raw_users_df.isna().sum().sum()/raw_users_df.shape[0]*100

0.014924490028409516

0.01% of total data

In [20]:
test = raw_recipes_df.dropna()
test.shape

(226657, 12)

In [21]:
raw_recipes_df.columns

Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
       'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients'],
      dtype='object')

In [22]:
raw_users_df.columns

Index(['user_id', 'recipe_id', 'date', 'rating', 'review'], dtype='object')