# Recipe Recommendation System: Data Preparation

Author: Kelly Li


## Table of contents:
* [1 Introduction](#one)
* [2 Datasets](#two)
    * [2.1 Data Sources](#twoone)
    * [2.2 Data Loading](#twotwo)
* [3 Data Cleaning](#three)
    * [3.1 Data Types](#threeone) 
    * [3.2 Missing Data](#threetwo)
    * [3.3 Duplicate Data](#threethree)
* [4 Data Preprocessing](#four)
    * [4.1 Column Splitting](#fourone) 
* [5 Conclusion](#five)

-------------------------------------------------------------------------------------------------------------------------------

## 1 Introduction <a class="anchor" id="one"></a>

Cooking enthusiasts often face challenges in finding personalized and diverse recipes that align with their tastes, dietar restrictions, and ingredient preferences. The existing search process is time-consuming, lacks inspiration, and fails to cater to specific dietary needs. To address these challenges, we have developed a recipe recommendation system that aids in recipe discovery, focusing on providing inspiration and catering to specific dietary needs. By leveraging advanced algorithms, it curates personalized recipe suggestions that ignite culinary creativity while considering unique tasts and dietary requirements. Wecome a world a culinary inspiration with

## 2 Datasets <a class="anchor" id="two"></a>

### 2.1 Data Sources <a class="anchor" id="twoone"></a>

Here's the data dictionary for the columns in the raw recipe dataset:

| Column Name   | Description                                               |
| ------------- | --------------------------------------------------------- |
| name          | The name of the recipe.                                   |
| id            | The unique identifier of the recipe.                       |
| minutes       | The total cooking and preparation time in minutes.         |
| contributor_id| The unique identifier of the user who submitted the recipe|
| submitted     | The date when the recipe was submitted.                    |
| tags          | Tags or labels associated with the recipe (e.g., vegetarian, vegan, dessert). |
| nutrition     | Nutritional information (i.e. calories (#), total fat (PDV*), sugar (PDV*) , sodium (PDV*) , protein (PDV*) , saturated fat). |
| n_steps       | The total number of steps or instructions in the recipe.   |
| steps         | The step-by-step instructions for preparing the recipe.    |
| description   | A brief description or summary of the recipe.              |
| ingredients   | The list of ingredients required for the recipe.           |
| n_ingredients | The total number of ingredients used in the recipe.        |

*PDV = % daily value

Here's the data dictionary for the columns in the raw user interactions dataset:

| Column Name | Description                                          |
| ----------- | ---------------------------------------------------- |
| user_id     | The unique identifier of the user.                    |
| recipe_id   | The unique identifier of the recipe associated with the user's interaction. |
| date        | The date when the user's interaction took place.      |
| rating      | The rating given by the user for the recipe.          |
| review      | The review or feedback provided by the user for the recipe. |

### 2.2 Data Loading <a class="anchor" id="twotwo"></a>

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Read the raw datasets
raw_recipes_df = pd.read_csv('RAW_recipes.csv')
raw_users_df = pd.read_csv('RAW_interactions.csv')

In [4]:
# Shape of the data
print('The shape of the recipe dataset is:', raw_recipes_df.shape)
print('The shape of the user interactions dataset is:', raw_users_df.shape)

The shape of the recipe dataset is: (231637, 12)
The shape of the user interactions dataset is: (1132367, 5)


In [5]:
# View recipes dataset
raw_recipes_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


In [6]:
# View user interactions dataset
raw_users_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


## 3 Data Cleaining <a class="anchor" id="three"></a>

### 3.1 Data Types <a class="anchor" id="threeone"></a>

To gain a comprehensive understanding of the data types and identify any non-null values within the dataset, we will employ the `.info()` method.

In [7]:
# Info about the recipe dataset
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            231636 non-null  object
 1   id              231637 non-null  int64 
 2   minutes         231637 non-null  int64 
 3   contributor_id  231637 non-null  int64 
 4   submitted       231637 non-null  object
 5   tags            231637 non-null  object
 6   nutrition       231637 non-null  object
 7   n_steps         231637 non-null  int64 
 8   steps           231637 non-null  object
 9   description     226658 non-null  object
 10  ingredients     231637 non-null  object
 11  n_ingredients   231637 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 21.2+ MB


The 'id' and 'contributor_id' columns are currently of `integer` data type, while the 'submitted' column is of `object` data type. To ensure consistency and facilitate data manipulation, we will convert the 'id' and 'contributor_id' columns to `object` data type. Additionally, we will convert the 'submitted' column to `datetime` data type for more convenient date-based operations.

In [8]:
# Convert 'id' and 'contributor_id' columns to object data type
raw_recipes_df['id'] = raw_recipes_df['id'].astype(str)
raw_recipes_df['contributor_id'] = raw_recipes_df['contributor_id'].astype(str)

# Convert 'submitted' column to date data type
raw_recipes_df['submitted'] = pd.to_datetime(raw_recipes_df['submitted'], format='%Y-%m-%d')

# Print updated data types of the columns
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231637 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   name            231636 non-null  object        
 1   id              231637 non-null  object        
 2   minutes         231637 non-null  int64         
 3   contributor_id  231637 non-null  object        
 4   submitted       231637 non-null  datetime64[ns]
 5   tags            231637 non-null  object        
 6   nutrition       231637 non-null  object        
 7   n_steps         231637 non-null  int64         
 8   steps           231637 non-null  object        
 9   description     226658 non-null  object        
 10  ingredients     231637 non-null  object        
 11  n_ingredients   231637 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 21.2+ MB


Let us now examine the data types within the user interactions dataset.

In [9]:
# Info about the recipe dataset
raw_users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   user_id    1132367 non-null  int64 
 1   recipe_id  1132367 non-null  int64 
 2   date       1132367 non-null  object
 3   rating     1132367 non-null  int64 
 4   review     1132198 non-null  object
dtypes: int64(3), object(2)
memory usage: 43.2+ MB


Similar to the recipes dataset, the 'user_id' and 'recipe_id' columns are currently of `integer` data type, while the 'date' column is of `object` data type. To ensure consistency and facilitate data manipulation, we will convert the 'user_id' and 'recipe_id' columns to `object` data type. Additionally, we will convert the 'date' column to `datetime` data type for more convenient date-based operations.

In [10]:
# Convert 'user_id' and 'recipe_id' columns to object data type
raw_users_df['user_id'] = raw_users_df['user_id'].astype(str)
raw_users_df['recipe_id'] = raw_users_df['recipe_id'].astype(str)

# Convert 'submitted' column to date data type
raw_users_df['date'] = pd.to_datetime(raw_users_df['date'], format='%Y-%m-%d')

# Print updated data types of the columns
raw_users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   user_id    1132367 non-null  object        
 1   recipe_id  1132367 non-null  object        
 2   date       1132367 non-null  datetime64[ns]
 3   rating     1132367 non-null  int64         
 4   review     1132198 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 43.2+ MB


### 3.2 Missing Data <a class="anchor" id="threetwo"></a>

Now that we have gained a deeper understanding of the dataset, let us proceed to explore any potential missing values within it.

In [11]:
# Checking for missing values
raw_recipes_df.isna().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

There is 1 missing value in the 'name' column and 4,979 missing values in the 'description' column. values represent. Let us examine the specific rows that contain missing values to identify any discernible patterns and explore potential strategies for data imputation.

In [12]:
# Row with missing name
raw_recipes_df[raw_recipes_df['name'].isna()]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
721,,368257,10,779451,2009-04-27,"['15-minutes-or-less', 'time-to-make', 'course...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"['in a bowl , combine ingredients except for o...",-------------,"['lemon', 'honey', 'horseradish mustard', 'gar...",10


It appears that the row with the missing 'name' value also lacks a corresponding description. Consquently, we are unable to infer a name from the description itself, as there is not textual information available to guide the imputation process. Considering that there is only one instance with a missing 'name' value, we can safely drop this row from the dataset.

In [13]:
# Drop row with missing name
raw_recipes_df.dropna(subset=['name'], inplace=True)

In [14]:
# Sanity check
raw_recipes_df.isna().sum()

name                 0
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

Let us take a look at the rows with a missing description.

In [15]:
# Row with missing description
raw_recipes_df[raw_recipes_df['description'].isna()].sample(10)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
202104,stuffed mexican meatloaf,6501,115,67395,2000-03-06,"['weeknight', 'time-to-make', 'main-ingredient...","[452.6, 51.0, 12.0, 30.0, 59.0, 81.0, 2.0]",14,"['combine ground beef , tomato sauce , taco se...",,"['beef chuck', 'tomato sauce', 'taco seasoning...",11
61188,crantastic baked chicken breast,29198,45,31695,2002-05-23,"['60-minutes-or-less', 'time-to-make', 'course...","[297.2, 10.0, 102.0, 33.0, 56.0, 3.0, 9.0]",6,"['combine the cranberry sauce , thousand islan...",,"['whole berry cranberry sauce', 'reduced-fat t...",4
178560,sake steamed halibut with dilled carrots,45310,40,59389,2002-11-04,"['60-minutes-or-less', 'time-to-make', 'course...","[524.9, 9.0, 11.0, 6.0, 94.0, 6.0, 5.0]",20,"['spray steamer rack with nonstick spray', 'sa...",,"['vegetable oil cooking spray', 'halibut fille...",8
6227,apple apricot smoothie,36675,10,1533,2002-08-09,"['15-minutes-or-less', 'time-to-make', 'course...","[305.9, 2.0, 241.0, 2.0, 11.0, 3.0, 23.0]",1,['place all ingredients in a blender and puree...,,"['apple', 'apple juice', 'apricots', 'banana',...",7
89937,garlic and oregano sweet potato wedges,66796,60,79036,2003-07-15,"['60-minutes-or-less', 'time-to-make', 'main-i...","[196.2, 13.0, 22.0, 7.0, 6.0, 6.0, 9.0]",8,"['heat oven to 450 degrees', 'on a baking shee...",,"['sweet potatoes', 'fresh oregano', 'kosher sa...",7
200433,strawberries with cheesecake cream,34211,20,29300,2002-07-15,"['30-minutes-or-less', 'time-to-make', 'course...","[317.2, 24.0, 103.0, 6.0, 8.0, 43.0, 14.0]",13,"['heat oven to 350 f', 'place vanilla wafers i...",,"['vanilla wafers', 'butter', 'cream cheese', '...",8
177815,rum cream apple pie,49775,20,62727,2003-01-03,"['30-minutes-or-less', 'time-to-make', 'course...","[3392.4, 149.0, 1891.0, 31.0, 60.0, 287.0, 205.0]",14,"['combine the oats , butter , and brown sugar'...",,"['rolled oats', 'butter', 'brown sugar', 'wate...",13
142919,navajo fry bread,2774,160,1547,1999-08-16,"['weeknight', 'time-to-make', 'course', 'cuisi...","[1856.1, 256.0, 33.0, 33.0, 28.0, 332.0, 24.0]",12,"['combine the flour , powdered milk , baking p...",,"['unsifted flour', 'lard', 'powdered milk', 'd...",6
230914,zucchini and cheese stuffed mushrooms,2913,35,1587,1999-09-02,"['60-minutes-or-less', 'time-to-make', 'course...","[119.5, 2.0, 45.0, 19.0, 22.0, 1.0, 7.0]",15,"['remove stems from mushrooms and discard', 's...",,"['fresh mushrooms', 'zucchini', 'lowfat parmes...",9
194078,spanish vegetables,25384,30,21399,2002-04-16,"['30-minutes-or-less', 'time-to-make', 'course...","[111.6, 6.0, 24.0, 24.0, 7.0, 3.0, 5.0]",3,['cook onion and garlic in oil in a skillet un...,,"['frozen corn', 'onion', 'garlic clove', 'oliv...",9


Since the descriptions are written by the recipe contributors themselves and not available in the source, it is not feasible to impute missing values for the 'description' column. Let us look at the extent to which these missing values constitute the overall dataset.

In [16]:
print('Missing values make up', round(raw_recipes_df.isna().sum().sum()/raw_recipes_df.shape[0]*100, 2), "%", "of the overall data.")

Missing values make up 2.15 % of the overall data.


Given that the missing values in the 'description' column constitute a small subset of the overall dataset, we can safely drop these rows.

In [17]:
# Drop rows with missing description
raw_recipes_df.dropna(subset=['description'], inplace=True)

In [18]:
# Sanity check
raw_recipes_df.isna().sum()

name              0
id                0
minutes           0
contributor_id    0
submitted         0
tags              0
nutrition         0
n_steps           0
steps             0
description       0
ingredients       0
n_ingredients     0
dtype: int64

We will now take a look at the user interactions dataset to identify any missing values.

In [19]:
raw_users_df.isna().sum()

user_id        0
recipe_id      0
date           0
rating         0
review       169
dtype: int64

There are 169 missing values in the 'review' column. values represent. Similar to the 'description' column in the recipes dataset, as the reviews are written by users themselves and are not available in the source, it is not feasible to impute missing values for the 'review' column. Let us now assess the proportion of these missing values in relation to the overall dataset.

In [20]:
print('Missing values make up', round(raw_users_df.isna().sum().sum()/raw_users_df.shape[0]*100, 2), "%", "of the overall data.")

Missing values make up 0.01 % of the overall data.


Given that the missing values in the 'review' column constitute a very small subset of the overall dataset, we can safely drop these rows.

In [21]:
# Drop rows with missing description
raw_users_df.dropna(subset=['review'], inplace=True)

In [22]:
# Sanity check
raw_users_df.isna().sum()

user_id      0
recipe_id    0
date         0
rating       0
review       0
dtype: int64

### 3.3 Duplicate Data <a class="anchor" id="threethree"></a>

Now that we have addressed the missing values, let us proceed to identify any potential duplicates within the datasets.

In [23]:
# Duplicated rows
print("duplicated rows in recipes dataset:", raw_recipes_df.duplicated().sum())
print("duplicated rows in user interactions dataset:", raw_users_df.duplicated().sum())

duplicated rows in recipes dataset: 0
duplicated rows in user interactions dataset: 0


Great! It appears that there are no duplicated rows within the datasets.

## 4 Data Preprocessing <a class="anchor" id="four"></a>

After completing the data cleaning phase, we will proceed to the data preprocessing stage. Let's revisit the recipes dataframe to review its current state.

In [24]:
# View recipes dataset
raw_recipes_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


It appears that splitting the values in the 'nutrition' column and creating new columns for each nutritional metric would be beneficial. In the next section, we will proceed with this data transformation step. 

In [25]:
# View user interactions dataset
raw_users_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


The user interactions dataset looks good (for now).

Regarding both datasets, we have made a deliberate decision to postpone the preprocessing of the text data in the columns until the modeling stage. This choice stems from recognizing the intricate nature of text data preprocessing and the diverse range of available approaches. By deferring the preprocessing step, we gain the opportunity to thoroughly explore various techniques and strategies specific to text data. This approach empowers us to make well-informed decisions tailored to the requirements of our modeling tasks and leverage the most effective methods for handling and analyzing textual information.

### 4.4 Column Splitting <a class="anchor" id="fourone"></a>

The 'nutrition' column in the dataframe consists of a list of values representing various nutritional metrics, including calories, total fat, sugar, sodium, protein, and saturated fat. To facilitate further analysis, we have split these values and created new columns to capture each nutritional metric individually. Subsequently, we drop the original 'nutrition' column from the dataframe, as we have already extracted the nutritional metrics into separate columns.

In [26]:
# Split the values in the 'nutrition' column into separate columns
nutrition_columns = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat']

for column in nutrition_columns:
    raw_recipes_df[column] = raw_recipes_df['nutrition'].apply(lambda x: eval(x)[nutrition_columns.index(column)])

# Drop the original 'nutrition' column
raw_recipes_df.drop('nutrition', axis=1, inplace=True)

In [27]:
# Sanity check
raw_recipes_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,n_steps,steps,description,ingredients,n_ingredients,calories,total_fat,sugar,sodium,protein,saturated_fat
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,51.5,0.0,13.0,0.0,2.0,0.0
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6,173.4,18.0,0.0,17.0,22.0,35.0
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13,269.8,22.0,32.0,48.0,39.0,27.0
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11,368.1,17.0,10.0,2.0,14.0,8.0
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8,352.9,1.0,337.0,23.0,3.0,0.0


In [28]:
# Check datatypes of new columns
raw_recipes_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226657 entries, 0 to 231636
Data columns (total 17 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   name            226657 non-null  object        
 1   id              226657 non-null  object        
 2   minutes         226657 non-null  int64         
 3   contributor_id  226657 non-null  object        
 4   submitted       226657 non-null  datetime64[ns]
 5   tags            226657 non-null  object        
 6   n_steps         226657 non-null  int64         
 7   steps           226657 non-null  object        
 8   description     226657 non-null  object        
 9   ingredients     226657 non-null  object        
 10  n_ingredients   226657 non-null  int64         
 11  calories        226657 non-null  float64       
 12  total_fat       226657 non-null  float64       
 13  sugar           226657 non-null  float64       
 14  sodium          226657 non-null  flo

Wonderful! As a result of the data transformation, we have successfully created separate columns to represent each specific nutritional value. These columns now store the respective nutritional metrics as floating-point values, which facilitates further analysis and computations involving these nutritional attributes.

#### Saving the data

After diligently cleaning the datasets, it is prudent to save them to files for future use. By preserving these clean datasets, we can readily leverage them during the exploratory data analysis (EDA) phase. Saving the cleaned datasets ensures that the processed data is readily available, allowing us to delve into a comprehensive analysis of the dataset's characteristics and relationships.

In [29]:
# Save the clean datasets
raw_recipes_df.to_csv("clean_recipes.csv", index=False)
raw_users_df.to_csv("clean_interactions.csv", index=False)

## 5 Conclusion <a class="anchor" id="five"></a>

During the data preprocessing step of the recipe recommendation system, the following changes and actions were performed:

- Converted the 'id' columns from integer to string data type for consistency and manipulation ease.
- Dropped rows with missing values in the 'name' column and rows with missing values in the 'description' column.
- Verified that there were no duplicate rows in the dataset.
- Split the values in the 'nutrition' column, creating new columns for each individual nutritional value (calories, total fat, sugar, sodium, protein, saturated fat).

With the datasets now clean and appropriately processed, we are well-prepared for the next phase of our project, which involves EDA and modeling. 