# Business Understanding

Preparing meals is often a challenge due to individual preferences, dietary needs, and ingredient availability. This project aims to develop a Personalized Recipe Recommendation System that uses machine learning and NLP to suggest relevant recipes tailored to each user. The system is designed to enhance convenience, promote healthier eating habits, and reduce food waste. It has potential applications in health tech, food delivery platforms, and smart kitchen systems.

## Problem Statement
To develop a Personalized Recipe Recommendation System that leverages machine learning and NLP 

## Objectives

1.   To develop a content-based model using NLP to recommend recipes based on ingredients and instructions.
2.   To build a collaborative filtering model using user ratings and interactions.
3.   To combine both approaches into a hybrid recommendation system.
4.   To evaluate model performance

In [1]:

import kagglehub
from kagglehub import KaggleDatasetAdapter

In [2]:
pip install isodate


Collecting isodate
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate
Successfully installed isodate-0.7.2


In [3]:
import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as plt
import seaborn as sns
from isodate import parse_duration
import warnings
warnings.filterwarnings("ignore")

In [4]:
file_path = "recipes.parquet"

df_recipes = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "irkaal/foodcom-recipes-and-reviews",
    file_path,
)


Downloading from https://www.kaggle.com/api/v1/datasets/download/irkaal/foodcom-recipes-and-reviews?dataset_version_number=2&file_name=recipes.parquet...


100%|██████████| 170M/170M [00:02<00:00, 80.7MB/s]

Extracting zip of recipes.parquet...





In [5]:
file_path2 = "reviews.parquet"

df_reviews = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "irkaal/foodcom-recipes-and-reviews",
    file_path2,
)


Downloading from https://www.kaggle.com/api/v1/datasets/download/irkaal/foodcom-recipes-and-reviews?dataset_version_number=2&file_name=reviews.parquet...


100%|██████████| 164M/164M [00:01<00:00, 130MB/s]

Extracting zip of reviews.parquet...





# Data Understanding


In [6]:
print("Recipes:", df_recipes.shape)
print("Reviews:", df_reviews.shape)

Recipes: (522517, 28)
Reviews: (1401982, 8)


In [7]:
df_recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[us, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

In [8]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1401982 entries, 0 to 1401981
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype              
---  ------         --------------    -----              
 0   ReviewId       1401982 non-null  int32              
 1   RecipeId       1401982 non-null  int32              
 2   AuthorId       1401982 non-null  int32              
 3   AuthorName     1401982 non-null  object             
 4   Rating         1401982 non-null  int32              
 5   Review         1401982 non-null  object             
 6   DateSubmitted  1401982 non-null  datetime64[us, UTC]
 7   DateModified   1401982 non-null  datetime64[us, UTC]
dtypes: datetime64[us, UTC](2), int32(4), object(2)
memory usage: 64.2+ MB


In [9]:
df_recipes.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38.0,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09 21:46:00+00:00,Make and share this Low-Fat Berry Blue Frozen ...,[https://img.sndimg.com/food/image/upload/w_55...,...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"[Toss 2 cups berries with sugar., Let stand fo..."
1,39.0,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29 13:12:00+00:00,Make and share this Biryani recipe from Food.com.,[https://img.sndimg.com/food/image/upload/w_55...,...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,[Soak saffron in warm milk for 5 minutes and p...
2,40.0,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05 19:52:00+00:00,This is from one of my first Good House Keepi...,[https://img.sndimg.com/food/image/upload/w_55...,...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"[Into a 1 quart Jar with tight fitting lid, pu..."
3,41.0,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03 14:54:00+00:00,This dish is best prepared a day in advance to...,[https://img.sndimg.com/food/image/upload/w_55...,...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"[Drain the tofu, carefully squeezing out exces..."
4,42.0,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19 06:19:00+00:00,Make and share this Cabbage Soup recipe from F...,[https://img.sndimg.com/food/image/upload/w_55...,...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"[Mix everything together and bring to a boil.,..."


In [10]:
df_reviews.head()

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25 21:44:00+00:00,2000-01-25 21:44:00+00:00
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17 16:49:59+00:00,2001-10-17 16:49:59+00:00
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25 09:00:00+00:00,2000-02-25 09:00:00+00:00
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13 21:15:00+00:00,2000-03-13 21:15:00+00:00
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28 12:51:00+00:00,2000-03-28 12:51:00+00:00


In [11]:
df_recipes.isnull().sum()

Unnamed: 0,0
RecipeId,0
Name,0
AuthorId,0
AuthorName,0
CookTime,82545
PrepTime,0
TotalTime,0
DatePublished,0
Description,5
Images,1


In [12]:
df_reviews.isnull().sum()

Unnamed: 0,0
ReviewId,0
RecipeId,0
AuthorId,0
AuthorName,0
Rating,0
Review,0
DateSubmitted,0
DateModified,0


# Data Cleaning

In [13]:
#Handling Missing Values
df_recipes['AggregatedRating'] = df_recipes['AggregatedRating'].fillna(0)
df_recipes['ReviewCount'] = df_recipes['ReviewCount'].fillna(0)
df_recipes['RecipeServings'] = df_recipes['RecipeServings'].fillna(df_recipes['RecipeServings'].median())
df_recipes['RecipeCategory'] = df_recipes['RecipeCategory'].fillna("Unknown").str.lower().str.strip()

df_reviews.dropna(subset=['Review'], inplace=True)

In [14]:
# Converting the time to minutes
def safe_parse_minutes(x):
    if pd.isnull(x) or not isinstance(x, str) or not x.startswith('P'):
        return 0
    try:
        return parse_duration(x).total_seconds() / 60
    except:
        return 0

df_recipes['CookTimeMinutes'] = df_recipes['CookTime'].apply(safe_parse_minutes)
df_recipes['PrepTimeMinutes'] = df_recipes['PrepTime'].apply(safe_parse_minutes)
df_recipes['TotalTimeMinutes'] = df_recipes['TotalTime'].apply(safe_parse_minutes)

In [15]:
# Filling missing time with 0
df_recipes[['CookTimeMinutes', 'PrepTimeMinutes', 'TotalTimeMinutes']] = df_recipes[
    ['CookTimeMinutes', 'PrepTimeMinutes', 'TotalTimeMinutes']
].fillna(0)

In [16]:
# Drop rows where total time is less than 0
df_recipes = df_recipes[df_recipes['TotalTimeMinutes'] > 0]


In [None]:
# Convert numpy arrays to regular lists
df_recipes['Ingredients'] = df_recipes['RecipeIngredientParts'].apply(lambda x: x.tolist() if isinstance(x, np.ndarray) else x)
df_recipes['Quantities'] = df_recipes['RecipeIngredientQuantities'].apply(lambda x: x.tolist() if isinstance(x, np.ndarray) else x)

In [18]:
#Convert Text to Lowercase & Clean
for text_col in ['Name', 'Description', 'RecipeInstructions','Keywords']:
    df_recipes[text_col] = df_recipes[text_col].astype(str).str.lower().str.replace(r'[^a-z\s]', '', regex=True)


In [19]:
#Tokenize Keywords into List Format
df_recipes['KeywordList'] = df_recipes['Keywords'].apply(lambda x: x.split())

In [20]:
df_reviews['Rating'] = df_reviews['Rating'].astype(float)

In [21]:
# Drop duplicate recipes and reviews
df_recipes.drop_duplicates(subset=['RecipeId'], inplace=True)
df_reviews.drop_duplicates(subset=['ReviewId'], inplace=True)


In [22]:
#Drop Recipes with Few reviews
MIN_REVIEWS = 5
popular_recipes = df_reviews['RecipeId'].value_counts()
popular_recipes = popular_recipes[popular_recipes >= MIN_REVIEWS].index
df_recipes = df_recipes[df_recipes['RecipeId'].isin(popular_recipes)]
df_reviews = df_reviews[df_reviews['RecipeId'].isin(popular_recipes)]


In [23]:
# Drop unnecesary cols
drop_cols = ['AuthorName', 'TotalTime', 'PrepTime','CookTime','RecipeIngredientParts','RecipeIngredientQuantities','RecipeYield','Keywords']
df_recipes.drop(columns=drop_cols, inplace=True, errors='ignore')

drop_cols2 = ['AuthorName']
df_reviews.drop(columns=drop_cols2, inplace=True, errors='ignore')

In [24]:
recipes_clean=df_recipes
reviews_clean=df_reviews

In [25]:
missing_recipe_ids = reviews_clean[~reviews_clean['RecipeId'].isin(recipes_clean['RecipeId'])]
print(f"Number of reviews with RecipeId not in recipes: {len(missing_recipe_ids)}")


Number of reviews with RecipeId not in recipes: 4079


In [26]:
missing_author_ids = reviews_clean[~reviews_clean['AuthorId'].isin(recipes_clean['AuthorId'])]
print(f"Number of reviews with AuthorId not in recipes: {len(missing_author_ids)}")


Number of reviews with AuthorId not in recipes: 532306


In [27]:
# Create a set of valid (RecipeId, AuthorId) pairs from the recipes dataset
valid_pairs = set(zip(recipes_clean['RecipeId'], recipes_clean['AuthorId']))

# Check which rows in reviews don't have a matching pair
invalid_pairs = reviews_clean[~reviews_clean.apply(lambda row: (row['RecipeId'], row['AuthorId']) in valid_pairs, axis=1)]

print(f"Number of reviews with unmatched RecipeId & AuthorId pairs: {len(invalid_pairs)}")


Number of reviews with unmatched RecipeId & AuthorId pairs: 1028061


In [28]:
# Keep only reviews with RecipeIds that exist in recipes
valid_reviews = reviews_clean[reviews_clean['RecipeId'].isin(recipes_clean['RecipeId'])].copy()

print(f"Remaining reviews after filtering: {len(valid_reviews)}")


Remaining reviews after filtering: 1024428


In [None]:
# Merge on RecipeId
merged_df = pd.merge(
    valid_reviews,
    recipes_clean,
    on='RecipeId',
    how='inner',
    suffixes=('_review', '_recipe')
)

print(f"Merged dataset shape: {merged_df.shape}")
print(merged_df[['RecipeId', 'AuthorId_review', 'AuthorId_recipe']].head())


Merged dataset shape: (1024428, 32)
   RecipeId  AuthorId_review  AuthorId_recipe
0       992             2008             1545
1      4523             2046             1932
2      7435             1773             1986
3        44             2085             1596
4     13307             2046            20914


In [32]:
# drop AuthorId_review - we are more interested in the authors of the recipes
drop_cols3 = ['AuthorId_review']
merged_df.drop(columns=drop_cols3, inplace=True, errors='ignore')

In [33]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024428 entries, 0 to 1024427
Data columns (total 31 columns):
 #   Column               Non-Null Count    Dtype              
---  ------               --------------    -----              
 0   ReviewId             1024428 non-null  int32              
 1   RecipeId             1024428 non-null  int32              
 2   Rating               1024428 non-null  float64            
 3   Review               1024428 non-null  object             
 4   DateSubmitted        1024428 non-null  datetime64[us, UTC]
 5   DateModified         1024428 non-null  datetime64[us, UTC]
 6   Name                 1024428 non-null  object             
 7   AuthorId_recipe      1024428 non-null  int32              
 8   DatePublished        1024428 non-null  datetime64[us, UTC]
 9   Description          1024428 non-null  object             
 10  Images               1024428 non-null  object             
 11  RecipeCategory       1024428 non-null  object     

In [34]:
from sklearn.preprocessing import LabelEncoder

recipe_encoder = LabelEncoder()
author_encoder = LabelEncoder()

merged_df['RecipeId_encoded'] = recipe_encoder.fit_transform(merged_df['RecipeId'])


In [35]:
#Normalize Nutritional Features
from sklearn.preprocessing import MinMaxScaler

nutritional_cols = [
    'Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent',
    'SodiumContent', 'CarbohydrateContent', 'FiberContent',
    'SugarContent', 'ProteinContent'
]

scaler = MinMaxScaler()
merged_df[nutritional_cols] = scaler.fit_transform(merged_df[nutritional_cols])
