# **Milestone 1 - Data Preprocessing**

## Importing Libraries and Modules

The cell below contains all imported libraries that are used in this notebook.

In [None]:
! pip install transformers
import pandas as pd
import re
import numpy as np
import transformers
from ast import literal_eval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading data set

Data set was saved in my Google Drive in a CSV format and was loaded as Pandas DataFrame as shown below:

In [None]:
dataset_file_path="/content/drive/MyDrive/recipe_data/full_dataset.csv"
df=pd.read_csv(dataset_file_path)
df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."


There are a total of 2,231,141 recipes. The data type of the ingredients, directions and NER columns aren't arrays of strings despite the square brackets, instead each entry is a single string.

## Filtering data set

To ensure that the model is able to generate a suitable number of complementary ingredients, the following recipes were filtered out:

* Recipes with more than 10 ingredients. This effectively prevents the model from generating too many complementary ingredients and simplifies the number of ingredient compatibilities the model has to learn within a single recipe.
* Recipes with fewer than 2 ingredients. No ingredient compatibilities would be present within such recipes.

In [None]:
# iterate over all ingredient lists in ingredients column
# return true for those with two or more ingredients
# convert each entry into a numpy array of strings, where each string is an ingredient
ingdt_filter = [np.array(literal_eval(ingredients)).size >= 2 for ingredients in df.ingredients]
filtered_df = df[ingdt_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."


After filtereing out all the recipes with fewer than 2 ingredients, we are left with 2,227,040 recipes. In other words, 4,102 recipes were filtered out from the gathered subset.

In [None]:
# iterate over all ingredient lists in ingredients column
# return true for those with two or more ingredients
# convert each entry into a numpy array of strings, where each string is an ingredient
ingdt_filter2 = [np.array(literal_eval(ingredients)).size <= 10 for ingredients in filtered_df.ingredients]
filtered_df = filtered_df[ingdt_filter2]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
2231134,2231134,Cheese-and-Salmon Quesadilla,"[""2 can salmon"", ""1 c. Monterey Jack cheese"", ...","[""In a bowl, stir together two 6-oz."", ""cans s...",www.delish.com/recipefinder/cheese-and-salmon-...,Recipes1M,"[""salmon"", ""cheese"", ""flour tortilla"", ""green ..."
2231135,2231135,Mozzarella Meatball Sandwiches,"[""1 loaf pepperidge farm frozen mozzarella gar...","[""Heat the oven to 400F."", ""Remove the bread f...",www.food.com/recipe/mozzarella-meatball-sandwi...,Recipes1M,"[""bread"", ""Italian sauce"", ""frozen meatballs""]"
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."


After filtereing out all the recipes with more than 10 ingredients, we are left with 1,634,328 recipes. In other words, 592,712 recipes were filtered out from the gathered subset.

Now, clean up the dataframe by removing the unnamed first column (not needed due to the indices present that help us navigate the dataframe) and resetting the indices so that they are consecutive numbers.

In [None]:
# delete first column and set it in place
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)

# reset index
filtered_df = filtered_df.reset_index(drop=True)
filtered_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=filtered_df.columns[0], inplace=True)


Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...
1634323,Cheese-and-Salmon Quesadilla,"[""2 can salmon"", ""1 c. Monterey Jack cheese"", ...","[""In a bowl, stir together two 6-oz."", ""cans s...",www.delish.com/recipefinder/cheese-and-salmon-...,Recipes1M,"[""salmon"", ""cheese"", ""flour tortilla"", ""green ..."
1634324,Mozzarella Meatball Sandwiches,"[""1 loaf pepperidge farm frozen mozzarella gar...","[""Heat the oven to 400F."", ""Remove the bread f...",www.food.com/recipe/mozzarella-meatball-sandwi...,Recipes1M,"[""bread"", ""Italian sauce"", ""frozen meatballs""]"
1634325,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
1634326,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."


Drop all columns from the DataFrame except for the NER column

In [None]:
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)

In [None]:
filtered_df

Unnamed: 0,NER
0,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...
1634323,"[""salmon"", ""cheese"", ""flour tortilla"", ""green ..."
1634324,"[""bread"", ""Italian sauce"", ""frozen meatballs""]"
1634325,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
1634326,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."


Save the filtered dataframe as a CSV file to avoid having to go through the filtering process everytime when accessing the training data, as it is time consuming.

In [None]:
filtered_dataset_destination = '/content/drive/MyDrive/recipe_data/filtered_ingdt_dataset.csv'
filtered_df.to_csv(filtered_dataset_destination, index = None)
