# **Milestone 1 - Data Preprocessing**

## Importing Libraries and Modules

The cell below contains all imported libraries that are used in this notebook.

In [None]:
! pip install transformers
import pandas as pd
import re
import numpy as np
import transformers
from ast import literal_eval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading data set

Data set was saved in my Google Drive in a CSV format and was loaded as Pandas DataFrame as shown below:

In [None]:
dataset_file_path="/content/drive/MyDrive/recipe_data/full_dataset.csv"
df=pd.read_csv(dataset_file_path)
df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."


There are a total of 2,231,141 recipes. The data type of the ingredients, directions and NER columns aren't arrays of strings despite the square brackets, instead each entry is a single string.

## Filtering data set

To ensure that the model is able to generate detailed and precise recipes, the recipes with the following features have to be removed to only leave high quality recipes in the data set:

* Recipes whose source is not "Gathered"
* Recipes with fewer than 2 ingredients, 2 directions, 2 words in the title, or 25 words in the directions.
* Recipes that enumerate steps in the directions, as they over-complicates the structure of a recipe and may hinder the learning process. These recipes can be detected by filtering out recipes that contain the word "step" in the directions, indicating that enumerated steps were cross-referenced.
* Recipes with the phrase "mix all", the model should be able to generate detailed recipes with explicit directions.

In [None]:
# returns an array of boolean values that is then applied to the dataframe to filter out certain rows
gathered_filter = (df.source == 'Gathered')
filtered_df = df[gathered_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
1643093,1643093,Tuna 'N Egg Salad In Pitas,"[""6 ounces tuna drained"", ""1 cook egg hard"", ""...","[""Mix tuna, egg, tomato, red onion and Hellman...",www.yummly.com/recipe/Tuna-_n-egg-salad-in-pit...,Gathered,"[""tuna"", ""egg hard"", ""tomatoes"", ""torn romain ..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes not within the "Gathered" subset, we are left with 1,643,098 recipes. In other words, 58,8043 recipes are not within the "Gathered" subset.

In [None]:
# iterate over all ingredient lists in ingredients column
# return true for those with two or more ingredients
# convert each entry into a numpy array of strings, where each string is an ingredient
ingdt_filter = [np.array(literal_eval(ingredients)).size >= 2 for ingredients in filtered_df.ingredients]
filtered_df = filtered_df[ingdt_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
1643093,1643093,Tuna 'N Egg Salad In Pitas,"[""6 ounces tuna drained"", ""1 cook egg hard"", ""...","[""Mix tuna, egg, tomato, red onion and Hellman...",www.yummly.com/recipe/Tuna-_n-egg-salad-in-pit...,Gathered,"[""tuna"", ""egg hard"", ""tomatoes"", ""torn romain ..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with fewer than 2 ingredients, we are left with 1,640,710 recipes. In other words, 2,388 recipes were filtered out from the gathered subset.

In [None]:
# analogous to ingdt_filter, removes all entries with fewer than 2 directions
dir_filter_1 = [np.array(literal_eval(directions)).size >= 2 for directions in filtered_df.directions]
filtered_df = filtered_df[dir_filter_1]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
5,5,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...,...
1643091,1643091,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with fewer than 2 directions, we are left with 1,472,742 recipes. In other words, 167,968 recipes were further filtered out from the filtered gathered subset.

To find the number of words in the directions column of each entry, we use `len(re.findall(r'\w+', sentence))` that determines the number of matches in the sentence to the regex `\w+` that consists of the character set `[a-zA-Z0-9_]+`.

In [None]:
# removes all entries with fewer than 25 words in the directions
dir_filter_2 = [len(re.findall(r'\w+', directions)) >= 25 for directions in filtered_df.directions]
filtered_df = filtered_df[dir_filter_2]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
5,5,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...,...
1643091,1643091,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with fewer than 25 words in their directions, we are left with 1,312,534 recipes. In other words, 160,208 recipes were further filtered out from the filtered gathered subset.

In [None]:
# instead of returning True when the string contains "step"
# it will return False instead due to "~" sign, so that all those entries can be filtered out
# case=False so that the search is not case sensitive
step_filter = ~filtered_df['directions'].str.contains("step", case=False)
filtered_df = filtered_df[step_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
5,5,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...,...
1643091,1643091,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with the word "step" in their directions, we are left with 1,301,548 recipes. In other words, 10,986 recipes were further filtered out from the filtered gathered subset.

In [None]:
# analogous to dir_filter_2
title_filter = [len(re.findall(r'\w+', title)) >= 2 for title in filtered_df.title]
filtered_df = filtered_df[title_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
5,5,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...,...
1643091,1643091,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with fewer than 2 words in their titles, we are left with 1,276,159 recipes. In other words, 25,389 recipes were further filtered out from the filtered gathered subset.

In [None]:
# analogous to step_filter
mix_all_filter = ~filtered_df['directions'].str.contains("mix all", case=False)
filtered_df = filtered_df[mix_all_filter]
filtered_df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
5,5,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...,...
1643091,1643091,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1643094,1643094,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1643095,1643095,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1643096,1643096,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


After filtereing out all the recipes with the phrase "mix all" in their directions, we are left with 1,234,476 recipes. In other words, 41,683 recipes were further filtered out from the filtered gathered subset. This should be more than enough to train the ML-Chef model.

Now, clean up the dataframe by removing the unnamed first column (not needed due to the indices present that help us navigate the dataframe) and resetting the indices so that they are consecutive numbers.

In [None]:
# delete first column and set it in place
filtered_df.drop(columns=filtered_df.columns[0], inplace=True)

# reset index
filtered_df = filtered_df.reset_index(drop=True)
filtered_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(columns=filtered_df.columns[0], inplace=True)


Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
3,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
4,Cheeseburger Potato Soup,"[""6 baking potatoes"", ""1 lb. of extra lean gro...","[""Wash potatoes; prick several times with a fo...",www.cookbooks.com/Recipe-Details.aspx?id=20115,Gathered,"[""baking potatoes"", ""extra lean ground beef"", ..."
...,...,...,...,...,...,...
1234471,Sweet Potato Pie,"[""1 cup butter or margarine, softened"", ""1/2 c...","[""In a mixing bowl, cream butter and sugar."", ...",www.yummly.com/recipe/Sweet-Potato-Pie-1674117,Gathered,"[""butter"", ""sugar"", ""eggs"", ""milk"", ""sweet pot..."
1234472,Croque Monsieur Panini,"[""2 tablespoons unsalted butter"", ""2 tablespoo...","[""In a small sauce pan, melt butter and add fl...",www.yummly.com/recipe/Croque-Monsieur-Panini-1...,Gathered,"[""unsalted butter"", ""flour"", ""milk"", ""black pe..."
1234473,Croque Monsieur With Cucumber Salad,"[""1/4 cup white wine vinegar"", ""1 teaspoon sug...","[""For the cucumber salad, mix the vinegar and ...",www.yummly.com/recipe/Croque-Monsieur-with-Cuc...,Gathered,"[""white wine vinegar"", ""sugar"", ""olive oil"", ""..."
1234474,Baked Pork Chops,"[""1 egg whites"", ""1 cup evaporated skim milk"",...","[""2. Beat egg white with evaporated skim milk....",www.yummly.com/recipe/Baked-Pork-Chops-1646800,Gathered,"[""egg whites"", ""milk"", ""center"", ""cornflake cr..."


Save the filtered dataframe as a CSV file to avoid having to go through the filtering process everytime when accessing the training data, as it is time consuming.

In [None]:
filtered_dataset_destination = '/content/drive/MyDrive/recipe_data/filtered_dataset.csv'
filtered_df.to_csv(filtered_dataset_destination, index = None)