# Part 1: Cleaning the Data

In [1]:
import pandas as pd

##### Looking at the data: 


In [2]:
df = pd.read_csv('./Data/RecipeNLG_dataset.csv')

In [3]:
df

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."
...,...,...,...,...,...,...,...
2231137,2231137,Sunny's Fake Crepes,"[""1/2 cup chocolate hazelnut spread (recommend...","[""Spread hazelnut spread on 1 side of each tor...",www.foodnetwork.com/recipes/sunny-anderson/sun...,Recipes1M,"[""chocolate hazelnut spread"", ""tortillas"", ""bu..."
2231138,2231138,Devil Eggs,"[""1 dozen eggs"", ""1 paprika"", ""1 salt and pepp...","[""Boil eggs on medium for 30mins."", ""Then cool...",cookpad.com/us/recipes/355411-devil-eggs,Recipes1M,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle..."
2231139,2231139,Extremely Easy and Quick - Namul Daikon Salad,"[""150 grams Daikon radish"", ""1 tbsp Sesame oil...","[""Julienne the daikon and squeeze out the exce...",cookpad.com/us/recipes/153324-extremely-easy-a...,Recipes1M,"[""radish"", ""Sesame oil"", ""White sesame seeds"",..."
2231140,2231140,Pan-Roasted Pork Chops With Apple Fritters,"[""1 cup apple cider"", ""6 tablespoons sugar"", ""...","[""In a large bowl, mix the apple cider with 4 ...",cooking.nytimes.com/recipes/1015164,Recipes1M,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay l..."


##### Annotation:
Looking at the data above we are able to see the size of the data. The data frame is composed of 2,231,142 rows × 7 columns. The good thing about this data set is that it contains a column with only the simple ingredients. With that said the columns contains ingredients and directions appear to be in lists. That will have to be changed if we want to use count/tfidf vectorizer and Word2Vec.

In [3]:
df.drop(columns = ['Unnamed: 0','source'],inplace = True)

df.drop(columns = ['ingredients'],inplace = True)

In [4]:
df.rename(columns = {'NER': 'raw_ingredients'},inplace= True)

In [5]:
df = df[['title','raw_ingredients','directions','link']]

In [6]:
#df.to_csv('reduced_recipe_data.csv')

##### Annotation:
Unnecessary columns have been dropped and the name for the previous column NER has been changed to raw_ingredients. Moreover the order of the columns was changed and the csv file was exported. Due to my limited computational power and the large nature of the data frame, my kernel died very often so I made sure to export csv files frequently during the cleaning.

In [2]:
#Kernel dies read new smaller csv
df = pd.read_csv('reduced_recipe_data.csv',index_col=0)

  mask |= (ar1 == a)


In [3]:
new_list = []
for word in df['raw_ingredients']:
    new_list.append([x.strip('\"') for x in word.strip("[ ]").split(", ")])

df.loc[:,'raw_ingredients'] = new_list

df.loc[:,'raw_ingredients'] = [', '.join(i) for i in df['raw_ingredients']]

In [8]:
#df.to_csv('reduced_recipe_data_2.csv')

In [4]:
#Kernel dies read new csv
df = pd.read_csv('reduced_recipe_data_2.csv',index_col=0)

  mask |= (ar1 == a)


In [5]:
new_list_2 = []
for word in df['directions']:
    new_list_2.append([x.strip('\"') for x in word.strip("[ ]").split(", ")])

df.loc[:,'directions'] = new_list_2

df.loc[:,'directions'] = [', '.join(i) for i in df['directions']]

In [9]:
#df.to_csv('reduced_recipe_data_3.csv')

##### Annotation:
What I did was remove ingredients from a list format and turned it into a more casual string input format which is more aesthetically appealing. Doing so will enable me to conduct count and tfidf vectorization. This was done for both the raw_ingredients column and the direction.

In [6]:
#Lets reduce the csv file to half its size, about 1.1 million recipes in case the csv becomes to big for modelling
df_half = df[0:1100000]

In [8]:
df_half.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_half.dropna(inplace = True)


In [9]:
#Making sure all characters are lower case
df_half['raw_ingredients'] = df_half['raw_ingredients'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_half['raw_ingredients'] = df_half['raw_ingredients'].str.lower()


In [10]:
from nltk.stem import WordNetLemmatizer 
import nltk

In [11]:
lemmatizer = WordNetLemmatizer()

In [12]:
word_remover = ['finely','chopped','small','and','to','medium','diced','fresh','serving','sliced','extra-virgin',
                'cherry', 'grape','refrigerated','peeled', 'diced','cooked','or to taste','skinless','boneless',
                'halves','slices','shredded','semisweet','unsweetened','melted','room temperature','pound',
                'fluid','optional','minced','cloves','divided','degrees','half','fryer','red','white', 'piece','meat',
                'thigh','tender','grated','processed','catsup','choc','of','confectioner','semi-sweet','all-purpose',
                'kidney','paste','sharp','food','weed','miracle','nonfat','low-fat','style','english','uncle','ben',
                'your','pack','handful','leftover','additional','unpeeled','nonstick','pan','jar','fried','medal',
                'healthy','well','armour','love','for','you','without','any','understanding','made','several','xxxx',
                'stir-fry','uncooked','including','tenderness','fry','tsp']
word_remover = [lemmatizer.lemmatize(w) for w in word_remover]

In [13]:
def lemmatize_text(text):

    return ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(text)]).replace(" ,",",")


In [15]:
df_half['raw_ingredients'] = df_half['raw_ingredients'].apply(lemmatize_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_half['raw_ingredients'] = df_half['raw_ingredients'].apply(lemmatize_text)


In [16]:
pat = r'\b(?:{})\b'.format('|'.join(word_remover))
df_half['raw_ingredients'] = df_half['raw_ingredients'].str.replace(pat, '')

  df_half['raw_ingredients'] = df_half['raw_ingredients'].str.replace(pat, '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_half['raw_ingredients'] = df_half['raw_ingredients'].str.replace(pat, '')


In [35]:
#df_half.to_csv('reduced_by_half_recipe_data2.csv')

##### Annotation:
The next step I took was reduce the data by half, creating a data frame composed of a little over 1 million recipes. Later along the investigation, I realized that having 2 million recipes was slowing my model juristically, therefore reducing it by half was a great idea. I came back to this part of the notebook and conducted further cleaning. I explored the ingredient’s column and targeted words that were unnecessary and might not help during modeling. Most of these words were verbs such as chopped, diced, and sliced. These words are not necessary so I added them to the word_remover list. Next I added words that would describe the state of the ingredients such as freshly, melted, and unpeeled. I then lemmatized and lower cased all the ingredients. Finally I lemmatized the content of word_remover and removed all the words from the raw_ingredients column.