# Data Preparation of the Substitute Recommender Model

This notebook documents the preparation of the relevant data for the implementation of the Substitute Recommender Model (SRM). 

In chapter 1.0, the dataset for training the word2vec model is described and prepared.
In chapter 2.0, the ingredient list needed for filtering the results is described and prepared.

- install all required libraries

In [1]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install flashtext

Note: you may need to restart the kernel to use updated packages.


In [3]:
import gensim
import gensim.parsing.preprocessing as pp
from gensim.parsing.preprocessing import STOPWORDS
from gensim import utils
from gensim.models.phrases import Phrases

import pandas as pd
import numpy as np

import re

from flashtext import KeywordProcessor

## 1.0 Training Data 

The final training dataset consists of the following five individual recipe datasets: 

    (1) RecipeNLG_dataset.csv, sourced from Kaggle
    (2) recipes.csv, sourced from Kaggle
    (3) recipes_raw_nosource_ar.json, sourced from GitHub
    (4) recipes_raw_nosource_epi.json, sourced from GitHub
    (5) recipes_raw_nosource_fn.json, sourced from GitHub

### 1.1 Building the Training Dataset
#### 1.1.1 Individual Datasets

In the following the individual training datasets are examined and prepared for further use. 

##### RecipeNLG Dataset

- read (1) RecipeNLG_dataset.csv as DataFrame

In [4]:
data_RecipeNLG = pd.read_csv("RecipeNLG_dataset.csv")

In [5]:
data_RecipeNLG.head()

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


- rename the relevant column for the training dataset and drop irrelevant columns

In [6]:
data_RecipeNLG.rename(columns={'directions':'instructions'}, inplace=True)

In [7]:
data_RecipeNLG_instructions = data_RecipeNLG.drop(['title', 'link', 'source', 'NER', 'ingredients', 'Unnamed: 0'], axis=1)
data_RecipeNLG_instructions.head()

Unnamed: 0,instructions
0,"[""In a heavy 2-quart saucepan, mix brown sugar..."
1,"[""Place chipped beef on bottom of baking dish...."
2,"[""In a slow cooker, combine all ingredients. C..."
3,"[""Boil and debone chicken."", ""Put bite size pi..."
4,"[""Combine first four ingredients and press in ..."


Examine the RecipeNLG dataset:

In [8]:
data_RecipeNLG_instructions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231142 entries, 0 to 2231141
Data columns (total 1 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   instructions  object
dtypes: object(1)
memory usage: 17.0+ MB


- check for missing values

In [9]:
data_RecipeNLG_instructions.isnull().sum().sum()

0

- check  length of dataset, i.e., the amount of recipes presented in the dataset

In [10]:
len(data_RecipeNLG_instructions)

2231142

##### recipes Dataset

- read (2) recipes.csv as DataFrame

In [11]:
data_recipes = pd.read_csv("recipes.csv")

In [12]:
data_recipes.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09T21:46:00Z,Make and share this Low-Fat Berry Blue Frozen ...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"c(""Toss 2 cups berries with sugar."", ""Let stan..."
1,39,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29T13:12:00Z,Make and share this Biryani recipe from Food.com.,"c(""https://img.sndimg.com/food/image/upload/w_...",...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,"c(""Soak saffron in warm milk for 5 minutes and..."
2,40,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05T19:52:00Z,This is from one of my first Good House Keepi...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"c(""Into a 1 quart Jar with tight fitting lid, ..."
3,41,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,This dish is best prepared a day in advance to...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"c(""Drain the tofu, carefully squeezing out exc..."
4,42,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19T06:19:00Z,Make and share this Cabbage Soup recipe from F...,"""https://img.sndimg.com/food/image/upload/w_55...",...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"c(""Mix everything together and bring to a boil..."


- rename the relevant column for the training dataset and drop irrelevant columns

In [13]:
data_recipes.rename(columns={'RecipeInstructions':'instructions'}, inplace=True)

In [14]:
data_recipes_instructions = data_recipes.drop(['Name', 'RecipeId', 'AuthorId', 'AuthorName', 'CookTime', 'PrepTime', 'TotalTime', 'DatePublished', 'Description', 'Images', 'RecipeIngredientParts', 'RecipeCategory', 'Keywords', 'RecipeIngredientQuantities', 'AggregatedRating', 'ReviewCount', 'Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent', 'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent', 'RecipeServings', 'RecipeYield'], axis=1)
data_recipes_instructions.head()

Unnamed: 0,instructions
0,"c(""Toss 2 cups berries with sugar."", ""Let stan..."
1,"c(""Soak saffron in warm milk for 5 minutes and..."
2,"c(""Into a 1 quart Jar with tight fitting lid, ..."
3,"c(""Drain the tofu, carefully squeezing out exc..."
4,"c(""Mix everything together and bring to a boil..."


Examine the recipes dataset:

In [15]:
data_recipes_instructions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 1 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   instructions  522517 non-null  object
dtypes: object(1)
memory usage: 4.0+ MB


- check for missing values

In [16]:
data_recipes_instructions.isnull().sum().sum()

0

- check length of dataset, i.e., the amount of recipes presented in the dataset

In [17]:
len(data_recipes_instructions)

522517

##### recipes_raw_nosource_{recipe_sources}.json Datasets

As these three datasets are identical in their structure, they are combined first before proceeded further. 

- read (3) recipes_raw_nosource_ar as DataFrames

In [18]:
data_recipes_raw_nosource_ar = pd.read_json('recipes_raw_nosource_ar.json')

In [19]:
data_recipes_raw_nosource_ar = data_recipes_raw_nosource_ar.transpose()
data_recipes_raw_nosource_ar.head()

Unnamed: 0,title,ingredients,instructions,picture_link
rmK12Uau.ntP510KeImX506H6Mr6jTu,Slow Cooker Chicken and Dumplings,"[4 skinless, boneless chicken breast halves AD...","Place the chicken, butter, soup, and onion in ...",55lznCYBbs2mT8BTx6BTkLhynGHzM.S
5ZpZE8hSVdPk2ZXo1mZTyoPWJRSCPSm,Awesome Slow Cooker Pot Roast,[2 (10.75 ounce) cans condensed cream of mushr...,"In a slow cooker, mix cream of mushroom soup, ...",QyrvGdGNMBA2lDdciY0FjKu.77MM0Oe
clyYQv.CplpwJtjNaFGhx0VilNYqRxu,Brown Sugar Meatloaf,"[1/2 cup packed brown sugar ADVERTISEMENT, 1/2...",Preheat oven to 350 degrees F (175 degrees C)....,LVW1DI0vtlCrpAhNSEQysE9i/7rJG56
BmqFAmCrDHiKNwX.IQzb0U/v0mLlxFu,Best Chocolate Chip Cookies,"[1 cup butter, softened ADVERTISEMENT, 1 cup w...",Preheat oven to 350 degrees F (175 degrees C)....,0SO5kdWOV94j6EfAVwMMYRM3yNN8eRi
N.jCksRjB4MFwbgPFQU8Kg.yF.XCtOi,Homemade Mac and Cheese Casserole,[8 ounces whole wheat rotini pasta ADVERTISEME...,Preheat oven to 350 degrees F. Line a 2-quart ...,YCnbhplMgiraW4rUXcybgSEZinSgljm


- check for missing values

In [20]:
data_recipes_raw_nosource_ar.isnull().sum().sum()

1120

- check length of dataset, i.e., the amount of recipes presented in the dataset

In [21]:
len(data_recipes_raw_nosource_ar)

39802

- read (4) recipes_raw_nosource_epi as DataFrames

In [22]:
data_recipes_raw_nosource_epi = pd.read_json('recipes_raw_nosource_epi.json')

In [23]:
data_recipes_raw_nosource_epi = data_recipes_raw_nosource_epi.transpose()
data_recipes_raw_nosource_epi.head()

Unnamed: 0,ingredients,picture_link,instructions,title
05zEpbSqcs9E0rcnCJWyZ9OgdH0MLby,"[12 egg whites, 12 egg yolks, 1 1/2 cups sugar...",,"Beat the egg whites until stiff, gradually add...",Christmas Eggnog
mF5SZmoqxF4WtIlhLRvzuKk.z6s7P2S,"[18 fresh chestnuts, 2 1/2 pounds veal stew me...",,Preheat oven to 400°F. Using small sharp knife...,"Veal, Carrot and Chestnut Ragoût"
oQV5D7cVbCFwmrDs3pBUv2y.AG0WV26,"[2 tablespoons unsalted butter, softened, 4 or...",3xjktRst3I5lDZ2Z5kTOtqQyzZFFN9u,Preheat the oven to 350°F. Spread the softened...,Caramelized Bread Pudding with Chocolate and C...
Z9seBJWaB5NkSp4DQHDnCAUBTwov/1u,"[3/4 pound Stilton, crumbled (about 3 cups) an...",,"In a food processor blend the Stilton, the cre...",Sherried Stilton and Green Peppercorn Spread
bB3GxoAplVZeoX3fzWNWyeECtQFxw6G,"[2 cups (about 9 1/2 ounces) whole almonds, to...",,Position rack in center of oven and preheat to...,Almond-Chocolate Macaroons


- check for missing values

In [24]:
data_recipes_raw_nosource_epi.isnull().sum().sum()

13204

- check length of dataset, i.e., the amount of recipes presented in the dataset

In [25]:
len(data_recipes_raw_nosource_epi)

25323

- read (5) recipes_raw_nosource_fn as DataFrames

In [26]:
data_recipes_raw_nosource_fn = pd.read_json('recipes_raw_nosource_fn.json')

In [27]:
data_recipes_raw_nosource_fn = data_recipes_raw_nosource_fn.transpose()
data_recipes_raw_nosource_fn.head()

Unnamed: 0,instructions,ingredients,title,picture_link
p3pKOD6jIHEcjf20CCXohP8uqkG5dGi,Toss ingredients lightly and spoon into a butt...,"[1/2 cup celery, finely chopped, 1 small green...",Grammie Hamblet's Deviled Crab,
S7aeOIrsrgT0jLP32jKGg4j.o9zi2DO,Watch how to make this recipe.\nSprinkle the s...,"[2 pounds skirt steak, cut into 1/2-inch dice,...",Infineon Raceway Baked Beans,Ja5uaD8Q7m7vvtWwk2.48dr1eCre/qi
o9MItV9txfoPsUQ4v8b0vh1.VdjwfsK,"In a large saucepan, let the beans soak in eno...","[1 1/2 cups dried black beans, picked over and...",Southwestern Black Bean Dip,
5l1yTSYFifF/M2dfbD6DX28WWQpLWNK,Watch how to make this recipe.\nPreheat the ov...,"[1 1/4 pounds ground chuck, One 15-ounce can t...",Sour Cream Noodle Bake,nm/WxalB.VjEZSa0iX9RuZ8xI51Y7bS
kRBQSWtqYWqtkb34FGeenBSbC32gIdO,Special equipment: sushi mat\nCook the brown r...,"[1 cup rice, brown, medium-grain, cooked, 1/2-...",Sushi Renovation,


- check for missing values

In [28]:
data_recipes_raw_nosource_fn.isnull().sum().sum()

30024

- check length of dataset, i.e., the amount of recipes presented in the dataset

In [29]:
len(data_recipes_raw_nosource_fn)

60039

- combine all three recipes_raw_nosource_{recipe_sources} Datasets

In [30]:
data_recipes_raw_nosource = [data_recipes_raw_nosource_ar, data_recipes_raw_nosource_epi, data_recipes_raw_nosource_fn]
data_recipes_raw_nosource_instructions = pd.concat(data_recipes_raw_nosource)
data_recipes_raw_nosource_instructions = data_recipes_raw_nosource_instructions.reset_index(drop=True)

- drop irrelevant columns for training dataset

In [31]:
data_recipes_raw_nosource_instructions = data_recipes_raw_nosource_instructions.drop(['title', 'picture_link', 'ingredients'], axis=1)

In [32]:
data_recipes_raw_nosource_instructions.head()

Unnamed: 0,instructions
0,"Place the chicken, butter, soup, and onion in ..."
1,"In a slow cooker, mix cream of mushroom soup, ..."
2,Preheat oven to 350 degrees F (175 degrees C)....
3,Preheat oven to 350 degrees F (175 degrees C)....
4,Preheat oven to 350 degrees F. Line a 2-quart ...


Examine the combined recipes_raw_nosource_{recipe_sources} dataset:

In [33]:
data_recipes_raw_nosource_instructions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125164 entries, 0 to 125163
Data columns (total 1 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   instructions  124473 non-null  object
dtypes: object(1)
memory usage: 978.0+ KB


- check for missing values

In [34]:
data_recipes_raw_nosource_instructions.isnull().sum().sum()

691

- check length of combined dataset, i.e., the amount of recipes presented in the dataset

In [35]:
len(data_recipes_raw_nosource_instructions)

125164

In [36]:
sum_len_data_recipes_raw_nosource = len(data_recipes_raw_nosource_ar) + len(data_recipes_raw_nosource_epi) + len(data_recipes_raw_nosource_fn)
sum_len_data_recipes_raw_nosource

125164

#### 1.1.2 Initial Training Dataset

The training dataset consists of the three above created DataFrames:

       (1) data_RecipeNLG_instructions
       (2) data_recipes_raw_nosource_instructions
       (3) data_recipes_instructions
   
- combine the three datasets

In [37]:
datasets = [data_RecipeNLG_instructions, data_recipes_raw_nosource_instructions, data_recipes_instructions]

datasets_combined = pd.concat(datasets)
datasets_combined.head()

Unnamed: 0,instructions
0,"[""In a heavy 2-quart saucepan, mix brown sugar..."
1,"[""Place chipped beef on bottom of baking dish...."
2,"[""In a slow cooker, combine all ingredients. C..."
3,"[""Boil and debone chicken."", ""Put bite size pi..."
4,"[""Combine first four ingredients and press in ..."


- check the length of the combined dataset 

In [38]:
len(datasets_combined)

2878823

### 1.2 Data Preprocessing 

This chapter preprocesses the Training Data for further use. This includes the following steps: 

1.  Handle missing values
2.	Convert letters into lower case
3.	Remove punctuations and numerical digits
4.	Remove stopwords
5.	Remove symbols 
6.	Remove duplicates
7.	Detect n-grams
8.	Replace individual words in instruction text with n-grams

#### 1.2.1 Handle missing values

- search for NaN/NA/None values in the training data 

In [39]:
datasets_combined['instructions'].replace('', np.nan, inplace=True)

In [40]:
datasets_combined['instructions'].isnull().values.any()

True

In [41]:
datasets_combined['instructions'].isnull().sum().sum()

714

- remove Nan/NA/None values

In [42]:
datasets_combined.dropna(subset = ["instructions"], inplace=True)

In [43]:
datasets_combined[datasets_combined['instructions'].notnull()]

Unnamed: 0,instructions
0,"[""In a heavy 2-quart saucepan, mix brown sugar..."
1,"[""Place chipped beef on bottom of baking dish...."
2,"[""In a slow cooker, combine all ingredients. C..."
3,"[""Boil and debone chicken."", ""Put bite size pi..."
4,"[""Combine first four ingredients and press in ..."
...,...
522512,"c(""Preheat oven to 350&deg;F Grease an 8x8 cak..."
522513,"c(""Position rack in center of oven and preheat..."
522514,"c(""heat half and half and heavy cream to a sim..."
522515,"c(""In a small bowl, combine mayo and wasabi pa..."


- check if NaN/NA/None values were removed successfully

In [44]:
is_NaN = datasets_combined.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = datasets_combined[row_has_NaN]

In [45]:
rows_with_NaN

Unnamed: 0,instructions


In [46]:
datasets_combined.isnull().values.any()

False

- check lenght of dataset

In [47]:
len(datasets_combined)

2878109

#### 1.2.2 Remove Stopwords

- define individual stopwords

In [48]:
to_remove_instructions = [
    'c',
    'cup',
    'cups',
    'f',
    'gram',
    'grams',
    'inch',
    'inches',
    'kg',
    'kilogram',
    'kilograms',
    'mg'
    'miligram',
    'miligrams',
    'o',
    'ounce',
    'ounces',
    'oz',
    'pkg',
    'pound',
    'pounds',
    'pwdr',
    't',
    'tablespoons',
    'tablespoon',
    'tblsp',
    'tbs',
    'tbsp',
    'tbspful',
    'tbspn',
    'tbspoon',
    'tbsps',
    'teaspoon',
    'teaspoons',
    'tsp',
    'tsps',
    'ub',
    'x'
]

In [49]:
instructions_stopwords = STOPWORDS.union(set(to_remove_instructions))

- define function for stopword removal 

In [50]:
def remove_instructions_stopwords(s):
    s = utils.to_unicode(s)
    return " ".join(w for w in s.split() if w not in instructions_stopwords)

#### 1.2.3 Remove Symbols

- define individual symbols 

In [51]:
to_remove_symbols = [
    '½',
    '⅓',
    '¼',
    '¾',
    '⅞',
    '©',
    '®',
    '™',
    '°f'
]

- define function for symbol removal 

In [52]:
def remove_symbols(text):
    regex = re.compile("(%s)" % "|".join(map(re.escape, to_remove_symbols)))
    return regex.sub("", text)

#### 1.2.3 Preprocess Training Data 

In [53]:
def preprocess_instructions(text):
    text = text.lower()
    text = pp.strip_punctuation(text)
    text = pp.strip_numeric(text)
    text = remove_instructions_stopwords(text)
    text = remove_symbols(text)
    
    return text

- apply preprocessing function to the training data

In [54]:
%%time
datasets_combined['instructions_pp'] = datasets_combined['instructions'].apply(preprocess_instructions)

CPU times: user 2min 33s, sys: 864 ms, total: 2min 34s
Wall time: 2min 35s


- drop not preprocessed column 

In [55]:
dataset_pp = datasets_combined.drop(['instructions'], axis=1)
dataset_pp

Unnamed: 0,instructions_pp
0,heavy quart saucepan mix brown sugar nuts evap...
1,place chipped beef baking dish place chicken b...
2,slow cooker combine ingredients cover cook low...
3,boil debone chicken bite size pieces average s...
4,combine ingredients press ungreased pan melt c...
...,...
522512,preheat oven deg grease cake pan recipe uses m...
522513,position rack center oven preheat place beef ...
522514,heat half half heavy cream simmer add sugar re...
522515,small bowl combine mayo wasabi paste stir addi...


#### 1.2.4 Remove Duplicates

- remove duplicates from the training data, based on the preprocessed instruction text (instructions_pp)

In [56]:
dataset_pp = dataset_pp.drop_duplicates(subset=['instructions_pp'], keep='first')

In [57]:
sum_duplicates = len(datasets_combined) - len(dataset_pp)
sum_duplicates

524667

#### 1.2.5 Handle missing values

- check again for missing values, which may appear after the preprocessing

In [None]:
dataset_pp['instructions_pp'].replace('', np.nan, inplace=True)

In [59]:
dataset_pp['instructions_pp'].isnull().values.any()

True

In [60]:
dataset_pp['instructions_pp'].isnull().sum().sum()

1

- remove Nan/NA/None values 

In [None]:
dataset_pp.dropna(subset = ["instructions_pp"], inplace=True)

In [62]:
len(dataset_pp)

2353441

#### 1.2.6 Detect n-grams

For the final training data, n-grams in the instructions text need to be generated in order to identify multi-word ingredients.

In [63]:
dataset_pp

Unnamed: 0,instructions_pp
0,heavy quart saucepan mix brown sugar nuts evap...
1,place chipped beef baking dish place chicken b...
2,slow cooker combine ingredients cover cook low...
3,boil debone chicken bite size pieces average s...
4,combine ingredients press ungreased pan melt c...
...,...
522511,beat eggs add oil water pumpkin mix add baking...
522512,preheat oven deg grease cake pan recipe uses m...
522513,position rack center oven preheat place beef ...
522514,heat half half heavy cream simmer add sugar re...



Train two phrases models on the instructions of the training data in order to identify n-gram ingredients:

The parameter threshold defines the forming of the phrases. A higher threshold implies fewer phrases.

- first phrases model training

In [64]:
instructions_pp_split = dataset_pp['instructions_pp'].apply(lambda x: x.split())

In [65]:
%%time
phrases_bigrams = Phrases(instructions_pp_split, threshold=4)

CPU times: user 2min 46s, sys: 15 s, total: 3min 1s
Wall time: 3min 14s


- second phrases model training 

In [66]:
%%time
phrases_ngrams = Phrases(phrases_bigrams[instructions_pp_split], threshold = 2)

CPU times: user 6min 42s, sys: 19.5 s, total: 7min 1s
Wall time: 7min 18s


#### 1.2.7 Replace individual words in instruction text with n-grams  

Extract the generated n-grams and replace them within the instruction text in the training data:

- export the generated n-grams

In [67]:
exported_phrases = phrases_ngrams.export_phrases()
ngram_ingredients = list(exported_phrases.keys())

In [68]:
len(ngram_ingredients)

97241

- create dictonary for the n-grams

In [69]:
ngram_dict = {}

for ngram in ngram_ingredients:
    ngram_dict[ngram] = [ngram.replace("_", " ")]

In [70]:
ngram_dict

{'quart_saucepan': ['quart saucepan'],
 'evaporated_milk': ['evaporated milk'],
 'wax_paper': ['wax paper'],
 'let_stand': ['let stand'],
 'wax_paper_let_stand': ['wax paper let stand'],
 'chipped_beef': ['chipped beef'],
 'baking_dish': ['baking dish'],
 'bake_uncovered': ['bake uncovered'],
 'slow_cooker': ['slow cooker'],
 'yields_servings': ['yields servings'],
 'debone_chicken': ['debone chicken'],
 'bite_size': ['bite size'],
 'bite_size_pieces': ['bite size pieces'],
 'average_size': ['average size'],
 'casserole_dish': ['casserole dish'],
 'mushroom_soup': ['mushroom soup'],
 'according_instructions': ['according instructions'],
 'according_instructions_box': ['according instructions box'],
 'shredded_cheese': ['shredded cheese'],
 'chocolate_chips': ['chocolate chips'],
 'melt_chocolate_chips': ['melt chocolate chips'],
 'gets_hard': ['gets hard'],
 'prick_times': ['prick times'],
 'times_fork': ['times fork'],
 'paper_towel': ['paper towel'],
 'ready_eat': ['ready eat'],
 'le

- define function for replacing the individual words in the instruction text with the n-grams

In [71]:
ngram_replacer = KeywordProcessor()
ngram_replacer.add_keywords_from_dict(ngram_dict)

def replace_ngrams(text):
    text = ngram_replacer.replace_keywords(text)
    return text

- apply replace function to the training data 

In [72]:
%%time
dataset_pp['instructions_pp_ngrams'] = dataset_pp['instructions_pp'].apply(replace_ngrams)

CPU times: user 5min 31s, sys: 7.18 s, total: 5min 38s
Wall time: 5min 47s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


- drop not replaced column 

In [73]:
training_data = dataset_pp.drop(['instructions_pp'], axis=1)

In [74]:
training_data

Unnamed: 0,instructions_pp_ngrams
0,heavy quart_saucepan mix brown sugar nuts evap...
1,place chipped_beef baking_dish place chicken b...
2,slow_cooker combine ingredients cover cook low...
3,boil debone_chicken bite_size_pieces average_s...
4,combine ingredients press ungreased pan melt_c...
...,...
522511,beat_eggs add oil water pumpkin mix add baking...
522512,preheat_oven_deg grease cake pan recipe_uses m...
522513,position_rack_center oven preheat place beef ...
522514,heat half half heavy_cream simmer add sugar re...


- save the final training data as pickle file for training the word2vec model 

In [77]:
training_data.to_pickle('SRM_training_data.pkl')

## 2.0 Ingredient List 

### 2.1 Build Ingredient List

The ingredient list for filtering the final substitute results is retrieved from the NER column of the RecipeNLG Dataset (data_RecipeNLG):

In [78]:
ingredients_data = [data_RecipeNLG["NER"]]

pd.set_option("display.max_colwidth", 100)
ingredients_data = pd.concat(ingredients_data, axis=1) 
ingredients_data

Unnamed: 0,NER
0,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""butter"", ""bite size shredded rice biscuits""]"
1,"[""beef"", ""chicken breasts"", ""cream of mushroom soup"", ""sour cream""]"
2,"[""frozen corn"", ""cream cheese"", ""butter"", ""garlic powder"", ""salt"", ""pepper""]"
3,"[""chicken"", ""chicken gravy"", ""cream of mushroom soup"", ""shredded cheese""]"
4,"[""peanut butter"", ""graham cracker crumbs"", ""butter"", ""powdered sugar"", ""chocolate chips""]"
...,...
2231137,"[""chocolate hazelnut spread"", ""tortillas"", ""butter"", ""marshmallows"", ""hazelnuts""]"
2231138,"[""eggs"", ""paprika"", ""salt"", ""choice"", ""miracle whip"", ""relish""]"
2231139,"[""radish"", ""Sesame oil"", ""White sesame seeds"", ""Salt"", ""Soy sauce""]"
2231140,"[""apple cider"", ""sugar"", ""kosher salt"", ""bay leaves"", ""arbol"", ""berries"", ""caraway seeds"", ""must..."


- check for missing values 

In [79]:
ingredients_data['NER'].isnull().sum()

0

Build a list of every ingredient used in all recipes:

- convert *string* values in the ingredients column to *arrays* 

In [80]:
ingredients_data["ingredients"] = ingredients_data["NER"].apply(eval)

In [81]:
ingredients_data = ingredients_data.drop(['NER'], axis=1)
ingredients_data

Unnamed: 0,ingredients
0,"[brown sugar, milk, vanilla, nuts, butter, bite size shredded rice biscuits]"
1,"[beef, chicken breasts, cream of mushroom soup, sour cream]"
2,"[frozen corn, cream cheese, butter, garlic powder, salt, pepper]"
3,"[chicken, chicken gravy, cream of mushroom soup, shredded cheese]"
4,"[peanut butter, graham cracker crumbs, butter, powdered sugar, chocolate chips]"
...,...
2231137,"[chocolate hazelnut spread, tortillas, butter, marshmallows, hazelnuts]"
2231138,"[eggs, paprika, salt, choice, miracle whip, relish]"
2231139,"[radish, Sesame oil, White sesame seeds, Salt, Soy sauce]"
2231140,"[apple cider, sugar, kosher salt, bay leaves, arbol, berries, caraway seeds, mustard seeds, cori..."


Examine the ingredient list:

In [82]:
ingredients_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231142 entries, 0 to 2231141
Data columns (total 1 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   ingredients  object
dtypes: object(1)
memory usage: 17.0+ MB


- list every ingredient used in each row, i.e., recipe, individually

In [83]:
def to_1D(series):
    return pd.Series([x for _list in series for x in _list])

In [84]:
ingredient_list_individual = to_1D(ingredients_data['ingredients'])
ingredient_list_individual

0                                     brown sugar
1                                            milk
2                                         vanilla
3                                            nuts
4                                          butter
                            ...                  
18921245                             tomato paste
18921246    freshly grated Pecorino Romano cheese
18921247                                     Salt
18921248                             tomato sauce
18921249                               red pepper
Length: 18921250, dtype: object

In [85]:
total_ingredient_list = pd.DataFrame(ingredient_list_individual, columns = ['ingredient'])
total_ingredient_list

Unnamed: 0,ingredient
0,brown sugar
1,milk
2,vanilla
3,nuts
4,butter
...,...
18921245,tomato paste
18921246,freshly grated Pecorino Romano cheese
18921247,Salt
18921248,tomato sauce


### 2.2 Data Preprocessing

The following Chapter preprosses the ingredinet list. The preprocessing includes:
1.	Convert letters into lower case
2.	Remove punctuations and numerical digits
3.	Remove stopwords
4.	Remove symbols 
5.	Handle missing values
6.	Extend ingredient list with frequency count 
7.	Drop ingredients with frequency lower than five 
8.	Count frequency of ingredients in training data 
9.	Revise ingredient list manually 

#### 2.2.1 Remove Stopwords

- define individual stopwords

In [86]:
to_remove_ingredients = [
    'additional',
    'barilla',
    'best',
    'bite',
    'bites',
    'boiled',
    'bottled',
    'bought',
    'bowl',
    'bowls',
    'breakfast', 
    'c',
    'chopped',
    'chunk',
    'chunks',
    'cold',
    'cooked',
    'cooking',
    'cool',
    'creamy',
    'crusty',
    'cup',
    'cups',
    'dish',
    'drained',
    'f',  
    'favorite',
    'fine',
    'firm',
    'fresh',
    'freshly',
    'fried',
    'frozen',
    'fry',
    'fryer',
    'frying',
    'fully',
    'gluten',
    'gram',
    'grams',
    'half',
    'halves',
    'handful',
    'hard',
    'hardboiled',
    'health',
    'healthy',
    'home',
    'homemade',
    'hot',
    'inch',
    'inches',
    'kg',
    'kilogram',
    'kilograms',
    'large',
    'leaf',
    'leaves',
    'leftover',
    'leftovers',
    'lightly',
    'like',
    'low',
    'lunch',
    'measures',
    'measuring',
    'melting',
    'mg',
    'miligram',
    'miligrams',
    'mixing',
    'new',
    'non',
    'o',
    'optional',
    'original',
    'ounce',
    'ounces',
    'oz',
    'pack',
    'packed',
    'pan',
    'piece',
    'pieces',
    'pkg',
    'pound',
    'pounds',
    'premade',
    'prepared',
    'pwdr',
    'quality',
    'recipe',
    'ripe',
    'shelled',
    'short',
    'size',
    'sized',
    'slice',
    'sliced',
    'slices',
    'small',
    'soft',
    'spray',
    'steamed',
    'stir',
    'store',
    'strong',
    'style',
    'substitute',
    'substitutes',
    't',
    'tablespoon',
    'tablespoons',
    'tasting',
    'tblsp',
    'tbs',
    'tbsp',
    'tbspful',
    'tbspn',
    'tbspoon',
    'tbsps',
    'teaspoon',
    'teaspoons',
    'tightly',
    'torn',
    'tsp',
    'tsps',
    'ub',
    'unbleached',
    'uncooked',
    'vital',
    'warm',
    'washed',
    'weight',
    'whisk',
    'x'
]

In [87]:
ingredients_stopwords = STOPWORDS.union(set(to_remove_ingredients))

- define function for stopword removal 

In [88]:
def remove_ingredients_stopwords(s):
    s = utils.to_unicode(s)
    return " ".join(w for w in s.split() if w not in ingredients_stopwords)

#### 2.2.2 Preprocessing Function

In [89]:
def preprocess_ingredients(text):
    text = text.lower()
    text = pp.strip_punctuation(text)
    text = pp.strip_numeric(text)
    text = remove_ingredients_stopwords(text)
    text = remove_symbols(text)
  
    return text

- apply preprocessing function to the ingredient list

In [90]:
%%time
total_ingredient_list['ingredient_pp'] = total_ingredient_list.ingredient.apply(preprocess_ingredients)
total_ingredient_list

CPU times: user 2min 12s, sys: 1.84 s, total: 2min 14s
Wall time: 2min 15s


Unnamed: 0,ingredient,ingredient_pp
0,brown sugar,brown sugar
1,milk,milk
2,vanilla,vanilla
3,nuts,nuts
4,butter,butter
...,...,...
18921245,tomato paste,tomato paste
18921246,freshly grated Pecorino Romano cheese,grated pecorino romano cheese
18921247,Salt,salt
18921248,tomato sauce,tomato sauce


- drop not preprocessed column

In [91]:
ingredient_list_pp = total_ingredient_list.drop(['ingredient'], axis=1)
ingredient_list_pp

Unnamed: 0,ingredient_pp
0,brown sugar
1,milk
2,vanilla
3,nuts
4,butter
...,...
18921245,tomato paste
18921246,grated pecorino romano cheese
18921247,salt
18921248,tomato sauce


#### 2.2.3 Missing Values 

- replace empty values with NaN

In [92]:
ingredient_list_pp['ingredient_pp'].replace('', np.nan, inplace=True)

In [93]:
ingredient_list_pp['ingredient_pp'].isnull().sum()

116565

- drop empty rows 

In [94]:
ingredient_list_pp.dropna(subset = ["ingredient_pp"], inplace=True)

In [95]:
ingredient_list_pp = ingredient_list_pp[ingredient_list_pp['ingredient_pp'].notnull()]

In [96]:
ingredient_list_pp

Unnamed: 0,ingredient_pp
0,brown sugar
1,milk
2,vanilla
3,nuts
4,butter
...,...
18921245,tomato paste
18921246,grated pecorino romano cheese
18921247,salt
18921248,tomato sauce


#### 2.2.4 Extend ingredient list with frequency count

- count frequency of ingredient in ingredient list

In [97]:
ingredient_list_pp = (ingredient_list_pp['ingredient_pp']).value_counts()
ingredient_list_pp

salt                   1015689
sugar                   666053
butter                  545664
flour                   489085
eggs                    422622
                        ...   
malagkit                     1
extra ginger                 1
brown organic eggs           1
danya                        1
white string cheese          1
Name: ingredient_pp, Length: 154844, dtype: int64

In [98]:
ingredient_frequency_list = pd.DataFrame({'ingredient':ingredient_list_pp.index, 'frequency':ingredient_list_pp.values})
ingredient_frequency_list.index = np.arange(1, (len(ingredient_list_pp) + 1))
ingredient_frequency_list

Unnamed: 0,ingredient,frequency
1,salt,1015689
2,sugar,666053
3,butter,545664
4,flour,489085
5,eggs,422622
...,...,...
154840,malagkit,1
154841,extra ginger,1
154842,brown organic eggs,1
154843,danya,1


- check for missing values

In [99]:
ingredient_frequency_list['ingredient'].replace('', np.nan, inplace=True)

In [100]:
ingredient_frequency_list['ingredient'].isnull().sum()

0

- drop ingredients with frequency lower than five

In [101]:
ingredient_frequency_list.drop(ingredient_frequency_list.loc[ingredient_frequency_list["frequency"] < 5].index, inplace=True)


In [102]:
ingredient_frequency_list

Unnamed: 0,ingredient,frequency
1,salt,1015689
2,sugar,666053
3,butter,545664
4,flour,489085
5,eggs,422622
...,...,...
27197,cheesy scalloped potatoes,5
27198,bug,5
27199,venison chops,5
27200,international italian,5


#### 2.2.5 Count frequency of ingredients in training data 

- transform DataFrame into list and replace blank space with underscore

In [103]:
ingredient_list = ingredient_frequency_list["ingredient"].to_list()

In [104]:
ingredient_list

['salt',
 'sugar',
 'butter',
 'flour',
 'eggs',
 'onion',
 'garlic',
 'water',
 'milk',
 'vanilla',
 'olive oil',
 'pepper',
 'brown sugar',
 'egg',
 'tomatoes',
 'baking powder',
 'lemon juice',
 'cinnamon',
 'parsley',
 'cream cheese',
 'sour cream',
 'baking soda',
 'celery',
 'chicken',
 'margarine',
 'cheddar cheese',
 'oil',
 'vegetable oil',
 'onions',
 'mayonnaise',
 'parmesan cheese',
 'pecans',
 'potatoes',
 'kosher salt',
 'ground black pepper',
 'basil',
 'carrots',
 'nuts',
 'soy sauce',
 'black pepper',
 'pineapple',
 'thyme',
 'oregano',
 'honey',
 'bacon',
 'mushrooms',
 'unsalted butter',
 'chicken broth',
 'extra virgin olive oil',
 'mustard',
 'ground beef',
 'cilantro',
 'worcestershire sauce',
 'cornstarch',
 'paprika',
 'red pepper',
 'ginger',
 'powdered sugar',
 'green pepper',
 'vinegar',
 'lemon',
 'green onions',
 'nutmeg',
 'ground cinnamon',
 'rice',
 'shortening',
 'walnuts',
 'chicken breasts',
 'heavy cream',
 'buttermilk',
 'chili powder',
 'cheese',
 

In [105]:
ingredient_list = [x.replace(' ', '_') for x in ingredient_list]

In [106]:
ingredient_list

['salt',
 'sugar',
 'butter',
 'flour',
 'eggs',
 'onion',
 'garlic',
 'water',
 'milk',
 'vanilla',
 'olive_oil',
 'pepper',
 'brown_sugar',
 'egg',
 'tomatoes',
 'baking_powder',
 'lemon_juice',
 'cinnamon',
 'parsley',
 'cream_cheese',
 'sour_cream',
 'baking_soda',
 'celery',
 'chicken',
 'margarine',
 'cheddar_cheese',
 'oil',
 'vegetable_oil',
 'onions',
 'mayonnaise',
 'parmesan_cheese',
 'pecans',
 'potatoes',
 'kosher_salt',
 'ground_black_pepper',
 'basil',
 'carrots',
 'nuts',
 'soy_sauce',
 'black_pepper',
 'pineapple',
 'thyme',
 'oregano',
 'honey',
 'bacon',
 'mushrooms',
 'unsalted_butter',
 'chicken_broth',
 'extra_virgin_olive_oil',
 'mustard',
 'ground_beef',
 'cilantro',
 'worcestershire_sauce',
 'cornstarch',
 'paprika',
 'red_pepper',
 'ginger',
 'powdered_sugar',
 'green_pepper',
 'vinegar',
 'lemon',
 'green_onions',
 'nutmeg',
 'ground_cinnamon',
 'rice',
 'shortening',
 'walnuts',
 'chicken_breasts',
 'heavy_cream',
 'buttermilk',
 'chili_powder',
 'cheese',
 

- identify presence of ingredients in n-gram modified instruction text (training data)

In [107]:
ingredient_count_ngrams = {}

def count_ingredients_ngrams(text):
    word_list = text.split(" ")

    for word in word_list:
        if word in ingredient_list:
            if word in ingredient_count_ngrams:
                ingredient_count_ngrams[word] = ingredient_count_ngrams[word] + 1
            else:
                ingredient_count_ngrams[word] = 1
                
                
    return ingredient_count_ngrams           

In [108]:
%%time
training_data['instructions_pp_ngrams'].apply(count_ingredients_ngrams)

CPU times: user 4h 36min 18s, sys: 20.7 s, total: 4h 36min 39s
Wall time: 4h 37min 1s


0         {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
1         {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
2         {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
3         {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
4         {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
                                                         ...                                                 
522511    {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
522512    {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
522513    {'heavy': 48867, 'mix': 1310617, 'brown': 381996, 'sugar': 862752, 'nuts': 112238, 'evaporated_m...
522514    

In [109]:
ingredient_count_ngrams

{'heavy': 48867,
 'mix': 1310617,
 'brown': 381996,
 'sugar': 862752,
 'nuts': 112238,
 'evaporated_milk': 11251,
 'butter': 680399,
 'margarine': 81923,
 'medium': 620670,
 'heat': 1503923,
 'mixture': 1181963,
 'boil': 282856,
 'minutes': 2723346,
 'vanilla': 240369,
 'cereal': 16746,
 'drop': 57024,
 'clusters': 894,
 'place': 1038182,
 'chipped_beef': 543,
 'chicken': 523162,
 'beef': 115613,
 'soup': 177889,
 'cream': 268812,
 'pour': 763702,
 'hours': 248578,
 'slow_cooker': 20815,
 'combine': 787439,
 'ingredients': 636308,
 'cover': 643493,
 'cook': 1172832,
 'cheese': 438173,
 'serving': 219907,
 'square': 30872,
 'gravy': 28893,
 'mushroom_soup': 11500,
 'level': 10034,
 'stuffing': 33760,
 'shredded_cheese': 13914,
 'bake': 974039,
 'approximately': 54982,
 'golden': 100026,
 'spread': 349233,
 'refrigerate': 128313,
 'chocolate': 147135,
 'refrigerator': 69190,
 'wash': 45450,
 'potatoes': 245180,
 'fork': 93289,
 'microwave': 68631,
 'wet': 20521,
 'ready_eat': 3650,
 'han

In [110]:
ingredient_count = pd.DataFrame(ingredient_count_ngrams.items(), columns=['ingredient', 'frequency'])

In [111]:
ingredient_count = ingredient_count.replace(regex=['_'], value=' ')
ingredient_count = ingredient_count.sort_values(by=['frequency'], ascending=False)
ingredient_count

Unnamed: 0,ingredient,frequency
12,minutes,2723346
9,heat,1503923
1,mix,1310617
10,mixture,1181963
29,cook,1172832
...,...,...
9366,pickerel,1
9485,chedasharp,1
9453,maids,1
8750,haddie,1


- export as excel file for manual revision

In [112]:
ingredient_count.to_excel('SRM_ingredient_list_unedited.xlsx')

- save adjusted ingredient list after manual revision as pickle file for the modeling of the SRM

In [113]:
df_ingredient_list = pd.read_excel('SRM_ingredient_list_final.xlsx')
df_ingredient_list.index = np.arange(1, (len(df_ingredient_list) + 1))

In [114]:
df_ingredient_list

Unnamed: 0,ingredient,frequency
1,sugar,862777
2,water,819858
3,butter,680430
4,flour,652974
5,salt,618946
...,...,...
4567,haddie,1
4568,panang curry,1
4569,striper,1
4570,tonka,1


In [115]:
len(df_ingredient_list[df_ingredient_list.frequency < 5])

31

In [116]:
df_ingredient_list.to_pickle('SRM_ingredient_list.pkl')