# Substitute Recommender Model

This notebook doucments the training and final application of the Substitute Recommender Model (SRM).

Chapter 1.0 documents  the training of the word2vec model based on recipe instructions.

Chapter 2.0 documents the final application of the SRM.

## 1.0 Model Training

- install all required libraries

In [1]:
import gensim
import pandas as pd
import numpy as np

from difflib import get_close_matches

### 1.1 Training Data

- load the training data, which is prepared in the CAI_data notbook

In [2]:
training_data = pd.read_pickle("SRM_training_data.pkl")

In [3]:
training_data

Unnamed: 0,instructions_pp_ngrams
0,heavy quart_saucepan mix brown sugar nuts evap...
1,place chipped_beef baking_dish place chicken b...
2,slow_cooker combine ingredients cover cook low...
3,boil debone_chicken bite_size_pieces average_s...
4,combine ingredients press ungreased pan melt_c...
...,...
522511,beat_eggs add oil water pumpkin mix add baking...
522512,preheat_oven_deg grease cake pan recipe_uses m...
522513,position_rack_center oven preheat place beef ...
522514,heat half half heavy_cream simmer add sugar re...


### 1.2 Word2Vec Model Training

The model is trained with gensim Word2Vec, using the following parameters:

The main parameters are:
- Choice of Model: Continious Bag of Words (default)
- Window Size = 5 (default)
- Negative Sampling Size = 5 (default)
- Vector Size = 100 (default)
- Minimum Count of Words = 5 (default)
- Epochs = 20


- define a class for the training corpus

In [4]:
class TrainingCorpus:
    def __init__(self, data):
      self.data = data

    def __iter__(self):
        for line in self.data:
            yield line.split()

- define the sentences the model is trained on

In [5]:
sentences = TrainingCorpus(training_data.instructions_pp_ngrams)

- train the word2vec model

In [6]:
%%time
SRM_w2v_model = gensim.models.Word2Vec(sentences=sentences, epochs=20)

CPU times: user 56min 24s, sys: 31.5 s, total: 56min 55s
Wall time: 19min 10s


- save the word2vec model 

In [7]:
SRM_w2v_model.save("SRM_w2v_model.model")

## 2.0 Application of SRM

### 2.1 Preparation for Application 

- load the word2vec model (original version of the project)

In [8]:
SRM_w2v_model = gensim.models.Word2Vec.load("SRM_final_model.model")

- load the prepared and revised ingredient list which is necessary for optimization processes

In [9]:
df_ingredient_list = pd.read_pickle('SRM_ingredient_list.pkl')

In [10]:
df_ingredient_list

Unnamed: 0,ingredient,frequency
1,sugar,862777
2,water,819858
3,butter,680430
4,flour,652974
5,salt,618946
...,...,...
4567,haddie,1
4568,panang curry,1
4569,striper,1
4570,tonka,1


### 2.1 Function for Substitute Recommendations

- create a list of the ingredient_list DataFrame

In [11]:
def create_ingredient_list(df_ingredient_list):
    return [x.replace(' ', '_') for x in df_ingredient_list['ingredient'].to_list()]

In [12]:
ingredient_list = create_ingredient_list(df_ingredient_list)
ingredient_list

['sugar',
 'water',
 'butter',
 'flour',
 'salt',
 'oil',
 'chicken',
 'sauce',
 'onion',
 'garlic',
 'dough',
 'milk',
 'spread',
 'pepper',
 'eggs',
 'cake',
 'potatoes',
 'vanilla',
 'egg',
 'onions',
 'rice',
 'tomatoes',
 'batter',
 'soup',
 'bread',
 'chocolate',
 'pasta',
 'cream_cheese',
 'lemon_juice',
 'mushrooms',
 'season_salt',
 'vinegar',
 'bacon',
 'parsley',
 'beans',
 'broth',
 'baking_powder',
 'beef',
 'cinnamon',
 'shrimp',
 'dressing',
 'pork',
 'cookies',
 'juice',
 'margarine',
 'carrots',
 'stock',
 'baking_soda',
 'apples',
 'syrup',
 'sausage',
 'peppers',
 'tomato',
 'wine',
 'corn',
 'olive_oil',
 'spinach',
 'pie',
 'coconut',
 'honey',
 'marinade',
 'mustard',
 'cornstarch',
 'pecans',
 'pudding',
 'mayonnaise',
 'sour_cream',
 'crisp',
 'ginger',
 'celery',
 'topping',
 'pastry',
 'fat',
 'spices',
 'basil',
 'ice_cream',
 'pineapple',
 'broccoli',
 'turkey',
 'ice',
 'zucchini',
 'potato',
 'cilantro',
 'ham',
 'shortening',
 'bread_crumbs',
 'wrap',
 'y

- function for identifying close matches within the ingredient list in case the user misspelled the initial ingredient

In [13]:
def find_ingredient(ingredient):
    if ingredient in ingredient_list:
        return ingredient
    else:
        matched_ingredient = (get_close_matches(ingredient, ingredient_list, n=1, cutoff=0.85) or [None])[0]
        if matched_ingredient:
            return matched_ingredient
        else:
            return ingredient

- function for removing substitutes from the recommendations which imply the initial ingredient (i.e., spelling mistakes, plural/singular of the initial ingredient)

In [14]:
def remove_same_ingredients(ingredient, substitutes_list, remove_count=100, cutoff=0.8):
    ingredients_to_remove = get_close_matches(ingredient, substitutes_list, remove_count, cutoff)
    cleaned_substitutes_list = [x for x in substitutes_list if x not in ingredients_to_remove]
    return cleaned_substitutes_list

- function for removing duplicative substitutes from the recommendations (i.e., spelling mistakes, plural/singular of the substitutes)

In [15]:
def remove_same_substitutes(substitutes_list):
    cleaned_substitutes_list = substitutes_list
    
    for substitute in substitutes_list:
        if substitute in cleaned_substitutes_list:
            cleaned_substitutes_list = remove_same_ingredients(substitute, cleaned_substitutes_list, remove_count=len(cleaned_substitutes_list), cutoff=0.8)
            cleaned_substitutes_list.append(substitute)
        
    return cleaned_substitutes_list



- final function for finding the substitutes based on a defined initial ingredient

In [16]:
def find_substitutes(ingredient, wv_topn=100, suggested_substitutes=10, sort_by='similarity'):
    ingredient = find_ingredient(ingredient.strip().replace(' ', '_').lower())
    similar_substitutes = SRM_w2v_model.wv.most_similar(ingredient, topn=wv_topn)
        
    df_substitutes = pd.DataFrame(similar_substitutes, columns = ['ingredient', 'similarity'])
    
    substitutes_list = df_substitutes['ingredient'].to_list()
    substitutes_list = remove_same_ingredients(ingredient, substitutes_list, remove_count=wv_topn, cutoff=0.8)
    substitutes_list = remove_same_substitutes(substitutes_list)
    df_substitutes['ingredient'] = pd.Series(substitutes_list)
    
    df_substitutes = df_substitutes[['ingredient', 'similarity']]
    df_substitutes['nr_substitute'] = np.arange(start=1, stop=(wv_topn+1), step=1)

    df_substitutes_index = df_substitutes.set_index('nr_substitute')

    df_substitutes_final = df_substitutes_index.replace('_', ' ', regex=True)
    
    possible_substitutes = df_ingredient_list.merge(df_substitutes_final, on='ingredient', how='inner')
    possible_substitutes = possible_substitutes.sort_values(by=[sort_by], ascending=False)

    r = len(possible_substitutes)
    possible_substitutes['nr_substitute'] = np.arange(start=1, stop=(r+1), step=1)
    possible_substitutes = possible_substitutes.set_index('nr_substitute')
    
    return possible_substitutes.head(suggested_substitutes)

### 2.2 Exemplary Substitute Recommendations 

This chapter presents 10 examples of substitute recommendations, calculated by the SPM. 
Five of the ten ingredients are randomly choosen based of an ingredient list of commonly subsituted ingredients (via FOOD52: https://food52.com/blog/25199-common-ingredient-substitutions)
The other five initial ingredients are randomly selected from the final ingredient list (df_ingredinet_list). 

#### 2.2.1 List of Commonly Substituted Ingredients (CIL)

- build DataFrame with the commonly substituted ingredients 

In [17]:
CIL = {'ingredient': ['worcestershire sauce', 'shortening', 'evaporated milk', 'buttermilk', 'molasses', 'maple syrup', 'cornstarch', 'vanilla extract', 'soy sauce', 'eggs', 'thyme', 'heavy cream', 'flour', 'broth', 'parmesan cheese', 'fish sauce', 'bread crumbs', 'lemon zest', 'syrup']}
df_CIL = pd.DataFrame(data=CIL)

In [18]:
df_CIL.index = np.arange(1, 20)
df_CIL

Unnamed: 0,ingredient
1,worcestershire sauce
2,shortening
3,evaporated milk
4,buttermilk
5,molasses
6,maple syrup
7,cornstarch
8,vanilla extract
9,soy sauce
10,eggs


- randomly select five samples

In [19]:
df_CIL.sample(5)

Unnamed: 0,ingredient
17,bread crumbs
12,heavy cream
1,worcestershire sauce
3,evaporated milk
6,maple syrup


The five randomly selected ingredinets of the df_commonly_substituted_ingredients list in the previous run are:
   
       (1) breadcrumbs
       (2) cornstarch
       (3) eggs 
       (4) parmesan cheese
       (5) soy sauce
   

- apply the find substitutes function to the five CIL ingredients

In [20]:
find_substitutes('breadcrumbs')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,panko,6744,0.956892
2,cracker crumbs,9746,0.783435
3,cracker meal,331,0.752334
4,saltine crumbs,57,0.691029
5,parmesan cheese,44371,0.649269
6,parlsey,103,0.615784
7,cornflakes,2694,0.565192
8,paprika,36204,0.547163
9,dukkah,264,0.545685
10,pecorino,1450,0.538059


In [21]:
find_substitutes('cornstarch') 

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,arrowroot,1123,0.901605
2,cornflour,2210,0.731578
3,tapioca,3994,0.651171
4,potato starch,1291,0.640615
5,pineapple juice,14776,0.579175
6,flour,652974,0.572905
7,soy sauce,34686,0.572015
8,oyster sauce,2639,0.569741
9,orange juice,31854,0.540068
10,tapioca starch,495,0.530462


In [22]:
find_substitutes('eggs')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,egg yolks,32133,0.747251
2,egg beaters,809,0.698374
3,egg whites,29930,0.654914
4,milk,389315,0.584376
5,mashed banana,1329,0.549757
6,nutmeg,33378,0.497066
7,baking powder,115741,0.482713
8,buttermilk,47067,0.479808
9,pumpkin puree,2862,0.460623


In [23]:
find_substitutes('parmesan cheese')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,romano cheese,1338,0.89733
2,asiago cheese,678,0.832239
3,cheddar cheese,17189,0.73259
4,parsley,118671,0.722792
5,italian seasoning,8753,0.71244
6,mozzarella,16643,0.7042
7,shredded cheese,13914,0.696656
8,pecorino,1450,0.683676
9,swiss cheese,4125,0.675707
10,parsley flakes,2683,0.672849


In [24]:
find_substitutes('soy sauce')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,oyster sauce,2639,0.866365
2,teriyaki sauce,2510,0.823103
3,sesame oil,15730,0.820751
4,tamari,1782,0.814498
5,shoyu,443,0.807435
6,mirin,1369,0.801489
7,chili paste,1634,0.780719
8,hoisin sauce,1791,0.773081
9,gochujang,427,0.718714
10,shaoxing wine,129,0.708817


#### 2.2.2 Final Ingredient List (FIL)

- randomly select five samples

In [25]:
df_ingredient_list.sample(5)

Unnamed: 0,ingredient,frequency
3304,lumpia wrappers,39
3740,longans,23
2373,italian meringue,116
1199,frankfurters,640
2022,borscht,183


The five randomly selected ingredinets of the df_ingredient_list list in the previous run are:
   
       (1) gooseberries
       (2) mayonnaise
       (3) orange curacao
       (4) piment d espelette
       (5) yuba
       
- apply the find substitutes function to the five FIL ingredients

In [26]:
find_substitutes('gooseberries')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,blackberries,3469,0.716166
2,rhubarb,11583,0.691985
3,plums,4786,0.688396
4,huckleberries,154,0.678308
5,blackcurrants,53,0.662755
6,mashed berries,170,0.653818
7,quince,900,0.653296
8,elderberries,39,0.643856
9,saskatoon berries,22,0.64084
10,raspberries,11719,0.636381


In [27]:
find_substitutes('mayonnaise')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,miracle whip,1123,0.830621
2,salad dressing,11706,0.806647
3,pickle relish,640,0.726012
4,horseradish,7649,0.663828
5,pimento,4060,0.661308
6,dressing,95841,0.632429
7,mustard,62992,0.627702
8,mashed avocado,383,0.623929
9,vegenaise,75,0.623153
10,ranch dressing,3748,0.61848


In [28]:
find_substitutes('orange curacao')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,curacao,137,0.818325
2,sloe gin,40,0.813751
3,green chartreuse,22,0.80794
4,pisco,154,0.805085
5,triple sec,580,0.801631
6,falernum,23,0.800061
7,gold rum,34,0.79829
8,cynar,52,0.798125
9,bacardi,85,0.792673
10,maraschino liqueur,73,0.792603


In [29]:
find_substitutes('piment d espelette')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,fennel pollen,128,0.667633
2,aleppo pepper,388,0.650705
3,pink peppercorns,262,0.621165
4,black pepper,40895,0.605996
5,smoked paprika,2197,0.581804
6,togarashi,131,0.554624
7,seafood seasoning,885,0.547431
8,spanish paprika,65,0.545493
9,chipotle powder,490,0.536309
10,cayenne pepper,11275,0.533391


In [30]:
find_substitutes('yuba')

Unnamed: 0_level_0,ingredient,frequency,similarity
nr_substitute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,spring roll,922,0.61243
2,kanpyo,27,0.518669
3,nori,1985,0.51249
4,lumpia wrappers,39,0.496727
5,wonton skins,192,0.49482
6,gyoza,385,0.486374
7,sushi,727,0.480534
8,seaweed,1237,0.479801
9,mochi,804,0.479685
