# Recommending the Ingredients
In this task, we will explore the recommender system (RecSys) that returns a list of ingredients for the user to choose, provided that the user specifies a list of ingredients. We think this activity is intriguing and has very profound meaning in real life. For example, an "intelligent" refridgerator can guess the type of cuisine we are likely going to make, and recommend us the potential ingredients we will need based on the existing ingredients in the fridge/kitchen. \
Specifically, we will break down this part of the project into two approaches. That is, we will build recommender systems based on the vectorizers and the `text_preprocess` method that we defined earlier, respectively, and examine the performance of the RecSys under different conditions. 


## Approach 1. RecSys under Vectorizers
We will first use vectorizers to transform the recipe corpus into sparse matrices. Let's begin with some basic data exploration as follows.

In [0]:
# Import packages and modules
import pandas as pd
from text_preprocess import *
import sys
import json
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
import numpy as np
from ingredient_recommendation import *
from random import sample 
seed = 208

#import nltk
#nltk.download('wordnet')

In [4]:
# Load data
f = open('../data/train.json','r', encoding = 'utf-8')
data = pd.read_json(f)
data

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."
...,...,...,...
39769,29109,irish,"[light brown sugar, granulated sugar, butter, ..."
39770,11462,italian,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,2238,irish,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,41882,chinese,"[boneless chicken skinless thigh, minced garli..."


In [0]:
# Clean the ingredient text data
corpus = text_preprocess(data)
# Flatten a list of lists for bag of words model
bow_corpus = [" ".join(doc) for doc in corpus]

In [6]:
# Convert text to word count vectors with CountVectorizer.
vec = CountVectorizer()
X = vec.fit_transform(bow_corpus)
dtm = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
dtm

Unnamed: 0,abalone,abbamele,absinthe,abura,acai,accent,accompaniment,achiote,acid,acinus,ackee,acorn,active,added,adobo,adzuki,agar,agave,age,aged,ahi,aioli,ajinomoto,ajwain,aka,alaskan,albacore,alcohol,ale,aleppo,alexia,alfalfa,alfredo,all,allpurpose,allspice,almond,almondmilk,aloe,alphabet,...,wolfberries,won,wondra,wonton,wood,worcestershire,world,wrap,wrapper,xanthan,xuxu,yakinori,yakisoba,yam,yardlong,yeast,yellow,yellowfin,yellowtail,yoghurt,yogurt,yolk,yoplait,york,young,yu,yuca,yucca,yukon,yum,yuzu,yuzukosho,zaatar,zatarains,zero,zest,zesty,zinfandel,ziti,zucchini
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39769,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
39770,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
39771,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
39772,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
np.mean(np.sum(dtm!=0,axis=1))

18.739025494041332

In [8]:
np.median(np.sum(dtm!=0,axis=1))

18.0

Since the mean of meadian of ingredients are around 18, we will randomly generate 17 ingredients and recommend 1 ingredient.<br>

We will recommend new ingredients by finding the most similar recipe in existing recipes and recommend the ingredients in this most similar recipe. Then we will calculate a recommendation index for the recommeneded ingredients. For each recommended ingredient, we count how many times it occurs in a recipe together with each of the 17 ingredients, and add those 17 numbers up. We will use this number as the index of to what extent we recommend a ingredient. \
The larger the index is, the more strongly we will recommend a ingredient as this indicates that the recommended ingredient occurs simultaneously with the 17 other ingredients the most frequently, and thus is more correlated with those 17 ingredients. For example, if apple has an index 1000 and salt an index 500, then we recommend apple more than salt.  

Now, we will prepare the corpus and feed it to different vectorizers to compare the recommendations under different choices of vectorizers with $4$ examples.

In [0]:
# Prepare the recipe corpus
d=vectorizer_preparation(bow_corpus)

In [10]:
# Example 1
index=sample(range(2849),17)
print("Using Countvectorizer to predict:","\n")
recommendation1=ingredient_recommendation(index,d,Vectorizer="Count")
print("\n")
print("Using Tfidfvectorizer to predict:","\n")
recommendation2=ingredient_recommendation(index,d,Vectorizer="Tfidf")

Using Countvectorizer to predict: 

The ingredients we have are: 

queso
romana
crema
muscavado
fume
mexicorn
edam
erythritol
ring
chimichurri
lentilles
classico
season
cracked
marmite
poolish
cannellini


The similarity socre of our recommendation is: 
 [0.14625448]


The recommended ingredients are: 

garlic: 630
onion: 575
tomato: 365
chicken: 297
chipotle: 48
lettuce: 43
adobo: 25
in: 23
tostada: 4


Using Tfidfvectorizer to predict: 

The ingredients we have are: 

queso
romana
crema
muscavado
fume
mexicorn
edam
erythritol
ring
chimichurri
lentilles
classico
season
cracked
marmite
poolish
cannellini


The similarity socre of our recommendation is: 
 [0.1917534]


The recommended ingredients are: 

pepper: 747
salt: 732
black: 605
olive: 453
ground: 298
leaf: 229
kosher: 185
parsley: 141
italian: 55
steak: 44
skirt: 10


In [11]:
# Example 2
index=sample(range(2849),17)
print("Using Countvectorizer to predict:","\n")
recommendation1=ingredient_recommendation(index,d,Vectorizer="Count")
print("\n")
print("Using Tfidfvectorizer to predict:","\n")
recommendation2=ingredient_recommendation(index,d,Vectorizer="Tfidf")

Using Countvectorizer to predict: 

The ingredients we have are: 

candied
ghee
uncook
gnocchetti
liquorice
roast
menta
sharp
madeleine
juice
fig
recip
bonito
rusk
polish
creamer
jalapeno


The similarity socre of our recommendation is: 
 [0.19802951]


The recommended ingredients are: 

lemon: 3762
sugar: 2857


Using Tfidfvectorizer to predict: 

The ingredients we have are: 

candied
ghee
uncook
gnocchetti
liquorice
roast
menta
sharp
madeleine
juice
fig
recip
bonito
rusk
polish
creamer
jalapeno


The similarity socre of our recommendation is: 
 [0.21511941]


The recommended ingredients are: 

sugar: 2857
powder: 1920
butter: 1824
flour: 1632
egg: 1364
milk: 1245
bean: 1034
allpurpose: 874
unsalted: 766
honey: 458
baking: 413
vanilla: 295


In [12]:
# Example 3
index=sample(range(2849),17)
print("Using Countvectorizer to predict:","\n")
recommendation1=ingredient_recommendation(index,d,Vectorizer="Count")
print("\n")
print("Using Tfidfvectorizer to predict:","\n")
recommendation2=ingredient_recommendation(index,d,Vectorizer="Tfidf")

Using Countvectorizer to predict: 

The ingredients we have are: 

wingettes
grit
sangiovese
chicory
ragu
bocconcini
hickoryflavored
hop
montreal
bresaola
tokyo
katakuriko
mora
chiffonade
bonein
thousand
giblet


The similarity socre of our recommendation is: 
 [0.14002801]


The recommended ingredients are: 

water: 174
coffee: 4


Using Tfidfvectorizer to predict: 

The ingredients we have are: 

wingettes
grit
sangiovese
chicory
ragu
bocconcini
hickoryflavored
hop
montreal
bresaola
tokyo
katakuriko
mora
chiffonade
bonein
thousand
giblet


The similarity socre of our recommendation is: 
 [0.21725728]


The recommended ingredients are: 

water: 174
coffee: 4


In [13]:
# Example 4
index=sample(range(2849),17)
print("Using Countvectorizer to predict:","\n")
recommendation1=ingredient_recommendation(index,d,Vectorizer="Count")
print("\n")
print("Using Tfidfvectorizer to predict:","\n")
recommendation2=ingredient_recommendation(index,d,Vectorizer="Tfidf")

Using Countvectorizer to predict: 

The ingredients we have are: 

tricolor
de
gao
provence
tree
andouille
weed
mam
vegan
doenzang
tipo
shortcrust
gomashio
sel
paccheri
persimmon
foster


The similarity socre of our recommendation is: 
 [0.17149859]


The recommended ingredients are: 

salt: 453
fleur: 50
sea: 43
zucchini: 32
fine: 16
dressing: 7


Using Tfidfvectorizer to predict: 

The ingredients we have are: 

tricolor
de
gao
provence
tree
andouille
weed
mam
vegan
doenzang
tipo
shortcrust
gomashio
sel
paccheri
persimmon
foster


The similarity socre of our recommendation is: 
 [0.1898315]


The recommended ingredients are: 

fresh: 237
sugar: 79
ginger: 39
pinenuts: 7


We see from the above examples that in some cases, the two vectorizers highly resemble each other in terms of the recommendations made in the ends; whereas in some other cases, the recommended ingredients are not even remotely close under the two vectorizers. While it is unclear what is causing this behavior, we have noticed that a critical flaw of the vectorizers is that it splits up the phrases by space. And consequently, some of the ingredients recommended do not appear to make intuitive sense alone, such as "tree", "sea", and "foster".

## Approach 2. RecSys under `text_preprocess`
Now, we will explore this very same task using a different text processing method. Instead of splitting up all the ingredients into single words, we will conserve the ingredients as they are and work with the ingredient phrases. We also introduce a performance measure, "top $n$ accuracy", for the models we are going to build below. In particular, we will compute the percentage that the model successfully ranks the actual ingredient within the top $n$ items. 
Note that in this approach, we always tell our model to ignore the last ingredient in the recipe in order to faciliate the recommendation as well as the "top $n$" calculation. 

In [0]:
# Load packages
import numpy as np
import pandas as pd
import itertools
from collections import Counter
from gensim.models import Word2Vec
import sys
sys.path.append('../code/')
from text_preprocess import *
from RecSys import *

In [0]:
# Load data
f = open('../data/train.json','r', encoding = 'utf-8')
data = pd.read_json(f)

Same as the first approach, we clean the data with `text_preprocess`, only that we stop here and proceed to the following parts with the processed text data, i.e. a list of lists. 

In [0]:
# Clean the ingredient text data
corpus = text_preprocess(data)

In [17]:
lengths = []
for i in corpus:
    lengths.append(len(i))

np.max(lengths), data['cuisine'][np.argmax(lengths)]

(65, 'italian')

We see that the longest recipe in our data set is an Italian cuisine, which contains $65$ ingredients.

In [18]:
all_ingredients = list(itertools.chain.from_iterable(corpus))
len(np.unique(all_ingredients))

6687

And importantly, there are $6687$ unique ingredients under the `text_preprocess` method. This tells us that the expanded recipe matrix (as defined in `RecSys.py`) will have $6687$ columns.


**We will build the model evaluator below, based on the idea of the "top $n$ accuracy".** \
In a nutshell, we will randomly select $50$ ingredients that are not in the recipe given by the user, and ask the model to rank them combined with an actual ingredient from the recipe (i.e. the last ingredient). Here, we use $n=5$; that is, we compute the percentage the model successfully rank the actual ingredient within the top $5$ of the $51$ ingredients. A more detailed description and the code can be found in `RecSys.py`.


In [0]:
evaluator = ModelEvaluator(all_items = all_ingredients, n = 5)

In [0]:
# Split the corpus into training and test corpuses
from sklearn.model_selection import train_test_split
corpus_tr, corpus_te, cuisine_tr, cuisine_te = train_test_split(corpus, data['cuisine'],
                                                                test_size = 0.2,
                                                                random_state = 0)
all_tr = list(itertools.chain.from_iterable(corpus_tr))
all_te = list(itertools.chain.from_iterable(corpus_te))

### 2.1 Popularity Model (Baseline)
To create some baseline measure of how well a RecSys model should perform under `text_preprocess` and the "top $n$" metric, we will construct a popularity model. The way this model makes prediction, as it sounds, is that it summarizes the number of occurrences of each ingredient in the list (of ingredients to be ranked) in the training recipes, and ranks the ingredients from the most popular to the least popular. 

In [21]:
# Fit the popularity model with ingredients in the training corpus
popular_model = popularity(all_items = all_tr)  

# Example
popular_model.recommend_items(['salt','water'],['romaine lettuce','pepper','garlic'])

['water', 'garlic', 'pepper', 'romaine lettuce']

In [22]:
evaluator.evaluate_model(popular_model, corpus_te)

Iteration: 0
Iteration: 500
Iteration: 1000
Iteration: 1500
Iteration: 2000
Iteration: 2500
Iteration: 3000
Iteration: 3500
Iteration: 4000
Iteration: 4500
Iteration: 5000
Iteration: 5500
Iteration: 6000
Iteration: 6500
Iteration: 7000
Iteration: 7500


0.8067881835323696

As we can see, the popularity model provides a quite decent performance in that over $80\%$ of the time, it ranks the actual ingredients among the top $5$ of the list of $51$ ingredients. 

### 2.2 Collaborative Filtering Model
Now, we will consider a more sophisticated RecSys model for our task. The idea of the collaborative filtering model is that it takes advantage of the similarity between the recipes and makes recommendations only based on the recipes similar to the recipe given by the user. Specifically, the similarity is defined by *cosine similarity*, which is also known as the Pearson correlation, between $2$ recipes. Of course, to establish the calculation of cosine similarity, we need to expand the recipes into recipe vectors/matrices (i.e. a matrix with mostly $0$'s and some $1$'s corresponding to the ingredients present in the recipes). 

Once the prep work is completed, we will fetch the $1,000$ recipes that are the most "similar" to the user-specified recipe, and then rank the list of $51$ ingredients based on their occurrences in these $1,000$ recipes. 


In [23]:
# Fit the collaborative filtering model on the training corpus
collab_model = collaborative(all_items = all_tr, corpus = corpus_tr)

# Example
collab_model.recommend_items(['salt','water'],['romaine lettuce','pepper','garlic'])

['water', 'garlic', 'pepper', 'romaine lettuce']

In [24]:
evaluator.evaluate_model(collab_model, corpus_te)

Iteration: 0
Iteration: 500
Iteration: 1000
Iteration: 1500
Iteration: 2000
Iteration: 2500
Iteration: 3000
Iteration: 3500
Iteration: 4000
Iteration: 4500
Iteration: 5000
Iteration: 5500
Iteration: 6000
Iteration: 6500
Iteration: 7000
Iteration: 7500


0.9862979258328095

Even though the evaluation process takes dramatically more time than the popularity model because of the complexity in the algorithm (as well as the dense matrix operation), we have obtained much better result than the popularity model as defined by the "top $n$" metric. 