<a href="https://colab.research.google.com/github/map222/Kibbeh/blob/deploy_space/ingredient_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ingredient recommender
In cooking, one unexpected ingredient can turn a boring dish into something exciting. For example, one of my best recipes is mashed potatoes with parsnip; the potatoes provide familiarity, but the parsnip provides a little spice, and makes the dish more memorable. Being better at ML than cooking, I created an ingredient recommender that takes an ingredient as input, and outputs something that should go well with it, but isn't used often. If you would like to see how I trained the model, you can follow along in this notebook. Or if you'd like to get some ingredient recommendations, you can [try it here](https://huggingface.co/spaces/map222/recipe-spice).

## Summary of this notebook
* This notebook trains a Word2Vec model on 2M recipes. It includes some light pre-processing and deduplication.
* I explore the optimal number of dimensions for the word embedding. For ingredients, it appears that using lower-dimensional embeddings provides better results. This could be because the "taste space" is relatively low dimensional; having more dimensions could allow for overfitting.
* Finally, this notebook contains the function to recommend ingredients, based on ingredients that are similar to each other in "flavor space," but don't get used together often.

# Setup:
These are just data and tool imports. To get to more interesting stuff, search for "Load recipes"
### Note to self:
This notebook may not auto-save. To save, you need to go to File menu, then save to github. You can either push directly to master, or create a branch

### Google drive
First, mount files from Google drive (to copy the file to your share folder, click [this link](https://drive.google.com/drive/folders/1fh5C0Wlda0QMzBXqOj6znhS8SlsCZm9O?usp=sharing)):

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
folder_loc = r'/content/gdrive/My Drive/Colab Notebooks/data/recipes/'

### Hugging Face

In [3]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.13.3


In [4]:
from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Imports

In [5]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors, Word2Vec
import warnings
warnings.filterwarnings('ignore')
from collections import defaultdict, Counter
import json

In [6]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from ast import literal_eval
fs = 22
font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : fs}

matplotlib.rc('font', **font)
plt.rc('xtick', labelsize=fs-6)
plt.rc('ytick', labelsize=fs-6)

## Load recipes
I downloaded a set of over [2M recipes from Kaggle](https://www.kaggle.com/code/paultimothymooney/explore-recipe-nlg-dataset). The dataset has titles for the recipes, as well as pre-processed named entity recognition (NER).

In [7]:
test = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', nrows = 5, sep = ',', index_col = 0)

In [8]:
test.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [7]:
%%time

recipes_pdf = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', sep = ',',usecols = ['title', 'NER'], converters={"NER": literal_eval})

CPU times: user 1min 16s, sys: 5.87 s, total: 1min 22s
Wall time: 1min 45s


In [18]:
# there are just over 2M recipes. Here we only keep the title and NER column to save time / space
recipes_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231142 entries, 0 to 2231141
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   title   object
 1   NER     object
dtypes: object(2)
memory usage: 34.0+ MB


# Pre-processing
As in any NLP application, we need to do some pre-processing. As I am focusing on the embedding aspect of this project, I only did some light stemming (removing plurals), and deduplication of recipes that had the same ingredients.

### Stemming
First, remove trailing `s`s to reduce token count

In [9]:
def pre_process(token_list):
  return [word.rstrip('s').lower() for word in token_list]
recipes_pdf['NER'] = recipes_pdf['NER'].apply(pre_process)

In [10]:
recipes_pdf['NER'].head()

0    [brown sugar, milk, vanilla, nut, butter, bite...
1    [beef, chicken breast, cream of mushroom soup,...
2    [frozen corn, cream cheese, butter, garlic pow...
3    [chicken, chicken gravy, cream of mushroom sou...
4    [peanut butter, graham cracker crumb, butter, ...
Name: NER, dtype: object

## Remove duplicate recipes
There are many duplicate recipes. We can see that at first by looking at recipe names. Below are the 2000th most common recipe titles, and we can see even this low in the rankings there are many duplicates.

In [13]:
recipes_pdf['title'].value_counts().head(2000).tail()

Tuna Ball                   59
Easy Coconut Cake           59
Lemon Poppy Seed Muffins    59
Tomato Pudding              59
Mrs. Field'S Cookies        59
Name: title, dtype: int64

### Janky way to remove duplicates
One way to remove duplicate recipes would be to find all recipes with the same set of ingredients. This would involve tokenizing stuff. My lazy way to do this is to just concatenate all the characters together, and use pandas `drop_duplicates`.

In [20]:
recipes_pdf['sorted_char'] = recipes_pdf['NER'].apply(lambda row: ''.join(sorted(''.join(row).replace(' ', ''))))


Recipes look funny when you concatenate all their letters. It's kinda fun to figure out what the ingredients are. E.g. `aaegrrstuw` is `sugar water`. `aefggiklllmorstu` is `flour milk salt egg`.


In [15]:
recipes_pdf['sorted_char'].value_counts().head()

                                                  574
aaaaabbdeeefggggiiikklllllmnnooprrrrsstttuuuvw    449
aefggiklllmorstu                                  336
aaegrrstuw                                        311
aaabbdeeefggggiikklllmnooprrrrsstttuuuw           307
Name: sorted_char, dtype: int64

In [16]:
recipes_pdf.loc[recipes_pdf['sorted_char'] == 'aaabbdeeefggggiikklllmnooprrrrsstttuuuw'].head()

Unnamed: 0,title,NER,sorted_char
175,Mom'S Pancakes,"[flour, baking powder, salt, sugar, egg, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
6031,Homemade Pancakes,"[flour, salt, sugar, egg, baking powder, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
7300,Plain Biscuits,"[flour, baking powder, sugar, salt, butter, mi...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
11873,Picklelets,"[egg, milk, baking powder, butter, sugar, flou...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
18121,Pancakes,"[baking powder, flour, sugar, salt, milk, egg,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw


In [17]:
recipes_pdf = recipes_pdf.drop_duplicates(subset = 'sorted_char')
recipes_pdf.shape

(1976599, 3)

Removed 20k recipes

# Modeling
To build a model for ingredients, I wanted a way to embed ingredients in space. The first thing that came to mind was to use Word2Vec, as it does just that. Here I train embeddings with different numbers of dimensions, ranging from 16-100. This is doable on a Google Colab CPU in about 3 minutes per model, for 2M recipes

### Train embeddings with different number of dimensions

In [13]:
%%time
#recipe_w2v_100 = Word2Vec(recipes_pdf['NER'], min_count=10)
#recipe_w2v_50 = Word2Vec(recipes_pdf['NER'], vector_size=50, min_count=10)
#recipe_w2v_25 = Word2Vec(recipes_pdf['NER'], vector_size=25, min_count=10)
recipe_w2v_16 = Word2Vec(recipes_pdf['NER'], vector_size=16, min_count=10)

CPU times: user 3min 12s, sys: 1.05 s, total: 3min 13s
Wall time: 2min


We have about 20k ingredients after training

In [15]:
# number of ingredients
len(recipe_w2v_16.wv.key_to_index)

18998

## Lower-dimensional embeddings look better on the surface than high-dimensions
When I initially trained these ingredient-embeddings, I found some surprising results, like `bone` showing up in the most similar ingredients for `carrot`. While this is somewhat understandable since soups are popular, this would make for some surprising ingredient recommendations.

One hypothesis I had for this was that the default number of embeddings dimensions is too high. Anthropic AI has an [interesting paper about how embeddings are able to represent more features than there are dimensions](https://transformer-circuits.pub/2022/toy_model/index.html#motivation) by using "superposition." The default number of dimensions for GenSim is 100, which implies it could encode hundreds of dimensions in flavor space. So I tested reducing the number of dimensions of the embeddings, and lo and behold it seemed to fix the `carrot-bone` similarity problem. Below are examples of this for carrots and barley. 

### Carrots
With the default number of dimensions (100), the most similar ingredients  to `carrot` are `carrot` variants, but `bone` and `beef bone` show up in the top 10.

In [None]:
recipe_w2v_100.wv.most_similar('carrot')

[('baby carrot', 0.8065677881240845),
 ('fresh carrot', 0.6772737503051758),
 ('peeled carrot', 0.6647123694419861),
 ('fresh baby carrot', 0.5763121843338013),
 ('grated carrot', 0.5363171696662903),
 ('beef bone', 0.5336546897888184),
 ('bone', 0.522489070892334),
 ('pearl barley', 0.4999176263809204),
 ('barley', 0.4961257576942444),
 ('only', 0.48451071977615356)]

Reducing to 25 dimensions pushes `bone` out of the top 10, but still surprisingly high.

In [None]:
recipe_w2v_25.wv.most_similar('carrot' , topn=15)

[('baby carrot', 0.8326777219772339),
 ('fresh carrot', 0.7855518460273743),
 ('white turnip', 0.7434546947479248),
 ('pearl barley', 0.7420206665992737),
 ('parsley root', 0.7365614175796509),
 ('turnip', 0.7275972962379456),
 ('barley', 0.7271861433982849),
 ('sweet potato', 0.7265936136245728),
 ('swede', 0.7255061864852905),
 ('green lentil', 0.7225844860076904),
 ('parsnip', 0.7192715406417847),
 ('bone', 0.715190589427948),
 ('chunk', 0.7009220123291016),
 ('stalks of celery', 0.6996918320655823),
 ('lentil', 0.6953086256980896)]

16 dimensions gets it out of the top 20

In [17]:
recipe_w2v_16.wv.most_similar('carrot' , topn=20)

[('baby carrot', 0.8764892220497131),
 ('stalks celery', 0.8321566581726074),
 ('pearl barley', 0.8292515277862549),
 ('barley', 0.823881208896637),
 ('green lentil', 0.816692590713501),
 ('fresh carrot', 0.8118476867675781),
 ('dried lentil', 0.81089848279953),
 ('lentil', 0.8044099807739258),
 ('potato', 0.8017277121543884),
 ('parsnip', 0.7993506193161011),
 ('celery', 0.7979180216789246),
 ('stalks of celery', 0.7857601046562195),
 ('brown lentil', 0.7810222506523132),
 ('white cabbage', 0.7796421051025391),
 ('dry lentil', 0.7789967060089111),
 ('turnip', 0.7745665311813354),
 ('swede', 0.7739119529724121),
 ('long grain brown rice', 0.7688688039779663),
 ('celery stalk', 0.7670461535453796),
 ('turkey kielbasa', 0.7657486200332642)]

### Barley
We can see a similar situation with barley: with a large number of dimensions, meat bones are 4/20 most similar ingredients. However, if we reduce the number of dimensions to 16, we find that 2/20 are meat.

In [None]:
recipe_w2v_100.wv.most_similar('barley', topn=20)

[('pearl barley', 0.9185601472854614),
 ('dried lentil', 0.7291703820228577),
 ('lentil', 0.7216474413871765),
 ('dry lentil', 0.6877376437187195),
 ('long grain brown rice', 0.6726171970367432),
 ('green lentil', 0.669456422328949),
 ('quick-cooking barley', 0.6392723321914673),
 ('brown lentil', 0.6360408067703247),
 ('frozen green bean', 0.6338297128677368),
 ('green split pea', 0.6311508417129517),
 ('ham bone', 0.628990888595581),
 ('beef soup bone', 0.6211417317390442),
 ('brown rice', 0.6149815320968628),
 ('alphabet pasta', 0.6042261719703674),
 ('cooking barley', 0.5910148024559021),
 ('frozen lima bean', 0.5896640419960022),
 ('beef bone', 0.5829190015792847),
 ('white bean', 0.5824376344680786),
 ('bone', 0.5795758962631226),
 ('vegetable bouillon', 0.5782933235168457)]

In [None]:
recipe_w2v_16.wv.most_similar('barley', topn=20)

[('pearl barley', 0.9637734889984131),
 ('dried lentil', 0.925638735294342),
 ('lentil', 0.9177876710891724),
 ('dry lentil', 0.8902276754379272),
 ('brown lentil', 0.8776227235794067),
 ('green lentil', 0.8661330938339233),
 ('quick-cooking barley', 0.8623511791229248),
 ('carrot', 0.8284534215927124),
 ('lamb stew meat', 0.8240671753883362),
 ('long grain brown rice', 0.8050515651702881),
 ('stalks celery', 0.7979966402053833),
 ('dried great northern bean', 0.7950857281684875),
 ('yellow pea', 0.7908360362052917),
 ('white bean', 0.7860153913497925),
 ('split pea', 0.7850393056869507),
 ('chunk', 0.7847321033477783),
 ('green split pea', 0.7832866907119751),
 ('lamb shank', 0.7812250256538391),
 ('bulgur', 0.7739902138710022),
 ('stalks of celery', 0.771740734577179)]

### Word2Vec addition
One fun thing to do with word embeddings is to see what you get with the standard vector addition. (In general, vector subtraction didn't make much sense in ingredient space).

For example, if you ask for the most similar ingredient to `kiwi + banana`, you get suggestions to add yogurt (in addition to some nonsense in `fruit` and `frozen banana`).

In [45]:
recipe_w2v_16.wv.most_similar(['kiwi', 'banana'], topn=3, )

[('fruit', 0.9203987717628479),
 ('vanilla yogurt', 0.888529360294342),
 ('frozen banana', 0.885556697845459)]

Or maybe you have a bunch of ingredients lying around, and want to know what will go well, you can throw them in the model. The suggestions for `lettuce, feta, and tortilla chip` are surprisingly good: add an olive, or make a pita pocket.

In [53]:
recipe_w2v_16.wv.most_similar(['lettuce', 'feta', 'tortilla chip'], topn=3, )

[('shredded lettuce', 0.8785420656204224),
 ('black olive', 0.8727546334266663),
 ('pita pocket', 0.8659997582435608)]

# Finding new ingredients
Now that we have an ingredient embedding, we can build our ingredient recommender. Given a source ingredient `Ing_S`, and a potential replacement ingredient `Ing_R`, I created a novelty score:

$$NoveltyScore = CosineSimilarity(Ing_S, Ing_R) * \frac{\#\ of\ recipes\ with\ Ing_S}{\# of\ recipes\ with\ both\ Ing_S and Ing_R}$$

This means that as similarity goes up, ingredients are more likely to be recommended; and ingredients which co-occur will have a lower score.

### Get ingredient counts, for understanding how common an ingredient is
One key component of the above score is the counts of how often each ingredient occurs. We can pre-calculate this to save time at inference.

In [59]:
ingredient_count = Counter([token for row in recipes_pdf['NER'].values for token in row])

Now that we have similarity scores and counts, we can start recommending ingredients. The code below calculates ingredient similarity, and the novelty scores for a given ingredient.

In [56]:
def get_fusion_ingredients(ingredient: str,
                           recipe_model, #gensim model
                           recipes, #iterable of recipes
                           ingredient_count: dict,
                           max_candidates = 20,
                           min_occurence_factor = 100 # minimum number of recipes an ingredient has to be in
                           ):

  ingredient_recipes = recipes.loc[recipes.apply(lambda row: ingredient in row)]

  ingredient_candidates = recipe_model.wv.most_similar(ingredient, topn=50) # get top similar ingredients
  candidate_names = list(zip(*ingredient_candidates))[0] # get the names from the first column

 # clean up candidates to remove duplicates (e.g. for apple, remove "gala apple")
  pruned_candidates = [candidate for candidate in candidate_names if ingredient not in candidate][:max_candidates]

 #  count how often these ingredients occur together for novelty score
  cooccurrence_counts = calc_cooccurrence(ingredient, candidate_names, ingredient_recipes)

  # final score for sorting: similarity / how often co-occur / total occurences
  min_occurences = max(cooccurrence_counts.values()) / min_occurence_factor

  # build a dictionary of novelty scores
  novelty_scores = {candidate[0]: candidate[1] / (cooccurrence_counts[candidate[0]]+1) / ingredient_count[candidate[0]] for candidate in ingredient_candidates if candidate[0] in pruned_candidates and cooccurrence_counts[candidate[0]] > min_occurences}

  top_candidates = sorted([(k,v) for k,v in novelty_scores.items()], key=lambda x: x[1])[-5:] # get top 5 candidates, and return as list rather than dictionary
  return top_candidates, cooccurrence_counts, ingredient_candidates # return multiple for debugging

This helper function calculates co-occurences.

In [55]:
def calc_cooccurrence(ingredient: str,
                      candidates,
                      recipes):
  ''' Calc how often the top ingredient co-occurs with the candidates
    - also removes candidates that are re-phrase of ingredient (e.g. "beef" and "ground beef")
    ingredient: str name of an ingredient ("apple")
    candidates: potential other ingredients ("orange")
    recipes: iterable of possible ingredients
  '''


  co_count = {}
  for candidate in candidates:
    co_count[candidate] = sum([candidate in recipe and ingredient in recipe for recipe in recipes])
  return co_count

### Example recommendations
#### Orange
Running the function takes about two second, primarily due to calculating co-occurences

In [62]:
%%time
top_candidates, cooccurrence_counts, ingredient_candidates = get_fusion_ingredients('orange', recipe_w2v_16, recipes_pdf['NER'], ingredient_count)


37.84
CPU times: user 2.22 s, sys: 115 ms, total: 2.33 s
Wall time: 2.24 s


Below are the raw cosine similarities between orange and the top ingredients. The most similar fruits to orange are tangerine and clementine, with kumquat coming in 3rd.

In [66]:
ingredient_candidates[:5]

[('tangerine', 0.944675087928772),
 ('clementine', 0.8985823392868042),
 ('kumquat', 0.8837090730667114),
 ('pomegranate', 0.8730214834213257),
 ('pear', 0.842303991317749)]

Below are the top five novelty scoress. After occurence counting kumquat comes rises to the top, as it is less commonly found with oranges. Note that the novelty scores are not interpretable, as they generally are some small number

In [63]:
top_candidates[::-1]

[('kumquat', 3.585899501163413e-05),
 ('clementine', 3.4800446895426364e-05),
 ('plum', 2.6533793546761154e-05),
 ('red grapefruit', 1.914712324948378e-05),
 ('tangerine', 1.5240873915892616e-05)]

# Serving the model
To allow everyone to spice up their recipes, I built a simple web interface so you can [try the model yourself here](https://huggingface.co/spaces/map222/recipe-spice).

To build the web interface, I used HuggingFace's Spaces framework. In this framework, I have [one repo, `recipe-spice-model`](https://huggingface.co/map222/recipe-spice-model/tree/main) that stores the `KeyedVector` and other binary files. For serving, there is a [separate repo, `recipe-spice`](https://huggingface.co/spaces/map222/recipe-spice/tree/main) that contains a file [`app.py`](https://huggingface.co/spaces/map222/recipe-spice/blob/main/app.py) which contains the code for calculating similarity and launching the code.

Overall, I found it fairly straightforward to build the interface. However, given it was my first time using the framework, I ended up using the web interface to try to debug filenames, which does not seem ideal.

# Save outputs
The rest of this notebook is just some ETL to save the models to Google Drive and HuggingFace.

## To drive

In [None]:
recipe_w2v_file = 'recipe_w2v.gensim'
recipe_w2v.save(folder_loc + recipe_w2v_file)

In [None]:
recipe_w2v_file_50 = 'recipe_w2v_50.gensim'
recipe_w2v_50.save(folder_loc + recipe_w2v_file_50)

In [None]:
recipe_w2v_file_25 = 'recipe_w2v_25.gensim'
recipe_w2v_25.save(folder_loc + recipe_w2v_file_25)

In [67]:
recipe_w2v_file_16 = 'recipe_w2v_16.gensim'
recipe_w2v_16.save(folder_loc + recipe_w2v_file_16)

In [None]:
ingredient_file = 'ingredient_count.json'
with open(folder_loc + ingredient_file, 'w') as ingredient_stream:
  json.dump(ingredient_count, ingredient_stream)

In [None]:
NER_tsv = 'recipe_NER.tsv'
recipes_pdf['NER'].to_csv(folder_loc + NER_tsv, sep='\t')

## To huggingface

In [69]:
from huggingface_hub import HfApi
api = HfApi()

In [71]:
recipe_w2v_file = recipe_w2v_file_16

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + recipe_w2v_file,
    path_in_repo=recipe_w2v_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
    create_pr =1
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + ingredient_file,
    path_in_repo=ingredient_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + NER_tsv,
    path_in_repo=NER_tsv,
    repo_id="map222/recipe-spice",
    repo_type="space",
)