<a href="https://colab.research.google.com/github/map222/Kibbeh/blob/deploy_space/ingredient_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ingredient recommender
One of my best recipes is mashed potatoes with parsnip; the potatoes provide familiarity, but the parsnip provides a little spice, and makes the dish more memorable. Often when cooking, we want to spice up our recipes with new ingredients. Being better at ML than cooking, I created an ingredient recommender that given an ingredient will suggest something similar to it, but that doesn't occur with it often. This project has two parts:
1. This training notebook where I develop the model
2. A hugging face deployment that allows anyone to input an ingredient, and get back similar ingredients, as well as "spicy" ingredients.

# Setup:
### Note to self:
This notebook may not auto-save. To save, you need to go to File menu, then save to github. You can either push directly to master, or create a branch

### Google drive
First, mount files from Google drive (to copy the file to your share folder, click [this link](https://drive.google.com/drive/folders/1fh5C0Wlda0QMzBXqOj6znhS8SlsCZm9O?usp=sharing)):

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
folder_loc = r'/content/gdrive/My Drive/Colab Notebooks/data/recipes/'

### Hugging Face

In [None]:
!pip install huggingface_hub

In [4]:
from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Imports

In [5]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors, Word2Vec
import warnings
warnings.filterwarnings('ignore')
from collections import defaultdict, Counter
import json

In [6]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from ast import literal_eval
fs = 22
font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : fs}

matplotlib.rc('font', **font)
plt.rc('xtick', labelsize=fs-6)
plt.rc('ytick', labelsize=fs-6)

## Load recipes
I downloaded a set of over [2M recipes from Kaggle](https://www.kaggle.com/code/paultimothymooney/explore-recipe-nlg-dataset). The dataset has titles for the recipes, as well as pre-processed named entity recognition (NER).

In [7]:
test = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', nrows = 5, sep = ',', index_col = 0)

In [8]:
test.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [9]:
%%time

recipes_pdf = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', sep = ',',usecols = ['title', 'NER'], converters={"NER": literal_eval})

CPU times: user 1min 20s, sys: 3.84 s, total: 1min 24s
Wall time: 3min 8s


In [10]:
recipes_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231142 entries, 0 to 2231141
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   title   object
 1   NER     object
dtypes: object(2)
memory usage: 34.0+ MB


# Pre-processing
As in any NLP application, we need to do some pre-processing. As I am focusing on the embedding aspect of this project, I only did some light stemming (removing plurals), and deduplication of recipes that had the same ingredients.

### Stemming
First, remove trailing `s`s to reduce token count

In [11]:
def pre_process(token_list):
  return [word.rstrip('s').lower() for word in token_list]
recipes_pdf['NER'] = recipes_pdf['NER'].apply(pre_process)

In [12]:
recipes_pdf['NER'].head()

0    [brown sugar, milk, vanilla, nut, butter, bite...
1    [beef, chicken breast, cream of mushroom soup,...
2    [frozen corn, cream cheese, butter, garlic pow...
3    [chicken, chicken gravy, cream of mushroom sou...
4    [peanut butter, graham cracker crumb, butter, ...
Name: NER, dtype: object

## Remove duplicate recipes
There are many duplicate recipes. We can see that at first by looking at recipe names. Below are the 2000th most common recipe titles, and we can see even this low in the rankings there are many duplicates.

In [13]:
recipes_pdf['title'].value_counts().head(2000).tail()

Tuna Ball                   59
Easy Coconut Cake           59
Lemon Poppy Seed Muffins    59
Tomato Pudding              59
Mrs. Field'S Cookies        59
Name: title, dtype: int64

### Janky way to remove duplicates
One way to remove duplicate recipes would be to find all recipes with the same set of ingredients. This would involve tokenizing stuff. My lazy way to do this is to just concatenate all the character together, and use pandas `drop_duplicates`.

In [14]:
recipes_pdf['sorted_char'] = recipes_pdf['NER'].apply(lambda row: ''.join(sorted(''.join(row).replace(' ', ''))))


In [15]:
recipes_pdf['sorted_char'].value_counts().head()

                                                  574
aaaaabbdeeefggggiiikklllllmnnooprrrrsstttuuuvw    449
aefggiklllmorstu                                  336
aaegrrstuw                                        311
aaabbdeeefggggiikklllmnooprrrrsstttuuuw           307
Name: sorted_char, dtype: int64

In [16]:
recipes_pdf.loc[recipes_pdf['sorted_char'] == 'aaabbdeeefggggiikklllmnooprrrrsstttuuuw'].head()

Unnamed: 0,title,NER,sorted_char
175,Mom'S Pancakes,"[flour, baking powder, salt, sugar, egg, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
6031,Homemade Pancakes,"[flour, salt, sugar, egg, baking powder, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
7300,Plain Biscuits,"[flour, baking powder, sugar, salt, butter, mi...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
11873,Picklelets,"[egg, milk, baking powder, butter, sugar, flou...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
18121,Pancakes,"[baking powder, flour, sugar, salt, milk, egg,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw


In [17]:
recipes_pdf = recipes_pdf.drop_duplicates(subset = 'sorted_char')
recipes_pdf.shape

(1976599, 3)

Removed 20k recipes

#### Get ingredient counts, for understanding how common an ingredient is

In [18]:
ingredient_count = Counter([token for row in recipes_pdf['NER'].values for token in row])

# Modeling
To build a model for ingredients, I wanted a way to embed ingredients in space. The first thing that came to mind was to use Word2Vec, as it does just that. Here I train embeddings with different numbers of dimensions, ranging from 16-100. This is doable on a Google Colab CPU in about 3 minutes per model, for 2M recipes

### Train embeddings with different number of dimensions

In [59]:
%%time
recipe_w2v_100 = Word2Vec(recipes_pdf['NER'], min_count=10)
recipe_w2v_50 = Word2Vec(recipes_pdf['NER'], size=50, min_count=10)
recipe_w2v_25 = Word2Vec(recipes_pdf['NER'], size=25, min_count=10)
recipe_w2v_16 = Word2Vec(recipes_pdf['NER'], size=16, min_count=10)

CPU times: user 6min 36s, sys: 2.27 s, total: 6min 38s
Wall time: 4min 17s


We have about 20k ingredients after training

In [21]:
# number of ingredients
len(recipe_w2v_16.wv.vocab)

18877

## Lower-dimensional embeddings look better on the surface than high-dimensions
When I initially trained these ingredient-embeddings, I found some surprising results, like `bone` showing up in the most similar ingredients for `carrot`. While this is somewhat understandable since people make lots of soups, this would make for some surprising ingredient replacements.

One hypothesis I had for this was that the default number of embeddings dimensions is too high. Anthropic AI has an [interesting paper about how embeddings are able to represent more features than there are dimensions](https://transformer-circuits.pub/2022/toy_model/index.html#motivation) by using "superposition." The default number of dimensions for GenSim is 100, which implies it could encode hundreds of dimensions in flavor space. So I tested reducing the number of dimensions of the embeddings, and lo and behold it seemed to fix the `carrot-bone` similarity problem. Below are examples of this for carrots and apples. 

### Carrots
With the default number of dimensions (100), the most similar ingredients  to `carrot` are `carrot` variants, but `bone` and `beef bone` show up in the top 10.

In [None]:
recipe_w2v.most_similar('carrot')

[('baby carrot', 0.8065677881240845),
 ('fresh carrot', 0.6772737503051758),
 ('peeled carrot', 0.6647123694419861),
 ('fresh baby carrot', 0.5763121843338013),
 ('grated carrot', 0.5363171696662903),
 ('beef bone', 0.5336546897888184),
 ('bone', 0.522489070892334),
 ('pearl barley', 0.4999176263809204),
 ('barley', 0.4961257576942444),
 ('only', 0.48451071977615356)]

25 dimensions pushes `bone` out of the top 10, but still surprisingly high.

In [None]:
recipe_w2v_25.most_similar('carrot' , topn=15)

[('baby carrot', 0.8326777219772339),
 ('fresh carrot', 0.7855518460273743),
 ('white turnip', 0.7434546947479248),
 ('pearl barley', 0.7420206665992737),
 ('parsley root', 0.7365614175796509),
 ('turnip', 0.7275972962379456),
 ('barley', 0.7271861433982849),
 ('sweet potato', 0.7265936136245728),
 ('swede', 0.7255061864852905),
 ('green lentil', 0.7225844860076904),
 ('parsnip', 0.7192715406417847),
 ('bone', 0.715190589427948),
 ('chunk', 0.7009220123291016),
 ('stalks of celery', 0.6996918320655823),
 ('lentil', 0.6953086256980896)]

16 dimensions gets it out of the top 20

In [None]:
recipe_w2v_16.most_similar('carrot' , topn=20)

[('stalks celery', 0.8890523910522461),
 ('baby carrot', 0.8877623081207275),
 ('celery', 0.8522034287452698),
 ('pearl barley', 0.8518322706222534),
 ('barley', 0.8432409167289734),
 ('stalks of celery', 0.8340665698051453),
 ('celery stalk', 0.8294845223426819),
 ('lentil', 0.8264703154563904),
 ('stalk celery', 0.8254605531692505),
 ('green bean', 0.8253151178359985),
 ('green lentil', 0.8252203464508057),
 ('parsnip', 0.8249714374542236),
 ('dried lentil', 0.817329466342926),
 ('white cabbage', 0.8002745509147644),
 ('turnip', 0.7969602346420288),
 ('potato', 0.7907851338386536),
 ('fresh carrot', 0.7867233753204346),
 ('chunk', 0.7820314168930054),
 ('fresh green bean', 0.7777174711227417),
 ('white turnip', 0.7760671377182007)]

### Barley
We can see a similar situation with barley: with a large number of dimensions, meat bones are 4/20 most similar ingredients. However, if we reduce the number of dimensions to 16, we find that 2/20 are meat.

In [79]:
recipe_w2v_100.most_similar('barley', topn=20)

[('pearl barley', 0.9185601472854614),
 ('dried lentil', 0.7291703820228577),
 ('lentil', 0.7216474413871765),
 ('dry lentil', 0.6877376437187195),
 ('long grain brown rice', 0.6726171970367432),
 ('green lentil', 0.669456422328949),
 ('quick-cooking barley', 0.6392723321914673),
 ('brown lentil', 0.6360408067703247),
 ('frozen green bean', 0.6338297128677368),
 ('green split pea', 0.6311508417129517),
 ('ham bone', 0.628990888595581),
 ('beef soup bone', 0.6211417317390442),
 ('brown rice', 0.6149815320968628),
 ('alphabet pasta', 0.6042261719703674),
 ('cooking barley', 0.5910148024559021),
 ('frozen lima bean', 0.5896640419960022),
 ('beef bone', 0.5829190015792847),
 ('white bean', 0.5824376344680786),
 ('bone', 0.5795758962631226),
 ('vegetable bouillon', 0.5782933235168457)]

In [80]:
recipe_w2v_16.most_similar('barley', topn=20)

[('pearl barley', 0.9637734889984131),
 ('dried lentil', 0.925638735294342),
 ('lentil', 0.9177876710891724),
 ('dry lentil', 0.8902276754379272),
 ('brown lentil', 0.8776227235794067),
 ('green lentil', 0.8661330938339233),
 ('quick-cooking barley', 0.8623511791229248),
 ('carrot', 0.8284534215927124),
 ('lamb stew meat', 0.8240671753883362),
 ('long grain brown rice', 0.8050515651702881),
 ('stalks celery', 0.7979966402053833),
 ('dried great northern bean', 0.7950857281684875),
 ('yellow pea', 0.7908360362052917),
 ('white bean', 0.7860153913497925),
 ('split pea', 0.7850393056869507),
 ('chunk', 0.7847321033477783),
 ('green split pea', 0.7832866907119751),
 ('lamb shank', 0.7812250256538391),
 ('bulgur', 0.7739902138710022),
 ('stalks of celery', 0.771740734577179)]

# Finding new ingredients
Find ingredients that are similar in embedding space, but not used together commonly

In [22]:
def calc_cooccurrence(ingredient: str,
                      candidates,
                      recipes):
  ''' Calc how often the top ingredient co-occurs with the candidates
    - also removes candidates that are re-phrase of ingredient (e.g. "beef" and "ground beef")
    ingredient: str name of an ingredient ("apple")
    candidates: potential other ingredients ("orange")
    recipes: iterable of possible ingredients
  '''


  co_count = {}
  for candidate in candidates:
    co_count[candidate] = sum([candidate in recipe and ingredient in recipe for recipe in recipes])
  return co_count

In [23]:
def get_fusion_ingredients(ingredient: str,
                           recipe_model, #gensim model
                           recipes, #iterable of recipes
                           ingredient_count: dict,
                           max_candidates = 20,
                           min_occurence_factor = 100 # minimum number of recipes an ingredient has to be in
                           ):

  ingredient_recipes = recipes.loc[recipes.apply(lambda row: ingredient in row)]

  ingredient_candidates = recipe_model.most_similar(ingredient, topn=50) # get top similar ingredients
  candidate_names = list(zip(*ingredient_candidates))[0]
  pruned_candidates = [candidate for candidate in candidate_names if ingredient not in candidate][:max_candidates] # clean up candidates to remove duplicates (e.g. "gala apple")
  cooccurrence_counts = calc_cooccurrence(ingredient, candidate_names, ingredient_recipes) # get counts for normalization
  # final score for sorting: similarity / how often co-occur / total occurences
  min_occurences = max(cooccurrence_counts.values()) / min_occurence_factor
  print(min_occurences)
  freq_norm_candidates = {candidate[0]: candidate[1] / (cooccurrence_counts[candidate[0]]+1) / ingredient_count[candidate[0]] for candidate in ingredient_candidates if candidate[0] in pruned_candidates and cooccurrence_counts[candidate[0]] > min_occurences}
  top_candidates = sorted([(k,v) for k,v in freq_norm_candidates.items()], key=lambda x: x[1])[-5:]
  return top_candidates, cooccurrence_counts, ingredient_candidates # return multiple for debugging

In [None]:
%%time
get_fusion_ingredients('orange', recipe_w2v_25, recipes_pdf['NER'], ingredient_count)

37.84
CPU times: user 2.8 s, sys: 122 ms, total: 2.92 s
Wall time: 3.18 s


([('tangerine', 1.4669418242644563e-05),
  ('red grapefruit', 1.9075492737998423e-05),
  ('plum', 2.4063287866286987e-05),
  ('clementine', 3.4001893315020326e-05),
  ('kumquat', 3.4480510623283614e-05)],
 {'tangerine': 96,
  'clementine': 56,
  'kumquat': 60,
  'grapefruit': 178,
  'red grapefruit': 119,
  'fresh squeezed orange juice': 58,
  'pear': 224,
  'pink grapefruit': 152,
  'tangerine juice': 16,
  'pomegranate': 90,
  'pomegranate aril': 11,
  'carrot juice': 14,
  'orange slice': 113,
  'pomegranate juice': 108,
  'firm pear': 9,
  'green apple': 99,
  'orange juice': 3784,
  'cranberrie': 1693,
  'valencia': 3,
  'fresh orange': 9,
  'freshly squeezed orange juice': 222,
  'pomegranate seed': 123,
  'persimmon': 15,
  'plum': 46,
  'bosc pear': 6,
  'cranberry juice': 269,
  'valencia orange': 5,
  'red plum': 6,
  'fresh juice': 4,
  'pear nectar': 0,
  'purple': 16,
  'grape juice': 76,
  'orange blossom honey': 27,
  'mandarin': 10,
  'fresh-squeezed orange juice': 5,
 

# Save outputs

## To drive

In [None]:
recipe_w2v_file = 'recipe_w2v.gensim'
recipe_w2v.save(folder_loc + recipe_w2v_file)

In [None]:
recipe_w2v_file_50 = 'recipe_w2v_50.gensim'
recipe_w2v_50.save(folder_loc + recipe_w2v_file_50)

In [None]:
recipe_w2v_file_25 = 'recipe_w2v_25.gensim'
recipe_w2v_25.save(folder_loc + recipe_w2v_file_25)

In [None]:
recipe_w2v_file_16 = 'recipe_w2v_16.gensim'
recipe_w2v_16.save(folder_loc + recipe_w2v_file_16)

In [None]:
ingredient_file = 'ingredient_count.json'
with open(folder_loc + ingredient_file, 'w') as ingredient_stream:
  json.dump(ingredient_count, ingredient_stream)

In [None]:
NER_tsv = 'recipe_NER.tsv'
recipes_pdf['NER'].to_csv(folder_loc + NER_tsv, sep='\t')

KeyboardInterrupt: ignored

## To huggingface

In [None]:
from huggingface_hub import HfApi
api = HfApi()

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + recipe_w2v_file,
    path_in_repo=recipe_w2v_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + ingredient_file,
    path_in_repo=ingredient_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + NER_tsv,
    path_in_repo=NER_tsv,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

# Some cleaning that seems too finnicky

In [24]:
def get_w2v_duplicate_map(ingredient_embedding, ingredient_count_local):
  top_ingredients = list(zip(*ingredient_count_local.most_common(100)))[0]
  ingredient_map = {}
  for top_ingredient in top_ingredients:
    similar_ingredients = list(zip(*ingredient_embedding.most_similar(top_ingredient, topn=20)))[0]
    ingredient_map[top_ingredient] = [ingredient for ingredient in similar_ingredients if top_ingredient in ingredient.split(' ')]
  return ingredient_map

In [25]:
test = get_w2v_duplicate_map(recipe_w2v_16, ingredient_count)

In [27]:
test

{'salt': ['coarse salt', 'kosher salt'],
 'sugar': ['white sugar',
  'granulated sugar',
  'cane sugar',
  'turbinado sugar',
  'light brown sugar',
  'demerara sugar',
  'brown sugar',
  'muscovado sugar',
  'powdered sugar'],
 'egg': ['egg white', 'egg substitute'],
 'butter': ['unsalted butter',
  'sweet butter',
  'cold butter',
  'light butter',
  'sweet cream butter'],
 'onion': ['yellow onion',
  'white onion',
  'spanish onion',
  'sweet onion',
  'brown onion',
  'cooking onion',
  'vidalia onion',
  'white pearl onion',
  'sweet yellow onion'],
 'flour': ['all-purpose flour', 'white flour', 'pastry flour'],
 'garlic': ['clove garlic',
  'cloves garlic',
  'fresh garlic',
  'clove of garlic'],
 'water': ['boiling water', 'hot water', 'cold water'],
 'milk': ['low-fat milk',
  'hot milk',
  'sweet milk',
  'nonfat milk',
  'warm milk'],
 'vanilla': ['vanilla extract'],
 'olive oil': [],
 'pepper': ['black pepper',
  'ground pepper',
  'ground black pepper',
  'lemon pepper',
  

In [47]:
reverse_map = {token:key for key,tokens in test.items() for token in tokens}

In [48]:
recipes_pdf['NER_deduped'] = recipes_pdf['NER'].apply(lambda row: [reverse_map.get(token, token) for token in row])

In [50]:
recipes_pdf['sorted_char'] = recipes_pdf['NER_deduped'].apply(lambda row: ''.join(sorted(''.join(row).replace(' ', ''))))

In [51]:
recipes_pdf = recipes_pdf.drop_duplicates(subset = 'sorted_char')
recipes_pdf.shape

(1908628, 4)

In [52]:
recipe_w2v_16_dedupe = Word2Vec(recipes_pdf['NER_deduped'], size=16, min_count=10)

In [77]:
recipe_w2v_16_dedupe.most_similar('barley', topn=20)

[('pearl barley', 0.9491546154022217),
 ('lentil', 0.8735455274581909),
 ('dry lentil', 0.871431827545166),
 ('green split pea', 0.8645519614219666),
 ('dried lentil', 0.8634966611862183),
 ('long grain brown rice', 0.8455582857131958),
 ('dried great northern bean', 0.8414177298545837),
 ('green lentil', 0.8296129107475281),
 ('yellow pea', 0.8275840282440186),
 ('brown lentil', 0.822412371635437),
 ('split pea', 0.8048617839813232),
 ('cranberry bean', 0.7999374866485596),
 ('alphabet pasta', 0.7965269088745117),
 ('potato', 0.7870755195617676),
 ('swede', 0.78580641746521),
 ('quick-cooking barley', 0.7856820225715637),
 ('ditalini', 0.7836834192276001),
 ('white bean', 0.7797808647155762),
 ('carrot', 0.7792209982872009),
 ('beef soup bone', 0.7789251804351807)]

In [78]:
recipe_w2v_16.most_similar('barley', topn=20)

[('pearl barley', 0.9637734889984131),
 ('dried lentil', 0.925638735294342),
 ('lentil', 0.9177876710891724),
 ('dry lentil', 0.8902276754379272),
 ('brown lentil', 0.8776227235794067),
 ('green lentil', 0.8661330938339233),
 ('quick-cooking barley', 0.8623511791229248),
 ('carrot', 0.8284534215927124),
 ('lamb stew meat', 0.8240671753883362),
 ('long grain brown rice', 0.8050515651702881),
 ('stalks celery', 0.7979966402053833),
 ('dried great northern bean', 0.7950857281684875),
 ('yellow pea', 0.7908360362052917),
 ('white bean', 0.7860153913497925),
 ('split pea', 0.7850393056869507),
 ('chunk', 0.7847321033477783),
 ('green split pea', 0.7832866907119751),
 ('lamb shank', 0.7812250256538391),
 ('bulgur', 0.7739902138710022),
 ('stalks of celery', 0.771740734577179)]