<a href="https://colab.research.google.com/github/map222/Kibbeh/blob/deploy_space/ingredient_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ingredient recommender
This notebook may not auto-save. To save, you need to go to File menu, then save to github. You can either push directly to master, or create a branch


## Setup:


### Google drive
First, mount files from Google drive (to copy the file to your share folder, click [this link](https://drive.google.com/drive/folders/1fh5C0Wlda0QMzBXqOj6znhS8SlsCZm9O?usp=sharing)):

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
folder_loc = r'/content/gdrive/My Drive/Colab Notebooks/data/recipes/'

### Hugging Face

In [5]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 5.4 MB/s 
Installing collected packages: huggingface-hub
Successfully installed huggingface-hub-0.10.1


In [6]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


### Imports

In [33]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors, Word2Vec
import warnings
warnings.filterwarnings('ignore')
from collections import defaultdict, Counter
import json

In [12]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from ast import literal_eval
fs = 22
font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : fs}

matplotlib.rc('font', **font)
plt.rc('xtick', labelsize=fs-6)
plt.rc('ytick', labelsize=fs-6)

## Load recipes

In [13]:
test = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', nrows = 5, sep = ',', index_col = 0)

In [14]:
test.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [15]:
%%time

recipes_pdf = pd.read_csv(folder_loc + 'RecipeNLG_dataset.csv', sep = ',',usecols = ['title', 'NER'], converters={"NER": literal_eval})

CPU times: user 1min 2s, sys: 6.84 s, total: 1min 9s
Wall time: 1min 21s


In [16]:
recipes_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231142 entries, 0 to 2231141
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   title   object
 1   NER     object
dtypes: object(2)
memory usage: 34.0+ MB


## Pre-processing

### Stemming
First, remove trailing `s`s to reduce token count

In [17]:
def pre_process(token_list):
  return [word.rstrip('s').lower() for word in token_list]
recipes_pdf['NER'] = recipes_pdf['NER'].apply(pre_process)

In [18]:
recipes_pdf['NER'].head()

0    [brown sugar, milk, vanilla, nut, butter, bite...
1    [beef, chicken breast, cream of mushroom soup,...
2    [frozen corn, cream cheese, butter, garlic pow...
3    [chicken, chicken gravy, cream of mushroom sou...
4    [peanut butter, graham cracker crumb, butter, ...
Name: NER, dtype: object

### Remove duplicate recipes
There are many duplicate recipes. We can see that at first by looking at recipe names. Below are the 2000th most common recipe titles, and we can see even this low in the rankings there are many duplicates.

In [19]:
recipes_pdf['title'].value_counts().head(2000).tail()

Tuna Ball                   59
Easy Coconut Cake           59
Lemon Poppy Seed Muffins    59
Tomato Pudding              59
Mrs. Field'S Cookies        59
Name: title, dtype: int64

#### Janky way to remove duplicates
One way to remove duplicate recipes would be to find all recipes with the same set of ingredients. This would involve tokenizing stuff. My lazy way to do this is to just concatenate all the character together, and use pandas `drop_duplicates`.

In [20]:
recipes_pdf['sorted_char'] = recipes_pdf['NER'].apply(lambda row: ''.join(sorted(''.join(row).replace(' ', ''))))


In [21]:
recipes_pdf['sorted_char'].value_counts().head()

                                                  574
aaaaabbdeeefggggiiikklllllmnnooprrrrsstttuuuvw    449
aefggiklllmorstu                                  336
aaegrrstuw                                        311
aaabbdeeefggggiikklllmnooprrrrsstttuuuw           307
Name: sorted_char, dtype: int64

In [22]:
recipes_pdf.loc[recipes_pdf['sorted_char'] == 'aaabbdeeefggggiikklllmnooprrrrsstttuuuw'].head()

Unnamed: 0,title,NER,sorted_char
175,Mom'S Pancakes,"[flour, baking powder, salt, sugar, egg, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
6031,Homemade Pancakes,"[flour, salt, sugar, egg, baking powder, milk,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
7300,Plain Biscuits,"[flour, baking powder, sugar, salt, butter, mi...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
11873,Picklelets,"[egg, milk, baking powder, butter, sugar, flou...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw
18121,Pancakes,"[baking powder, flour, sugar, salt, milk, egg,...",aaabbdeeefggggiikklllmnooprrrrsstttuuuw


In [23]:
recipes_pdf = recipes_pdf.drop_duplicates(subset = 'sorted_char')
recipes_pdf.shape

(1976599, 3)

Removed 20k recipes

#### Get ingredient counts, for understanding how common an ingredient is

In [28]:
ingredient_count = Counter([token for row in recipes_pdf['NER'].values for token in row])

## Build W2V model

### Train embeddings with different number of dimensions

In [75]:
%%time
recipe_w2v_100 = Word2Vec(recipes_pdf['NER'], min_count=10)
recipe_w2v_50 = Word2Vec(recipes_pdf['NER'], size=50, min_count=10)
recipe_w2v_25 = Word2Vec(recipes_pdf['NER'], size=25, min_count=10)
recipe_w2v_16 = Word2Vec(recipes_pdf['NER'], size=16, min_count=10)

CPU times: user 3min 4s, sys: 1.05 s, total: 3min 5s
Wall time: 1min 46s


In [72]:
# number of ingredients
len(recipe_w2v.wv.vocab)

31124

### Lower-dimensional embeddings look better on the surface than high-dimensions
When I initially trained these ingredient-embeddings, I found some surprising results, like `bone` showing up in the most similar ingredients for `carrot`. While this is somewhat understandable since people make lots of soups, this would make for some surprising ingredient replacements.

One hypothesis I had for this was that the default number of embeddings dimensions is too high. Anthropic AI has an [interesting paper about how embeddings are able to represent more features than there are dimensions](https://transformer-circuits.pub/2022/toy_model/index.html#motivation) by using "superposition." The default number of dimensions for GenSim is 100, which implies it could encode hundreds of dimensions in flavor space. So I tested reducing the number of dimensions of the embeddings, and lo and behold it seemed to fix the `carrot-bone` similarity problem. Below are examples of this for carrots and apples. 

#### Carrots
With the default number of dimensions (100), the most similar ingredients  to `carrot` are `carrot` variants, but `bone` and `beef bone` show up in the top 10.

In [55]:
recipe_w2v.most_similar('carrot')

[('baby carrot', 0.8065677881240845),
 ('fresh carrot', 0.6772737503051758),
 ('peeled carrot', 0.6647123694419861),
 ('fresh baby carrot', 0.5763121843338013),
 ('grated carrot', 0.5363171696662903),
 ('beef bone', 0.5336546897888184),
 ('bone', 0.522489070892334),
 ('pearl barley', 0.4999176263809204),
 ('barley', 0.4961257576942444),
 ('only', 0.48451071977615356)]

25 dimensions pushes `bone` out of the top 10, but still surprisingly high.

In [77]:
recipe_w2v_25.most_similar('carrot' , topn=15)

[('baby carrot', 0.8326777219772339),
 ('fresh carrot', 0.7855518460273743),
 ('white turnip', 0.7434546947479248),
 ('pearl barley', 0.7420206665992737),
 ('parsley root', 0.7365614175796509),
 ('turnip', 0.7275972962379456),
 ('barley', 0.7271861433982849),
 ('sweet potato', 0.7265936136245728),
 ('swede', 0.7255061864852905),
 ('green lentil', 0.7225844860076904),
 ('parsnip', 0.7192715406417847),
 ('bone', 0.715190589427948),
 ('chunk', 0.7009220123291016),
 ('stalks of celery', 0.6996918320655823),
 ('lentil', 0.6953086256980896)]

16 dimensions gets it out of the top 20

In [76]:
recipe_w2v_16.most_similar('carrot' , topn=20)

[('stalks celery', 0.8890523910522461),
 ('baby carrot', 0.8877623081207275),
 ('celery', 0.8522034287452698),
 ('pearl barley', 0.8518322706222534),
 ('barley', 0.8432409167289734),
 ('stalks of celery', 0.8340665698051453),
 ('celery stalk', 0.8294845223426819),
 ('lentil', 0.8264703154563904),
 ('stalk celery', 0.8254605531692505),
 ('green bean', 0.8253151178359985),
 ('green lentil', 0.8252203464508057),
 ('parsnip', 0.8249714374542236),
 ('dried lentil', 0.817329466342926),
 ('white cabbage', 0.8002745509147644),
 ('turnip', 0.7969602346420288),
 ('potato', 0.7907851338386536),
 ('fresh carrot', 0.7867233753204346),
 ('chunk', 0.7820314168930054),
 ('fresh green bean', 0.7777174711227417),
 ('white turnip', 0.7760671377182007)]

### Another ingredient

In [120]:
recipe_w2v_100.most_similar('oregano', topn=20)

[('ground oregano', 0.7718523740768433),
 ('leaf oregano', 0.7263194918632507),
 ('oregano flake', 0.7059062719345093),
 ('fresh oregano', 0.6925216913223267),
 ('oregano leave', 0.6920971870422363),
 ('oregeno', 0.6829726696014404),
 ('italian seasoning', 0.6229325532913208),
 ('italian spice', 0.5812777876853943),
 ('oregano dried', 0.5680622458457947),
 ('marjoram', 0.5490894913673401),
 ('italian herb seasoning', 0.5356825590133667),
 ('italian herb', 0.513263463973999),
 ('dried oregano', 0.5126845240592957),
 ('tomato paste', 0.5118139982223511),
 ('tomato sauce', 0.5104940533638),
 ('mixed italian herb', 0.5072267651557922),
 ('tomatoe', 0.4902517795562744),
 ('italian tomatoe', 0.47530269622802734),
 ('dried marjoram', 0.4749998450279236),
 ('fennel seed', 0.47368210554122925)]

In [121]:
recipe_w2v_16.most_similar('oregano', topn=20)

[('oregano dried', 0.9083280563354492),
 ('sweet basil', 0.904662013053894),
 ('dried basil', 0.9041186571121216),
 ('dried leaf basil', 0.8976918458938599),
 ('tomato paste', 0.8945822715759277),
 ('tomato sauce', 0.8814541101455688),
 ('basil', 0.8676785230636597),
 ('ground oregano', 0.8636078834533691),
 ('italian seasoning', 0.8266404271125793),
 ('tomatoe', 0.8259135484695435),
 ('oregano flake', 0.8163722157478333),
 ('tomato puree', 0.8131784200668335),
 ('ground basil', 0.8062314987182617),
 ('oregano ground', 0.7994402647018433),
 ('leaf oregano', 0.7947260141372681),
 ('fresh oregano', 0.7930879592895508),
 ('dried marjoram', 0.7881389260292053),
 ('lean ground lamb', 0.7843124866485596),
 ('italian plum', 0.7773176431655884),
 ('marjoram', 0.7710994482040405)]

# The ingredient finding part
Find ingredients that are similar in embedding space, but not used together commonly

In [29]:
def calc_cooccurrence(ingredient: str,
                      candidates,
                      recipes):
  ''' Calc how often the top ingredient co-occurs with the candidates
    - also removes candidates that are re-phrase of ingredient (e.g. "beef" and "ground beef")
    ingredient: str name of an ingredient ("apple")
    candidates: potential other ingredients ("orange")
    recipes: iterable of possible ingredients
  '''


  co_count = {}
  for candidate in candidates:
    co_count[candidate] = sum([candidate in recipe and ingredient in recipe for recipe in recipes])
  return co_count

In [30]:
def get_fusion_ingredients(ingredient: str,
                           recipe_model, #gensim model
                           recipes, #iterable of recipes
                           ingredient_count: dict,
                           max_candidates = 20,
                           min_occurence_factor = 100 # minimum number of recipes an ingredient has to be in
                           ):

  ingredient_recipes = recipes.loc[recipes.apply(lambda row: ingredient in row)]

  ingredient_candidates = recipe_model.most_similar(ingredient, topn=50) # get top similar ingredients
  candidate_names = list(zip(*ingredient_candidates))[0]
  pruned_candidates = [candidate for candidate in candidate_names if ingredient not in candidate][:max_candidates] # clean up candidates to remove duplicates (e.g. "gala apple")
  cooccurrence_counts = calc_cooccurrence(ingredient, candidate_names, ingredient_recipes) # get counts for normalization
  # final score for sorting: similarity / how often co-occur / total occurences
  min_occurences = max(cooccurrence_counts.values()) / min_occurence_factor
  print(min_occurences)
  freq_norm_candidates = {candidate[0]: candidate[1] / (cooccurrence_counts[candidate[0]]+1) / ingredient_count[candidate[0]] for candidate in ingredient_candidates if candidate[0] in pruned_candidates and cooccurrence_counts[candidate[0]] > min_occurences}
  top_candidates = sorted([(k,v) for k,v in freq_norm_candidates.items()], key=lambda x: x[1])[-5:]
  return top_candidates, cooccurrence_counts, ingredient_candidates # return multiple for debugging

In [60]:
%%time
get_fusion_ingredients('orange', recipe_w2v_25, recipes_pdf['NER'], ingredient_count)

37.84
CPU times: user 2.8 s, sys: 122 ms, total: 2.92 s
Wall time: 3.18 s


([('tangerine', 1.4669418242644563e-05),
  ('red grapefruit', 1.9075492737998423e-05),
  ('plum', 2.4063287866286987e-05),
  ('clementine', 3.4001893315020326e-05),
  ('kumquat', 3.4480510623283614e-05)],
 {'tangerine': 96,
  'clementine': 56,
  'kumquat': 60,
  'grapefruit': 178,
  'red grapefruit': 119,
  'fresh squeezed orange juice': 58,
  'pear': 224,
  'pink grapefruit': 152,
  'tangerine juice': 16,
  'pomegranate': 90,
  'pomegranate aril': 11,
  'carrot juice': 14,
  'orange slice': 113,
  'pomegranate juice': 108,
  'firm pear': 9,
  'green apple': 99,
  'orange juice': 3784,
  'cranberrie': 1693,
  'valencia': 3,
  'fresh orange': 9,
  'freshly squeezed orange juice': 222,
  'pomegranate seed': 123,
  'persimmon': 15,
  'plum': 46,
  'bosc pear': 6,
  'cranberry juice': 269,
  'valencia orange': 5,
  'red plum': 6,
  'fresh juice': 4,
  'pear nectar': 0,
  'purple': 16,
  'grape juice': 76,
  'orange blossom honey': 27,
  'mandarin': 10,
  'fresh-squeezed orange juice': 5,
 

In [61]:
a,b,c = get_fusion_ingredients('apple', recipe_w2v_25, recipes_pdf['NER'], ingredient_count,20)

46.19


In [62]:
a

[('yam', 4.598554950495624e-06),
 ('craisin', 1.3405399475123224e-05),
 ('bartlett', 1.4835682855074244e-05),
 ('acorn squash', 2.186774596344618e-05),
 ('frozen cranberrie', 2.3645754286065995e-05)]

In [63]:
fish_sauce = get_fusion_ingredients('fish sauce', recipe_w2v, recipes_pdf['NER'],ingredient_count, 10)

6.76


In [64]:
fish_sauce[0]

[('rice vermicelli', 1.4720903420300217e-05),
 ('fresh lemongra', 1.538415095900283e-05),
 ('lemongra', 1.8311396896240826e-05),
 ('sweet soy sauce', 0.00011322733184419209),
 ('mani', 0.00011857803910970688)]

In [None]:
fish_sauce[2]

[('asian fish sauce', 0.7992143630981445),
 ('rice noodle', 0.7358037829399109),
 ('nuoc', 0.7349714040756226),
 ('lemon gra', 0.727480411529541),
 ('mani', 0.7231417298316956),
 ('sweet soy sauce', 0.7184907793998718),
 ('stalk lemongra', 0.7184152603149414),
 ('ketjap mani', 0.7102898359298706),
 ('stalks lemongra', 0.710004448890686),
 ('rice vermicelli', 0.7045285701751709),
 ('fresh lemongra', 0.7023175954818726),
 ('lemongra', 0.700231671333313),
 ('nuoc nam', 0.6911463141441345),
 ('lime leaf', 0.6892822980880737),
 ('red chili paste', 0.6854043006896973),
 ('fresh galangal', 0.6847156882286072),
 ('bird chile', 0.6795110106468201),
 ('lime leave', 0.6791095733642578),
 ('fishsauce', 0.677825927734375),
 ('red curry', 0.6775223016738892),
 ('shrimp paste', 0.6735805869102478),
 ('tamarind juice', 0.672806441783905),
 ('oyster sauce', 0.6725000739097595),
 ('red bird', 0.6667814254760742),
 ('cilantro root', 0.6667782068252563),
 ('ﬁsh sauce', 0.6667441725730896),
 ('lemongrass s

In [None]:
a,b,c = get_fusion_ingredients('carrot', recipe_w2v, recipes_pdf['NER'], ingredient_count, 10)

23.51


In [None]:
a

[('bone', 3.4262489299384916e-05),
 ('beef bone', 6.591575503591108e-05),
 ('kale leave', 7.309549723339305e-05),
 ('chunk', 8.086041287258938e-05),
 ('soup bone', 0.00023302498216531717)]

In [69]:
a,b,c = get_fusion_ingredients('celery', recipe_w2v_25, recipes_pdf['NER'], ingredient_count, 10, 200)

145.195


In [70]:
a

[('pea', 2.6550650463943598e-08),
 ('red potatoe', 7.504685987482834e-08),
 ('pimiento', 8.571710535817896e-08),
 ('head cabbage', 3.9551559040775317e-07),
 ('green cabbage', 5.220314260075893e-07)]

In [None]:
ingredient_count['kielbasa'], ingredient_count['rutabaga']

(1680, 356)

In [None]:
b['kielbasa'], b['rutabaga']

(204, 33)

In [67]:
c

[('stalks celery', 0.9407098889350891),
 ('celery stalk', 0.8794925212860107),
 ('stalk celery', 0.8570525646209717),
 ('stalks of celery', 0.7485243082046509),
 ('celery rib', 0.5456163883209229),
 ('stalk of celery', 0.5421972274780273),
 ('fresh celery', 0.5250746607780457),
 ('celery heart', 0.5132310390472412),
 ('head green cabbage', 0.49086734652519226),
 ('cabbage', 0.4817184507846832),
 ('green cabbage', 0.47366559505462646),
 ('celery root', 0.47208115458488464),
 ('celery salt', 0.4491557776927948),
 ('celery top', 0.4421187937259674),
 ('head of cabbage', 0.43540793657302856),
 ('sweet pea', 0.4314376711845398),
 ('head of green cabbage', 0.4291992783546448),
 ('pea', 0.4287971258163452),
 ('kielbasa', 0.4215185046195984),
 ('wild rice', 0.41996079683303833),
 ('red potato', 0.419281929731369),
 ('turkey kielbasa', 0.4128633141517639),
 ('head cabbage', 0.41173380613327026),
 ('red cabbage', 0.411342978477478),
 ('rutabaga', 0.4067332148551941),
 ('turkey', 0.40327382087707

In [68]:
recipes_pdf.loc[(recipes_pdf['NER'].apply(lambda row: 'celery' in row)) & (recipes_pdf['NER'].apply(lambda row: 'kielbasa' in row)), 'title'].value_counts()

Jambalaya                         7
Kielbasa Bean Soup                4
Bean Soup                         3
Kielbasa Soup                     2
Sausage Stew                      2
                                 ..
Oven Baked Split Pea Soup         1
Turkey And Sausage Jambalaya      1
Kielbasa And Lentil Soup          1
Creole Stuffed Shrimp             1
Incredible 20 Minute Bean Soup    1
Name: title, Length: 184, dtype: int64

# Save outputs

## To drive

In [122]:
recipe_w2v_file = 'recipe_w2v.gensim'
recipe_w2v.save(folder_loc + recipe_w2v_file)

In [123]:
recipe_w2v_file_50 = 'recipe_w2v_50.gensim'
recipe_w2v_50.save(folder_loc + recipe_w2v_file_50)

In [124]:
recipe_w2v_file_25 = 'recipe_w2v_25.gensim'
recipe_w2v_25.save(folder_loc + recipe_w2v_file_25)

In [127]:
recipe_w2v_file_16 = 'recipe_w2v_16.gensim'
recipe_w2v_16.save(folder_loc + recipe_w2v_file_16)

In [125]:
ingredient_file = 'ingredient_count.json'
with open(folder_loc + ingredient_file, 'w') as ingredient_stream:
  json.dump(ingredient_count, ingredient_stream)

In [126]:
NER_tsv = 'recipe_NER.tsv'
recipes_pdf['NER'].to_csv(folder_loc + NER_tsv, sep='\t')

KeyboardInterrupt: ignored

## To huggingface

In [None]:
from huggingface_hub import HfApi
api = HfApi()

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + recipe_w2v_file,
    path_in_repo=recipe_w2v_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + ingredient_file,
    path_in_repo=ingredient_file,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

In [None]:
api.upload_file(
    path_or_fileobj=folder_loc + NER_tsv,
    path_in_repo=NER_tsv,
    repo_id="map222/recipe-spice",
    repo_type="space",
)

# Some cleaning that seems too finnicky

In [95]:
def get_w2v_duplicate_map(ingredient_embedding, ingredient_count_local):
  top_ingredients = list(zip(*ingredient_count_local.most_common(100)))[0]
  ingredient_map = {}
  for top_ingredient in top_ingredients:
    similar_ingredients = list(zip(*ingredient_embedding.most_similar(top_ingredient, topn=20)))[0]
    ingredient_map[top_ingredient] = [ingredient for ingredient in similar_ingredients if top_ingredient in ingredient.split(' ')]
  return ingredient_map

In [96]:
test = get_w2v_duplicate_map(recipe_w2v_16, ingredient_count)

In [101]:
test

{'salt': ['coarse salt', 'kosher salt'],
 'sugar': ['white sugar',
  'granulated sugar',
  'cane sugar',
  't sugar',
  'demerara sugar',
  'light brown sugar',
  'turbinado sugar',
  'brown sugar',
  'powdered sugar',
  'weight sugar',
  'sugar substitute'],
 'egg': ['egg white', 'egg substitute'],
 'butter': ['unsalted butter', 'sweet butter', 'cold butter'],
 'onion': ['yellow onion', 'spanish onion', 'cooking onion', 'white onion'],
 'flour': ['all-purpose flour', 'white flour'],
 'garlic': ['clove garlic', 'fresh garlic', 'cloves garlic', 'garlic smashed'],
 'water': ['hot water',
  'boiling water',
  'cold water',
  'water boiling',
  'water cold'],
 'milk': ['low-fat milk', 'hot milk', 'sweet milk', 'warm milk'],
 'vanilla': ['vanilla extract'],
 'olive oil': [],
 'pepper': ['black pepper',
  'ground black pepper',
  'ground pepper',
  'black ground pepper',
  'fresh ground pepper',
  'lemon pepper',
  'freshly grnd pepper',
  'freshly grnd black pepper'],
 'tomatoe': ['fresh to