# Recipe Cleaner

This notebook walks through the process of taking in scraped recipes and outputting recipes with names familiar to the FooDB database. To do so, we have used pretrained BERT embeddings. These allow us to map recipe ingredient strings to their appropriate match in FooDB's database.

This process is lengthy. If you do not have a need to rerun it yourself, the output files are available. They are:
* __all_recipes_cleaned.pkl__
* __all_recipes_cleaned.csv__
* __all_recipes_cleaned_indices.pkl__

## Imports and Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import ast
import time
import pickle
import csv
import zipfile
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')




In [64]:
with open('content_dict_weight.pkl', 'rb') as f:
    content_dict_weight = pickle.load(f)
with open('content_dict_presence.pkl', 'rb') as f:
    content_dict_presence = pickle.load(f)
with open('content_dict_complete.pkl', 'rb') as f:
    content_dict_complete = pickle.load(f)

In [89]:
all_recipes_simplified = pd.read_csv("All_Recipes_Simplified.csv")

In [8]:
weight_keys = content_dict_weight.keys()
presence_keys = content_dict_presence.keys()
complete_keys = content_dict_complete.keys() #presence and complete have the same keys. weight has fewer keys

print(len(weight_keys), len(presence_keys), len(complete_keys))

9209 9461 9461


In [20]:
ingredient_names = []
for i in complete_keys:
    if type(i) == str:
        ingredient_names.append(i)

In [21]:
#weight_embeddings = sbert_model.encode(weight_keys)
complete_embeddings = sbert_model.encode(ingredient_names)

## Mapping names to FooDB Names

In [38]:
def best_match(ingredient):
    ''' 
    Function to intake any ingredient (as from Food.com) and return a name that is acceptable within Content.csv
    '''
    best_match = 0.65 #can start this at something >0 if you want to implement a stricter cutoff.
    best_match_name = "Bad Match" #could change this too in order to set name to something w zero compounds
    ingredient_embedding = sbert_model.encode(ingredient)
    for i in range(len(complete_embeddings)):
        match = float(cosine_similarity(ingredient_embedding.reshape(1,-1), 
                                        complete_embeddings[i].reshape(1,-1)))
        if match > best_match:
            best_match = match
            best_match_name = ingredient_names[i]
    #if best_match > cutoff:
    return best_match_name

In [39]:
matched_ingredients = {}
def content_map(ingredient):
    if ingredient in ingredient_names:
        matched_ingredients[ingredient] = ingredient
        return ingredient
    elif ingredient in matched_ingredients.keys():
        return matched_ingredients[ingredient]
    else:
        match = best_match(ingredient)
        matched_ingredients[ingredient] = match
        return match

In [78]:
def matcher(ingredient):
    s = time.time()
    a = best_match(ingredient)
    print(time.time() - s, a)
    s = time.time()
    b = content_map(ingredient)
    print(time.time() - s, b)
    
matcher("apple")

2.828763723373413 Apple
2.1678812503814697 Apple


In [104]:
start = time.time()

cleaned_recipes_full = []
cleaned_recipes_full_indices = []
#for i in range(10):
for i in range(all_recipes_simplified.shape[0]):
    recipe = []
    bad_matches = 0
    try:
        for j in range(len(ast.literal_eval(all_recipes_simplified._c0[i]))):
            matched_name = content_map(ast.literal_eval(all_recipes_simplified._c0[i])[j][0])
            if matched_name == "Bad Match":
                bad_matches += 1
            recipe.append([matched_name, ast.literal_eval(all_recipes_simplified._c0[i])[j][1]])
        if bad_matches == 0:
            cleaned_recipes_full.append(recipe)
            cleaned_recipes_full_indices.append(i)
    except:
        pass
    if i in [1, 10, 100, 500, 1000, 5000, 10000, 20000, 50000, 100000, 170000]:
        progress = time.time() - start
        print(i, "recipes processed", progress)

1 recipes processed 0.01658177375793457
10 recipes processed 4.260057687759399
100 recipes processed 570.0142889022827
500 recipes processed 1685.5871918201447
1000 recipes processed 2583.545888900757
5000 recipes processed 5839.20241189003
10000 recipes processed 7819.295591831207
20000 recipes processed 10090.306432008743
50000 recipes processed 13766.288568735123
100000 recipes processed 17049.469326734543
170000 recipes processed 19847.147416830063


In [105]:
with open('all_recipes_cleaned.pkl', 'wb') as f:
    pickle.dump(cleaned_recipes_full, f, protocol=pickle.HIGHEST_PROTOCOL)

In [106]:
with open("all_recipes_cleaned.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(cleaned_recipes_full)

In [124]:
with open('all_recipes_cleaned_indices.pkl', 'wb') as f:
    pickle.dump(cleaned_recipes_full_indices, f, protocol=pickle.HIGHEST_PROTOCOL)

### Final Thoughts

* In all, all but 1,699 of the recipes were matched to FooDB friendly names. This leaves us with a cleaned dataset, __all_recipes_cleaned__, containing 176,286 recipes.

* __all_recipes_cleaned_indices__, saved above as a pkl, is a list of indices from __All_Recipes_Simplified.csv__ that were not removed due to poor ingredient matches. This can be used if you want to recocile the cleaned recipes with the originals.

* __recipe_match__, a function included below, is an example of how one could compare the recipes before and after cleaning using indices

In [128]:
#cleaned_recipes_full_indices

In [127]:
print(len(all_recipes_simplified) -  len(cleaned_recipes_full), len(cleaned_recipes_full))

#1,699 recipes were removed due to bad ingredient matches. If you want to pair up the cleaned recipes with
#the originals, use the function below

1699 176286


In [129]:
def recipe_match(index):
    '''Take in index from all_recipes_cleaned, output original recipe as seen in all_recipes_simplified.'''
    return ast.literal_eval(all_recipes_simplified.iloc[cleaned_recipes_full_indices[index]]._c0)

In [121]:
recipe_match(2)

[['chicken legs', 1360.78],
 ['onion', 150.0],
 ['soy sauce', 232.0],
 ['brown sugar', 220.0],
 ['ground ginger', 4.0],
 ['minced garlic cloves', 6.0],
 ['dry sherry', 60.0]]

In [122]:
cleaned_recipes_full[2]

[['Chicken spread', 1360.78],
 ['onion', 150.0],
 ['Soy sauce', 232.0],
 ['Sugar, brown', 220.0],
 ['Spices, ginger, ground', 4.0],
 ['Spices, garlic powder', 6.0],
 ['Sherry, dry', 60.0]]