# Clean and Combine Recipes

This notebook contains functions to clean recipes and output them to a dataframe for downstream analysis <br><br>

The majority of recipes come from the eightportions dataset, which consists of pre-scraped recipes from allrecipes, epicurious, and food network. It is available at https://eightportions.com/datasets/Recipes/.<br><br>

This dataset is spread across three json files, with each file corresponding to a different repository. Features consist of:
1. recipe name
2. ingredients
3. cooking instructions
4. picture link (not used)

The remaining recipes were obtained using the spoonacular food api. Features for these recipes include:
1. recipe name
2. ingredients
3. cooking instructions

as well as additional features not incorporated here: ingredient units and amounts, sourceURL, health score, rating, and likes.<br>
Only recipes tagged with staple foods missing from the eightportions datasets are included in the spoonacular recipes.


In [1]:
import os
import json
import pickle

import itertools
import string
import unicodedata
from collections import OrderedDict

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import inflect

import pandas as pd

### general stopwords and stopwords specific to food/recipes ###

The ingredient_stops list is primarily used to clean the eightportions recipes, which have a multitude of descriptors in the ingredients.

In [2]:
stop_words = set(stopwords.words('english'))

with open('ingredient_stops.pickle', 'rb') as f:
    ingredient_stops = pickle.load(f)

### for stemming and punctuation

In [3]:
table = str.maketrans('', '', string.punctuation)
porter = PorterStemmer()

### staple foods - names modified for compatibility

Names of some staples have been modified to fit their more conventional names. Several words are also abnormally truncated to accomodate errors in the .singular_noun() method employed in the cleaning step. These can be string replaced further downstream.

In [4]:
df_staples = pd.read_csv('staples_tagged_singular.csv')
food_staples = df_staples['AbbrvName']

## functions to clean ingredients and instructions

In [5]:
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

In [6]:
deplural = inflect.engine()

def clean_ingredients(ingredient):
    
    dup_words = ['stock', 'steak', 'chuck', 'crawfish']
    fixed_words = ['broth', 'beef', 'beef', 'crayfish']
    
    ## tokenize and lowercase ##
    ingrd_tokens = word_tokenize(ingredient)
    words = [w.lower() for w in ingrd_tokens if w.isalpha()]
    
    clean_words = []
    for word in words:
        
        ## remove stop words ##
        if word not in stop_words and word not in ingredient_stops:
            
            ## remove accents ##
            word = remove_accents(word)
            
            ## words to singular form ##
            singular = deplural.singular_noun(word)
            if singular:
                word = singular
                
            ## adjust niche cases for certain staples ##
            if word in dup_words:
                word = fixed_words[dup_words.index(word)]
            clean_words.append(word)
        else:
            continue
    
    ingredient_clean = ' '.join(words)
    
    ## adjust additional niche cases ##
    ingredient_clean = ingredient_clean.replace('game hen', 'hen')
    
    ## ignore instances of 'salt and pepper'.
    ## can be modified to replace with salt, pepper, or return both
    ingredient_clean = ingredient_clean.replace('salt pepper', '')
    
    ## call staple foods consistently ##
    for staple in food_staples:
        if staple in ingredient_clean:
            ingredient_clean = staple
            break     
            
    return ingredient_clean

In [7]:
def clean_instructions(doc):
    try:
        
        ## remove punctuation and stopwords, and stem ##
        tokens = word_tokenize(doc.lower())
        stripped = [w.translate(table) for w in tokens]
        words = [word for word in stripped if word.isalpha()]
        words = [w for w in words if not w in stop_words]
        stemmed = [porter.stem(word) for word in words]
        return stemmed
    
    ## if no instructions are present
    except AttributeError:
        return []

## clean spoonacular recipes 

In [8]:
with open('spoonacular_recipes.json') as infile:
    recipe_list = json.loads(infile.read())

example spoonacular recipe:

In [9]:
recipe_list[0]

{'ingredient_names': ['baby spinach',
  'coleslaw mix',
  'dijon mustard',
  'havarti cheese',
  'horseradish',
  'kosher salt',
  'mayonnaise',
  'pickled beets',
  'roast beef deli slices',
  'rye bread',
  'sour cream'],
 'source': 'http://www.foodnetwork.com/recipes/food-network-kitchens/10-minute-beef-and-beet-salad-with-horseradish-dressing.html',
 'rating': 83.0,
 'name': '10-Minute Beef-and-Beet Salad with Horseradish Dressing',
 'ingredient_amts': [4.0, 7.0, 2.0, 4.0, 3.0, 4.0, 2.0, 0.75, 0.5, 2.0, 0.25],
 'health_score': 22.0,
 'instructions': ['Toast the rye bread.',
  'Meanwhile, whisk together the sour cream, horseradish, mayonnaise, mustard, 3 tablespoons water, 3/4 teaspoon salt and 1/4 teaspoon pepper in a large bowl.',
  'Add the spinach, coleslaw mix and cheese and toss to combine.',
  'Cut the toasted bread, crusts and all, into 1/2-inch pieces Divide dressed greens among 4 salad plates and top each with roast beef, beets and rye croutons.'],
 'ingredient_units': ['c

In [10]:
ingrds_instr = []
for recipe in recipe_list:
    ingrds_in = recipe['ingredient_names']
    cleaned_ingr_list = [clean_ingredients(item) for item in ingrds_in]
    cleaned_ingr_list = [x for x in cleaned_ingr_list if x]
    
    ## remove duplicate instances of an ingredient
    cleaned_ingr_list = list(OrderedDict.fromkeys(cleaned_ingr_list))
    
    ## spoonacular instructions come in a list of steps and clean_instructions() takes a single string
    instructions_doc = ' '.join(recipe['instructions'])
    instructions_doc = clean_instructions(instructions_doc)
    
    ingrds_instr.append((recipe['name'], cleaned_ingr_list, instructions_doc))
    
df_spoon = pd.DataFrame(ingrds_instr, columns = ['names','ingredients', 'instructions'])    

In [11]:
df_spoon.head()

Unnamed: 0,names,ingredients,instructions
0,10-Minute Beef-and-Beet Salad with Horseradish...,"[spinach, coleslaw mix, dijon mustard, cheese,...","[toast, rye, bread, meanwhil, whisk, togeth, s..."
1,100% Whole Wheat Nut & Seed Bread,"[egg, flour, honey, molass, olive oil, orange,...","[add, yeast, water, salt, honey, molass, stand..."
2,3-Cheese Eggplant Lasagna,"[tomato, canola oil, carrot, basil, egg, spina...","[sprinkl, side, eggplant, slice, tablespoon, s..."
3,3-Layer Almond Coconut Chocolate Bars,"[almond, coconut, rice, sea salt, maple syrup,...","[add, almond, food, processor, process, fine, ..."
4,30-Minute Garlic Ginger Chicken Stir Fry,"[braggs liquid aminos, broccoli, carrot, chili...","[note, serv, meal, rice, start, cook, follow, ..."


## clean eightportions recipes

In [None]:
path = 'eight_portions_recipes/'
recipe_lists = os.listdir(path)
recipe_lists = [x for x in recipe_lists if x.endswith('.json')]

In [None]:
recipe_attrib = []
for recipe_list in recipe_lists:
    with open(path+recipe_list, encoding = 'utf-8') as infile:
        recipes = json.loads(infile.read())
    for k in recipes.keys():
        recipe = recipes[k]
        try:
            name = recipe['title']
            instructions_out = clean_instructions(recipe['instructions'])
            ingredients_out = [clean_ingredients(i) for i in recipe['ingredients']]
        except KeyError:
            continue
        
        if instructions_out:
            recipe_attrib.append((name, ingredients_out, instructions_out))
                

In [None]:
df_eightportion = pd.DataFrame(recipe_attrib, columns = ['names', 'ingredients', 'instructions'])

#### remove duplicate ingredients and empty strings from eight_portion recipes

In [None]:
unique_ingrds_list = [list(OrderedDict.fromkeys(x)) for x in df_eightportion['ingredients']]
for i in range(len(unique_ingrds_list)):
    unique_ingrds_list[i] = [x for x in unique_ingrds_list[i] if x]
df_eightportion['ingredients'] = unique_ingrds_list

### concatenate dataframes and pickle

In [None]:
df_tot = pd.concat([df_eightportion, df_spoon])

## drop duplicates as some recipes are posted to multiple sites
df_tot.drop_duplicates(subset = 'names', inplace = True)

In [None]:
'''
with open('compiled_recipes.pickle', 'wb') as f:
    pickle.dump(df_tot, f)
'''

The final output is a single dataframe consisting the name, ingredients, and instructions for each recipe. Ingredients and instructions are both stored as lists. 

The saved dataframe is available at https://tinyurl.com/yyanydd4. The dataframe is pickled to retain list functionality for ingredients and instructions.