<a href="https://colab.research.google.com/github/mscholl96/mad-recime/blob/documentation-stuff/data/recipe1M/parser/parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recipe1M parser

In [5]:
!pip install quantulum3==0.7.9
!pip install stemming
!pip install pattern

Collecting quantulum3==0.7.9
  Downloading quantulum3-0.7.9-py3-none-any.whl (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 16.9 MB/s 
Installing collected packages: quantulum3
  Attempting uninstall: quantulum3
    Found existing installation: quantulum3 0.7.10
    Uninstalling quantulum3-0.7.10:
      Successfully uninstalled quantulum3-0.7.10
Successfully installed quantulum3-0.7.9




In [2]:
# Add GDrive
from google.colab import drive
import sys
drive.mount('/content/drive/')
sys.path.append('/content/drive/My Drive/Datasets/Recipe1M/')

Mounted at /content/drive/


In [2]:
import pandas as pd
import re
from recipe import Recipe
import requests
from bs4 import BeautifulSoup

#FILE_DIR = '../'
FILE_DIR = '/content/drive/My Drive/Datasets/Recipe1M/'

## Recipe1M data
Recipe1M comes with various json files containing crawled recipes from the web. For our project, two of them are interesting:
* layer1.json: Contains all recipes to their full extend
  
  ![layer1](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/layer1_puml.png?raw=1)

* det_ingrs.json: Only contains recipe ID, parsed ingredients and validity flag for parsing 
  
  ![det_ingrs](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/det_ingrs_puml.png?raw=1)

In our first attempt we want to make use of the parsed ingredient list and only consider recipes, where all ingredients are marked valid. The parsed ingredients don't contain amounts, so our parser has to kind of merge content of both files. Extracting ingredients from one and their amount and unit from the other file.

## Preprocessing
Removal of all invalid sets from ingredient and full data json to reduce memory. Use pickle instead of json.

### Helper functions

In [5]:
# Replace double numbers (such as "1 12"), include trailing space to make sure it is an actual number
doubleNumbersRegex = re.compile('\d+ \d+ ')
# Replace fractions (such as "1/2")
fractionRegex = re.compile('\d+/\d+')
# Replace mixed fractions
mixedFractionRegex = re.compile('\d+(?:\s|\s?\-\s?)\d+/\d+')
# Needed to extract numbers
numberRegex = re.compile('\d+')

def isEmpty(string):
    return not (string and string.strip())

assert(isEmpty('') == True)
assert(isEmpty('a b') == False)
assert(isEmpty('    ') == True)

def numberReplacement(text):
    # First all mixed fractions
    mixedFractions = re.findall(mixedFractionRegex, text)
    for mixedFraction in mixedFractions:
        numbers = [int(s) for s in re.findall(numberRegex, mixedFraction)]
        float_representation = round(numbers[0] + numbers[1]/numbers[2], 2)
        text = text.replace(mixedFraction, str(float_representation))
    # Then the remaining fractions
    fractions = re.findall(fractionRegex, text)
    for fraction in fractions:
        numbers = fraction.split('/')
        float_representation = round(int(numbers[0])/int(numbers[1]), 2)
        text = text.replace(fraction, str(float_representation))
    # Last: Use first number in case of double number (food.com will be corrected later)
    doubleNumbers = re.findall(doubleNumbersRegex, text)
    for doubleNumber in doubleNumbers:
        numbers = [int(s) for s in re.findall(numberRegex, doubleNumber)]
        float_representation = numbers[0]
        text = text.replace(doubleNumber, str(float_representation) + ' ')
    return text

assert(numberReplacement("1 12 1 1/4 1/2") == "1 1.25 0.5")

### Bringing the two json files together

In [6]:
ingredient_data = None
recipe_data = None
ingredient_file = FILE_DIR + 'det_ingrs.json'
layer1_file = FILE_DIR + 'layer1.json'

layer1_out = FILE_DIR + '2022_03_30/layer1_valid.pkl'

# Get data, set id as used index and drop unnecessary information
ingredient_data = pd.read_json(ingredient_file).set_index('id')
recipe_data = pd.read_json(layer1_file).drop(columns=['partition']).set_index('id')

# Drop recipes with more than 20 ingredients
indices = ingredient_data[[True if len(row) > 20 else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 20 ingredients')

# Drop recipes with more than 30 instructions
indices = recipe_data[[True if len(row) > 30 else False for row in recipe_data['instructions']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 30 instructions')

# Removal of all elements in ingredient json which contain invalid entries according to the data set
# Get indices of ingredients which contain false valid flags 
indices = ingredient_data[[True if any(x == False for x in row) else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices).drop(columns=['valid'])
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with invalid ingredients')

# Bring both datasets together
recipe_data['ingredients_parsed'] = ingredient_data['ingredients']

Removed 18626 recipes with more than 20 ingredients
Removed 14270 recipes with more than 30 instructions
Removed 149732 recipes with invalid ingredients


### Cleaning up the data

In [7]:

indices = []
# Scraped amounts of food.com
df_food_com_amounts = pd.read_pickle(FILE_DIR + 'food_com_amounts_conv.pkl')

j = 0
for idx, recipe in recipe_data.iterrows():
    
    # Determine and print progress
    j+=1
    if j % 100000 == 0:
        print(f'Progress: {j}')
    
    # Remove recipes from food.com that have not been scraped
    if 'www.food.com' in recipe['url']:
        try:
            df_food_com_amounts.loc[idx]
        except KeyError:
            # Element could not be found, remove later and continue
            #print(f'Exclude {idx} because it was not scraped from: {recipe["url"]}')
            indices.append(idx)
            continue
    
    # Remove empty elements
    recipe['ingredients'] = [elem for elem in recipe['ingredients'] if not isEmpty(elem['text'])]
    recipe['ingredients_parsed'] = [elem for elem in recipe['ingredients_parsed'] if not isEmpty(elem['text'])]

    # Remove recipes where amount of parsed ingredients doesn't match amount of original ingredients
    if len(recipe['ingredients_parsed']) != len(recipe['ingredients']):
        indices.append(idx)
        continue

    # Replace numbers in ingredient texts
    for ingredient in recipe['ingredients']:
        ingredient['text'] = numberReplacement(ingredient['text'])

# Remove the previously determined recipes
recipe_data = recipe_data.drop(indices).drop(columns=['url'])

# Save data to pickle (it's faster)
recipe_data.to_pickle(layer1_out)


Progress: 100000
Progress: 200000
Progress: 300000
Progress: 400000
Progress: 500000
Progress: 600000
Progress: 700000
Progress: 800000


## Plausibilty check (postprocessing)
Steps in postprocessing are:
1. Check recipe on ingredient units that obviously don't fit to the context of recipes.
2. Check for implausible amounts and remove recipes with implausible amounts
3. Convert units to metrical system and unify (e.g. all volumes to ml, all masses to g). Only leave cup and spoons as they are.

Return None if recipe is implausible, else return the processed recipe

In [2]:
import re

numberRegex = re.compile('\d+')
floatRegex = re.compile('\d+\.\d+')

def str2num(string):
    amounts = string.split('-')
    # Only a single number
    if len(amounts) == 1:
        amount = amounts[0]
    # Range of amounts, we take the upper one (the more the better;))
    else:
        amount = amounts[1]

    if re.findall(floatRegex, amount):
        return float(amount)
    
    numbers = [int(s) for s in re.findall(numberRegex, amount)]
    if not numbers:
        return 1
    if len(numbers) == 3:
        return round(numbers[0] + numbers[1]/numbers[2], 2)
    elif len(numbers) == 2:
        return round(int(numbers[0])/int(numbers[1]), 2)
    else:
        return int(numbers[0])

assert(str2num('1/2') == 0.5)
assert(str2num('1 1/2') == 1.5)
assert(str2num('1/4 - 1/2') == 0.5)
assert(str2num('1 1/2 - 2') == 2)
assert(str2num('1 - 1 1/2') == 1.5)
assert(str2num('5') == 5)
assert(str2num('42') == 42)
assert(str2num(' ') == 1)

### Converting scraped amounts (only executed once)

In [None]:
# Load scraped amounts for recipes from food.com
df_amounts = pd.read_csv(FILE_DIR + 'food_com_amounts.csv').set_index('id')
df_amounts_conv = df_amounts

for idx, amounts in df_amounts.iterrows():
    amounts_conv = []
    amount_list = amounts['amounts'].replace('[', '').replace(']', '').replace("'", "").split(',')
    for elem in amount_list:
        amounts_conv.append(str2num(elem))
    df_amounts_conv.loc[idx]['amounts'] = amounts_conv

df_amounts_conv.to_csv(FILE_DIR + 'food_com_amounts_conv.csv')
df_amounts_conv.to_pickle(FILE_DIR + 'food_com_amounts_conv.pkl')

### Postprocessing function

In [5]:

allowed_units = ['cup', 'tablespoon', 'teaspoon', '', 'pound-mass', 'ounce', 'millilitre', 'gram' ,'package',
 'pinch', 'quart', 'drop' ,'pint', 'inch', 'litre', 'fluid ounce', 'kilogram', 'centimetre',
 'gallon', 'centilitre', 'decilitre', 'millimetre', 'dessertspoon', 'milligram']

allowed_amounts = {'cup': 20, 'tablespoon': 20, 'teaspoon': 20, '': 50, 'pound-mass': 80, 'ounce': 50, 'millilitre': 5000, 'gram': 5000 ,'package': 20,
 'pinch': 20, 'quart': 20, 'drop': 20 ,'pint': 20, 'inch': 50, 'litre': 10, 'fluid ounce': 100, 'kilogram': 10, 'centimetre': 300,
 'gallon': 10, 'centilitre': 1000, 'decilitre': 100, 'millimetre': 3000, 'dessertspoon': 20, 'milligram': 5000}

df_food_com_amounts = pd.read_pickle(FILE_DIR + 'food_com_amounts_conv.pkl')

def postprocess(recipe):
    try:
        amounts = df_food_com_amounts.loc[recipe.id]['amounts']
    except KeyError:
        amounts = None
    if amounts and len(amounts) != len(recipe.ingredients):
        print(f'Removed recipe {recipe.id} because the number of ingredients does not match the scraped number')
        return None

    for index, ingredient in enumerate(recipe.ingredients):
        # Replace all amounts that are allowed to be replaced (if it is a food.com recipe)
        if amounts and ingredient['replaceable']:
            ingredient['amount'] = amounts[index]

        # Some of the resulting units don't fit well to usual recipe units
        if not ingredient['unit'] in allowed_units:
            print(f'Recipe {recipe.id} is not added bc {ingredient} is filtered!')
            return None
        
        # If amount is too high, we don't consider the recipe
        if ingredient['amount'] > allowed_amounts[ingredient['unit']]:
            print(f'Recipe {recipe.id} is not added bc {ingredient} exceeds limits!')
            return None

        # Convert units
        if ingredient['unit'] == 'litre':
            ingredient['unit'] = 'millilitre'
            ingredient['amount'] = ingredient['amount'] * 1000
        elif ingredient['unit'] == 'kilogram':
            ingredient['unit'] = 'gram'
            ingredient['amount'] = ingredient['amount'] * 1000
        elif ingredient['unit'] == 'centilitre':
            ingredient['unit'] = 'millilitre'
            ingredient['amount'] = ingredient['amount'] * 10
        elif ingredient['unit'] == 'decilitre':
            ingredient['unit'] = 'millilitre'
            ingredient['amount'] = ingredient['amount'] * 100
        elif ingredient['unit'] == 'millimetre':
            ingredient['unit'] = 'centimetre'
            ingredient['amount'] = ingredient['amount'] / 10
        elif ingredient['unit'] == 'dessertspoon':
            ingredient['unit'] = 'teaspoon'
            ingredient['amount'] = ingredient['amount'] * 2
        elif ingredient['unit'] == 'milligram':
            ingredient['unit'] = 'gram'
            ingredient['amount'] = ingredient['amount'] / 1000

    return recipe

## Actual parsing

In [3]:
recipes = []
# Sort by ID, 
recipe_data = pd.read_pickle(FILE_DIR + '2022_03_30/layer1_valid.pkl')
print(f'Total number of recipes: {len(recipe_data)}')
recipe_data.head(5)


Total number of recipes: 846117


Unnamed: 0_level_0,ingredients,title,instructions,ingredients_parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000033e39b,"[{'text': '1 c. elbow macaroni'}, {'text': '1 ...",Dilly Macaroni Salad Recipe,[{'text': 'Cook macaroni according to package ...,"[{'text': 'elbow macaroni'}, {'text': 'America..."
000035f7ed,"[{'text': '8 tomatoes, quartered'}, {'text': '...",Gazpacho,[{'text': 'Add the tomatoes to a food processo...,"[{'text': 'tomatoes'}, {'text': 'kosher salt'}..."
00003a70b1,"[{'text': '2 cups milk'}, {'text': '1 cups wat...",Crunchy Onion Potato Bake,[{'text': 'Preheat oven to 350 degrees Fahrenh...,"[{'text': 'milk'}, {'text': 'water'}, {'text':..."
00004320bb,[{'text': '1 (3 ounce) package watermelon gela...,Cool 'n Easy Creamy Watermelon Pie,"[{'text': 'Dissolve Jello in boiling water.'},...","[{'text': 'watermelon gelatin'}, {'text': 'boi..."
0000631d90,"[{'text': '12 cup shredded coconut'}, {'text':...",Easy Tropical Beef Skillet,"[{'text': 'In a large skillet, toast the cocon...","[{'text': 'shredded coconut'}, {'text': 'lean ..."


In [6]:
j = 0
fileIndex = 0
for idx, raw_recipe in recipe_data.iterrows():
    
    recipe = Recipe(idx, raw_recipe['title'])
    recipe.parse_ingredients(raw_recipe['ingredients_parsed'])
    recipe.parse_instructions(raw_recipe['instructions'])
    recipe.get_ingredient_amounts(raw_recipe['ingredients'])

    # Append postprocessed recipe
    if (postprocess(recipe)):
        recipes.append(recipe)

    j += 1
    if j % 10000 == 0:
      print(f'Progress: {j}')

    # Save every 100000 recipes
    if len(recipes) == 100000:
        # Create data frame in the end (according to Stackoverflow this is faster)                
        df = pd.DataFrame([r.to_dict() for r in recipes]).set_index('id')
        path = FILE_DIR + f'2022_03_30/recipes_valid_{fileIndex}'
        fileIndex += 1
        df.to_pickle(path + '.pkl')
        del df
        recipes = []

# Save remaining data
if len(recipes) > 0:
    df = pd.DataFrame([r.to_dict() for r in recipes]).set_index('id')
    path = FILE_DIR + f'2022_03_30/recipes_valid_{fileIndex}'
    df.to_pickle(path + '.pkl')
    df.to_json(path + '.json', orient='records')
    
df.head(20)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Recipe 897b2e9a65 is not added bc {'amount': 24.0, 'unit': 'cup', 'ingredient': 'popped popcorn', 'replaceable': True} exceeds limits!
Recipe 897b784e38 is not added bc {'amount': 97.0, 'unit': 'cup', 'ingredient': 'ice cream', 'replaceable': False} exceeds limits!
Recipe 89872728a3 is not added bc {'amount': 23.0, 'unit': 'cup', 'ingredient': 'brown sugar', 'replaceable': True} exceeds limits!
Recipe 898d15760e is not added bc {'amount': 23.0, 'unit': 'cup', 'ingredient': 'cocoa powder', 'replaceable': True} exceeds limits!
Recipe 899aeb1c9d is not added bc {'amount': 8000.0, 'unit': '', 'ingredient': 'key limes', 'replaceable': True} exceeds limits!
Recipe 89aef3aadc is not added bc {'amount': 64, 'unit': 'ounce', 'ingredient': 'apple juice', 'replaceable': True} exceeds limits!
Recipe 89b268ab19 is not added bc {'amount': 60, 'unit': '', 'ingredient': 'ritz crackers', 'replaceable': True} exceeds limits!
Recipe 89bacd2

Unnamed: 0_level_0,title,ingredients,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f51fd3dece,Shakuti,amount unit ingredi...,0 Cut chicken into serviing...
f51fdd2eff,Chicken Riggies,amount unit ingredie...,"0 Combine the salt, pepper, and flour in a..."
f52003d0ea,Desperation Beef,amount unit ingredient 0 1....,0 Cut beef into very smal...
f5203e1b02,Easter Pie (pizza piena),amount unit ingredient 0 ...,0 To make the crust: Mix the flour and sal...
f52041d765,Sauteed Bananas in Praline Sauce with Vanilla ...,amount unit ingredient 0 ...,0 Heat grill ...
f5206114a5,German Baked Apple Pancake,amount unit ingredient 0 ...,0 Preheat the oven to 450...
f52075dc7e,Beef Stew,amount unit ingredie...,0 In a plastic bag combine flour and 1 tsp ...
f520815f40,Oven Roasted Tomato and Mozzarella Salad with ...,amount unit ingredient 0 ...,0 Start by preparing the tomatoes cut the ...
f520e89628,Must Have Muffins for Pups,amount unit ingredient 0 1.50...,0 Preheat oven to 425F Spray a 12 muffin pa...
f520eebe0f,Skillet Beef Enchiladas,amount unit ingredient ...,0 Brown meat and onion in large ...
