<a href="https://colab.research.google.com/github/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/parser/parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recipe1M parser

In [1]:
!pip install quantulum3
!pip install stemming



In [2]:
# Add GDrive
from google.colab import drive
import sys
drive.mount('/content/drive/')
sys.path.append('/content/drive/My Drive/Datasets/Recipe1M/')

Mounted at /content/drive/


In [1]:
import pandas as pd
import re
from recipe import Recipe

#FILE_DIR = '../'
FILE_DIR = '/content/drive/My Drive/Datasets/Recipe1M/'

## Recipe1M data
Recipe1M comes with various json files containing crawled recipes from the web. For our project, two of them are interesting:
* layer1.json: Contains all recipes to their full extend
  
  ![layer1](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/layer1_puml.png?raw=1)

* det_ingrs.json: Only contains recipe ID, parsed ingredients and validity flag for parsing 
  
  ![det_ingrs](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/det_ingrs_puml.png?raw=1)

In our first attempt we want to make use of the parsed ingredient list and only consider recipes, where all ingredients are marked valid. The parsed ingredients don't contain amounts, so our parser has to kind of merge content of both files. Extracting ingredients from one and their amount and unit from the other file.

## Preprocessing
Removal of all invalid sets from ingredient and full data json to reduce memory. Use pickle instead of json.

In [3]:
ingredient_data = None
recipe_data = None
ingredient_file = FILE_DIR + 'det_ingrs.json'
layer1_file = FILE_DIR + 'layer1.json'

ingredient_out = FILE_DIR + '2022_01_19/det_ingrs_valid.pkl'
layer1_out = FILE_DIR + '2022_01_19/layer1_valid.pkl'

# Get data, set id as used index and drop unnecessary information
ingredient_data = pd.read_json(ingredient_file).set_index('id')
recipe_data = pd.read_json(layer1_file).drop(columns=['url', 'partition']).set_index('id')

# Drop recipes with more than 20 ingredients
indices = ingredient_data[[True if len(row) > 20 else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 20 ingredients')

# Drop recipes with more than 30 instructions
indices = recipe_data[[True if len(row) > 30 else False for row in recipe_data['instructions']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 30 instructions')

# Removal of all elements in ingredient json which contain invalid entries according to the data set
# Get indices of ingredients which contain false valid flags 
indices = ingredient_data[[True if any(x == False for x in row) else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices).drop(columns=['valid'])
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with invalid ingredients')

# Remove fractions from raw ingredients
fractionRegex = re.compile("[0-9]+/[0-9]+")
for _, recipe in recipe_data.iterrows():
    ingredients_mod = []
    for ingredient in recipe['ingredients']:
        ingredient_mod = ""
        for word in ingredient['text'].split(' '):
            match = re.match(fractionRegex, word)
            if match:
                numbers = match.group(0).split('/')

                float_representation = int(numbers[0])/int(numbers[1])
                ingredient_mod += f'{float_representation} '
            else:
                ingredient_mod += f'{word} '
        ingredients_mod.append({'text': ingredient_mod})
    recipe['ingredients'] = ingredients_mod

# Save data to pickle (it's faster)
ingredient_data.to_pickle(ingredient_out)
recipe_data.to_pickle(layer1_out)

recipe_data.head(5)



Removed 18626 recipes with more than 20 ingredients
Removed 14270 recipes with more than 30 instructions
Removed 149732 recipes with invalid ingredients


Unnamed: 0_level_0,ingredients,title,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000033e39b,"[{'text': '1 c. elbow macaroni '}, {'text': '1...",Dilly Macaroni Salad Recipe,[{'text': 'Cook macaroni according to package ...
000035f7ed,"[{'text': '8 tomatoes, quartered '}, {'text': ...",Gazpacho,[{'text': 'Add the tomatoes to a food processo...
00003a70b1,"[{'text': '2 12 cups milk '}, {'text': '1 12 c...",Crunchy Onion Potato Bake,[{'text': 'Preheat oven to 350 degrees Fahrenh...
00004320bb,[{'text': '1 (3 ounce) package watermelon gela...,Cool 'n Easy Creamy Watermelon Pie,"[{'text': 'Dissolve Jello in boiling water.'},..."
0000631d90,"[{'text': '12 cup shredded coconut '}, {'text'...",Easy Tropical Beef Skillet,"[{'text': 'In a large skillet, toast the cocon..."


## Actual parsing

In [2]:
recipes = []
ingredient_data = None
recipe_data = None
# Sort by ID, 
data = pd.read_pickle(FILE_DIR + '2022_01_19/layer1_valid.pkl')
print(f'Total number of recipes: {len(data)}')
data.head(5)


Total number of recipes: 847092


Unnamed: 0_level_0,ingredients,title,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000033e39b,"[{'text': '1 c. elbow macaroni '}, {'text': '1...",Dilly Macaroni Salad Recipe,[{'text': 'Cook macaroni according to package ...
000035f7ed,"[{'text': '8 tomatoes, quartered '}, {'text': ...",Gazpacho,[{'text': 'Add the tomatoes to a food processo...
00003a70b1,"[{'text': '2 12 cups milk '}, {'text': '1 12 c...",Crunchy Onion Potato Bake,[{'text': 'Preheat oven to 350 degrees Fahrenh...
00004320bb,[{'text': '1 (3 ounce) package watermelon gela...,Cool 'n Easy Creamy Watermelon Pie,"[{'text': 'Dissolve Jello in boiling water.'},..."
0000631d90,"[{'text': '12 cup shredded coconut '}, {'text'...",Easy Tropical Beef Skillet,"[{'text': 'In a large skillet, toast the cocon..."


In [3]:
ingredient_data = pd.read_pickle(FILE_DIR + '2022_01_19/det_ingrs_valid.pkl')
print(f'Total number of ingredients: {len(ingredient_data)}')
ingredient_data.head(5)

Total number of ingredients: 847092


Unnamed: 0_level_0,ingredients
id,Unnamed: 1_level_1
000033e39b,"[{'text': 'elbow macaroni'}, {'text': 'America..."
000035f7ed,"[{'text': 'tomatoes'}, {'text': 'kosher salt'}..."
00003a70b1,"[{'text': 'milk'}, {'text': 'water'}, {'text':..."
00004320bb,"[{'text': 'watermelon gelatin'}, {'text': 'boi..."
0000631d90,"[{'text': 'shredded coconut'}, {'text': 'lean ..."


In [4]:
j = 0
for idx, ingredients in ingredient_data.iterrows():
    
    raw_recipe = data.loc[idx]
    recipe = Recipe(idx)
    
    recipe.parse_ingredients(ingredients['ingredients'])

    recipe.parse_instructions(raw_recipe['instructions'])
      
    recipe.get_ingredient_amounts(raw_recipe['ingredients'])

    recipe.title = raw_recipe['title']
    recipes.append(recipe)
    j += 1
    if j % 5000 == 0:
      print(f'Progress: {j}')
      if j == 300000:
        print(f'Parsed {j} recipes')
        break


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
2022-01-22 13:21:09,233 --- The classifier was built using a different scikit-learn version (=0.24.2, !=1.0.2). The disambiguation tool could behave unexpectedly. Consider running classifier.train_classfier()


Progress: 5000
Progress: 10000
Progress: 15000
Progress: 20000
Progress: 25000
Progress: 30000
Progress: 35000
Progress: 40000
Progress: 45000
Progress: 50000
Progress: 55000
Progress: 60000
Progress: 65000
Progress: 70000
Progress: 75000
Progress: 80000
Progress: 85000
Progress: 90000
Progress: 95000
Progress: 100000
Progress: 105000
Progress: 110000
Progress: 115000
Progress: 120000
Progress: 125000
Progress: 130000
Progress: 135000
Progress: 140000
Progress: 145000
Progress: 150000
Progress: 155000
Progress: 160000
Progress: 165000
Progress: 170000
Progress: 175000
Progress: 180000
Progress: 185000
Progress: 190000
Progress: 195000
Progress: 200000
Progress: 205000
Progress: 210000
Progress: 215000
Progress: 220000
Progress: 225000
Progress: 230000
Progress: 235000
Progress: 240000
Progress: 245000
Progress: 250000
Progress: 255000
Progress: 260000
Progress: 265000
Progress: 270000
Progress: 275000
Progress: 280000
Progress: 285000
Progress: 290000
Progress: 295000
Progress: 300000


In [5]:
# Try to clean up
del ingredient_data, data

# Create data frame in the end (according to Stackoverflow this is faster)                
df = pd.DataFrame([r.to_dict() for r in recipes]).set_index('id')
del recipes

path = FILE_DIR + '2022_01_19/recipes_valid'

df.to_pickle(path + '.pkl')
df.to_json(path + '.json', indent=2, orient='records')
df.head(10)

Unnamed: 0_level_0,title,ingredients,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000033e39b,Dilly Macaroni Salad Recipe,amount unit ingredient 0 1....,0 Cook macaroni according to package direct...
000035f7ed,Gazpacho,amount unit ingredient 0 8.0 ...,0 Add the tomatoes to a food processor with...
00003a70b1,Crunchy Onion Potato Bake,amount unit ingredient 0 1....,0 Preheat oven to 350 degrees Fah...
00004320bb,Cool 'n Easy Creamy Watermelon Pie,amount unit ingredient 0 1.0...,0 Dissolve Jello in boiling water. 1 ...
0000631d90,Easy Tropical Beef Skillet,amount unit ingredient 0...,"0 In a large skillet, toast the coconut ove..."
000075604a,Kombu Tea Grilled Chicken Thigh,amount unit ingredient 0 2.0 ...,0 Pierce the skin of the chicken with a for...
00007bfd16,Strawberry Rhubarb Dump Cake,amount unit ing...,0 Put ingredients in a buttered 9 x 12 x 2-...
000095fc1d,Yogurt Parfaits,amount unit ingredient 0 ...,0 Layer all ingredients in a serving dish. ...
0000973574,Zucchini Nut Bread,amount unit ingredient 0 2...,0 Sift dry ingr...
0000b1e2b5,Fennel-Rubbed Pork Tenderloin with Roasted Fen...,amount unit ing...,0 Preheat oven to 350F with rack i...
