<a href="https://colab.research.google.com/github/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/parser/parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recipe1M parser

In [1]:
!pip install quantulum3
!pip install stemming
!pip install pattern

Collecting quantulum3
  Downloading quantulum3-0.7.9-py3-none-any.whl (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 4.7 MB/s 
Collecting num2words
  Downloading num2words-0.5.10-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.8 MB/s 
Installing collected packages: num2words, quantulum3
Successfully installed num2words-0.5.10 quantulum3-0.7.9
Collecting stemming
  Downloading stemming-1.0.1.zip (13 kB)
Building wheels for collected packages: stemming
  Building wheel for stemming (setup.py) ... [?25l[?25hdone
  Created wheel for stemming: filename=stemming-1.0.1-py3-none-any.whl size=11138 sha256=f56a673f2b3379b7fd7d00634f859050cce0b887845bf2ab8f8ea743b23c92a2
  Stored in directory: /root/.cache/pip/wheels/6b/e5/e2/c52ebc0a5b53fd82b00cc385e57bb1c90bd50e5f54ddbc06d1
Successfully built stemming
Installing collected packages: stemming
Successfully installed stemming-1.0.1
Collecting pattern
  Downloading Pattern-3.6.0.tar.gz (22.2 MB)
[K     

In [2]:
# Add GDrive
from google.colab import drive
import sys
drive.mount('/content/drive/')
sys.path.append('/content/drive/My Drive/Datasets/Recipe1M/')

Mounted at /content/drive/


In [1]:
import pandas as pd
import re
from recipe import Recipe

#FILE_DIR = '../'
FILE_DIR = '/content/drive/My Drive/Datasets/Recipe1M/'

## Recipe1M data
Recipe1M comes with various json files containing crawled recipes from the web. For our project, two of them are interesting:
* layer1.json: Contains all recipes to their full extend
  
  ![layer1](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/layer1_puml.png?raw=1)

* det_ingrs.json: Only contains recipe ID, parsed ingredients and validity flag for parsing 
  
  ![det_ingrs](https://github.com/mscholl96/mad-recime/blob/recipe1M-parser/data/recipe1M/dataset-analysis/det_ingrs_puml.png?raw=1)

In our first attempt we want to make use of the parsed ingredient list and only consider recipes, where all ingredients are marked valid. The parsed ingredients don't contain amounts, so our parser has to kind of merge content of both files. Extracting ingredients from one and their amount and unit from the other file.

## Preprocessing
Removal of all invalid sets from ingredient and full data json to reduce memory. Use pickle instead of json.

In [2]:
ingredient_data = None
recipe_data = None
ingredient_file = FILE_DIR + 'det_ingrs.json'
layer1_file = FILE_DIR + 'layer1.json'

ingredient_out = FILE_DIR + '2022_01_29/det_ingrs_valid.pkl'
layer1_out = FILE_DIR + '2022_01_29/layer1_valid.pkl'

# Get data, set id as used index and drop unnecessary information
ingredient_data = pd.read_json(ingredient_file).set_index('id')
recipe_data = pd.read_json(layer1_file).drop(columns=['url', 'partition']).set_index('id')

# Drop recipes with more than 20 ingredients
indices = ingredient_data[[True if len(row) > 20 else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 20 ingredients')

# Drop recipes with more than 30 instructions
indices = recipe_data[[True if len(row) > 30 else False for row in recipe_data['instructions']]].index
ingredient_data = ingredient_data.drop(indices)
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with more than 30 instructions')

# Removal of all elements in ingredient json which contain invalid entries according to the data set
# Get indices of ingredients which contain false valid flags 
indices = ingredient_data[[True if any(x == False for x in row) else False for row in ingredient_data['valid']]].index
ingredient_data = ingredient_data.drop(indices).drop(columns=['valid'])
recipe_data = recipe_data.drop(indices)
print(f'Removed {len(indices)} recipes with invalid ingredients')

indices = []
# Replace fractions in raw ingredients
fractionRegex = re.compile('[0-9]+/[0-9]+')
# Replace patterns like "1 12" which is intended to be 1 1/2 (1.5)
doubleRegex = re.compile('[1-9] [1-9][2-9]')

# Bring both datasets together
recipe_data['ingredients_parsed'] = ingredient_data['ingredients']

for idx, recipe in recipe_data.iterrows():
    # Put every information into one data set
    for ingredient in recipe['ingredients']:
        text = ingredient['text']
        if text:
            fractions = re.findall(fractionRegex, text)
            doubleNumbers = re.findall(doubleRegex, text)
            if fractions:
                numbers = fractions[0].split('/')
                float_representation = int(numbers[0])/int(numbers[1])
                text = text.replace(fractions[0], str(float_representation))
                ingredient['text'] = text
            elif doubleNumbers:
                numbers = doubleNumbers[0].split(' ')
                float_representation = int(numbers[0]) + int(numbers[1][0])/int(numbers[1][1])
                text = text.replace(doubleNumbers[0], str(float_representation))
                ingredient['text'] = text
        elif idx not in indices:
            # To be removed later
            indices.append(idx)



# Remove empty ingredients from data
for index in indices:
    recipe_data.loc[index]['ingredients'] = [elem for elem in recipe_data.loc[index]['ingredients'] if elem['text']]
    recipe_data.loc[index]['ingredients_parsed'] = [elem for elem in recipe_data.loc[index]['ingredients_parsed'] if elem['text']]

# Save data to pickle (it's faster)
recipe_data.to_pickle(layer1_out)

recipe_data.head(10)



Removed 18626 recipes with more than 20 ingredients
Removed 14270 recipes with more than 30 instructions
Removed 149732 recipes with invalid ingredients


Unnamed: 0_level_0,ingredients,title,instructions,ingredients_parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000033e39b,"[{'text': '1 c. elbow macaroni'}, {'text': '1 ...",Dilly Macaroni Salad Recipe,[{'text': 'Cook macaroni according to package ...,"[{'text': 'elbow macaroni'}, {'text': 'America..."
000035f7ed,"[{'text': '8 tomatoes, quartered'}, {'text': '...",Gazpacho,[{'text': 'Add the tomatoes to a food processo...,"[{'text': 'tomatoes'}, {'text': 'kosher salt'}..."
00003a70b1,"[{'text': '2.5 cups milk'}, {'text': '1.5 cups...",Crunchy Onion Potato Bake,[{'text': 'Preheat oven to 350 degrees Fahrenh...,"[{'text': 'milk'}, {'text': 'water'}, {'text':..."
00004320bb,[{'text': '1 (3 ounce) package watermelon gela...,Cool 'n Easy Creamy Watermelon Pie,"[{'text': 'Dissolve Jello in boiling water.'},...","[{'text': 'watermelon gelatin'}, {'text': 'boi..."
0000631d90,"[{'text': '12 cup shredded coconut'}, {'text':...",Easy Tropical Beef Skillet,"[{'text': 'In a large skillet, toast the cocon...","[{'text': 'shredded coconut'}, {'text': 'lean ..."
000075604a,"[{'text': '2 Chicken thighs'}, {'text': '2 tsp...",Kombu Tea Grilled Chicken Thigh,[{'text': 'Pierce the skin of the chicken with...,"[{'text': 'chicken thighs'}, {'text': 'tea'}, ..."
00007bfd16,"[{'text': '6 -8 cups fresh rhubarb, or'}, {'te...",Strawberry Rhubarb Dump Cake,[{'text': 'Put ingredients in a buttered 9 x 1...,"[{'text': 'fresh rhubarb'}, {'text': 'frozen r..."
000095fc1d,"[{'text': '8 ounces, weight Light Fat Free Van...",Yogurt Parfaits,[{'text': 'Layer all ingredients in a serving ...,"[{'text': 'non - fat vanilla yogurt'}, {'text'..."
0000973574,"[{'text': '2 cups flour'}, {'text': '1 tablesp...",Zucchini Nut Bread,"[{'text': 'Sift dry ingredients.'}, {'text': '...","[{'text': 'flour'}, {'text': 'cinnamon'}, {'te..."
0000b1e2b5,"[{'text': '1 teaspoon fennel seeds'}, {'text':...",Fennel-Rubbed Pork Tenderloin with Roasted Fen...,[{'text': 'Preheat oven to 350F with rack in m...,"[{'text': 'fennel seeds'}, {'text': 'pork tend..."


## Actual parsing

In [2]:
recipes = []
# Sort by ID, 
recipe_data = pd.read_pickle(FILE_DIR + '2022_01_29/layer1_valid.pkl')
print(f'Total number of recipes: {len(recipe_data)}')
recipe_data.head(5)


Total number of recipes: 847092


Unnamed: 0_level_0,ingredients,title,instructions,ingredients_parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000033e39b,"[{'text': '1 c. elbow macaroni'}, {'text': '1 ...",Dilly Macaroni Salad Recipe,[{'text': 'Cook macaroni according to package ...,"[{'text': 'elbow macaroni'}, {'text': 'America..."
000035f7ed,"[{'text': '8 tomatoes, quartered'}, {'text': '...",Gazpacho,[{'text': 'Add the tomatoes to a food processo...,"[{'text': 'tomatoes'}, {'text': 'kosher salt'}..."
00003a70b1,"[{'text': '2.5 cups milk'}, {'text': '1.5 cups...",Crunchy Onion Potato Bake,[{'text': 'Preheat oven to 350 degrees Fahrenh...,"[{'text': 'milk'}, {'text': 'water'}, {'text':..."
00004320bb,[{'text': '1 (3 ounce) package watermelon gela...,Cool 'n Easy Creamy Watermelon Pie,"[{'text': 'Dissolve Jello in boiling water.'},...","[{'text': 'watermelon gelatin'}, {'text': 'boi..."
0000631d90,"[{'text': '12 cup shredded coconut'}, {'text':...",Easy Tropical Beef Skillet,"[{'text': 'In a large skillet, toast the cocon...","[{'text': 'shredded coconut'}, {'text': 'lean ..."


In [None]:
j = 0
for idx, raw_recipe in recipe_data.iterrows():
    
    recipe = Recipe(idx, raw_recipe['title'])
    
    recipe.parse_ingredients(raw_recipe['ingredients_parsed'])

    recipe.parse_instructions(raw_recipe['instructions'])
      
    recipe.get_ingredient_amounts(raw_recipe['ingredients'])

    recipes.append(recipe)

    j += 1
    if j % 5000 == 0:
      print(f'Progress: {j}')
      # Save every 100000 recipes
      if j % 100000 == 0:
        print(f'Parsed {j} recipes')
        # Create data frame in the end (according to Stackoverflow this is faster)                
        df = pd.DataFrame([r.to_dict() for r in recipes]).set_index('id')
        path = FILE_DIR + f'2022_01_29/recipes_valid_{int(j / 100000)}'
        df.to_pickle(path + '.pkl')
        del df
        recipes = []


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
2022-01-29 14:13:52,076 --- The classifier was built using a different scikit-learn version (=0.24.2, !=1.0.2). The disambiguation tool could behave unexpectedly. Consider running classifier.train_classfier()


Progress: 5000


In [None]:
# Try to clean up
del recipe_data

# Create data frame in the end (according to Stackoverflow this is faster)                
df = pd.DataFrame([r.to_dict() for r in recipes]).set_index('id')
del recipes

path = FILE_DIR + '2022_01_29/recipes_valid'

df.to_pickle(path + '.pkl')
df.to_json(path + '.json', indent=2, orient='records')
df.head(10)

Unnamed: 0_level_0,title,ingredients,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000033e39b,Dilly Macaroni Salad Recipe,amount unit ingredient 0 1....,0 Cook macaroni according to package direct...
000035f7ed,Gazpacho,amount unit ingredient 0 8.0 ...,0 Add the tomatoes to a food processor with...
00003a70b1,Crunchy Onion Potato Bake,amount unit ingredient 0 1....,0 Preheat oven to 350 degrees Fah...
00004320bb,Cool 'n Easy Creamy Watermelon Pie,amount unit ingredient 0 1.0...,0 Dissolve Jello in boiling water. 1 ...
0000631d90,Easy Tropical Beef Skillet,amount unit ingredient 0...,"0 In a large skillet, toast the coconut ove..."
000075604a,Kombu Tea Grilled Chicken Thigh,amount unit ingredient 0 2.0 ...,0 Pierce the skin of the chicken with a for...
00007bfd16,Strawberry Rhubarb Dump Cake,amount unit ing...,0 Put ingredients in a buttered 9 x 12 x 2-...
000095fc1d,Yogurt Parfaits,amount unit ingredient 0 ...,0 Layer all ingredients in a serving dish. ...
0000973574,Zucchini Nut Bread,amount unit ingredient 0 2...,0 Sift dry ingr...
0000b1e2b5,Fennel-Rubbed Pork Tenderloin with Roasted Fen...,amount unit ing...,0 Preheat oven to 350F with rack i...
