# Recipe1M parser

In [5]:
import pandas as pd
import re
from recipe import Recipe

## Recipe1M data
Recipe1M comes with various json files containing crawled recipes from the web. For our project, two of them are interesting:
* layer1.json: Contains all recipes to their full extend
  
  ![layer1](../dataset-analysis/layer1_puml.png)

* det_ingrs.json: Only contains recipe ID, parsed ingredients and validity flag for parsing 
  
  ![det_ingrs](../dataset-analysis/det_ingrs_puml.png)

In our first attempt we want to make use of the parsed ingredient list and only consider recipes, where all ingredients are marked valid. The parsed ingredients don't contain amounts, so our parser has to kind of merge content of both files. Extracting ingredients from one and their amount and unit from the other file.

## Preprocessing
Removal of all invalid sets from ingredient and full data json to reduce memory. Use pickle instead of json.

In [12]:
# Removal of all elements in ingredient json which contain invalid entries according to the data set
ingredient_data = pd.read_json('../ingSub.json')
recipe_raw_data = pd.read_json('../layer1_stripped.json')

indices = []
i = 0
for row in ingredient_data.valid:
    if any(x == False for x in row):
        indices.append(i)
    i += 1

# Frame of ids that have to be dropped from raw data
drop_ids = pd.DataFrame(ingredient_data.iloc[indices]['id'])

# Drop indices from ingredient data
ingredient_data.drop(indices, inplace=True)
ingredient_data.info()

# Remove data from raw recipes where id matches
recipe_mod = recipe_raw_data[~recipe_raw_data.id.isin(drop_ids.id)]
recipe_mod.info()

# Remove fractions from raw ingredients
fractionRegex = re.compile("[0-9]+/[0-9]+")
for _, recipe in recipe_mod.iterrows():
    ingredients_mod = []
    for ingredient in recipe['ingredients']:
        ingredient_mod = ""
        for word in ingredient['text'].split(' '):
            if re.match(fractionRegex, word):
                numbers = word.split('/')
                float_representation = int(numbers[0])/int(numbers[1])
                ingredient_mod += f'{float_representation} '
            else:
                ingredient_mod += f'{word} '
        ingredients_mod.append({'text': ingredient_mod})
    recipe['ingredients'] = ingredients_mod

recipe_mod.info()

# TODO: Replace unparseable stuff like "c." --> cup 

# Save data to pickle (it's faster)
ingredient_data.to_pickle('../det_ingrs_valid.pkl')
recipe_mod.to_pickle('../layer1_valid.pkl')


<class 'pandas.core.frame.DataFrame'>
Int64Index: 21 entries, 1 to 28
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   valid        21 non-null     object
 1   id           21 non-null     object
 2   ingredients  21 non-null     object
dtypes: object(3)
memory usage: 672.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22 entries, 1 to 29
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ingredients   22 non-null     object
 1   url           22 non-null     object
 2   partition     22 non-null     object
 3   title         22 non-null     object
 4   id            22 non-null     object
 5   instructions  22 non-null     object
dtypes: object(6)
memory usage: 1.2+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22 entries, 1 to 29
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        -------

## Actual parsing

In [14]:
recipes = []
data = pd.read_pickle('../layer1_valid.pkl')
num_recipes = len(data)
print(f'Total number of recipes: {num_recipes}')
# Use id as index for easy access
data = data.set_index('id')

ingredient_data = pd.read_pickle('../det_ingrs_valid.pkl')
for _, row in ingredient_data.iterrows():
    recipe = Recipe(row['id'])
    
    # Continue if parser didn't parse
    if False == recipe.parse_ingredients(row['ingredients']):
        continue
    
    # Find raw recipe by id
    raw_recipe = data.loc[recipe.id]
    recipe.get_ingredient_amounts(raw_recipe['ingredients'])
    
    # Continue if parser didn't parse
    if False == recipe.parse_instructions(raw_recipe['instructions']):
        continue

    recipe.title = raw_recipe['title']
    recipes.append(recipe)

# Create data frame in the end (according to Stackoverflow this is faster)                
df = pd.DataFrame([vars(r) for r in recipes])
df = df.set_index('id')
df.to_pickle('../recipes_valid.pkl')
df.head(10)

Total number of recipes: 22
Determined entity: currency, from unit: centavo of text: 1 c. elbow macaroni 
Determined entity: currency, from unit: centavo of text: 0.5 c. sliced celery 
Determined entity: currency, from unit: centavo of text: 0.5 c. mayonnaise or possibly salad dressing 
Determined entity: unknown, from unit: cubic cup of text: 2 cups cubed seedless watermelon 


Unnamed: 0_level_0,title,ingredients,instructions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000033e39b,Dilly Macaroni Salad Recipe,amount unit ingredient 0 1.0...,0 Cook macaroni according to package direct...
000035f7ed,Gazpacho,amount unit ingredient 0 8.0 ...,0 Add the tomatoes to a food processor with...
00003a70b1,Crunchy Onion Potato Bake,amount unit ingredient 0 1 ...,0 Preheat oven to 350 degrees Fah...
00004320bb,Cool 'n Easy Creamy Watermelon Pie,amount unit ingredient 0 ...,0 Dissolve Jello in boiling water. 1 ...
0000631d90,Easy Tropical Beef Skillet,amount unit ingredient 0 ...,"0 In a large skillet, toast the coconut ove..."
000075604a,Kombu Tea Grilled Chicken Thigh,amount unit ingredient 0 2.0 ...,0 Pierce the skin of the chicken with a for...
00007bfd16,Strawberry Rhubarb Dump Cake,amount unit ingred...,0 Put ingredients in a buttered 9 x 12 x 2-...
000095fc1d,Yogurt Parfaits,amount unit ingredient 0 ...,0 Layer all ingredients in a serving dish. ...
0000973574,Zucchini Nut Bread,amount unit ingredient 0 2.0...,0 Sift dry ingr...
0000b1e2b5,Fennel-Rubbed Pork Tenderloin with Roasted Fen...,amount unit ingr...,0 Preheat oven to 350F with rack i...
