**Citation**

Data for this project comes from this source: N. Sakib, G. Shahariar, M. M. Kabir, M. K. Hasan, and H. Mahmud, “Assorted, archetypal and annotated two million (3a2m) cooking recipes dataset based on active learning.” (https://www.kaggle.com/datasets/nazmussakibrupol/3a2m-cooking-recipe-dataset)

In [1]:
import pandas as pd
from pathlib import Path
import json

In [2]:
current_dir = Path.cwd()  # .../src/data_processing
project_root = current_dir.parent.parent  # Go up 2 levels to main dir

output_parent =  project_root / "data" / "processed"
raw_data_folder = project_root / "data" / "raw"

In [3]:
df = pd.read_csv(raw_data_folder / "3A2M.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,directions,NER,genre,label
0,0,Reeses Cups(Candy),"[""Combine first four ingredients and press in ...","[""peanut butter"", ""graham cracker crumbs"", ""bu...",drinks,2
1,1,Rhubarb Coffee Cake,"[""Cream sugar and butter."", ""Add egg and beat ...","[""sugar"", ""butter"", ""egg"", ""buttermilk"", ""flou...",drinks,2
2,2,Quick Barbecue Wings,"[""Clean wings."", ""Flour and fry until done."", ...","[""chicken"", ""flour"", ""barbecue sauce""]",nonveg,3
3,3,Chocolate Frango Mints,"[""Mix ingredients together for 5 minutes."", ""S...","[""cake mix"", ""chocolate fudge pudding"", ""sour ...",drinks,2
4,4,Corral Barbecued Beef Steak Strips,"[""Brown strips in cooking oil."", ""Pour off dri...","[""long"", ""cooking oil"", ""tomato sauce"", ""water...",drinks,2


In [4]:
df.shape

(2231143, 6)

As seen the dataset has over 2 million rows, which is too large for the purpose of our project, so we will use stratified random sampling based on the genre to select 4500 random rows to use as data instead. This is 500 rows from each of the 9 genres.

In [5]:
df.drop(['Unnamed: 0', 'label'], axis=1, inplace=True)
df.head()

Unnamed: 0,title,directions,NER,genre
0,Reeses Cups(Candy),"[""Combine first four ingredients and press in ...","[""peanut butter"", ""graham cracker crumbs"", ""bu...",drinks
1,Rhubarb Coffee Cake,"[""Cream sugar and butter."", ""Add egg and beat ...","[""sugar"", ""butter"", ""egg"", ""buttermilk"", ""flou...",drinks
2,Quick Barbecue Wings,"[""Clean wings."", ""Flour and fry until done."", ...","[""chicken"", ""flour"", ""barbecue sauce""]",nonveg
3,Chocolate Frango Mints,"[""Mix ingredients together for 5 minutes."", ""S...","[""cake mix"", ""chocolate fudge pudding"", ""sour ...",drinks
4,Corral Barbecued Beef Steak Strips,"[""Brown strips in cooking oil."", ""Pour off dri...","[""long"", ""cooking oil"", ""tomato sauce"", ""water...",drinks


In [6]:
stratified_sampled_df = (
    df
    .groupby("genre", group_keys = False)
    .apply(lambda x: x.sample(n = 500, random_state = 36))
)

stratified_sampled_df.head()

  .apply(lambda x: x.sample(n = 500, random_state = 36))


Unnamed: 0,title,directions,NER,genre
2108573,Breakfast Speciality,Butter one or both sides of bread. Cut in cube...,"[""eggs"", ""bread"", ""pork sausage"", ""Cheddar che...",Fusion
2003055,Hello Dollies,Combine ingredients. Bake for 30 minutes at 35...,"[""oleo"", ""graham crackers"", ""milk"", ""chocolate...",Fusion
2059741,Linguine A La Margarita D.,Put the vegetable oil in saucepan or melt the ...,"[""linguine"", ""vegetable oil"", ""garlic"", ""lime ...",Fusion
2178316,Stuffed Mirlitons,Peel mirlitons and boil until tender. Save som...,"[""mirlitons"", ""onions"", ""celery"", ""bell pepper...",Fusion
2215443,Noodle Squeal,Cut and cook sausage pepper and onion together...,"[""noodles"", ""onion"", ""green pepper"", ""tomato s...",Fusion


In [7]:
stratified_sampled_df.shape

(5000, 4)

In [8]:
stratified_sampled_df.rename(columns={'title': 'recipe_name', 'NER': 'ingredients', 'genre': 'recipe_category'}, inplace=True)
stratified_sampled_df.head()

Unnamed: 0,recipe_name,directions,ingredients,recipe_category
2108573,Breakfast Speciality,Butter one or both sides of bread. Cut in cube...,"[""eggs"", ""bread"", ""pork sausage"", ""Cheddar che...",Fusion
2003055,Hello Dollies,Combine ingredients. Bake for 30 minutes at 35...,"[""oleo"", ""graham crackers"", ""milk"", ""chocolate...",Fusion
2059741,Linguine A La Margarita D.,Put the vegetable oil in saucepan or melt the ...,"[""linguine"", ""vegetable oil"", ""garlic"", ""lime ...",Fusion
2178316,Stuffed Mirlitons,Peel mirlitons and boil until tender. Save som...,"[""mirlitons"", ""onions"", ""celery"", ""bell pepper...",Fusion
2215443,Noodle Squeal,Cut and cook sausage pepper and onion together...,"[""noodles"", ""onion"", ""green pepper"", ""tomato s...",Fusion


In [9]:
#convert to json
recipe_list = []

for index, row in stratified_sampled_df.iterrows():
    recipe_list.append(row.to_dict())

print(recipe_list[0:5])

[{'recipe_name': 'Breakfast Speciality', 'directions': 'Butter one or both sides of bread. Cut in cubes and lay in pan (9 x 13-inch). Sprinkle cooked sausage over bread cubes. Grate cheese over sausage. In a bowl combine eggs half and half salt pepper and dry mustard. Mix at slow speed until well blended. Pour over other ingredients and refrigerate overnight. Bake at 350\\u00b0 about 40 to 45 minutes in the morning and do not drain.', 'ingredients': '["eggs", "bread", "pork sausage", "Cheddar cheese", "dry mustard", "salt"]', 'recipe_category': 'Fusion'}, {'recipe_name': 'Hello Dollies', 'directions': 'Combine ingredients. Bake for 30 minutes at 350\\u00b0 in a 9 x 13-inch pan.', 'ingredients': '["oleo", "graham crackers", "milk", "chocolate chips", "coconut", "nuts"]', 'recipe_category': 'Fusion'}, {'recipe_name': 'Linguine A La Margarita D.', 'directions': "Put the vegetable oil in saucepan or melt the butter. Add the squeezed garlic lime juice lemon juice vermouth salt and pepper co

In [10]:
with open(output_parent/"3a2m_recipe_data.json", "w", encoding='utf-8') as f:
    json.dump(recipe_list, f, indent=2, ensure_ascii=False)