Up until now, data has been downloaded and preprocessed to remove unncessary columns and features. then, a set of rules-based regex functions were created as soft labels for common restricted ingredient classes (allergens, cultural restrictions, etc). Now, Uthese functions will be used to seed-label a subset fo the cleaned training examples (~10,000), whichwill eventually be used to train an XGBoost algorithm to mroe robustly classify all classes. This will allow for all data to be flexibly labeled, creating a high qualiuty training set for LLM fine tuning. 

In [8]:
# Importing required dependencies and referencing cleaned CSV file

import pandas as pd
import os

CLEANED_RECIPES_PATH = '../CSV_data/recipes_cleaned.csv'

# Ensuring correct path
print(f'Path exists?: {os.path.exists(CLEANED_RECIPES_PATH)} ')

Path exists?: True 


In [23]:
# Great! We're referencing the correct path. 

# Now, we need to pick a random subset of training examples to label.

# This will be done using a technique called reservoir sampling to choose 10,000 random examples in O(n) time
# The algorithm works as follows:
# * Select the first k examples to populate the reservoir
# * iterate over the remaining examples (initializing i = k + 1)
# * for each remaining example, choose a random integer j between 1 and i
# * if j <= i, replace reservoir[j-1] with the new, ith item.

import random
from tqdm import tqdm


total_rows = 0
with open(CLEANED_RECIPES_PATH, "r") as f:
    for _ in f:
        total_rows += 1
total_rows -= 1 # Accounting for CSV header
progress_bar = tqdm(total=total_rows, leave=False)

# Setting a seed for reproducability 
random.seed(42)

k = 10000 # Seed subset size
chunksize = 50000
reservoir_rows = []

chunks_iterator = pd.read_csv(CLEANED_RECIPES_PATH, chunksize=chunksize)

count = 0 # Global tracker of number of rows seen so far (as row indices restart in each chunk)

for chunk in chunks_iterator:
    for idx, row in chunk.iterrows():
        count += 1
        row_dict = row.to_dict()
        row_dict["original_index"] = idx
        if len(reservoir_rows) < k:
            reservoir_rows.append(row_dict)
        else:
            j = random.randint(1, count)
            if j <= k:
                reservoir_rows[j-1] = row_dict
        progress_bar.update(1)

progress_bar.close()

reservoir_df = pd.DataFrame(reservoir_rows)
display(reservoir_df)


            

                                                                                                                                        

Unnamed: 0,title,ingredients,directions,original_index
0,brown rice,"['22 oz. water', '6 beef bouillon cubes', '1 c...","['dissolve bouillon cubes in water.', 'saute o...",635717
1,rotini with broccoli,"['1 pound pasta, rotini 16 ounces', '2 cups br...",['bring a large pot of salted water to a boil ...,2054112
2,nthochi (banana) bread,"['1/2 cup margarine', '1 cup sugar', '2 cups f...","['grease a loaf pan well.', 'preheat oven to 3...",971971
3,veal and lemon saltimbocca,"['4 veal chops, pounded 1/2 to 1/4-inch thick'...",['place the veal chops on a work surface and s...,1819976
4,cheese cake,"['1/2 stick margarine', '4 tbsp. sugar', '1/2 ...","['beat margarine and sugar; beat in egg.', 'ad...",810518
...,...,...,...,...
9995,punch,"['2 (46 oz.) cans orange juice', '1 (46 oz.) c...","['combine all ingredients and chill.', 'some o...",712451
9996,homemade gently sweet sakura denbu,"['1 piece cod', '1/2 tbsp plus sugar', '1 tbsp...","['mix the ingredients together.', ""i've made a...",1969705
9997,buckwheat crepes with cashew-chive pesto and m...,"['3/4 cup buckwheat flour', '2 large eggs', '1...","['in a medium bowl, whisk the buckwheat flour ...",2154153
9998,its a paleo chicken biryani,"['2 cups hot water', '1 1/2 cups shredded unsw...",['combine the hot water and coconut flakes in ...,1822422
