To this point, the seed labeling subset has been extracted using reservoir sampling, creating a random set of 10,000 examples. Now, this subset, which has been written to seed_labeling_unlabeled.csv, will be labeled and updated with the regex functions found in diet_classificaion.py

In [2]:
# Let's begin by importing required functions and dependencies.

import pandas as pd
from diet_classification import PATTERN_FUNCTIONS

seed_df = pd.read_csv('../CSV_data/seed_labeling_unlabeled.csv')

seed_df.head(1)


Unnamed: 0,title,ingredients,directions,original_index
0,brown rice,"['22 oz. water', '6 beef bouillon cubes', '1 c...","['dissolve bouillon cubes in water.', 'saute o...",635717


In [3]:
# We're on the right track! Next, let's manipulate this data to a singular row

seed_df_rows = seed_df.iterrows() # Creates an iterable of rows

# What does one row look like?
first_row = next(seed_df_rows)[1] # Recall, pandas rows come in form of a two-tuple, (index, row), so we extract row. 
display(first_row)

title                                                    brown rice
ingredients       ['22 oz. water', '6 beef bouillon cubes', '1 c...
directions        ['dissolve bouillon cubes in water.', 'saute o...
original_index                                               635717
Name: 0, dtype: object

In [4]:
# Let's look at how we can apply all of our pattern functions to the list of ingredients and store them to an individual row. 
# Then, we can continue on to generalize this to all rows, iterating one-by-one.

first_row_ingredients = first_row['ingredients']
display(first_row_ingredients)
display(type(first_row_ingredients))

"['22 oz. water', '6 beef bouillon cubes', '1 c. rice', '1 stick butter', '1/2 c. chopped onion', '1/2 c. chopped mushrooms']"

str

In [5]:
# Wait - this isn't a list, it's a string! We need to convert it to the correct data type before we proceed.
# Since it is of the baove form, we can parse it as follows:

import ast

first_row_ingredients_list = ast.literal_eval(first_row_ingredients)

display(first_row_ingredients_list)
display(type(first_row_ingredients_list))

['22 oz. water',
 '6 beef bouillon cubes',
 '1 c. rice',
 '1 stick butter',
 '1/2 c. chopped onion',
 '1/2 c. chopped mushrooms']

list

In [6]:
# Now, let's attempt to label our example and modify the data, then generalize

from diet_classification import label_food_classes

first_row_food_classes = label_food_classes(first_row_ingredients_list)
display(first_row_food_classes)

{'pork': 0,
 'beef': 1,
 'chicken': 0,
 'fish': 0,
 'eggs': 0,
 'dairy': 1,
 'honey': 0,
 'nuts': 0,
 'peanuts': 0,
 'high_carb': 1,
 'gluten': 0,
 'soy': 0,
 'shellfish': 0,
 'sesame': 0,
 'alcohol': 0,
 'processed_meats': 0,
 'legumes': 0,
 'sugar': 0}

In [9]:
# Not bad for rules based labeling! More importantly, it runs as expected. Now, we can safely apply this strategy to all
# rows in the seed-labeling set, write these multi-hot encodings to the dataframe, and save to a CSV.

# While we created an iterator for the rows earlier, it is actually more efficient to apply to all rows with pandas

# However, this first means we need to define a function that both parses our string into a list and then applies labelling to this list.
def parse_ingredients(x):
    if isinstance(x, str):
        return ast.literal_eval(x)
    return x

def label_food_classes_safe(ingredients):
    return label_food_classes(parse_ingredients(ingredients))

df_labels = seed_df["ingredients"].apply(label_food_classes_safe)
df_labels_expanded = pd.json_normalize(df_labels)
seed_df = pd.concat([seed_df, df_labels_expanded], axis=1)

display(seed_df)
    

Unnamed: 0,title,ingredients,directions,original_index,pork,beef,chicken,fish,eggs,dairy,...,peanuts,high_carb,gluten,soy,shellfish,sesame,alcohol,processed_meats,legumes,sugar
0,brown rice,"['22 oz. water', '6 beef bouillon cubes', '1 c...","['dissolve bouillon cubes in water.', 'saute o...",635717,0,1,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
1,rotini with broccoli,"['1 pound pasta, rotini 16 ounces', '2 cups br...",['bring a large pot of salted water to a boil ...,2054112,0,0,1,0,0,1,...,0,1,0,0,0,0,0,0,0,0
2,nthochi (banana) bread,"['1/2 cup margarine', '1 cup sugar', '2 cups f...","['grease a loaf pan well.', 'preheat oven to 3...",971971,0,0,0,0,1,1,...,0,1,1,0,0,0,0,0,0,1
3,veal and lemon saltimbocca,"['4 veal chops, pounded 1/2 to 1/4-inch thick'...",['place the veal chops on a work surface and s...,1819976,1,0,1,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,cheese cake,"['1/2 stick margarine', '4 tbsp. sugar', '1/2 ...","['beat margarine and sugar; beat in egg.', 'ad...",810518,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,punch,"['2 (46 oz.) cans orange juice', '1 (46 oz.) c...","['combine all ingredients and chill.', 'some o...",712451,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
9996,homemade gently sweet sakura denbu,"['1 piece cod', '1/2 tbsp plus sugar', '1 tbsp...","['mix the ingredients together.', ""i've made a...",1969705,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
9997,buckwheat crepes with cashew-chive pesto and m...,"['3/4 cup buckwheat flour', '2 large eggs', '1...","['in a medium bowl, whisk the buckwheat flour ...",2154153,0,0,0,0,1,1,...,0,1,1,0,0,0,0,0,0,0
9998,its a paleo chicken biryani,"['2 cups hot water', '1 1/2 cups shredded unsw...",['combine the hot water and coconut flakes in ...,1822422,0,0,1,0,0,1,...,0,1,0,0,0,0,0,0,0,0


In [11]:
# Now that all 10,000 examples have been successfully labeled, we can write them to CSV and use them to train our model

seed_df.to_csv('../CSV_data/seed_labeled_XGBoost_training_data.csv', index=False)