To have less noise in the data, the ingredients will be categorized. Each ingredient will be matched with an ingredient in the USDA database.
This will lead to:
#### Less unique ingredients
Ingredients which are very similar will become the same, be it two extremely similar ingredients (large egg, egg), two different words representing the same ingredient (eggplant, aubergine), or differences in spelling (eggs, egg)
#### Give the food category
Each USDA item is linked to a category, so if we link each ingredient to a USDA item, we know the category of the ingredient
#### Allow the extraction of nutritional information
The USDA database has nutritional information of all of its items which can be fetched through their API

In [1]:
import pandas as pd
from ast import literal_eval

generic = lambda x: literal_eval(x)
conv = {'nutrition' : generic, 'steps' : generic, 'ingredients' : generic, 'id_column' : generic, 'jaccard_similarity' : generic}
df = pd.read_csv("C:/Users/01din\Documents/University\BSc thesis\data\RAW_recipes.csv/ingredients/ingredients.csv")
df = df[['ingredient', 'frequency']]
cats = pd.read_csv("C:/Users/01din\Documents/University\BSc thesis\pap\input/food_category.csv")

In [23]:
df = df[df.frequency>49]

In [24]:
df

Unnamed: 0,ingredient,frequency
0,salt,85746
1,butter,54975
2,sugar,44535
3,onion,39065
4,water,34914
...,...,...
2712,low-sodium low-fat chicken broth,50
2713,scotch whisky,50
2714,sugar-free strawberry gelatin,50
2715,frying chickens,50


In [72]:
import re
from tqdm import tqdm
import numpy as np

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    return tokens

def jaccard_similarity(a, b):
    a, b = set(a), set(b)
    intersection = a.intersection(b)
    union = a.union(b)
    return len(intersection) / len(union)

#Preprocess the ingredients and category descriptions
df['ingredient_tokens'] = df['ingredient'].apply(preprocess)
cats['Long_Desc_tokens'] = cats['Long_Desc'].apply(preprocess)

#Create a function to find the best matching category for each ingredient
def find_best_match(ingredient_tokens, threshold=0.0):
    best_score = threshold
    best_category = None

    #Group the categories by FdGrp_Desc
    grouped_cats = cats.groupby('FdGrp_Desc')

    for category, group in grouped_cats:
        #Compute average similarity for each category
        similarities = []
        for idx, row in group.iterrows():
            first_four_tokens = row['Long_Desc_tokens'][:4]
            similarity = jaccard_similarity(ingredient_tokens, first_four_tokens)
            similarities.append(similarity)
        avg_similarity = np.mean(similarities)

        if avg_similarity > best_score:
            best_score = avg_similarity
            best_category = category

    return best_category

#For progress
tqdm.pandas(desc="Finding best match")
df['category'] = df['ingredient_tokens'].progress_apply(find_best_match)

#Drop temp columns
df.drop(columns=['ingredient_tokens'], inplace=True)
cats.drop(columns=['Long_Desc_tokens'], inplace=True)


Finding best match: 100%|██████████| 50/50 [00:09<00:00,  5.39it/s]


results are quite poor. let's try to instead use another similarity measure

In [7]:
df

Unnamed: 0,ingredient,frequency
0,salt,85746
1,butter,54975
2,sugar,44535
3,onion,39065
4,water,34914
...,...,...
2712,low-sodium low-fat chicken broth,50
2713,scotch whisky,50
2714,sugar-free strawberry gelatin,50
2715,frying chickens,50


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#Assuming the ingredients from OpenFoodFacts and USDA are in a pandas DataFrame
#df['ingredient'] contains ingredients from my dataset
#cats['Long_Desc'] contains ingredients from USDA
#cats['FdGrp_Desc'] contains the food group descriptions for USDA ingredients

#Use sklearn TFIDF
vectorizer = TfidfVectorizer(ngram_range=(1, 4))

#Combine data into a matrix
combined_data = pd.concat([df['ingredient'], cats['Long_Desc']])

#Vectorize
vectorized_data = vectorizer.fit_transform(combined_data)

#Split back into two vectors
ingredient_vectors = vectorized_data[:len(df['ingredient'])]
usda_vectors = vectorized_data[len(df['ingredient']):]

#Compute cosine sim
cosine_sim_matrix = cosine_similarity(ingredient_vectors, usda_vectors)

#Set threshold (low as any similarity is better than no similarity)
threshold = 0.1

#Find the most similar USDA ingredient for each ingredient from my dataset
df['Long_Desc'] = None
df['FdGrp_Desc'] = None
for i, row in enumerate(cosine_sim_matrix):
    max_similarity_index = row.argmax()
    max_similarity = row[max_similarity_index]
    if max_similarity >= threshold:
        df.loc[i, 'Long_Desc'] = cats['Long_Desc'][max_similarity_index]
        df.loc[i, 'FdGrp_Desc'] = cats['FdGrp_Desc'][max_similarity_index]


In [21]:
df

Unnamed: 0,ingredient,frequency,Long_Desc,FdGrp_Desc
0,salt,85746,"Salt, table",Spices and Herbs
1,butter,54975,"Butter, salted",Dairy and Egg Products
2,sugar,44535,"Sugar, turbinado",Sweets
3,onion,39065,"Spices, onion powder",Spices and Herbs
4,water,34914,"Crackers, water biscuits",Baked Products
...,...,...,...,...
2712,low-sodium low-fat chicken broth,50,"Fat, chicken",Fats and Oils
2713,scotch whisky,50,"Kale, scotch, raw",Vegetables and Vegetable Products
2714,sugar-free strawberry gelatin,50,"Syrups, sugar free",Sweets
2715,frying chickens,50,"Shortening frying (heavy duty), palm (hydrogen...",Fats and Oils


In [16]:
df.to_csv("C:/Users/01din\Documents/University\BSc thesis\data\RAW_recipes.csv/ingredients/ingredients.csv")

In [19]:
df = pd.read_csv("C:/Users/01din\Documents/University\BSc thesis\data\RAW_recipes.csv/ingredients/ingredients_labels.csv", index_col=['id'])

In [20]:
df

Unnamed: 0_level_0,ingredient,frequency,Long_Desc,FdGrp_Desc,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,salt,85746,"Salt, table",Spices and Herbs,1
1,butter,54975,"Butter, without salt",Dairy and Egg Products,1
2,sugar,44535,"Sugars, granulated",Sweets,1
3,onion,39065,"Soup, onion, dry, mix","Soups, Sauces, and Gravies",0
4,water,34914,"Crackers, water biscuits",Baked Products,0
...,...,...,...,...,...
2712,low-sodium low-fat chicken broth,50,"Soup, chicken broth, low sodium, canned","Soups, Sauces, and Gravies",1
2713,scotch whisky,50,"Kale, scotch, raw",Vegetables and Vegetable Products,0
2714,sugar-free strawberry gelatin,50,"Gelatin desserts, dry mix, with added ascorbic...",Sweets,1
2715,frying chickens,50,"Shortening frying (heavy duty), palm (hydrogen...",Fats and Oils,0


In [21]:
df_false = df[df.label==0]
df_false

Unnamed: 0_level_0,ingredient,frequency,Long_Desc,FdGrp_Desc,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,onion,39065,"Soup, onion, dry, mix","Soups, Sauces, and Gravies",0
4,water,34914,"Crackers, water biscuits",Baked Products,0
5,eggs,33761,"Babyfood, cereal, with eggs, strained",Baby Foods,0
7,flour,26266,Potato flour,Vegetables and Vegetable Products,0
8,milk,25786,"Crackers, milk",Baked Products,0
...,...,...,...,...,...
2709,chevre cheese,50,"Cheese, cream",Dairy and Egg Products,0
2710,turkey pepperoni,50,"Fat, turkey",Fats and Oils,0
2713,scotch whisky,50,"Kale, scotch, raw",Vegetables and Vegetable Products,0
2715,frying chickens,50,"Shortening frying (heavy duty), palm (hydrogen...",Fats and Oils,0


After this I went through the process of manually labeling all rows, and then manually labeling all incorrectly labeled entries.
This took a lot of time but at this point I had tried many different methods (JS, NLP, TFIDF) and it was not getting much better, so I decided to just bite the bullet.
Now I can fetch the nutritional data from the USDA api