# __Preprocessing Recipes Datasets__
This notebook handles the preproccessing of recipe ingredient data. This is important for mapping ingredients to visual classes for object detection training and improving model accuracy. <br>
<br>
The recipes dataset is downloaded from Kaggle, [Food.com Recipes with Search Terms and Tags](https://www.kaggle.com/datasets/shuyangli94/foodcom-recipes-with-search-terms-and-tags/data). The dataset consists of recipes uploads on Food.com with 494963 entries and 10 columns which are: <br>
1. **_id_**: Recipe ID
2. **_name_**: Recipe Name
3. **_description_**: Recipe Description
4. **_ingredients_**: List of Normalized Recipe Ingredients
5. **_ingredients_raw_str_**: List of Ingredients with Quantities
6. **_serving_size_**: Serving size (with grams)
7. **_servings_:** Number of Servings
8. **_steps_**: Recipe Instructions in Ordered
9. **_tags_**: User-assigned Tags
10. **_search_terms_**: Search Terms on Food.com that Return the Recipe

In [None]:
import pandas as pd
import ast

# Load CSV file
df = pd.read_csv("recipes.csv")

### <b>Step 1: Extracting Malaysian Cuisine (without pork) </b>
Extracting the recipes that are only relevant to __Malaysian Cuisine__ by looking for the keyword **'Malaysian'** but without **'pork'** in the _'search_term'_ column. The results are saved into a csv file names _'filtered_recipes.csv'_. 

In [None]:
# Filter rows where the 'search_terms' column contains the keyword "Malaysian" and do not contain "pork"
filtered_df = df[
    df["search_terms"].str.contains("Malaysian", case=False, na=False) & 
    ~df["search_terms"].str.contains("pork", case=False, na=False) &
    ~df["ingredients"].str.contains("pork", case=False, na=False)
]

# Save to a new CSV
filtered_df.to_csv("filtered_recipes.csv", index=False)

print(f"Filtered recipes saved to filtered_recipes.csv")

### <b> Step 2: Normalizing the Recipe Dataset </b>
To ensure a smooth a accurate word-pattern matching betwen pantry items and recipe ingredients, the recipe dataset is normalized to clean standardize the data, where the steps included: 
1. **Lowercasing**: Converting all ingredient names to lowercase to avoid mismatches caused by capitalization. 
2. **Whitespace Stripping**: Removing extra spaces before or after ingredients.
3. **Punctuation Removal**: Removing characters such as period or special symbols for consistency. 
4. **Duplication Removal**: Removing duplicate ingredients within the same recipe to prevent redundant comparisons.
5. **Sorting**: Sorting the ingredients alphabetically after cleaning to ensure consistent pattern analysis. 

<br>
These cleaning steps ensure the recipe dataset is prepared for accuracte ingredient matching and reliable recipe recommendations based on the ingredients. 

In [None]:
import re

# Function to normalize ingredients
def normalize_ingredients(ingredient_str):
    if pd.isna(ingredient_str):
        return ""
    
    # Convert string to list
    ingredients = ast.literal_eval(ingredient_str)

    # Clean each ingredient
    cleaned = []
    for ing in ingredients:
        ing = ing.lower()  # Lowercase
        ing = ing.strip()  # Remove whitespace
        ing = re.sub(r'[^\w\s]', '', ing)  # Remove punctuation
        cleaned.append(ing)

    # Remove duplicates & sort alphabetically
    cleaned = sorted(set(cleaned))

    # Join ingredients back into a comma-separated string
    return ", ".join(cleaned)

# Apply normalization to ingredients
filtered_df["ingredients"] = filtered_df["ingredients"].apply(normalize_ingredients)

# Save result (optional)
filtered_df.to_csv("final_recipes.csv", index=False)

print("Recipe dataset normalization complete and saved to final_recipes.csv.")

### <b>Step 3: Extracting Ingredients Frequency </b>
The frequency of each ingredient from the normalized recipe data is calculated and extracted to find out the most commonly used ingredients in the recipe dataset. By identifying the most commonly used ingredients, the relevant pantry item classes selected will provide a higher impact in recipe recommendation and pantry tracking.### <b> Selecting the Ingredients Manually </b>
The full dataset includes many uncommon ingredients, therefore, the list of extracted ingredients is reviewed and selected manually. The commonly used and visually distinct items are selected to create a cleaned list of ket ingredients for object detection. <br>
<br>

In [None]:
from collections import Counter

# Step 1: Flatten all ingredients into a single list
all_ingredients = []

for ing_list in filtered_df["ingredients"].dropna():
    all_ingredients.extend([i.strip() for i in ing_list.split(",") if i.strip() != ""])

# Step 2: Count frequency
ingredient_counts = Counter(all_ingredients)

# Step 3: Convert to DataFrame and sort alphabetically
ingredient_freq_df = pd.DataFrame.from_dict(ingredient_counts, orient='index', columns=['frequency'])
ingredient_freq_df = ingredient_freq_df.sort_index()  # Sort alphabetically by ingredient name

# Step 4: Save or display
ingredient_freq_df.to_csv("ingredient_frequency.csv")

print("Ingredient frequency table saved to ingredient_frequency.csv.")

### <b> Step 4: Selecting the Ingredients Manually </b>
The full dataset includes many uncommon ingredients, therefore, the list of extracted ingredients is reviewed and selected manually. The commonly used and visually distinct items are selected to create a cleaned list of key ingredients for object detection. This ensures the model can learn features beyond colour, such as shape and texture. <br>
<br>
The ingredients are selected based on the below criteria: 
1. Frequency: Appears in at least 10 recipes in the dataset. 
2. Visual Distinctiveness: Has a clear, unique appearance (e.g. flour, salt and sugar are not selected as they are not visually distinguishable). 
3. Availability: Can be found in public image datasets or collected easily. 

<br>
Automated selection based on frequency alone included ambiguous and visually similar items, A manual review is to ensure the selected classes had distinct visual features for better YOLOv5 object detection model. 

In [None]:
selected_ingredients = ['beef', 'cabbage', 'chicken', 'chili_pepper', 'cilantro', 'egg', 'fish', 'garlic', 'ginger', 'green_onion', 'lime', 'mango', 'noodles', 'onion', 'potato', 'tomato']

The selected ingredient classes—such as beef, cabbage, chicken, egg, garlic, and tomato—were chosen based on their common usage in everyday cooking and their availability in the VegFru dataset. The list includes a mix of proteins (e.g., beef, chicken, fish), vegetables (e.g., cabbage, onion, potato), aromatics (e.g., garlic, ginger, green onion), and flavor enhancers (e.g., chili pepper, lime, cilantro), making them suitable for generating a wide range of common household recipes. This curated selection helps to ensure that the system can demonstrate meaningful functionality within a manageable scope during the initial prototype phase.