# __Preprocessing Recipes Datasets__
This notebook handles the preproccessing of recipe ingredient data. This is important for mapping ingredients to visual classes for object detection training and improving model accuracy. <br>
<br>
The recipes dataset is downloaded from Kaggle, [Food.com Recipes with Search Terms and Tags](https://www.kaggle.com/datasets/shuyangli94/foodcom-recipes-with-search-terms-and-tags/data). The dataset consists of recipes uploads on Food.com with 494963 entries and 10 columns which are: <br>
1. **_id_**: Recipe ID
2. **_name_**: Recipe Name
3. **_description_**: Recipe Description
4. **_ingredients_**: List of Normalized Recipe Ingredients
5. **_ingredients_raw_str_**: List of Ingredients with Quantities
6. **_serving_size_**: Serving size (with grams)
7. **_servings_:** Number of Servings
8. **_steps_**: Recipe Instructions in Ordered
9. **_tags_**: User-assigned Tags
10. **_search_terms_**: Search Terms on Food.com that Return the Recipe

### <b>Extracting Malaysian Cuisine: </b>
Extracting the recipes that are only relevant to __Malaysian Cuisine__ by looking for the keyword **'Malaysian'** in the _'search_term'_ column. The results are saved into a csv file names _'filtered_recipes.csv'_. 

In [27]:
import pandas as pd

In [None]:
# Load your CSV file
df = pd.read_csv("recipes.csv")

# Filter rows where the 'ingredients' column contains the keyword (case-insensitive)
keyword = "Malaysian"
filtered_df = df[df["search_terms"].str.contains(keyword, case=False, na=False)]

# Save to a new CSV
filtered_df.to_csv("filtered_recipes.csv", index=False)

print(f"Filtered recipes containing '{keyword}' saved to filtered_recipes.csv")

### <b>Extracting Ingredients Frequency </b>
Generating a frequency list of ingredients from the filtered Malaysian recipes to help with the guidance of ingredients selection process afterwards. 

In [None]:
import ast
from collections import Counter

# Load the filtered CSV file
filtered_df = pd.read_csv("filtered_recipes.csv")

# Counter to hold ingredient frequencies
ingredient_counter = Counter()

# Iterate through each row in the 'ingredients' column
for item in filtered_df["ingredients"]:
    try:
        # Convert string representation of list into actual list
        ingredients = ast.literal_eval(item)

        # Make sure it's a list before counting
        if isinstance(ingredients, list):
            # Normalize (e.g., lowercase and strip whitespace) and count
            normalized = [ing.strip().lower() for ing in ingredients]
            ingredient_counter.update(normalized)
    except Exception as e:
        print(f"Skipping row due to error: {e}")

# Convert to DataFrame for viewing and saving
ingredient_df = pd.DataFrame(ingredient_counter.items(), columns=["ingredient", "count"])
ingredient_df = ingredient_df.sort_values(by="count", ascending=False)

# Save to CSV
ingredient_df.to_csv("ingredient_frequencies.csv", index=False)

print("Ingredient frequency counts saved to 'ingredient_frequencies.csv'")

### <b> Selecting the Ingredients Manually </b>
The full dataset includes many uncommon ingredients, therefore, the list of extracted ingredients is reviewed and selected manually. The commonly used and visually distinct items are selected to create a cleaned list of ket ingredients for object detection. <br>
<br>