In [1]:
import pandas as pd
data = pd.read_json("json_files/train.json")
data_cleaned = pd.read_json("json_files/train.json")
print(data.head())

      id      cuisine                                        ingredients
0  10259        greek  [romaine lettuce, black olives, grape tomatoes...
1  25693  southern_us  [plain flour, ground pepper, salt, tomatoes, g...
2  20130     filipino  [eggs, pepper, salt, mayonaise, cooking oil, g...
3  22213       indian                [water, vegetable oil, wheat, salt]
4  13162       indian  [black pepper, shallots, cornflour, cayenne pe...


## 1. Data Understanding

### Total number of rows, cuisines, ingredients

In [2]:
df = pd.DataFrame(data)

total_rows = df.shape[0]
total_cuisines = df['cuisine'].nunique()
unique_ingredients = set(ingredient for ingredients_list in df['ingredients'] for ingredient in ingredients_list)
total_ingredients = len(unique_ingredients)

print(f"Total number of rows: {total_rows}")
print(f"Total number of cuisines: {total_cuisines}")
print(f"Total number of ingredients: {total_ingredients}")

Total number of rows: 39774
Total number of cuisines: 20
Total number of ingredients: 6714


### Cuisine Distribution

In [3]:
cuisine_counts = df['cuisine'].value_counts()
print(cuisine_counts)

cuisine
italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: count, dtype: int64


### Ingredient Counts

In [4]:
from collections import Counter

all_ingredients = [ingredient for ingredients_list in df['ingredients'] for ingredient in ingredients_list]
ingredient_counts = Counter(all_ingredients)

most_common_10 = ingredient_counts.most_common(10)

least_common_10 = ingredient_counts.most_common()[:-11:-1]

print("Most Common 10 Ingredients:")
print(most_common_10)

print("\nLeast Common 10 Ingredients:")
print(least_common_10)

Most Common 10 Ingredients:
[('salt', 18049), ('onions', 7972), ('olive oil', 7972), ('water', 7457), ('garlic', 7380), ('sugar', 6434), ('garlic cloves', 6237), ('butter', 4848), ('ground black pepper', 4785), ('all-purpose flour', 4632)]

Least Common 10 Ingredients:
[('crushed cheese crackers', 1), ('tomato garlic pasta sauce', 1), ('lop chong', 1), ('Hidden Valley® Greek Yogurt Original Ranch® Dip Mix', 1), ('Lipton® Iced Tea Brew Family Size Tea Bags', 1), ('ciabatta loaf', 1), ('cholesterol free egg substitute', 1), ('orange glaze', 1), ('Challenge Butter', 1), ('Oscar Mayer Cotto Salami', 1)]


## 2-1. Data PreProcessing: Cleaning

### First-Hand Mapping of Frequently Used Last Words

To achieve the dual objectives of:

1. **Initial data cleaning**, and  
2. **Identifying potential data errors**,  

we analyzed ingredients with frequently occurring last words (appearing more than 15 times).  

The results of this analysis are as follows:

In [5]:
unique_ingredients_list = set(ingredient.lower() for ingredients_list in df['ingredients'] for ingredient in ingredients_list)

sorted_unique_ingredients = sorted(unique_ingredients_list)

last_word_counts = Counter(ingredient.split()[-1] for ingredient in sorted_unique_ingredients)

frequent_last_words = {word: count for word, count in last_word_counts.items() if count > 15}

# Print the filtered dictionary
print("Words used more than 15 times:")
print(frequent_last_words)
print("count: ", len(frequent_last_words))

Words used more than 15 times:
{'sauce': 188, 'paste': 70, 'milk': 50, 'tomatoes': 51, 'beans': 80, 'cheese': 163, 'yogurt': 43, 'broth': 65, 'ham': 38, 'seasoning': 67, 'mix': 119, 'beef': 20, 'noodles': 45, 'juice': 71, 'powder': 83, 'squash': 17, 'flakes': 29, 'vinegar': 45, 'steaks': 29, 'water': 28, 'pepper': 50, 'flour': 70, 'butter': 38, 'extract': 30, 'liqueur': 30, 'oil': 70, 'syrup': 43, 'almonds': 16, 'pasta': 31, 'slices': 22, 'chile': 21, 'fillets': 55, 'sausage': 52, 'seeds': 37, 'apples': 18, 'bacon': 22, 'halves': 19, 'rice': 56, 'bread': 45, 'dressing': 44, 'leaves': 67, 'tortillas': 16, 'corn': 17, 'mushrooms': 32, 'potatoes': 24, 'salt': 41, 'chips': 29, 'crust': 20, 'chocolate': 18, 'spray': 16, 'sugar': 48, 'peppers': 21, 'mayonnaise': 17, 'soup': 45, 'roast': 32, 'steak': 36, 'meat': 34, 'stock': 44, 'onion': 18, 'cream': 56, 'dough': 21, 'salsa': 22, 'garlic': 17, 'chicken': 25, 'olives': 30, 'peas': 19, 'chops': 20, 'crumbs': 23, 'rolls': 33, 'buns': 19, 'mustar

### Factors Considered for Merging Ingredients

The following factors were taken into account when merging ingredients:

- **Frequency of Usage**
    - Frequently used modifiers were **not merged** when they likely contained **important contextual information**.
- **Meaning of Modifier**
    - Although *"Old Bay Seasoning"* appeared in **98 recipes**, it was merged despite **2,212 other ingredients** ending with the word *"seasoning"*.  
      This is because the modifier *"Old Bay"* refers to a **brand name** rather than a descriptive term.
- **Spacing Issues**
    - Example: *"poppyseeds"* → *"poppy seeds"*

Based on these factors, a mapping was created for each last word, and the ingredients were cleaned accordingly.

#### Filter Unique Ingredients Ending with "Milk"

The function **`cleaned_with_frequency`** was used to create the ingredient mapping.  
For a more advanced understanding, this task was also performed **manually**.

In [6]:
def cleaned_with_frequency(ingredients, data):
    last_word = ingredients[0].split()[-1].lower()
    rows_with_last_word = df[df['ingredients'].apply(
        lambda x: any(last_word in ingredient.lower() for ingredient in x)
    )]
    total_last_word_count = len(rows_with_last_word)
    ingredient_frequencies = {}
    for ingredient in ingredients:
        rows_with_ingredient = df[df['ingredients'].apply(
            lambda x: any(ingredient.lower() in ingredient_in_row.lower() for ingredient_in_row in x)
        )]
        total_ingredient_count = len(rows_with_ingredient)
        
        if total_last_word_count > 0:
            percentage_frequency = (total_ingredient_count / total_last_word_count) * 100
        else:
            percentage_frequency = 0
        
        ingredient_frequencies[ingredient] = round(percentage_frequency, 2)
    
    return ingredient_frequencies

In [7]:
ingredients_with_ = [ingredient for ingredient in sorted_unique_ingredients if ingredient.endswith("mushrooms")]
print(ingredients_with_)

['baby portobello mushrooms', 'black mushrooms', 'black trumpet mushrooms', 'brown beech mushrooms', 'button mushrooms', 'chestnut mushrooms', 'chinese black mushrooms', 'cremini mushrooms', 'crimini mushrooms', 'diced mushrooms', 'dried black mushrooms', 'dried mushrooms', 'dried porcini mushrooms', 'dried shiitake mushrooms', 'dried wood ear mushrooms', 'fresh mushrooms', 'fresh shiitake mushrooms', 'green giant™ sliced mushrooms', 'maitake mushrooms', 'matsutake mushrooms', 'mixed mushrooms', 'mushrooms', 'oyster mushrooms', 'shimeji mushrooms', 'sliced mushrooms', 'straw mushrooms', 'tree ear mushrooms', 'white button mushrooms', 'white mushrooms', 'wild mushrooms', 'wood ear mushrooms', 'wood mushrooms']


### Function: Ingredient Cleaner

The **`clean_ingredients`** function processes a list of ingredient names by **standardizing** them using a predefined **`keyword_mapping`** dictionary.

#### **How It Works:**
1. **Normalization**: Each ingredient name is converted to **lowercase** to ensure case-insensitive matching.
2. **Keyword Mapping**: The function checks if any **keywords** from `keyword_mapping` are present in an ingredient name.
   - If a keyword is found, the ingredient name is replaced with the corresponding **standardized term** from the dictionary.
   - If no keywords match, the **original ingredient name is preserved**.
3. **Redundancy Reduction**: This process ensures that variations of ingredient names are **grouped under a standardized name**, reducing inconsistencies.

#### **Output:**
The function returns a **processed list of ingredients**, where:
- Each name is either **mapped to a generalized term**, or  
- **Left unchanged** if no match is found.

In [8]:
def clean_ingredients_with_mapping(ingredients, mapping):
    cleaned = []
    for ingredient in ingredients:
        lower_ingredient = ingredient.lower()
        found = False
        for generalized_name, keywords in mapping.items():
            if any(keyword in lower_ingredient for keyword in keywords):
                cleaned.append(generalized_name)
                found = True
                break
        if not found:
            cleaned.append(ingredient)
    return list(dict.fromkeys(cleaned))

#### Function: Merge or Leave Distinguisher

The **`merge_or_leave_distinguisher`** function analyzes the usage of a **specific ingredient** in relation to a **generalized ingredient** within a recipe dataset. Its purpose is to determine whether the **specific ingredient** should be merged with the **generalized one** or kept distinct.

#### **How It Works:**
1. **Counts Recipe Occurrences**:  
   - Computes the total number of recipes containing **both the generalized and specific ingredient**.  
2. **Percentage Calculation**:  
   - Calculates the percentage of recipes that include the **specific ingredient** relative to the **generalized ingredient**.
3. **Threshold-Based Decision**:  
   - Compares the computed percentage against a **user-defined threshold**.  
   - If the percentage is **below the threshold**, the ingredient is **merged**; otherwise, it remains **distinct**.

#### **Output:**
The function returns a **dictionary** containing:
- **Total recipe counts** for both the **generalized and specific ingredient**.
- **The computed percentage** of recipes containing the specific ingredient.
- **A decision** (`"merge"` or `"leave"`) based on the threshold.

This approach is particularly useful for **cleaning and simplifying ingredient data** while **preserving significant variations** in recipe datasets.

In [11]:
def analyze_ingredient_usage(data, specific_ingredient, generalized_ingredient, threshold=5):
    """
    Analyzes the usage of a specific ingredient compared to its generalized ingredient.
    """
    data['ingredients'] = data['ingredients'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    
    recipes_with_generalized = data[data['ingredients'].apply(
        lambda x: any(generalized_ingredient.lower() in ingredient.lower() for ingredient in x)
    )]
    total_with_generalized = len(recipes_with_generalized)
    
    recipes_with_specific = data[data['ingredients'].apply(
        lambda x: any(specific_ingredient.lower() in ingredient.lower() for ingredient in x)
    )]
    total_with_specific = len(recipes_with_specific)
    
    if total_with_generalized > 0:
        percentage = (total_with_specific / total_with_generalized) * 100
    else:
        percentage = 0

    decision = "merge" if percentage < threshold else "leave"
    
    return {
        "total_recipes_with_generalized": total_with_generalized,
        "total_recipes_with_specific": total_with_specific,
        "percentage": percentage,
        "decision": decision
    }

# Example usage
data = pd.read_json("json_files/train.json")  # Load the dataset


specific_ingredient = "black mushrooms"
generalized_ingredient = "mushrooms"
threshold = 1  # Percentage threshold

result = analyze_ingredient_usage(data, specific_ingredient, generalized_ingredient, threshold)

# Print results
print(f"Total recipes that use '{generalized_ingredient}': {result['total_recipes_with_generalized']}")
print(f"Total recipes that use '{specific_ingredient}': {result['total_recipes_with_specific']}")
print(f"Percentage of '{specific_ingredient}': {result['percentage']:.2f}%")
print(f"Decision: {result['decision']}")

Total recipes that use 'mushrooms': 1981
Total recipes that use 'black mushrooms': 22
Percentage of 'black mushrooms': 1.11%
Decision: leave


#### Mapping Applier Function

The **`mapping_applier`** function applies predefined ingredient mappings to clean the dataset and logs any changes.  

In [13]:
import pandas as pd
import json

def mapping_applier(data, mappings, output_json="json_files/train_cleaned.json", log_file="txt_files/mapping_changes.txt"):
    """
    Applies ingredient mappings to clean the dataset and logs changes.

    Args:
        data (pd.DataFrame): The dataset containing an 'ingredients' column with lists of ingredients.
        mappings (dict): A dictionary mapping last words to standardized ingredients.
        output_json (str): Filename for the cleaned dataset.
        log_file (str): Filename for logging changes.
    """
    changes = []

    def map_ingredient(ingredient):
        if not isinstance(ingredient, str):
            return ingredient
        words = ingredient.lower().split() 
        last_word = words[-1] if words else ""

        if last_word in mappings:  
            word_mapping = mappings[last_word]
            for key, value in word_mapping.items():
                for i in value:
                    if i in words:
                        if ingredient != key: 
                            changes.append(f'"{ingredient}" > "{key}"')
                        return key

        return ingredient

    data_cleaned["ingredients"] = data_cleaned["ingredients"].apply(
        lambda ingredient_list: [map_ingredient(ingredient) for ingredient in ingredient_list]
        if type(ingredient_list) == list else ingredient_list
    )

    data.to_json(output_json, orient="records", indent=4)

    if changes:
        with open(log_file, "w") as log:
            log.write("\n".join(changes))
        print(f"Mapping applied. Cleaned dataset saved to {output_json}. Changes logged in {log_file}.")
    else:
        print("No mappings applied. Check your mapping dictionary or input data.")



In [14]:
mappings = {
    "sauce": {
        "tomato sauce" : ["tomato", "pasta", "spaghetti", "ragu", "prego", "marinara", "italian"],
        "soy sauce" : ["soy", "tamari"],
        "barbeque sauce" : ["barbecue", "barbeque", "bbq"],
        "worcestershire sauce" : ["worcestershire"],
        "hot sauce" : ["pepper", "hot", "chili", "chill", "chile"],
        "enchilada sauce" : ["enchilada"],
        "bean sauce" : ["bean"],
        "teriyaki sauce" : ["teriyaki", "kung", "fish"],
        "fish sauce" : ["fish"],
        "oyster sauce" : ["oyster"],
        "picante sauce" : ["picante"],
        "tartar sauce" : ["tartar"],
        "cranberry sauce" : ["cranberry"],
        "alfredo sauce" : ["alfredo"],
        "sweet and sour sauce" : ["sour"],
        "apple sauce": ["apple"],
        "plum sauce" : ["plum"],
        "arrabbiata sauce" : ["arrabbiata"]
    },
    "paste": {
        "tomato paste": ["tomato"],
        "hot paste": ["hot"] ,
        "chili paste": ["chilli", "chile", "chili"] ,
        "curry paste": ["curry"],
        "hot paste": ["hot"] ,
        "soy bean paste": ["soy bean"] ,
        "red bean paste" : ["sweet bean paste", "sweet red bean paste", "bean paste", "fermented"],
        "sesame paste" : ["sesame"]
    },
    "milk": {
        "chocolate milk" : ["chocolate"],
        "coconut milk" : ["coconut"],
        "buttermilk" : ["buttermilk"],
        "almond milk" : ["almond"],
        "soy milk" : ["soy"],
        "evaporated milk" : ["evaporated", "condensed"],
        "cashew milk" : ["cashew"],
        "powdered milk" : ["powdered", "dry", "dried"],
        "sour milk" : ["sour"],
        "oat milk" : ["skim"],
        "evaporated milk" : ["evaporated", "carnation"]
        },
    "tomatoes": {
        "plum/cherry tomatoes": ["plum tomatoes", "italian plum tomatoes","cherry tomatoes"],
        "grape tomatoes": ["grape tomatoes"],
        "green tomatoes": ["green tomatoes"]
    },
    "beans": {
        "refried beans": [
        "refried beans", "fat-free refried beans", "reduced sodium refried beans",
        "vegetarian refried beans", "low-fat refried beans", "old el paso™ refried beans", "refried black beans"
        ],
        "black beans": [
        "black beans", "canned black beans", "dried black beans", "reduced sodium black beans",
        "seasoned black beans", "no-salt-added black beans", "progresso black beans", "kroger black beans",
        "low sodium black beans", "refried black beans"
        ],
        "kidney beans": [
        "kidney beans", "red kidney beans", "light kidney beans", "light red kidney beans",
        "dried kidney beans", "reduced sodium kidney beans", "white kidney beans"
        ],
        "lima beans": [
        "lima beans", "baby lima beans", "fresh lima beans", "frozen lima beans"
        ],
        "pinto beans": [
        "pinto beans", "dried pinto beans", "low sodium pinto beans"
        ],
        "red beans": [
        "red beans", "small red beans", "sweetened red beans"
        ],
        "white beans": [
        "white beans", "white cannellini beans", "giant white beans", "small white beans", "great northern beans", "cannellini beans"
        ],
        "green beans": [
        "green beans", "long green beans", "frozen green beans", "string beans", "wax beans"
        ],
        "edamame beans": [
        "green soybeans", "edamame beans", "frozen edamame beans"
        ],
        "coffee beans": [
        "chocolate covered coffee beans", "chocolatecovered espresso beans", "coffee beans", "espresso beans"
        ]
    },
    "cheese": {
        "cheddar cheese": ["cheddar"],
        "mozzarella cheese": ["mozzarella", "italian", "string"],
        "cream cheese": ["cream"],
        "cottage cheese" : ["cottage"],
        "jack cheese" : ["jack"],
        "goat cheese" : ["goat"],
        "swiss cheese" :["swiss", "jarlsberg"],
        "manchego cheese" : ["manchego"],
        "blue cheese" : ["blue", "stilton", "roquefort"],
        "parmesan cheese" : ["parmesan cheese", "parmigiano"],
        "gruyere cheese": ["gruyere", "grated gruyère"],
        "ricotta cheese" : ["ricotta cheese"],
        "romano cheese" : ["romano"],
        "mexican cheese" : ["mexican"],
        "american cheese" : ["american"],
        "cheese" : ["low-fat cheese", "shredded cheese", "reduced-fat cheese", "semi-soft cheese", "hard cheese", "herb cheesse", "processed cheese", "fresh cheese", "garlic herb spreadable cheese", "crumbled cheese", "vegan cheese"],
        "taco cheese" : ["taco"],
        "ricotta cheese" : ["ricotta"],
        "provolone cheese" : ["provolone"],
        "gouda cheese" : ["gouda"]
    },
    "yogurt" : {
        "greek yogurt": ["greek yogurt", "low-fat greek yogurt", "full-fat plain yogurt", "nonfat greek yogurt",
        "fat free greek yogurt", "whole milk greek yogurt", "yoplait® greek 100 blackberry pie yogurt",'greek style plain yogurt',
        "yoplait® greek 2% caramel yogurt", "plain low fat greek yogurt","low-fat greek yogurt","honey-flavored greek style yogurt",
        "lowfat plain greekstyle yogurt","strained yogurt"],
        "frozen yogurt": ['coffee low-fat frozen yogurt', 'nonfat frozen yogurt','vanilla low-fat frozen yogurt',
        'nonfat vanilla frozen yogurt', 'vanilla frozen yogurt'],
        "vanilla yogurt":['vanilla lowfat yogurt','nonfat vanilla yogurt','low-fat vanilla yogurt'],
        "yogurt":['plain low-fat yogurt','low-fat plain yogurt','homemade yogurt','cream yogurt',
        "fat free yogurt",'low-fat natural yogurt','low-fat yogurt','natural yogurt', 'non dairy yogurt','plain yogurt'
        ,'vegan yogurt','strawberry yogurt','nonfat yogurt','plain whole-milk yogurt'],
        "soy yogurt":['plain soy yogurt','soy yogurt']
    },
    "broth" : {
        "chicken broth" : ["chicken"],
        "beef broth" : ["beef", "bone broth"],
        "vegetable broth" : ["vegetable"],
        "broth" : ["gluten-free broth", "low sodium broth"]
    },
    "ham": {
        "smoked ham": ['smoked','black forest ham'],
        "cooked ham": ['cooked','baked','boiled ham'],
        "ham":['']
    },
    "seasoning": {
        "taco seasoning" : ["taco"],
        "creole seasoning" :["creole"],
        "steak seasoning" :["steak"],
        "jerk seasoning" : ["jerk"],
        "cajun seasoning" : ["cajun"],
        "adobo seasoning" : ["adobo"],
        "barbeque seasoning" :["bbq", "barbeque", "barbecue"],
        "blackening seasoning" : ["black"],
        "greek seasoning" :["greek"],
        "steak seasoning" : ["steak", "meat"],
        "seafood seasoning" : ["crab", "seafood"],
        "pepper seasoning" : ["lemon pepper", "garlic pepper"],
        "italian seasoning" : ["italian"],
        "jerk seasoning" : ["jerk"],
        "poultry seasoning" :["poultry"]
    },
    "mix": {
        "seasoning mix": ["seasoning","creole spice mix",'cajun spice mix','tandoori masala mix'],
        "cake mix":["cake"],
        "stuffing mix":["stuffing"],
        "baking mix":["baking","biscuit","cookie","brownie",'jiffy corn muffin mix','cornbread','pizza','pie','muffin'
        ,"bread","gravy mix",'cornmeal mix'],
        "drink mix":["hot chocolate mix", "hot cocoa mix", "margarita mix", "bacardi® mixers margarita mix",
        "pina colada mix", "sour mix", "lipton lemon iced tea mix",'chocolate milk mix','bloody mary mix'],
        "dessert mix":["instant pudding mix", "instant butterscotch pudding mix", "powdered vanilla pudding mix",
        "custard dessert mix", "icing mix",'chocolate ice cream mix'],
        "soup mix":["soup",'knorr'],
        "ranch/salad mix":["ranch","salad",'dip mix','cole slaw mix','dressing','coleslaw','slaw'],
        "gravy mix":["gravy"],
        "curry":['curry'],
        "jambalaya mix":['jambalaya'],
        "rice mix":['rice'],
        "alfredo sauce":["alfredo sauce mix"]
    },
    "beef": {
        "beef fillet" : ["fillet"]
    },
    "noodles": {
        'rice noodles':['rice','vermicelli'],
        'egg noodles':['egg'],
        'buckwheat noodles':['buckwheat','soba'],
        'ramen noodels':['ramen','chuka'],
        'spaghetti noodles':['spaghetti'],
        'chinese noodles':['chinese','mein','hong','shanghai'],
        'lasagna noodles':['lasagna']
    },
    "juice": {
        "orange juice" : ["orange"],
        "clam juice" : ["clam"],
        "lemon juice" : ["lemon"],
        "lime juice" : ["lime"],
        "pineapple juice" : ["pineapple"],
        "vegetable juice" : ["vegetable juice", "v8", "v 8"],
        "tomato juice" : ["tomato"],
        "pickle juice" : ["pickle"],
        "sugarcane juice" : ["cane"],
        "apple juice" : ["unsweetened apple juice"],
        "grapefruit juice" : ["grapefruit"],
        "calamansi juice" : ["kalamansi"]
    },
    "powder":{
        "cocoa powder":['cocoa','cacao','vanilla'],
        "chili powder": [
        "achiote powder", "ancho powder", "chipotle chile powder", "chili powder", "hot chili powder",
        "red chile powder", "red chili powder", "salt free chili powder", "guajillo chile powder",
        "new mexico red chile powder", "habanero powder",'chile'],
        'curry powder':['curry'],
        'baking powder':['baking','meringue'],
        'milk powder':['milk','cream'],
        'bouillon powder':['bouillon','chicken-flavored soup powder'],
        'tea powder':['tea','espresso'],
        'asafetida powder':['asafetida','asafoetida'],
        'rice powder':['rice'],
        'five-spice powder':['five-spice'],
        'mustard powder':['mustard'],
        'file powder':['file'],
        'coriander powder':['coriander','dhaniya'],
        'mushroom powder':['porcini','mushroom']
    },
    "squash":{
        "squash" : ["banana squash", "winter squash", "buttercup squash", "delicata squash", "heirloom squash", "hubbard squash", "opo squash", "pattypan squash", "summer squash", "yellow crookneck squash", "yellow squash", "yellow summer squash"]
    },
    "flakes":{
        'chili flakes':['chili','chile'],
        'red pepper flakes':['red pepper'],
        'bonito flakes':['bonito'],
        'parsley flakes':['parsley'],
        'coconut flakes':['coconut'],
        'potato flakes':['potato'],
        'corn flakes':['cornflakes','corn'],
        'fish flakes':['fish'],
        'mint flakes':['mint']
    },
    "vinegar":{
        "rice vinegar" : ["rice vinegar"],
        "cider vinegar" : ["apple cider vinegar"],
        "wine vinegar" : ["wine vinegar", "sherry", "champagne"],
        "black vinegar" : ["black", "chinkiang"],
        "white vinegar" : ["white"],
        "red vinegar" : ["chinese red"],
        "balsamic vinegar" : ["balsamic"],
        "malt vinegar" : ["malt"]
    },
    "steaks":{
        'rib eye steaks':['rib eye'],
        'tuna steaks':['tuna'],
        'lamb steaks':['lamb'],
        'rump steaks':['rump'],
        'tenderloin steaks':['tenderloin'],
        'fillet steaks':['fillet','filet'],
        'fish steaks':['salmon'],
        'top loin steaks':['strip']
    },
    "water":{
        "water" : ["boiling water", "cold water", "hot water", "ice water", "mineral water", "spring", "tap water", "warm water"],
        "tuna in water" : ["tuna"],
        "rose water" : ["rose"],
        "carbonated water" : ["carbonated water", "seltzer water", "soda water", "sparkling"]
    },
    "pepper":{
        "chilli pepper":["chil",'ground pepper'],
        "black pepper":['black'],
        "dr pepper":['diet dr. pepper', 'dr pepper', 'dr. pepper'],
        "pepper":['long green pepper', 'long pepper','fresno','sichuan','chinese','bird','chipotle','poblano','cherry','pasilla','habanero'],
        "green bell pepper":['green bell pepper', 'green bellpepper','yellow bell pepper'],
        "bell pepper":['chopped bell pepper','diced bell peppe','orange bell pepper'],
        "cayenne pepper":['cayenne'],
        "white pepper":['white'],
        'red pepper':['crushed red pepper','ground red pepper']
    },
    "flour":{
        "all purpose flour" : ["purpose"],
        "barley flour" : ["barley"],
        "bread flour" : ["bread"],
        "corn flour" : ["corn"],
        "almond flour" : ["almond"],
        "wheat flour" : ["wheat"],
        "cake flour" : ["cake"],
        "self-raising flour" : ["self"],
        "pastry flour" : ["pastry"],
        "semolina flour" : ["semolina"],
        "rye flour" : ["rye"]
    },
    "butter":{
        'peanut butter':['peanut'],
        'unsalted butter':['unsalted'],
        'butter' : ['']
    },
    "extract":{
        "vanilla extract" : ["vanilla"],
        "maple extract" : ["maple"]
    },
    "liqueur":{
        'cream liqueur':['cream'],
        'chocolate liqueur':['chocolate'],
        'liqueur':['kirschenliqueur'],
        'southern liqueur':['southern'],
        'raspberry liqueur':['framboise']
    },
    "oil": {
        "olive oil" : ["olive"],
        "truffle oil" : ["truffle"],
        "coconut oil" : ["coconut", "palm"],
        "canola oil" : ["canola"],
        "vegetable oil" : ["vegetable"],
        "corn oil" : ["corn"],
        "almond oil" : ["almond"],
        "sesame oil" : ["sesame"],
        "tuna in oil" : ["tuna"]
    },
    "syrup":{
        'caramel syrup':['caramel'],
        'dark corn syrup':['dark'],
        'corn syrup':['karo','corn'],
        'syrup':['flavored','simple','pancake','table'],
        'maple syrup':['maple'],
        'ginger syrup':['ginger'],
        'peaches syrup':['peaches'],
        'brown rice syrup':['rice'],
        'barley malt syrup':['malt']
    },
    "almonds":{
        "marcona almonds" : ["marcona"],
        "tamari almonds" : ["tamari"]
    },
    "pasta":{
        'bow tie pasta':['bow-tie pasta'],
        'penne pasta':['penne'],
        'pasta' : ['']
    },
    "slices":{
        "sausage" : ["sausage"],
        "jack cheese" : ["jack cheese"],
        "mozzarella cheese" : ["mozzarella"],
        "american cheese" : ["american cheese"],
        "bacon" : ["bacon"],
        "pepperoni" : ["pepperoni"]
    },
    "chile":{
        'green chile':['green'],
        'arbol chile':['arbol'],
        'chile':['dried']
    },
    "fillets":{
        "bass fillets" : ["bass"],
        "flounder fillets" : ["flounder"],
        "cod fillets" : ["cod"]
    },
    "sausages":{
        'andouille sausage':['andouille'],
        'smoked sausage':['smoked'],
        'italian sausage':['italian'],
        'pork sausage':['pork'],
        'sausage':['hillshire farms low fat sausage'],
        'hot sausage':['hot','spicy'],
        'turkey sausage':['turkey'],
        'chicken sausage':['chicken']
    },
    "seeds":{
        "pumpkin seeds" : ["pumpkin"],
        "sesame seeds" : ["sesame"],
        "sunflower seeds" : ["sunflower"],
        "poppy seeds" : ["poppy"],
        "mustard seeds" : ["yellow mustard"]
    },
    "apples":{
        "apples" : ["cooking apples", "diced apples", "red", "sliced apples", "delicious"]
    },
    "bacon":{
        "bacon" : ["center cut bacon", "chopped bacon", "cooked bacon", "crispy bacon", "diced bacon", "oscar mayer bacon", "thick-cut bacon"],
        "streaky bacon" : ["streaky"]
    },
    "halves":{
        'chicken breast halves':['chicken'],
        'duck breast halves':['duck'],
        'pecan halves':['pecan'],
        'turkey breast halves':['turkey']
    },
    "rice":{
        "brown rice" : ["brown rice"],
        "white rice" : ["white rice"],
        "mexican rice" : ["mexican rice"],
        "spanish rice" : ["spanish", "risotto", "paella"],
        "rice" : ["sushi", "uncle ben's converted brand rice"]
    },
    "bread":{
        'country bread':['country'],
        'corn bread':['corn'],
        'bread':['gluten-free bread'],
        'pumpernickel bread':['pumpernickel'],
        'flatbread':['flat'],
        'sourdough bread':['sourdough'],
        'wheat bread':['wheat'],
        'white bread':['white']
    },
    "dressing":{
        "italian dressing" : ["hidden valley® farmhouse originals italian with herbs dressing", "italian"],
        "ranch dressing" : ["ranch"],
        "sesame ginger dressing" : ["sesame"],
        "vinaigrette dressing" : ["vinaigrette"],
        "caesar salad dressing" : ["caesar"],
        "salad dressing" : ["low-fat salad dressing"]
    },
    "leaves":{
        'basil leaves':['basil'],
        'leaves':['chopped leaves'],
        'collard green leaves':['collard'],
        'grape leaves':['grape','vine leaves']
    },
    "tortillas":{
        "corn tortillas" : ["corn"],
        "flour tortillas" : ["flour"],
        "wheat tortillas" : ["wheat"]
    },
    "corn":{
        'baby corn':['baby'],
        'popcorn':['popcorn'],
        'sweet corn':['sweet'],
        'pepper':['crushed peppercorn','ground peppercorn','whole peppercorn'],
        'corn':['ear of corn','yellow','frozen','fresh','canned']
    },
    "mushrooms": {
        "mushrooms" : ["diced mushrooms", "fresh mushrooms", "sliced mushrooms"],
        "shiitake mushrooms" : ["shiitake"]
    },
    "potatoes":{
        'sweet potatoes':['sweet'],
        'potatoes' : ['']
    },
    "salt":{
        "kosher salt" : ["kosher"],
        "coarse salt" : ["coarse"],
        "salt" : ["fine", "low sodium"]
    },
    "chips":{
        'tortilla chips' :['tortilla'],
        'miniature chocolate chips' :['miniature'],
        'pita chips':['pita'],
        'semisweet chocolate chips':['semi sweet','semisweet'],
        'chocolate chips':['mini']
    },
    "crust":{
        "pie crust" : ["pie", "double crust"],
        "pizza crust" : ["pizza"]
    },
    "chocolate":{
        'dark chocolate':['plain'] 
    },
    "spray":{
        "vegetable oil spray" : ["vegetable"],
        "olive oil spray" : ["olive oil"],
        "butter cooking spray" : ["butter"],
        "cooking spray" : ["cooking spray", "stick spray"]
    },
    "sugar":{
        'powdered sugar':['domino confectioners sugar'],
        'light borwn sugar':['light brown sugar','golden brown sugar','light brown'],
        'sugar':['granulated sugar','superfine','cane','granulated','confectioners'],
        'brown sugar':['brown sugar'],
        'organic sugar':['organic']
    },
    "peppers":{
        "jalapeno peppers" : ["jalapeno"],
        "serrano peppers" : ["serrano"],
        "chili peppers" : ["chile", "chili"],
        "bell peppers" : ["bell"],
        "sichuan peppers" : ["sichuan"]
    },
    "mayonnaise":{
        'basil mayonnaise':['basil'],
        'garlic mayonnaise':['garlic'],
        'japanese mayonnaise':['japanese','kewpie'],
        'mayonnaise' : ['']
    },
    "soup":{
        "cheese soup" : ["cheese"],
        "cream of chicken soup" : ["cream of chicken"],
        "cream of mushroom soup" : ["cream of mushroom"],
        "cream of tomato soup" : ["cream of tomato"],
        "tomato soup" : ["tomato"],
        "cream of broccoli soup" : ["cream of broccoli"],
        "cream of celery soup" : ["cream of celery"],
        "cream of potato soup" : ["cream of potato"],
        "onion soup" : ["onion soup"]
    },
    "roast":{
        'chuck roast':['chuck'],
        'beef roast':['beef roast'],
        'pork shoulder roast':['pork shoulder roast'],
        'pork roast':['center cut pork roast']
    },
    "steak":{
        "sirloin steak" : ["sirloin"],
        "rib eye steak" : ["ribeye"],
        "round steak" : ["round"],
        "steak" : ["leftover steak", "cooked steak"],
        "flank steak" : ["flank"],
        "lean steak" : ["lean"]
    },
    "meat":{
        'meat':['cooked meat','cubed meat','dark meat','chop','sliced'],
        'crab meat':['crab'],
        'luncheon meat':['luncheon'],
        'ground meat':['ground'],
        'minced meat':['minced','mincemeat'],
        'turkey meat':['turkey breast deli meat'],
        'duck meat':['duck'],
        'coconut meat':['coconut meat'],
        'beef meat':['beef']
    },
    "stock":{
        "chicken stock" : ["chicken"],
        "vegetable stock" : ["vegetable"],
        "beef stock" : ["beef"],
        "stock" : ["low sodium stock", "homemade stock"],
        "turkey stock" : ["turkey"]
    },
    "onion":{
        'sweet onion':['maui','bermuda'],
        'yellow onion':['yellow onion'],
        'purple onion':['purple onion'],
        'onion' : ['']
    },
    "cream":{
        "vanilla ice cream" : ["vanilla ice cream", "vanilla bean ice cream", "vanilla low-fat ic cream"],
        "sour cream" : ["sour cream"],
        "whipping cream" : ["whip"],
        "coconut ice cream" : ["coconut ice cream"],
        "coffee ice cream" : ["coffee ice cream"],
        "cream" : ["low fat cream"],
        "ice cream" : ["fat free ice cream"]
    },
    "dough":{
        'bread dough':['bread'],
        'pizza dough':['pizza'],
        'croissant dough':['croissant','crescent'],
        'wheat dough':['wheat'],
        'cookie dough':['cookie']
    },
    "salsa":{
        "salsa" : ["bottled low sodium"],
        "chunky salsa" : ["chunky"],
        "salsa" : ["herdez salsa", "medium salsa", "tomato"]
    },
    "garlic":{
        'black garlic':['black garlic'],
        'garlic' : ['']
    },
    "chicken":{
        "chicken" : ["low sodium chicken", "minced chicken", "whole chicken", "cut up", "sliced chicken", "diced chicken"],
        "broiler chicken" : ["broiler"],
        "popcorn chicken" : ["popcorn"],
        "chicken stock" : ["stock"]
    },
    "olives":{
        'black olives':['black'],
        'kalamata olives':['kalamata','calamata'],
        'olives':['oil cured olives','brine-cured olives','pitted','sliced'],
        'pimento olives':['pimento'],
        'green olives':['green']
    },
    "peas":{
        "peas" : ["fresh peas", "garden peas", "green peas"]
    },
    "chops":{
        'loin pork chops':['loin pork chops','center cut pork loin chops','pork loin chops'],
        'pork chops':['pork chops'],
        'lamb shoulder chops':['lamb shoulder chops','shoulder lamb chops']
    },
    "crumbs":{
        "bread crumbs" : ["bread"],
        "cracker crumbs" : ["cracker"]
    },
    "rolls":{
        'french rolls':['french'],
        'italian rolls':['italian'],
        'crescent rolls':['crescent'],
        'dinner rools':['dinner'],
        'sandwich rolls':['sandwich']
    },
    "buns":{
        "burger buns" : ["burger"]
    },
    "mustard":{
        'dijon mustard':['dijon'],
        'chinese mustard':['chinese'],
        'coarse mustard':['coarse'],
        'ground mustard':['ground','dry'],
        'mustard':['prepared','yellow'],
        'spicy brown mustard':['brown']
    },
    "shrimp":{
        "dried shrimp" : ["dried shrimp"],
        "shrimp" : ["deveined", "shelled"]
    },
    "wine":{
        'rice wine':['rice'],
        'cooking wine':['cooking','flavored'],
        'wine':['table'],
        'kosher wine':['passover']
    },
    "shells":{
        "pasta shells" : ["pasta shells"],
        "tostada shells" : ["tostada"]
    },
    "spread":{
        'spread' :['country crock® spread'],
        'buttery spread':['buttery',"i can't believ it' not butter! made with olive oil spread",
        "i can't believe it's not butter!® spread"],
        'cheese spread':['process cheese spread','velveeta cheese spread'],
        'raspberry spread':['raspberry'],
        'honey spread':['honey']
    },
    "tofu":{
        "silken tofu" : ["silken tofu"],
        "firm tofu" : ["firm tofu"],
        "tofu" : ["regular tofu"]
    }
}    
    
    


mapping_applier(data_cleaned, mappings)

Mapping applied. Cleaned dataset saved to json_files/train_cleaned.json. Changes logged in txt_files/mapping_changes.txt.


### Plural To Singular Mapping

In [16]:
data_mapped = pd.read_json("json_files/train_cleaned.json")
data_singular_changed = pd.read_json("json_files/train_cleaned.json")

In [15]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('punkt')
nltk.download('punkt_tab')  
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet') 

[nltk_data] Downloading package punkt to /Users/jun-
[nltk_data]     seoyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/jun-
[nltk_data]     seoyang/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jun-seoyang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/jun-
[nltk_data]     seoyang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Function: Plural-to-Singular Mapping Applier

The **`p_to_s_mapping_applier`** function converts **plural ingredient names** into their **singular forms** using **lemmatization**. This process helps standardize ingredient names, reducing variations in the dataset.

---

### **Function Purpose**
- Converts **plural nouns** (e.g., `"mushrooms"` → `"mushroom"`) to their **singular form**.
- Uses **part-of-speech tagging (POS tagging)** to identify **plural nouns (NNS, NNPS)**.
- Logs all changes made during the conversion process.
- Saves the **cleaned dataset** and a **log file** for tracking modifications.

---

### **How It Works**
1. **Tokenization & POS Tagging**  
   - Splits each ingredient name into **individual words**.
   - Uses **POS tagging** to identify **plural nouns (NNS, NNPS)**.
  
2. **Singularization using Lemmatization**  
   - Converts **plural nouns** to their **singular form**.
   - Example: `"shiitake mushrooms"` → `"shiitake mushroom"`

3. **Apply Transformation to the Dataset**  
   - Iterates over the **"ingredients"** column, applying the singularization process.

4. **Save Results**  
   - The cleaned dataset is saved as **JSON**.
   - A **log file** records all plural-to-singular changes for reference.

---

### **Example Transformation**
| Original Ingredient | Singularized Ingredient |
|---------------------|------------------------|
| Shiitake Mushrooms | Shiitake Mushroom |
| Dried Cranberries | Dried Cranberry |
| Fresh Tomatoes | Fresh Tomato |

In [20]:
data_mapped = pd.read_json("json_files/train_cleaned.json")
data_singular_changed = pd.read_json("json_files/train_cleaned.json")

In [17]:
def p_to_s_mapping_applier(data, output_json="json_files/train_singular_cleaned.json", log_file="txt_files/singular_changes.txt"):
    changes = []

    lemmatizer = nltk.WordNetLemmatizer()


    def convert_to_singular(ingredient):
        tokens = nltk.word_tokenize(ingredient)

        tagged_words = pos_tag(tokens)
    
        singular_words = [
            lemmatizer.lemmatize(word, pos='n') if tag in ['NNS', 'NNPS'] else word
            for word, tag in tagged_words
        ]

        cleaned = ' '.join(singular_words)

        if ingredient != cleaned:
            changes.append(f'"{ingredient}" > "{cleaned}"')
            return cleaned
        else:
            return ingredient

    data_singular_changed["ingredients"] = data["ingredients"].apply(
        lambda ingredient_list: [convert_to_singular(ingredient) for ingredient in ingredient_list]
        if isinstance(ingredient_list, list) else ingredient_list  # Ensure it's a list
    )

    data_singular_changed.to_json(output_json, orient="records", indent=4)

    if changes:
        with open(log_file, "w") as log:
            log.write("\n".join(changes))
        print(f"Spacing cleaned. Cleaned dataset saved to {output_json}. Changes logged in {log_file}.")
    else:
        print("No changes applied.")

In [18]:
p_to_s_mapping_applier(data_mapped)

Spacing cleaned. Cleaned dataset saved to json_files/train_singular_cleaned.json. Changes logged in txt_files/singular_changes.txt.


# Remove Descriptive Modifiers from Ingredients

This script cleans the dataset by removing descriptive modifiers, measurements, and unnecessary words from ingredient names. It saves the cleaned dataset to a JSON file and logs the changes made.

## **Key Features**
1. **Modifier Removal**:
   - Removes common measurement units (e.g., `"cup"`, `"tbsp"`, `"kg"`).
   - Removes descriptive modifiers (e.g., `"chopped"`, `"fresh"`, `"minced"`).
   - Removes unnecessary stop words (e.g., `"and"`, `"with"`).

2. **Logging Changes**:
   - Tracks each change made to ingredient names in a log file.

3. **Outputs**:
   - Saves the cleaned dataset to `json_files/train_no_modifiers.json`.
   - Logs all changes to `txt_files/modifier_changes.txt`.

In [20]:
import re

def remove_descriptive_modifiers(data, output_json="json_files/train_no_modifiers.json", log_file="txt_files/modifier_changes.txt"):

    changes = []  

    measurement_units = [
        "cup", "g", "grams", "kg", "lb", "liter", "milliliter", "ml", 
        "ounce", "oz", "pound", "quart", "tablespoon", "tbsp", "teaspoon", "tsp"]
    descriptive_modifiers = [
        "boneless", "chopped", "cooked", "crushed", "diced", "dried", "fatfree", 
        "freeze-dried", "fresh", "freshly", "frozen", "ground", "groundnut", 
        "minced", "skinless", "sliced", "stone-ground", "sun-dried", "sundried", "whole",
        "allpurpose", "broilerfryer", "longgrain", "uncook", "medium", "peeled", "reducedfat",
        "shellon", "lowfat", "hardboiled", "sundried", "leav", "selfraising", "%"]
    stop_words = ["and", "for", "from", "in", "or", "piece", "to", "with"]
    
    def clean_ingredient(ingredient):
        """Removes descriptive words, measurements, and modifiers from a single ingredient."""
        if not isinstance(ingredient, str):
            return ingredient
        
        cleaned = re.sub(r"[\d\/\.\-]+", "", ingredient)  # Remove numbers, fractions, and punctuation
        cleaned = cleaned.replace(",", "")  # Remove commas

        words = word_tokenize(cleaned.lower())

        filtered_words = [word for word in words if word not in measurement_units 
                          and word not in descriptive_modifiers and word not in stop_words]

        cleaned_ingredient = " ".join(filtered_words)

        if ingredient != cleaned_ingredient:
            changes.append(f'"{ingredient}" > "{cleaned_ingredient}"')

        return cleaned_ingredient

    data_cleaned = data.copy()

    data_cleaned["ingredients"] = data_cleaned["ingredients"].apply(
        lambda ingredient_list: [clean_ingredient(ingredient) for ingredient in ingredient_list]
        if isinstance(ingredient_list, list) else ingredient_list
    )

    data_cleaned.to_json(output_json, orient="records", indent=4)

    if changes:
        with open(log_file, "w") as log:
            log.write("\n".join(changes))
        print(f"Modifier cleanup complete. Cleaned dataset saved to {output_json}. Changes logged in {log_file}.")
    else:
        print("No modifier changes applied.")

In [21]:
data_singular_changed = pd.read_json("json_files/train_singular_cleaned.json")
remove_descriptive_modifiers(data_singular_changed)

Modifier cleanup complete. Cleaned dataset saved to json_files/train_no_modifiers.json. Changes logged in txt_files/modifier_changes.txt.


### Spacing Cleaner

The **`spacing_cleaner`** function normalizes ingredient names by **removing extra spaces** while ensuring words remain properly separated. This helps standardize ingredient formatting across the dataset.

---

### **Function Purpose**
- **Removes unnecessary spaces** in ingredient names (e.g., `"  green  onions  "` → `"green onions"`).
- **Ensures words remain correctly spaced** without merging unintended terms.
- **Logs all spacing corrections** for tracking changes.
- **Saves the cleaned dataset** for further processing.

In [22]:
data_no_modifiers = pd.read_json("json_files/train_no_modifiers.json")
data_spacing_changed = pd.read_json("json_files/train_no_modifiers.json")

In [23]:
def spacing_cleaner(data, output_json="json_files/train_final.json", log_file="txt_files/spacing_changes.txt"):
    changes = []
    def clean_spacing(ingredient):
        if not isinstance(ingredient, str): 
            return ingredient
        
        cleaned = " ".join(ingredient.split()) 

        if ingredient != cleaned: 
            changes.append(f'"{ingredient}" > "{cleaned}"')

        return cleaned

    data_spacing_changed["ingredients"] = data_no_modifiers["ingredients"].apply(
        lambda ingredient_list: [clean_spacing(ingredient) for ingredient in ingredient_list]
        if isinstance(ingredient_list, list) else ingredient_list  # Ensure it's a list
    )

    data_spacing_changed.to_json(output_json, orient="records", indent=4)

    if changes:
        with open(log_file, "w") as log:
            log.write("\n".join(changes))
        print(f"Spacing cleaned. Cleaned dataset saved to {output_json}. Changes logged in {log_file}.")
    else:
        print("No spacing changes applied.")

In [24]:
spacing_cleaner(data_no_modifiers)

No spacing changes applied.
