# Meal Classification: From Random Forests to Transformers

In this notebook, we'll build a classifier that predicts which cuisine a meal comes from based on its ingredients.

**Learning Journey:**
1. Generate synthetic meal data
2. Build a Random Forest classifier (traditional ML)
3. Build a Transformer classifier (modern deep learning)
4. Compare the approaches

**The Question:** Given a list of ingredients, can we predict the cuisine?

In [None]:
import random
import numpy as np
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: The Data - Fridge Ingredients by Cuisine

In [None]:
indian_fridge_ingredients = ['Chicken','Lentils (Dal)','Yogurt','Cumin','Tomatoes','Garlic','Onions','Ginger','Cilantro (Coriander leaves)','Green chilies','Turmeric','Mustard seeds','Garam masala','Curry leaves','Paneer (Indian cottage cheese)','Ghee (Clarified butter)','Fresh mint','Fenugreek leaves (Kasuri methi)','Spinach','Eggplant (Brinjal)','Okra (Bhindi)','Potatoes','Cauliflower','Green peas','Bell peppers','Carrots','Fresh coconut','Tamarind paste','Rice','Whole wheat flour (Atta)','Chickpeas (Chana)','Black beans (Rajma)','Butter','Milk','Eggs','Mangoes','Lemons','Lime','Jaggery','Cardamom','Cloves','Cinnamon sticks','Bay leaves','Fennel seeds','Red chili powder','Coriander powder','Black mustard oil','Asafoetida (Hing)','Pickles (Achar)','Pomegranate']
american_fridge_ingredients = ['Milk','Eggs','Butter','Cheddar cheese','Mozzarella cheese','Yogurt','Chicken breast','Ground beef','Bacon','Ham','Sausage','Turkey','Lettuce','Spinach','Carrots','Broccoli','Bell peppers','Tomatoes','Cucumbers','Zucchini','Mushrooms','Onions','Garlic','Potatoes','Sweet potatoes','Corn','Peas','Green beans','Apples','Bananas','Grapes','Oranges','Strawberries','Blueberries','Bread','Tortillas','Pasta','Rice','Ketchup','Mustard','Mayonnaise','Ranch dressing','Barbecue sauce','Soy sauce','Hot sauce','Butter','Cream cheese','Orange juice','Apple juice','Jam']
french_fridge_ingredients = ['Butter','Cream','Milk','Eggs','Cheese (Camembert)','Cheese (Brie)','Cheese (Roquefort)','Cheese (Gruyère)','Cheese (Comté)','Cheese (Goat cheese)','Yogurt','Chicken','Duck','Ham','Sausage','Pâté','Smoked salmon','Fresh herbs (Thyme)','Fresh herbs (Rosemary)','Fresh herbs (Tarragon)','Fresh herbs (Parsley)','Garlic','Onions','Shallots','Leeks','Carrots','Celery','Potatoes','Tomatoes','Zucchini','Eggplant','Bell peppers','Green beans','Lettuce','Spinach','Mushrooms','Baguette','Croissants','Wine (Red)','Wine (White)','Champagne','Olive oil','Balsamic vinegar','Dijon mustard','Crème fraîche','Anchovies','Capers','Cornichons','Puff pastry','Pears','Apples']
korean_fridge_ingredients = ['Kimchi','Soy sauce','Gochujang (Korean red chili paste)','Doenjang (Fermented soybean paste)','Gochugaru (Korean red chili flakes)','Sesame oil','Sesame seeds','Garlic','Ginger','Green onions (Scallions)','Onions','Korean radish (Mu)','Napa cabbage','Spinach','Carrots','Zucchini','Cucumber','Bean sprouts','Bell peppers','Potatoes','Sweet potatoes','Tofu','Fish sauce','Oyster sauce','Rice vinegar','Mirin','Rice cakes (Tteok)','Rice','Rice noodles','Glass noodles (Dangmyeon)','Seaweed (Gim/Nori)','Dried anchovies','Beef','Pork','Chicken','Eggs','Milk','Cheese','Butter','Mushrooms (Enoki)','Mushrooms (Shiitake)','Korean pears','Apples','Asian pears','Persimmons','Chili peppers','Perilla leaves','Ssamjang (Korean dipping sauce)','Kimchi base (Mak kimchi)']
mexican_fridge_ingredients = ['Chicken','Beef','Pork','Chorizo','Fish','Shrimp','Eggs','Milk','Cheese (Queso fresco)','Cheese (Queso Oaxaca)','Cheese (Cotija)','Butter','Crema (Mexican sour cream)','Limes','Cilantro','Jalapeños','Serrano peppers','Poblano peppers','Habanero peppers','Tomatoes','Tomatillos','Avocados','Onions','Garlic','Bell peppers','Cucumbers','Carrots','Radishes','Zucchini','Corn','Black beans','Pinto beans','Lettuce','Cabbage','Spinach','Chayote','Nopales (Cactus)','Tortillas','Tortilla chips','Salsa','Guacamole','Hot sauce','Adobo sauce','Chipotle peppers in adobo','Pickled jalapeños','Mole sauce','Mexican chocolate','Tequila','Beer']

fridges = {
    'Indian': indian_fridge_ingredients,
    'American': american_fridge_ingredients,
    'French': french_fridge_ingredients,
    'Korean': korean_fridge_ingredients,
    'Mexican': mexican_fridge_ingredients
}

print("Cuisines available:", list(fridges.keys()))
print("\nIngredient counts:")
for cuisine, ingredients in fridges.items():
    print(f"  {cuisine}: {len(ingredients)} ingredients")

## Part 2: Generate Synthetic Meal Data

We'll create realistic meals by:
- Picking ingredients primarily from one cuisine (90%)
- Sometimes adding ingredients from other cuisines (10%) for realism

In [None]:
def generate_meal(cuisine, num_ingredients=6):
    """Generate a meal primarily from one cuisine."""
    primary_ingredients = fridges[cuisine]
    
    # 90% chance to pick from primary cuisine, 10% from others
    meal_ingredients = []
    for _ in range(num_ingredients):
        if random.random() < 0.9:
            # Pick from primary cuisine
            meal_ingredients.append(random.choice(primary_ingredients))
        else:
            # Pick from a random other cuisine (adds noise/realism)
            other_cuisine = random.choice([c for c in fridges.keys() if c != cuisine])
            meal_ingredients.append(random.choice(fridges[other_cuisine]))
    
    return meal_ingredients

# Test it
print("Sample Indian meal:", generate_meal('Indian', 5))
print("Sample Korean meal:", generate_meal('Korean', 5))
print("Sample Mexican meal:", generate_meal('Mexican', 5))

In [None]:
def create_meal_dataset(meals_per_cuisine=500):
    """Create a dataset of meals with labels."""
    meals = []
    labels = []
    
    for cuisine in fridges.keys():
        for _ in range(meals_per_cuisine):
            # Vary meal size between 4-8 ingredients
            num_ingredients = random.randint(4, 8)
            meal = generate_meal(cuisine, num_ingredients)
            meals.append(meal)
            labels.append(cuisine)
    
    return meals, labels

# Generate dataset
print("Generating meals...")
meals, labels = create_meal_dataset(meals_per_cuisine=500)
print(f"✓ Created {len(meals)} meals")
print(f"✓ Cuisines: {Counter(labels)}")

# Show some examples
print("\nFirst 3 meals:")
for i in range(3):
    print(f"{labels[i]:10} → {meals[i]}")

## Part 3: Feature Engineering for Random Forest

Random Forests need numerical features. We'll create:
- **Bag of Words:** Binary vector indicating which ingredients are present
- Each ingredient becomes a feature (0 or 1)

In [None]:
# Build vocabulary of all possible ingredients
all_ingredients = set()
for ingredients in fridges.values():
    all_ingredients.update(ingredients)

vocab = sorted(list(all_ingredients))
ingredient_to_idx = {ing: idx for idx, ing in enumerate(vocab)}

print(f"Total unique ingredients: {len(vocab)}")
print(f"Sample ingredients: {vocab[:5]}")

In [None]:
def meal_to_features(meal):
    """Convert a meal (list of ingredients) to a binary feature vector."""
    features = np.zeros(len(vocab))
    for ingredient in meal:
        if ingredient in ingredient_to_idx:
            features[ingredient_to_idx[ingredient]] = 1
    return features

# Convert all meals to feature vectors
X = np.array([meal_to_features(meal) for meal in meals])
y = np.array(labels)

print(f"Feature matrix shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"\nExample feature vector (first meal):")
print(f"Non-zero features: {np.sum(X[0])} out of {len(vocab)}")

## Part 4: Train Random Forest Classifier

Random Forests work by:
1. Building many decision trees
2. Each tree votes on the prediction
3. Majority vote wins

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Train Random Forest
print("Training Random Forest...")
rf_model = RandomForestClassifier(
    n_estimators=100,  # Number of trees
    max_depth=20,
    random_state=42
)
rf_model.fit(X_train, y_train)
print("✓ Training complete")

# Evaluate
train_pred = rf_model.predict(X_train)
test_pred = rf_model.predict(X_test)

train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)

print(f"\nRandom Forest Results:")
print(f"  Train Accuracy: {train_acc:.3f}")
print(f"  Test Accuracy:  {test_acc:.3f}")

In [None]:
# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, test_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, test_pred, labels=list(fridges.keys()))
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=fridges.keys(),
            yticklabels=fridges.keys())
plt.title('Random Forest - Confusion Matrix')
plt.ylabel('True Cuisine')
plt.xlabel('Predicted Cuisine')
plt.tight_layout()
plt.show()

In [None]:
# Feature importance - which ingredients are most predictive?
importance_data = list(zip(vocab, rf_model.feature_importances_))
importance_data.sort(key=lambda x: x[1], reverse=True)

print("\nTop 20 Most Important Ingredients for Classification:")
print(f"{'Ingredient':<40} {'Importance':>10}")
print("-" * 52)
for ingredient, importance in importance_data[:20]:
    print(f"{ingredient:<40} {importance:>10.6f}")

## Part 5: Build a Transformer Classifier

Now let's use a Transformer! 

**Why Transformers?**
- They can learn relationships between ingredients (self-attention)
- Don't need manual feature engineering
- Work directly with sequences

**Architecture:**
1. Embedding layer (ingredient → vector)
2. Transformer encoder layer
3. Classification head

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Create ingredient and label vocabularies
ingredient_to_id = {ing: idx+1 for idx, ing in enumerate(vocab)}  # Start from 1
ingredient_to_id['<PAD>'] = 0  # Padding token

cuisine_to_id = {cuisine: idx for idx, cuisine in enumerate(fridges.keys())}
id_to_cuisine = {idx: cuisine for cuisine, idx in cuisine_to_id.items()}

print(f"Ingredient vocab size: {len(ingredient_to_id)}")
print(f"Number of cuisines: {len(cuisine_to_id)}")
print(f"Cuisine mapping: {cuisine_to_id}")

In [None]:
class MealDataset(Dataset):
    """Dataset for meals."""
    def __init__(self, meals, labels):
        self.meals = meals
        self.labels = labels
    
    def __len__(self):
        return len(self.meals)
    
    def __getitem__(self, idx):
        meal = self.meals[idx]
        label = self.labels[idx]
        
        # Convert ingredients to IDs
        ingredient_ids = [ingredient_to_id.get(ing, 0) for ing in meal]
        
        return (
            torch.tensor(ingredient_ids, dtype=torch.long),
            torch.tensor(cuisine_to_id[label], dtype=torch.long)
        )

def collate_fn(batch):
    """Pad sequences to same length in batch."""
    meals, labels = zip(*batch)
    meals_padded = pad_sequence(meals, batch_first=True, padding_value=0)
    labels = torch.stack(labels)
    return meals_padded, labels

In [None]:
# Split meals into train/test (using same split as before)
from sklearn.model_selection import train_test_split

meals_train, meals_test, labels_train, labels_test = train_test_split(
    meals, labels, test_size=0.2, random_state=42, stratify=labels
)

# Create datasets
train_dataset = MealDataset(meals_train, labels_train)
test_dataset = MealDataset(meals_test, labels_test)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

print(f"Train batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

In [None]:
class MealTransformer(nn.Module):
    """Transformer model for meal classification."""
    def __init__(self, vocab_size, num_classes, d_model=128, nhead=4, num_layers=2):
        super().__init__()
        
        # Embedding layer: ingredient ID -> dense vector
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
        
        # Positional encoding (simple learned version)
        self.pos_encoding = nn.Embedding(20, d_model)  # Max 20 ingredients
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=512,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        batch_size, seq_len = x.shape
        
        # Create position indices
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
        
        # Embed ingredients and add positional encoding
        x = self.embedding(x) + self.pos_encoding(positions)
        
        # Create padding mask (True for padding tokens)
        padding_mask = (x.sum(dim=-1) == 0)
        
        # Apply transformer
        x = self.transformer(x, src_key_padding_mask=padding_mask)
        
        # Take mean of non-padded tokens
        mask = (~padding_mask).unsqueeze(-1).float()
        x = (x * mask).sum(dim=1) / mask.sum(dim=1)
        
        # Classify
        return self.classifier(x)

# Create model
model = MealTransformer(
    vocab_size=len(ingredient_to_id),
    num_classes=len(cuisine_to_id),
    d_model=128,
    nhead=4,
    num_layers=2
).to(device)

print(f"Model created with {sum(p.numel() for p in model.parameters())} parameters")

In [None]:
# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, loader, criterion, optimizer, device):
    """Train for one epoch."""
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for meals, labels in loader:
        meals, labels = meals.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(meals)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track metrics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(loader), correct / total

def evaluate(model, loader, criterion, device):
    """Evaluate model."""
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for meals, labels in loader:
            meals, labels = meals.to(device), labels.to(device)
            
            outputs = model(meals)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
            
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    return total_loss / len(loader), correct / total, all_preds, all_labels

In [None]:
# Train the model
print("Training Transformer...\n")
num_epochs = 20
best_acc = 0

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc, _, _ = evaluate(model, test_loader, criterion, device)
    
    if test_acc > best_acc:
        best_acc = test_acc
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:2d}: Train Loss={train_loss:.3f} Acc={train_acc:.3f} | "
              f"Test Loss={test_loss:.3f} Acc={test_acc:.3f}")

print(f"\n✓ Training complete! Best test accuracy: {best_acc:.3f}")

In [None]:
# Final evaluation
_, test_acc, test_preds, test_labels = evaluate(model, test_loader, criterion, device)

# Convert IDs back to cuisine names
pred_cuisines = [id_to_cuisine[p] for p in test_preds]
true_cuisines = [id_to_cuisine[l] for l in test_labels]

print("\nTransformer Classification Report:")
print(classification_report(true_cuisines, pred_cuisines))

In [None]:
# Confusion matrix for Transformer
cm_transformer = confusion_matrix(true_cuisines, pred_cuisines, labels=list(fridges.keys()))
plt.figure(figsize=(8, 6))
sns.heatmap(cm_transformer, annot=True, fmt='d', cmap='Greens',
            xticklabels=fridges.keys(),
            yticklabels=fridges.keys())
plt.title('Transformer - Confusion Matrix')
plt.ylabel('True Cuisine')
plt.xlabel('Predicted Cuisine')
plt.tight_layout()
plt.show()

## Part 6: Compare Random Forest vs Transformer

In [None]:
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(f"Random Forest Test Accuracy:  {test_acc:.3f}")
print(f"Transformer Test Accuracy:    {best_acc:.3f}")
print("="*60)

print("\nKey Differences:")
print("\n1. Random Forest:")
print("   - Treats each ingredient independently")
print("   - Fast to train")
print("   - Provides feature importance")
print("   - Good for tabular data")

print("\n2. Transformer:")
print("   - Learns relationships between ingredients (attention)")
print("   - Can understand ingredient combinations")
print("   - More flexible for sequence data")
print("   - Requires more data to shine")

## Part 7: Test on New Meals

Let's create some test meals and see what both models predict!

In [None]:
def predict_with_both_models(meal_ingredients):
    """Predict cuisine using both models."""
    print(f"\nMeal: {meal_ingredients}")
    print("-" * 60)
    
    # Random Forest prediction
    rf_features = meal_to_features(meal_ingredients)
    rf_pred = rf_model.predict([rf_features])[0]
    rf_proba = rf_model.predict_proba([rf_features])[0]
    
    print("Random Forest Prediction:")
    print(f"  Predicted: {rf_pred}")
    for i, cuisine in enumerate(fridges.keys()):
        print(f"    {cuisine:10} {rf_proba[i]:.3f}")
    
    # Transformer prediction
    meal_ids = [ingredient_to_id.get(ing, 0) for ing in meal_ingredients]
    meal_tensor = torch.tensor([meal_ids], dtype=torch.long).to(device)
    
    model.eval()
    with torch.no_grad():
        outputs = model(meal_tensor)
        probs = F.softmax(outputs, dim=1)[0]
        pred_idx = outputs.argmax(1).item()
    
    transformer_pred = id_to_cuisine[pred_idx]
    
    print("\nTransformer Prediction:")
    print(f"  Predicted: {transformer_pred}")
    for i, cuisine in enumerate(fridges.keys()):
        print(f"    {cuisine:10} {probs[i].item():.3f}")

# Test some meals
test_meals = [
    ['Kimchi', 'Rice', 'Gochujang (Korean red chili paste)', 'Garlic', 'Sesame oil'],
    ['Tortillas', 'Avocados', 'Limes', 'Cilantro', 'Tomatoes'],
    ['Butter', 'Cheese (Brie)', 'Wine (Red)', 'Garlic', 'Fresh herbs (Thyme)'],
    ['Cumin', 'Turmeric', 'Garam masala', 'Yogurt', 'Chicken'],
    ['Bacon', 'Eggs', 'Cheddar cheese', 'Bread', 'Butter']
]

for meal in test_meals:
    predict_with_both_models(meal)

## Summary

We built two classifiers for cuisine prediction:

**Random Forest:**
- Traditional ML approach
- Uses hand-crafted features (bag-of-words)
- Fast and interpretable
- Works well for this task

**Transformer:**
- Modern deep learning approach
- Learns features automatically
- Uses attention to find ingredient relationships
- More flexible and scalable

Both approaches work well for this problem! The choice depends on:
- Data size (Transformers need more data)
- Interpretability needs (Random Forest is clearer)
- Complexity of relationships (Transformers handle complex patterns better)
- Computational resources (Random Forest is faster)