# Simple Cuisine Classification ML Training Pipeline

A straightforward ML training pipeline for cuisine classification using ResNet-50.

## Pipeline Overview
1. **Data Loading**: Load processed images from gold layer
2. **Simple Preprocessing**: Convert bytes to PIL images with transforms
3. **Model Training**: Fine-tune ResNet-50 using standard Transformers patterns
4. **MLflow Integration**: Log and register model

*Based on proven reference patterns - simple and reliable.*

In [0]:
# Simple installation - only what we need
%pip install torch torchvision transformers datasets mlflow scikit-learn

Collecting mlflow
  Downloading mlflow-3.6.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-skinny==3.6.0 (from mlflow)
  Downloading mlflow_skinny-3.6.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.6.0 (from mlflow)
  Downloading mlflow_tracing-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting Flask-CORS<7 (from mlflow)
  Downloading flask_cors-6.0.1-py3-none-any.whl.metadata (5.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting huey<3,>=2.5.0 (from mlflow)
  Downloading huey-2.5.4-py3-none-any.whl.metadata (4.6 kB)
Collecting opentelemetry-proto<3,>=1.9.0 (from mlflow-skinny==3.6.0->mlflow)
  Downloading opentelemetry_proto-1.38.0-py3-none-any.whl.metadata (2.3 kB)
Collecting python-dotenv<2,>=0.19.0 (from mlflow-skinny==3.6.0->mlflow)
  Downloading python_dotenv-1.2.1-py3-none-any.whl.met

In [0]:
dbutils.library.restartPython()

In [0]:
# Simple imports - clean and minimal
import mlflow
import torch
import pandas as pd
import numpy as np
from transformers import AutoImageProcessor, AutoModelForImageClassification, TrainingArguments, Trainer
from PIL import Image
import io
from torchvision.transforms import Compose, Normalize, ToTensor, Lambda
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score
# from sklearn.preprocessing import LabelEncoder

print("✅ Simple imports loaded successfully")

2025-11-10 09:31:03.415470: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-10 09:31:03.598520: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-10 09:31:03.781035: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762767063.933936    1397 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762767064.049283    1397 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1762767064.325282    1397 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

[2025-11-10 09:31:17,483] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


✅ Simple imports loaded successfully


In [0]:
# Simple configuration - no complex widgets
CATALOG = "cuisine_vision_catalog"
MODEL_CHECKPOINT = "microsoft/resnet-50"
EXPERIMENT_NAME = "/cuisine_classifier"
NUM_EPOCHS = 3
BATCH_SIZE = 8
LEARNING_RATE = 5e-5

print(f"🔧 Configuration:")
print(f"   📊 Catalog: {CATALOG}")
print(f"   🧠 Model: {MODEL_CHECKPOINT}")
print(f"   🔄 Epochs: {NUM_EPOCHS}")
print(f"   📦 Batch Size: {BATCH_SIZE}")
print(f"   📈 Learning Rate: {LEARNING_RATE}")

🔧 Configuration:
   📊 Catalog: cuisine_vision_catalog
   🧠 Model: microsoft/resnet-50
   🔄 Epochs: 3
   📦 Batch Size: 8
   📈 Learning Rate: 5e-05


In [0]:
# Simple data loading - direct from gold table
print("📊 Loading data from gold layer...")

# Load data directly - no complex joins
dataset_df = (
    spark.table(f"{CATALOG}.gold.ml_dataset")
    .select("processed_image_data", "cuisine_category")
    .filter("processed_image_data IS NOT NULL")
    .toPandas()
)

print(f"✅ Loaded {len(dataset_df)} samples")
print(f"   🍽️ Cuisines: {sorted(dataset_df['cuisine_category'].unique())}")

# Create HuggingFace dataset - simple rename
dataset = Dataset.from_pandas(
    dataset_df.rename(columns={
        "processed_image_data": "image", 
        "cuisine_category": "label"
    })
)

# Simple train/test split
splits = dataset.train_test_split(test_size=0.2, seed=42)
train_ds = splits['train']
val_ds = splits['test']

print(f"✅ Data splits:")
print(f"   🏋️ Training: {len(train_ds)} samples")
print(f"   ✅ Validation: {len(val_ds)} samples")

📊 Loading data from gold layer...
✅ Loaded 1065 samples
   🍽️ Cuisines: ['american', 'chinese', 'french', 'international', 'italian', 'japanese', 'mediterranean', 'mexican']
✅ Data splits:
   🏋️ Training: 852 samples
   ✅ Validation: 213 samples


In [0]:
# Simple preprocessing - exactly like reference notebook
print("🔄 Setting up simple preprocessing...")

# Load image processor
image_processor = AutoImageProcessor.from_pretrained(MODEL_CHECKPOINT)

# Simple transform pipeline
transforms = Compose([
    Lambda(lambda b: Image.open(io.BytesIO(b)).convert("RGB")),
    ToTensor(),
    Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
])

def preprocess(batch):
    """Simple preprocessing function"""
    batch["image"] = [transforms(image) for image in batch["image"]]
    return batch

# Apply transforms
train_ds.set_transform(preprocess)
val_ds.set_transform(preprocess)

print("✅ Simple preprocessing setup complete")

🔄 Setting up simple preprocessing...


preprocessor_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


✅ Simple preprocessing setup complete


In [0]:
# Simple model setup - no complex wrappers
print("🧠 Setting up simple model...")

# Create simple label mappings
unique_labels = sorted(set(dataset['label']))
label2id = {label: i for i, label in enumerate(unique_labels)}
id2label = {i: label for label, i in label2id.items()}
num_labels = len(unique_labels)

print(f"✅ Labels: {id2label}")

# Load model - simple and direct
model = AutoModelForImageClassification.from_pretrained(
    MODEL_CHECKPOINT,
    label2id=label2id,
    id2label=id2label,
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

print(f"✅ Model loaded with {num_labels} classes")

🧠 Setting up simple model...
✅ Labels: {0: 'american', 1: 'chinese', 2: 'french', 3: 'international', 4: 'italian', 5: 'japanese', 6: 'mediterranean', 7: 'mexican'}


config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-50 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([8]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([8, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Model loaded with 8 classes


In [0]:
# Simple training - no complex custom trainers
print("🏋️ Starting simple training...")

# Setup MLflow
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run() as run:
    print(f"🔄 MLflow run: {run.info.run_id}")
    
    # Simple training arguments
    args = TrainingArguments(
        output_dir=f"/dbfs/tmp/cuisine-classifier-simple",
        remove_unused_columns=False,
        eval_strategy="epoch",  # Fixed: was evaluation_strategy
        save_strategy="epoch",
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        logging_steps=10,
        report_to=[]
    )
    
    # Simple data collator - like reference
    def collate_fn(examples):
        pixel_values = torch.stack([e["image"] for e in examples])
        labels = torch.tensor([label2id[e["label"]] for e in examples], dtype=torch.long)
        return {"pixel_values": pixel_values, "labels": labels}
    
    # Simple metrics
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = predictions.argmax(axis=-1)
        accuracy = accuracy_score(labels, predictions)
        f1 = f1_score(labels, predictions, average='weighted')
        return {'accuracy': accuracy, 'f1': f1}
    
    # Simple trainer - standard Transformers
    trainer = Trainer(
        model=model, 
        args=args, 
        train_dataset=train_ds, 
        eval_dataset=val_ds, 
        tokenizer=image_processor, 
        data_collator=collate_fn,
        compute_metrics=compute_metrics
    )
    
    # Train the model
    print("🚀 Training started...")
    trainer.train()
    print("✅ Training completed!")
    
    # Evaluate
    print("📊 Evaluating model...")
    eval_results = trainer.evaluate()
    print(f"✅ Final metrics: {eval_results}")
    
    # Log parameters
    mlflow.log_param("model_checkpoint", MODEL_CHECKPOINT)
    mlflow.log_param("num_epochs", NUM_EPOCHS)
    mlflow.log_param("batch_size", BATCH_SIZE)
    mlflow.log_param("learning_rate", LEARNING_RATE)
    mlflow.log_param("num_labels", num_labels)
    
    # Log metrics
    for key, value in eval_results.items():
        if isinstance(value, (int, float)):
            mlflow.log_metric(key, value)

🏋️ Starting simple training...
🔄 MLflow run: 8cd6515b2c234a74a3b20d33448bed25


  trainer = Trainer(


🚀 Training started...




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,2.0493,2.027872,0.248826,0.101408
2,1.9533,2.013881,0.253521,0.102548
3,1.9778,2.017588,0.253521,0.102548




✅ Training completed!
📊 Evaluating model...




✅ Final metrics: {'eval_loss': 2.013880968093872, 'eval_accuracy': 0.2535211267605634, 'eval_f1': 0.10254787149865484, 'eval_runtime': 10.1883, 'eval_samples_per_second': 20.906, 'eval_steps_per_second': 2.65, 'epoch': 3.0}


In [0]:
# Simple model wrapper for MLflow - like reference
print("📦 Creating simple model wrapper...")

from transformers import pipeline

# Create pipeline from trained model
classifier = pipeline(
    "image-classification", 
    model=trainer.model, 
    feature_extractor=image_processor
)

class SimpleCuisineClassifier(mlflow.pyfunc.PythonModel):
    """Simple wrapper for cuisine classification - like reference notebook"""
    
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.pipeline.model.eval()
    
    def predict(self, context, model_input):
        """Simple prediction method"""
        # Handle DataFrame input
        if isinstance(model_input, pd.DataFrame):
            # Convert bytes to PIL images
            images = model_input['processed_image_data'].apply(
                lambda b: Image.open(io.BytesIO(b)).convert("RGB")
            ).tolist()
            
            # Get predictions
            with torch.no_grad():
                predictions = self.pipeline(images)
            
            # Return top prediction for each image
            return pd.DataFrame([
                max(pred, key=lambda x: x['score']) 
                for pred in predictions
            ])
        
        # Handle single image bytes
        else:
            image = Image.open(io.BytesIO(model_input)).convert("RGB")
            with torch.no_grad():
                prediction = self.pipeline(image)
            return max(prediction, key=lambda x: x['score'])

# Create wrapped model
wrapped_model = SimpleCuisineClassifier(classifier)
print("✅ Simple model wrapper created")

📦 Creating simple model wrapper...


Device set to use cpu


✅ Simple model wrapper created




In [0]:
# Simple MLflow logging and registration
print("📊 Logging model to MLflow...")

# Import signature utilities
from mlflow.models.signature import infer_signature

with mlflow.start_run(run_id=run.info.run_id):
    # Test model with sample data and create signature
    test_df = dataset_df[['processed_image_data']].head(3)
    test_predictions = wrapped_model.predict(None, test_df)
    print(f"✅ Test predictions: {test_predictions}")
    
    # Create model signature - required for Unity Catalog
    signature = infer_signature(test_df, test_predictions)
    print(f"✅ Model signature created: {signature}")
    
    # Log model with signature - required for Unity Catalog
    model_info = mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model=wrapped_model,
        signature=signature,  # Added signature for Unity Catalog
        pip_requirements=[
            "torch", 
            "transformers", 
            "pillow", 
            "pandas",
            "numpy"
        ]
    )
    
    print(f"✅ Model logged with signature: {model_info.model_uri}")

# Register to Unity Catalog - simple registration
full_model_name = f"{CATALOG}.ml_models.cuisine_classifier_simple"
registered_model = mlflow.register_model(
    model_uri=model_info.model_uri, 
    name=full_model_name,
    tags={
        "stage": "development",
        "task": "image_classification",
        "architecture": "ResNet-50",
        "approach": "simple"
    }
)

print(f"🎉 Model registered successfully!")
print(f"   📦 Model: {full_model_name}")
print(f"   🏷️ Version: {registered_model.version}")

📊 Logging model to MLflow...




✅ Test predictions:       label     score
0  american  0.172005
1  american  0.196499
2  american  0.173358
✅ Model signature created: inputs: 
  ['processed_image_data': binary (required)]
outputs: 
  ['label': string (required), 'score': double (required)]
params: 
  None



🔗 View Logged Model at: https://adb-2867553723712000.0.azuredatabricks.net/ml/experiments/2328462332528265/models/m-7054c1d213f7468cb7dfc192bbf3fe68?o=2867553723712000


✅ Model logged with signature: models:/m-7054c1d213f7468cb7dfc192bbf3fe68


Registered model 'cuisine_vision_catalog.ml_models.cuisine_classifier_simple' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/10 [00:00<?, ?it/s]

🔗 Created version '1' of model 'cuisine_vision_catalog.ml_models.cuisine_classifier_simple': https://adb-2867553723712000.0.azuredatabricks.net/explore/data/models/cuisine_vision_catalog/ml_models/cuisine_classifier_simple/version/1?o=2867553723712000


🎉 Model registered successfully!
   📦 Model: cuisine_vision_catalog.ml_models.cuisine_classifier_simple
   🏷️ Version: 1
   🎯 Approach: Simple & Reliable


In [0]:
# Simple testing - verify everything works
print("🧪 Final testing...")

# Test with a few samples
test_samples = dataset_df.sample(n=3)
for idx, row in test_samples.iterrows():
    true_label = row['cuisine_category']
    image_bytes = row['processed_image_data']
    
    # Make prediction
    prediction = wrapped_model.predict(None, image_bytes)
    
    print(f"Sample {idx}:")
    print(f"   ✅ True: {true_label}")
    print(f"   🎯 Predicted: {prediction['label']} (score: {prediction['score']:.3f})")
    print()

print("🎉 Simple pipeline completed successfully!")
print("\n📋 Summary:")
print(f"   📊 Total samples: {len(dataset_df)}")
print(f"   🏷️ Classes: {num_labels}")
print(f"   🔄 Epochs: {NUM_EPOCHS}")
print(f"   📦 Model: {full_model_name} v{registered_model.version}")

🧪 Final testing...
Sample 31:
   ✅ True: american
   🎯 Predicted: american (score: 0.209)

Sample 832:
   ✅ True: japanese
   🎯 Predicted: american (score: 0.170)

Sample 413:
   ✅ True: french
   🎯 Predicted: american (score: 0.169)

🎉 Simple pipeline completed successfully!

📋 Summary:
   📊 Total samples: 1065
   🏷️ Classes: 8
   🔄 Epochs: 3
   📦 Model: cuisine_vision_catalog.ml_models.cuisine_classifier_simple v1
   💡 Approach: Clean, simple, and reliable!
