# MLflow Hyperparameter Tuning Tutorial

This tutorial demonstrates how to use MLflow for experiment tracking and hyperparameter optimization with the MNIST dataset.

**MNIST Dataset**: A collection of 28Ã—28 pixel grayscale images of handwritten digits (0-9), containing 70,000 samples total.

**Prerequisites**:
- MLflow server running (default: `http://127.0.0.1:5000`)
- Required libraries: `mlflow`, `hyperopt`, `scikit-learn`, `numpy`, `pandas`

Let's get started!

In [None]:
import os
import numpy as np
import mlflow
import pandas as pd

# Configure MLflow tracking
# os.environ['MLFLOW_TRACKING_URI'] = 'http://127.0.0.1:5000'
# mlflow.set_tracking_uri('http://127.0.0.1:5000/')

## Overview

This tutorial covers:

1. **Data Management**: Proper train/validation/test splits with reproducibility
2. **Baseline Model**: Training and logging a baseline logistic regression model
3. **Hyperparameter Tuning**: Using Hyperopt for automated hyperparameter optimization
4. **Best Practices**: 
   - Reproducible experiments with fixed random seeds
   - Proper data splits to avoid overfitting
   - Comprehensive MLflow logging (parameters, metrics, models, examples)
   - Test set evaluation for final model comparison

### Key Concepts

- **Reproducibility**: Using fixed random seeds ensures anyone running the code gets identical results
- **Data Splitting Tradeoffs**: Larger validation/test sets provide more reliable estimates but reduce training data
- **Hyperparameter Optimization**: Systematic search for optimal model configurations
- **MLflow Tracking**: Comprehensive logging of experiments for comparison and reproducibility


In [None]:
# Import required libraries
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import warnings

warnings.filterwarnings("ignore")  # To keep the output clean

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## Step 1: Load and Preprocess Data

In this step, we:
- Load the MNIST dataset from OpenML
- Normalize the features using StandardScaler
- Split the data into train/validation/test sets with proper stratification
- Ensure reproducibility using fixed random seeds

**Key Points:**
- Using `random_state` ensures anyone running this code gets identical data splits
- Stratified splitting maintains class distribution across splits
- Test set is reserved for final evaluation only (not used during training or tuning)


In [None]:
def load_data():
    """
    Load and preprocess MNIST dataset with proper train/validation/test splits.
    
    Returns:
        X_train, X_val, X_test, y_train, y_val, y_test, scaler, split_info
    """
    print("Loading MNIST dataset...")
    mnist = fetch_openml('mnist_784', version=1, as_frame=False)
    X, y = mnist['data'], mnist['target'].astype(np.int64)
    
    # Normalize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Sample
    n_samples = 1_000
    indices = np.random.choice(len(X), size=n_samples, replace=False)
    X_sample = X[indices]
    y_sample = y[indices]


    # First split: separate test set (10k samples)
    # Using random_state ensures reproducibility - anyone running this code gets the same splits
    X_temp, X_test, y_temp, y_test = train_test_split(
        X_sample, y_sample, test_size=100, random_state=RANDOM_STATE, stratify=y_sample
    )
    
    # Second split: separate train and validation sets (60k train, 10k val)
    X_train, X_val, y_train, y_val = train_test_split(
        X_temp, y_temp, test_size=100, random_state=RANDOM_STATE, stratify=y_temp
    )
    
    print(f"Data splits - Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
    
    # Log split sizes for reproducibility
    split_info = {
        'train_size': len(X_train),
        'val_size': len(X_val),
        'test_size': len(X_test),
        'random_state': RANDOM_STATE
    }
    
    return X_train, X_val, X_test, y_train, y_val, y_test, scaler, split_info

# Load the data
X_train, X_val, X_test, y_train, y_val, y_test, scaler, split_info = load_data()

## Step 2: Train and Log Baseline Model

We establish a baseline model to:
- Set a performance benchmark for comparison
- Demonstrate proper MLflow logging practices
- Log all parameters, metrics, and model artifacts

**What we log:**
- Model hyperparameters (C, max_iter, solver, etc.)
- Data split information for reproducibility
- Validation and test set accuracies
- Model signature and input examples
- Tags for easy filtering in MLflow UI


In [None]:
def train_and_log_baseline(X_train, X_val, y_train, y_val, X_test, y_test, split_info):
    """
    Train a baseline logistic regression model and log it to MLflow.
    
    Note: We use the same parameter values for both model instantiation and logging
    to ensure consistency and avoid discrepancies.
    """
    experiment_name = "MNIST_LogisticRegression"
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name="baseline_logreg") as run:
        # Model: multinomial logistic regression with lbfgs solver (best for multinomial)
        C = 1.0
        max_iter = 100
        
        # Use the same parameters for both model and logging to ensure consistency
        model = LogisticRegression(
            multi_class='multinomial', 
            solver='lbfgs', 
            max_iter=max_iter, 
            C=C, 
            random_state=RANDOM_STATE
        )
        
        print("Training baseline logistic regression model...")
        model.fit(X_train, y_train)
        
        # Evaluate on validation set
        y_pred_val = model.predict(X_val)
        acc_val = accuracy_score(y_val, y_pred_val)
        print(f"Baseline Validation Accuracy: {acc_val:.4f}")
        
        # Evaluate on test set (for final comparison)
        y_pred_test = model.predict(X_test)
        acc_test = accuracy_score(y_test, y_pred_test)
        print(f"Baseline Test Accuracy: {acc_test:.4f}")
        
        # Log parameters (ensuring they match the model's actual parameters)
        mlflow.log_param("model_type", "LogisticRegression")
        mlflow.log_param("solver", "lbfgs")
        mlflow.log_param("multi_class", "multinomial")
        mlflow.log_param("max_iter", max_iter)
        mlflow.log_param("C", C)
        mlflow.log_param("random_state", RANDOM_STATE)
        
        # Log data split information
        for key, value in split_info.items():
            mlflow.log_param(f"data_{key}", value)
        
        # Log metrics
        mlflow.log_metric("accuracy_val", acc_val)
        mlflow.log_metric("accuracy_test", acc_test)
        
        # Log tags
        mlflow.set_tag("dataset", "MNIST")
        mlflow.set_tag("task", "multiclass_classification")
        mlflow.set_tag("model_version", "baseline")
        
        # Log input examples (prevents warnings and documents expected input format)
        # Sample a few examples from each class
        example_indices = []
        for digit in range(10):
            indices = np.where(y_train == digit)[0]
            if len(indices) > 0:
                example_indices.append(indices[0])
        
        examples = X_train[example_indices[:5]]  # Log 5 examples
        
        # Create model signature with input examples
        signature = infer_signature(examples, model.predict(examples))
        
        # Log model with signature and input examples
        mlflow.sklearn.log_model(model, artifact_path="model", signature=signature, input_example=examples)
        
        return run.info.run_id, model

# Train and log baseline model
print("="*60)
print("Training Baseline Model")
print("="*60)
baseline_run_id, baseline_model = train_and_log_baseline(
    X_train, X_val, y_train, y_val, X_test, y_test, split_info
)
print(f"Baseline model run ID: {baseline_run_id}")

## Step 3: Hyperparameter Tuning with Hyperopt

We use Hyperopt's Tree-structured Parzen Estimator (TPE) algorithm to efficiently search the hyperparameter space.

**Hyperparameters being tuned:**
- **C**: Inverse of regularization strength (log-uniform distribution: 1e-4 to 10)
- **max_iter**: Maximum iterations for solver convergence (quantized uniform: 50-300)

**Key Concepts:**
- TPE is a Bayesian optimization method that learns from previous trials
- We optimize on the validation set (test set remains untouched)
- **All trials are logged to MLflow as nested runs** in a parent run group for live comparison
- You can watch the hyperparameter search progress in real-time in the MLflow UI
- Each trial is logged with its hyperparameters and validation accuracy for easy comparison


In [None]:
def hyperopt_train_eval(params, X_train, X_val, y_train, y_val, trial_num, parent_run_id, split_info):
    """
    Objective function for hyperopt with MLflow logging for each trial.
    
    Hyperparameters:
    - C: Inverse of regularization strength (smaller = stronger regularization)
    - max_iter: Maximum number of iterations for solver convergence
    
    Returns:
        Dictionary with loss (negative accuracy), status, and model
    """
    C = params['C']
    max_iter = int(params['max_iter'])
    
    # Log each trial as a child run in the parent run group
    with mlflow.start_run(run_name=f"trial_{trial_num}", nested=True) as child_run:
        model = LogisticRegression(
            multi_class='multinomial', 
            solver='lbfgs', 
            max_iter=max_iter, 
            C=C, 
            random_state=RANDOM_STATE
        )
        
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        acc = accuracy_score(y_val, y_pred)
        
        # Log hyperparameters
        mlflow.log_param("C", float(C))
        mlflow.log_param("max_iter", int(max_iter))
        mlflow.log_param("trial_num", trial_num)
        
        # Log data split information
        for key, value in split_info.items():
            mlflow.log_param(f"data_{key}", value)
        
        # Log metrics
        mlflow.log_metric("accuracy_val", acc)
        mlflow.log_metric("loss", -acc)  # Loss for hyperopt (negative accuracy)
        
        # Log tags
        mlflow.set_tag("dataset", "MNIST")
        mlflow.set_tag("task", "multiclass_classification")
        mlflow.set_tag("tuning", "hyperopt")
        mlflow.set_tag("trial_type", "hyperparameter_search")
        
        # Log model (optional - can be commented out to save space)
        # mlflow.sklearn.log_model(model, artifact_path="model")
    
    # We want to maximize accuracy, so return negative loss
    return {'loss': -acc, 'status': STATUS_OK, 'model': model, 'accuracy': acc}

def tune_hyperparameters(X_train, X_val, y_train, y_val, split_info, max_evals=30):
    """
    Tune hyperparameters using Hyperopt with Tree-structured Parzen Estimator (TPE).
    All trials are logged to MLflow in a grouped run for live comparison.
    
    Search space:
    - C: Log-uniform distribution from 1e-4 to 10 (common range for regularization)
    - max_iter: Uniform quantized distribution from 50 to 300 (step size 10)
    
    All trial results are logged to MLflow as child runs for real-time comparison.
    """
    experiment_name = "MNIST_LogisticRegression"
    mlflow.set_experiment(experiment_name)
    
    # Create a parent run to group all hyperparameter tuning trials
    with mlflow.start_run(run_name="hyperopt_tuning_group") as parent_run:
        parent_run_id = parent_run.info.run_id
        
        # Log parent run metadata
        mlflow.set_tag("dataset", "MNIST")
        mlflow.set_tag("task", "multiclass_classification")
        mlflow.set_tag("tuning", "hyperopt")
        mlflow.set_tag("run_type", "hyperparameter_tuning_group")
        mlflow.log_param("max_evals", max_evals)
        mlflow.log_param("optimization_algorithm", "TPE")
        
        # Define search space
        # Using loguniform for C since regularization strength varies over orders of magnitude
        space = {
            'C': hp.loguniform('C', np.log(1e-4), np.log(10)),
            'max_iter': hp.quniform('max_iter', 50, 300, 10)
        }
        
        # Note: Optimizing on validation set can lead to overfitting to validation data.
        # In production, use cross-validation or a separate validation set for final tuning.
        
        # Track trial number for run naming
        trial_counter = [0]  # Use list to allow modification in nested function
        
        def objective_with_logging(params):
            trial_counter[0] += 1
            return hyperopt_train_eval(
                params, X_train, X_val, y_train, y_val, 
                trial_counter[0], parent_run_id, split_info
            )
        
        trials = Trials()
        best = fmin(
            fn=objective_with_logging,
            space=space,
            algo=tpe.suggest,  # Tree-structured Parzen Estimator algorithm
            max_evals=max_evals,
            trials=trials,
            rstate=np.random.default_rng(RANDOM_STATE)  # Reproducibility
        )
        
        # Extract best model from trials
        best_trial = trials.best_trial
        best_acc = -best_trial['result']['loss']
        best_model = best_trial['result']['model']
        
        # Log summary metrics to parent run
        mlflow.log_metric("best_accuracy_val", best_acc)
        mlflow.log_metric("n_trials_completed", len(trials.trials))
        mlflow.log_param("best_C", float(best['C']))
        mlflow.log_param("best_max_iter", int(best['max_iter']))
        
        # Log all trial results as a summary
        trial_results = {
            'n_trials': len(trials.trials),
            'best_accuracy': float(best_acc),
            'best_params': {k: float(v) if isinstance(v, (int, float, np.number)) else v 
                           for k, v in best.items()},
            'parent_run_id': parent_run_id
        }
    
    return best, best_acc, best_model, trial_results

# Run hyperparameter tuning
print("="*60)
print("Starting Hyperparameter Tuning with Hyperopt")
print("="*60)
print("All trials will be logged to MLflow for live comparison...")
best_params, best_acc, best_model, trial_results = tune_hyperparameters(
    X_train, X_val, y_train, y_val, split_info, max_evals=30
)
print(f"\nBest hyperparameters: {best_params}")
print(f"Best validation accuracy: {best_acc:.4f}")
print(f"Total trials: {trial_results['n_trials']}")
print(f"Parent run ID: {trial_results['parent_run_id']}")
print("View all trials grouped together in MLflow UI!")

## Step 4: Log Best Tuned Model

After hyperparameter tuning, we log the best model to MLflow with:
- All hyperparameters that were found
- Validation and test set performance
- Model signature and input examples
- Tags indicating this is a tuned model

This allows easy comparison with the baseline model in the MLflow UI.


In [None]:
def log_best_model(best_params, best_acc, best_model, X_train, X_val, X_test, 
                   y_train, y_val, y_test, split_info, trial_results):
    """
    Log the best model from hyperparameter tuning to MLflow.
    """
    experiment_name = "MNIST_LogisticRegression"
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name="tuned_logreg") as run:
        # Evaluate on test set
        y_pred_test = best_model.predict(X_test)
        acc_test = accuracy_score(y_test, y_pred_test)
        
        # Log parameters
        mlflow.log_param("model_type", "LogisticRegression")
        mlflow.log_param("solver", "lbfgs")
        mlflow.log_param("multi_class", "multinomial")
        mlflow.log_param("C", float(best_params['C']))
        mlflow.log_param("max_iter", int(best_params['max_iter']))
        mlflow.log_param("random_state", RANDOM_STATE)
        
        # Log data split information
        for key, value in split_info.items():
            mlflow.log_param(f"data_{key}", value)
        
        # Log hyperopt trial summary
        mlflow.log_param("n_trials", trial_results['n_trials'])
        
        # Log metrics
        mlflow.log_metric("accuracy_val", best_acc)
        mlflow.log_metric("accuracy_test", acc_test)
        
        # Log tags
        mlflow.set_tag("dataset", "MNIST")
        mlflow.set_tag("task", "multiclass_classification")
        mlflow.set_tag("tuning", "hyperopt")
        mlflow.set_tag("model_version", "tuned")
        
        # Log input examples
        example_indices = []
        for digit in range(10):
            indices = np.where(y_train == digit)[0]
            if len(indices) > 0:
                example_indices.append(indices[0])
        
        examples = X_train[example_indices[:5]]
        
        # Create model signature with input examples
        signature = infer_signature(examples, best_model.predict(examples))
        
        # Log model with signature and input examples
        mlflow.sklearn.log_model(best_model, artifact_path="model", signature=signature, input_example=examples)
        
        print(f"Tuned model logged - Val Accuracy: {best_acc:.4f}, Test Accuracy: {acc_test:.4f}")
        return run.info.run_id, best_model

# Log the best tuned model
print("="*60)
print("Logging Best Tuned Model")
print("="*60)
tuned_run_id, tuned_model = log_best_model(
    best_params, best_acc, best_model, X_train, X_val, X_test,
    y_train, y_val, y_test, split_info, trial_results
)
print(f"Tuned model run ID: {tuned_run_id}")

## Step 5: Compare Models on Test Set

Finally, we compare both models on the held-out test set to get an unbiased estimate of their performance. This is the only time we use the test set - it was not used during training or hyperparameter tuning.

**Why this matters:**
- The test set provides an unbiased estimate of model performance
- Comparing models on the same test set ensures fair comparison
- The classification report shows per-class performance metrics


In [None]:
def compare_models(baseline_model, tuned_model, X_test, y_test):
    """
    Compare baseline and tuned models on the test set.
    """
    print("\n" + "="*60)
    print("Final Model Comparison on Test Set")
    print("="*60)
    
    baseline_pred = baseline_model.predict(X_test)
    tuned_pred = tuned_model.predict(X_test)
    
    baseline_acc = accuracy_score(y_test, baseline_pred)
    tuned_acc = accuracy_score(y_test, tuned_pred)
    
    print(f"\nBaseline Model Test Accuracy: {baseline_acc:.4f}")
    print(f"Tuned Model Test Accuracy: {tuned_acc:.4f}")
    print(f"Improvement: {tuned_acc - baseline_acc:.4f} ({((tuned_acc - baseline_acc) / baseline_acc * 100):.2f}%)")
    
    print("\nTuned Model Classification Report:")
    print(classification_report(y_test, tuned_pred))
    
    return baseline_acc, tuned_acc

# Compare models on test set
baseline_test_acc, tuned_test_acc = compare_models(baseline_model, tuned_model, X_test, y_test)

print("\n" + "="*60)
print("Experiment Complete!")
print("="*60)
print(f"View your experiments at: {mlflow.get_tracking_uri()}")

## Summary

Congratulations! You've successfully completed a comprehensive MLflow hyperparameter tuning workflow. Here's what we accomplished:

### Key Takeaways

1. **Reproducibility**: 
   - Fixed random seeds ensure identical results across runs
   - Data split sizes are logged for transparency
   - All hyperparameters are tracked in MLflow

2. **Proper Data Management**:
   - Separate train/validation/test splits prevent data leakage
   - Test set is only used for final evaluation, not during tuning
   - Stratified splits maintain class distribution

3. **Comprehensive Logging**:
   - All model parameters are logged and match actual model configuration
   - Input examples are logged to document expected data format
   - Both validation and test metrics are tracked
   - Hyperparameter search space and results are documented

4. **Best Practices**:
   - Baseline model establishes a performance benchmark
   - Hyperparameter tuning uses validation set (test set reserved for final evaluation)
   - Model comparison on held-out test set provides unbiased performance estimates

### Next Steps

- Explore MLflow UI to compare runs and visualize metrics
- Experiment with different hyperparameter search spaces
- Try other optimization algorithms (e.g., random search, Bayesian optimization)
- Consider cross-validation for more robust hyperparameter selection
- Extend to other model types (neural networks, ensemble methods)

This workflow ensures scientific rigor and reproducibility in your machine learning experiments!