# ML Experiment Tracking with MLflow

In this notebook we train 3 benchmark models (same ones from Assignment 1), track all experiments using MLflow, register the models, and compare them using AUCPR as the model selection metric.

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import (accuracy_score, f1_score,
                             precision_recall_curve, auc)
import mlflow
import mlflow.sklearn
import warnings
warnings.filterwarnings('ignore')

## 1. Load Data

In [2]:
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('validation.csv')
test_df = pd.read_csv('test.csv')

print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

Train: 3900, Validation: 836, Test: 836


In [3]:
X_train = train_df['message']
y_train = train_df['label']
X_val = val_df['message']
y_val = val_df['label']
X_test = test_df['message']
y_test = test_df['label']

## 2. Feature Extraction

Using TF-IDF to convert text messages into numerical features.

In [4]:
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

print(f"TF-IDF feature shape: {X_train_tfidf.shape}")

TF-IDF feature shape: (3900, 5000)


## 3. Set Up MLflow

In [5]:
# using local file-based tracking
mlflow.set_tracking_uri("file://" + os.getcwd() + "/mlruns")
mlflow.set_experiment("sms-spam-classification")

2026/02/15 14:19:26 INFO mlflow.tracking.fluent: Experiment with name 'sms-spam-classification' does not exist. Creating a new experiment.


In [6]:
def compute_aucpr(model, X, y):
    """
    Compute Area Under Precision-Recall Curve.
    Uses predict_proba if available, otherwise decision_function.
    """
    if hasattr(model, 'predict_proba'):
        y_scores = model.predict_proba(X)[:, 1]
    else:
        y_scores = model.decision_function(X)
    precision, recall, _ = precision_recall_curve(y, y_scores)
    return auc(recall, precision)

## 4. Train and Track Models

We train the same 3 benchmark models from Assignment 1, but now we log everything to MLflow and register each model.

1. Logistic Regression
2. Naive Bayes (Multinomial)
3. Linear SVM

### Model 1: Logistic Regression

In [7]:
with mlflow.start_run(run_name="logistic_regression") as run:
    lr = LogisticRegression(max_iter=1000, C=1.0)
    lr.fit(X_train_tfidf, y_train)
    
    y_val_pred = lr.predict(X_val_tfidf)
    val_acc = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred)
    val_aucpr = compute_aucpr(lr, X_val_tfidf, y_val)
    
    # log parameters and metrics
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("C", 1.0)
    mlflow.log_metric("val_accuracy", val_acc)
    mlflow.log_metric("val_f1", val_f1)
    mlflow.log_metric("val_aucpr", val_aucpr)
    
    # log and register the model
    mlflow.sklearn.log_model(lr, "model", registered_model_name="logistic_regression")
    lr_run_id = run.info.run_id

print("Logistic Regression:")
print(f"  Validation Accuracy: {val_acc:.4f}")
print(f"  Validation F1:       {val_f1:.4f}")
print(f"  Validation AUCPR:    {val_aucpr:.4f}")

Logistic Regression:
  Validation Accuracy: 0.9689
  Validation F1:       0.8687
  Validation AUCPR:    0.9830


Successfully registered model 'logistic_regression'.
Created version '1' of model 'logistic_regression'.


### Model 2: Naive Bayes

In [8]:
with mlflow.start_run(run_name="naive_bayes") as run:
    nb = MultinomialNB(alpha=1.0)
    nb.fit(X_train_tfidf, y_train)
    
    y_val_pred = nb.predict(X_val_tfidf)
    val_acc = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred)
    val_aucpr = compute_aucpr(nb, X_val_tfidf, y_val)
    
    mlflow.log_param("model_type", "MultinomialNB")
    mlflow.log_param("alpha", 1.0)
    mlflow.log_metric("val_accuracy", val_acc)
    mlflow.log_metric("val_f1", val_f1)
    mlflow.log_metric("val_aucpr", val_aucpr)
    
    mlflow.sklearn.log_model(nb, "model", registered_model_name="naive_bayes")
    nb_run_id = run.info.run_id

print("Naive Bayes:")
print(f"  Validation Accuracy: {val_acc:.4f}")
print(f"  Validation F1:       {val_f1:.4f}")
print(f"  Validation AUCPR:    {val_aucpr:.4f}")

Naive Bayes:
  Validation Accuracy: 0.9797
  Validation F1:       0.9179
  Validation AUCPR:    0.9765


Successfully registered model 'naive_bayes'.
Created version '1' of model 'naive_bayes'.


### Model 3: Linear SVM

In [9]:
with mlflow.start_run(run_name="linear_svm") as run:
    svm = LinearSVC(max_iter=1000)
    svm.fit(X_train_tfidf, y_train)
    
    y_val_pred = svm.predict(X_val_tfidf)
    val_acc = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred)
    val_aucpr = compute_aucpr(svm, X_val_tfidf, y_val)
    
    mlflow.log_param("model_type", "LinearSVC")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_metric("val_accuracy", val_acc)
    mlflow.log_metric("val_f1", val_f1)
    mlflow.log_metric("val_aucpr", val_aucpr)
    
    mlflow.sklearn.log_model(svm, "model", registered_model_name="linear_svm")
    svm_run_id = run.info.run_id

print("Linear SVM:")
print(f"  Validation Accuracy: {val_acc:.4f}")
print(f"  Validation F1:       {val_f1:.4f}")
print(f"  Validation AUCPR:    {val_aucpr:.4f}")

Linear SVM:
  Validation Accuracy: 0.9892
  Validation F1:       0.9585
  Validation AUCPR:    0.9874


Successfully registered model 'linear_svm'.
Created version '1' of model 'linear_svm'.


## 5. View All Experiments

Pull all runs from MLflow and compare side by side.

In [10]:
# pull all runs from the experiment
experiment = mlflow.get_experiment_by_name("sms-spam-classification")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

display_cols = ['run_id', 'params.model_type', 'metrics.val_accuracy', 
                'metrics.val_f1', 'metrics.val_aucpr']
print("All MLflow runs:")
print(runs[display_cols].to_string())

All MLflow runs:
                           run_id      model_type  val_accuracy    val_f1  val_aucpr
0  3088d547e9394baeb91cf2d4010ab98f      LinearSVC      0.989234  0.958525   0.987426
1  06cc20cf06664adc8b3966e96297db6c  MultinomialNB      0.979665  0.917874   0.976503
2  c9689029e9b04822a41137d244c5792e  LogisticRegression  0.968900  0.868687   0.982976


## 6. Checkout and Print AUCPR for Each Model

Retrieve the AUCPR metric from each model's MLflow run.

In [11]:
# retrieve AUCPR for each model from their MLflow runs
model_runs = {
    "Logistic Regression": lr_run_id,
    "Naive Bayes": nb_run_id,
    "Linear SVM": svm_run_id
}

print("Model Selection Metric (AUCPR) for each model:")
print("-" * 45)

best_aucpr = 0
best_model = ""

for name, run_id in model_runs.items():
    run = mlflow.get_run(run_id)
    aucpr = run.data.metrics['val_aucpr']
    print(f"  {name:25s} AUCPR: {aucpr:.4f}")
    if aucpr > best_aucpr:
        best_aucpr = aucpr
        best_model = name

print(f"\nBest model: {best_model} (AUCPR = {best_aucpr:.4f})")

Model Selection Metric (AUCPR) for each model:
---------------------------------------------
  Logistic Regression       AUCPR: 0.9830
  Naive Bayes               AUCPR: 0.9765
  Linear SVM                AUCPR: 0.9874

Best model: Linear SVM (AUCPR = 0.9874)


Linear SVM has the highest AUCPR at 0.9874, same as Assignment 1 where it was also the best model. This is consistent with what we saw before -- SVM works well for this text classification task.