# MLflow for Production AI and ML

## The Scenario

üõ©Ô∏è **Leadership just gave you the order:** Your team has IoT sensor data streaming in from aircraft engines across 5 factories. By the end of the week, you need to deploy a predictive model to identify potential defects before they cause failures. This notebook gets you from raw data to a production-ready model in 30 minutes.

## What You'll Learn

‚úÖ **Experiment** - Track model training with MLflow autologging  
‚úÖ **Register** - Version control models in Unity Catalog  
‚úÖ **Compare** - Train multiple models and compare results  

**Key Concepts:**
- **MLflow Tracking**: Automatically log parameters, metrics, and models
- **Unity Catalog Model Registry**: Enterprise-grade model versioning and governance
- **Model Aliases**: Tag models as "Champion" or "Challenger" for deployment

---

**References:**
- [MLflow Tracking](https://docs.databricks.com/aws/en/mlflow/tracking)
- [Databricks Autologging](https://docs.databricks.com/aws/en/mlflow/databricks-autologging)
- [Unity Catalog Model Registry](https://docs.databricks.com/aws/en/machine-learning/manage-model-lifecycle/index.html)

## Why MLflow?

Without MLflow, data scientists face challenges like:
- **Lost experiments** - "Which hyperparameters gave us that 95% accuracy?"
- **Model chaos** - "Where's the model we deployed last week?"
- **No reproducibility** - "I can't recreate these results"

MLflow solves this by providing:
- **Experiment Tracking**: Automatic logging of parameters, metrics, and artifacts
- **Model Registry**: Centralized model versioning with Unity Catalog
- **Deployment**: Seamless path from experiment to production

**The MLOps Workflow (This Notebook):**
```
1. EXPERIMENT ‚Üí Train models, MLflow tracks everything
2. REGISTER   ‚Üí Save best model to Unity Catalog
3. COMPARE    ‚Üí Review experiments and pick the best

(Next notebook covers: DEPLOY ‚Üí Inference patterns)
```

In [0]:
%pip install mlflow scikit-learn 
%restart_python

## Setup: Configuration

Update these values with your catalog and schema.

In [0]:
# Configuration
import re

catalog = "dwx_airops_insights_platform_dev_working"
source_schema = "db_crash_course"  # Shared schema to read from
username = spark.sql("SELECT current_user()").collect()[0][0]
username_base = username.split('@')[0]  # Extract username before @ symbol
target_schema = re.sub(r'[^a-zA-Z0-9_]', '_', username_base)  # Replace special chars with _

# Create target schema if it doesn't exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{target_schema}")

print(f"‚úÖ Using catalog: {catalog}")
print(f"üìñ Reading from schema: {source_schema} (shared)")
print(f"‚úçÔ∏è  Writing to schema: {target_schema} (your personal schema)")

## AutoML vs MLflow in Notebooks: What's the Difference?

In the previous notebook, you used **AutoML** - Databricks' point-and-click UI for training models. In this notebook, you'll use **MLflow** within notebooks for custom model training.

### Key Differences:

| Aspect | AutoML (Previous Notebook) | MLflow in Notebooks (This Notebook) |
|--------|---------------------------|-------------------------------------|
| **Interface** | Point-and-click UI | Code in notebooks |
| **Control** | Automated decisions | Full control over everything |
| **Speed** | Fast, no code needed | Requires writing code |
| **Customization** | Limited to provided options | Unlimited customization |
| **Best For** | Quick experiments, baselines | Custom models, fine-tuning |

### Both Use MLflow Under the Hood!

**Important:** AutoML automatically uses MLflow to track experiments. Whether you use AutoML or write custom code, **all experiments are tracked in MLflow**.

**When to use which:**
- **AutoML**: Fast proof-of-concept, baseline models, learning
- **MLflow in Notebooks**: Custom algorithms, specific architectures, production models

---

## Load and Prepare Training Data

Let's jump straight into preparing data for model training. You can explore tables in Catalog Explorer as you learned in Day 1.

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.window import Window

# Load sensor and inspection data from source schema (shared)
sensor_df = spark.table(f"{catalog}.{source_schema}.sensor_bronze")
inspection_df = spark.table(f"{catalog}.{source_schema}.inspection_bronze")

# Join sensor data with inspection labels
# For each device, take the most recent sensor reading before each inspection
window_spec = Window.partitionBy("device_id").orderBy(F.col("sensor_timestamp").desc())

training_data = (
    sensor_df
    .withColumnRenamed("timestamp", "sensor_timestamp")
    .join(
        inspection_df.withColumnRenamed("timestamp", "inspection_timestamp"),
        ["device_id"]
    )
    .filter(F.col("sensor_timestamp") <= F.col("inspection_timestamp"))
    .withColumn("row_num", F.row_number().over(window_spec))
    .filter(F.col("row_num") == 1)
    .select(
        "device_id",
        "factory_id", 
        "model_id",
        "airflow_rate",
        "rotation_speed",
        "air_pressure",
        "temperature",
        "delay",
        "density",
        F.col("defect").cast("int").alias("defect")
    )
)

print(f"Training dataset size: {training_data.count():,} records")
print(f"Defect rate: {training_data.filter('defect = 1').count() / training_data.count() * 100:.2f}%")

display(training_data.limit(10))

## Convert to Pandas for Sklearn

For this quick example, we'll use scikit-learn. For larger datasets, consider using Spark MLlib or distributed training.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Convert to Pandas
pdf = training_data.toPandas()

# Prepare features and target
feature_cols = ["airflow_rate", "rotation_speed", "air_pressure", "temperature", "delay", "density"]
X = pdf[feature_cols]
y = pdf["defect"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")

## 1Ô∏è‚É£ EXPERIMENT: Train Model with MLflow Autologging

**Key Point:** Use `mlflow.autolog()` to automatically track everything! No need to manually log parameters, metrics, or models.

**What gets auto-logged:**
- Model architecture and parameters
- Training metrics (accuracy, precision, recall, etc.)
- Model artifacts
- Feature importances
- Training dataset signature

In [0]:
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Enable autologging - this is the magic! ‚ú®
mlflow.autolog()

# Train model - MLflow automatically tracks everything
with mlflow.start_run(run_name="IoT Defect Prediction - RF") as run:
    # Train Random Forest
    rf_model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    )
    rf_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = rf_model.predict(X_test)
    y_pred_proba = rf_model.predict_proba(X_test)[:, 1]
    
    # Calculate additional metrics (autolog captures most, but we can add custom ones)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    run_id = run.info.run_id

### üîç Explore the Databricks MLflow UI

**Click the "Experiment" button at the top right of this notebook** to open the MLflow UI. You'll see:

1. **Runs table** - All your experiments in one place
2. **Parameters** - Hyperparameters used (n_estimators, max_depth, etc.)
3. **Metrics** - Model performance (accuracy, precision, recall, etc.)
4. **Artifacts** - Saved model files, feature importances, and more
5. **Charts** - Visualize metric comparisons across runs

Try clicking on your run to see all the details that were automatically logged!

## Train Another Model to Compare

Let's train a Gradient Boosting model to compare performance.

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

# Handle missing values in temperature column
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Autologging is still enabled from earlier
with mlflow.start_run(run_name="IoT Defect Prediction - GBM") as run:
    # Train Gradient Boosting
    gbm_model = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
    gbm_model.fit(X_train_imputed, y_train)
    
    # Make predictions
    y_pred = gbm_model.predict(X_test_imputed)
    y_pred_proba = gbm_model.predict_proba(X_test_imputed)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    print(f"‚úÖ Run ID: {run.info.run_id}")
    print(f"üìä Accuracy: {accuracy:.4f}")
    print(f"üéØ Precision: {precision:.4f}")
    print(f"üîç Recall: {recall:.4f}")
    print(f"üìà F1 Score: {f1:.4f}")
    print(f"üìâ AUC: {auc:.4f}")

üí° **Pro Tip:** Go back to the MLflow UI and compare the two runs side-by-side. Which model performs better?

## 2Ô∏è‚É£ REGISTER: Save Model to Unity Catalog

The **Unity Catalog Model Registry** is your enterprise model store. It provides:
- **Versioning**: Every model update creates a new version
- **Lineage**: Track which data and code produced each model
- **Governance**: Control who can access and deploy models
- **Aliases**: Tag models as "Champion", "Challenger", "Staging", etc.

In [0]:
# Register the best model to your target schema
# Model is unique because it's in your personal schema
model_name = f"{catalog}.{target_schema}.iot_defect_predictor"
model_uri = f"runs:/{run_id}/model"

print(f"üì¶ Registering model: {model_name}")
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

print(f"‚úÖ Registered model version: {model_details.version}")

### Set Model Alias to "Champion"

Model aliases let you tag specific versions for deployment (e.g., "Champion" for production, "Challenger" for testing).

In [0]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Add model description
client.update_registered_model(
    name=model_name,
    description="Random Forest model to predict defects in aircraft engine IoT sensors. Trained on sensor readings (airflow, rotation speed, temperature, pressure) and inspection results."
)

# Set the "Champion" alias to this version
client.set_registered_model_alias(
    name=model_name,
    alias="Champion",
    version=model_details.version
)

print(f"‚úÖ Model version {model_details.version} tagged as 'Champion'")

üéØ **View your model in Unity Catalog:**
1. Click "Catalog" in the left sidebar
2. Navigate to your catalog ‚Üí schema ‚Üí "iot_defect_predictor"
3. See model versions, lineage, and metadata

## 3Ô∏è‚É£ PREDICT: Load and Use the Model

Load the "Champion" model and use it for predictions. This is how you'd use the model in production.

In [0]:
import mlflow.pyfunc

# Load the Champion model by alias
champion_model_uri = f"models:/{model_name}@Champion"
print(f"üì• Loading model from: {champion_model_uri}")

champion_model = mlflow.pyfunc.load_model(champion_model_uri)

print("‚úÖ Model loaded successfully!")

### Make Batch Predictions

Use the loaded model to predict defects on new sensor data.

In [0]:
# Make predictions on test set
predictions = champion_model.predict(X_test)

# Create results DataFrame
results_df = pd.DataFrame({
    "actual_defect": y_test.values,
    "predicted_defect": predictions,
    "airflow_rate": X_test["airflow_rate"].values,
    "rotation_speed": X_test["rotation_speed"].values,
    "temperature": X_test["temperature"].values
})

print("üîÆ Predictions:")
display(results_df.head(20))

# Calculate accuracy
accuracy = (results_df["actual_defect"] == results_df["predicted_defect"]).mean()
print(f"\n‚úÖ Prediction Accuracy: {accuracy:.2%}")

## ‚úÖ Mission Accomplished!

**What you just did:**
1. ‚úÖ **EXPERIMENT** - Trained models with automatic MLflow tracking
2. ‚úÖ **REGISTER** - Saved the best model to Unity Catalog with Champion alias
3. ‚úÖ **PREDICT (Demo)** - Loaded and tested the model on sample data

**You're now ready to:**
- Show leadership you have a working predictive model ‚ú®
- Deploy this model to production (next notebook covers batch, streaming, and real-time inference)
- Track model performance over time
- Iterate and improve with new model versions

**Next Up:** Move to notebook 6 (ML and AI Inference) to learn production deployment patterns!

## üöÄ Try This Out: Next Steps

Now that you have the MLOps basics down, here are ways to level up:

### 1. Train XGBoost with Databricks Assistant

Try training an XGBoost model and compare it to the Random Forest:

**Steps:**
1. Create a new code cell
2. Ask Databricks Assistant: "Train an XGBoost classifier using the same training data with MLflow autologging"
3. Compare the results in the MLflow UI
4. Which performs better - Random Forest or XGBoost?

**Bonus:** Try LightGBM too!

---

### 2. Experiment with Different Models

Try these other algorithms using the same training code pattern:
- **Logistic Regression** - Simple baseline
- **LightGBM** - Fast gradient boosting
- **XGBoost** - Popular gradient boosting
- **Neural Network** - sklearn MLPClassifier

**Tip:** Use Databricks Assistant to help with the code!

---

### 3. Hyperparameter Tuning with Hyperopt

Automatically find the best parameters with [Hyperparameter Tuning](https://docs.databricks.com/aws/en/machine-learning/automl-hyperparam-tuning/optuna)

---


### 4. Move to the Next Notebook

Ready to deploy your models? The **next notebook (6 ML and AI Inference)** covers:
- **Batch predictions** - Score large datasets
- **Streaming predictions** - Real-time monitoring
- **Model serving APIs** - REST endpoints for applications
- **AI Query** - SQL-based inference

This is where your trained models go to production! üöÄ

## üìö Additional Resources

- [MLflow Quickstart](https://docs.databricks.com/aws/en/mlflow/quick-start.html)
- [MLflow 3 Migration Guide](https://docs.databricks.com/aws/en/mlflow/mlflow-3-install.html)
- [Unity Catalog Model Registry](https://docs.databricks.com/aws/en/machine-learning/manage-model-lifecycle/index.html)
- [Databricks Autologging](https://docs.databricks.com/aws/en/mlflow/databricks-autologging.html)
- [Model Deployment Guide](https://docs.databricks.com/aws/en/machine-learning/model-serving/index.html)

&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>