# 🎯 Notebook: 04_Model_Training_And_Registration

This notebook is responsible for training a machine learning model to detect potential aircraft anomalies based on enriched sensor data, and registering the trained model in **Unity Catalog** using MLflow.

## 🧩 Key Steps Covered

- 📥 **Load features** from the registered feature store (`sensor_features`) which includes:
  - 7-day rolling averages for engine metrics
  - Anomaly history (`prev_anomaly`)
  - Days since last maintenance
- 🧹 **Data cleaning** to ensure no missing values and enforce schema compliance
- 🤖 **Model training** using a `RandomForestClassifier` with scikit-learn
- 📊 **Model evaluation** with precision, recall, and F1 score metrics
- 📝 **Model registration** in Unity Catalog with:
  - Signature: clearly defined input/output schema
  - Version control and metadata tracking
  - Compatibility with inference workflows

## 📎 Output

- Registered model: `AircraftAnomalyPredictor` (UC registered with signature)
- Logged experiment run with MLflow, including parameters, metrics, and artifacts

## 📥 Load Feature Store Data
We load features from a registered feature table to ensure consistency across training and inference.

In [0]:
from databricks.feature_store import FeatureLookup

feature_lookups = [
    FeatureLookup(
        table_name="arao.aerodemo.sensor_features_table",
        lookup_key=["aircraft_id", "timestamp"],
        # Exclude anomaly_score (since it's the label)
        feature_names=[  # explicitly list features except 'anomaly_score'
            "engine_temp", "fuel_efficiency", "vibration", "altitude", "airspeed",
            "oil_pressure", "engine_rpm", "battery_voltage",
            "avg_engine_temp_7d", "avg_vibration_7d", "avg_rpm_7d",
            "prev_anomaly", "days_since_maint"
        ]
    )
]

training_set = fs.create_training_set(
    df=labels_df,
    feature_lookups=feature_lookups,
    label="anomaly_score"
)

training_df = training_set.load_df().toPandas()

## 📊 Prepare Training Data
We extract selected features and define the target label (`anomaly_score`).

In [0]:
X = training_df[[
    "engine_temp", "fuel_efficiency", "vibration", "altitude", "airspeed",
    "oil_pressure", "engine_rpm", "battery_voltage", "prev_anomaly", 
    "avg_engine_temp_7d", "avg_vibration_7d", "avg_rpm_7d", "days_since_maint"
]]

# ✅ Enforce correct types before training & model signature
X = X.astype({
    "engine_temp": float,
    "fuel_efficiency": float,
    "vibration": float,
    "altitude": float,
    "airspeed": float,
    "oil_pressure": float,
    "engine_rpm": np.int32,         # Important for schema enforcement
    "battery_voltage": float,
    "prev_anomaly": float,
    "avg_engine_temp_7d": float,
    "avg_vibration_7d": float,
    "avg_rpm_7d": float,
    "days_since_maint": int
})

# ✅ Ensure labels are clean integers
y = training_df["anomaly_score"].astype(float).astype(int)

## ✂️ Train/Test Split and Scaling
We split data and scale features before training.

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 🤖 Train and Evaluate Model
Train a Random Forest Classifier and log classification metrics.

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import mlflow
from mlflow.models.signature import infer_signature

with mlflow.start_run(run_name="Aircraft_Anomaly_RF_Model"):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y_train)

    preds = model.predict(X_test_scaled)
    report = classification_report(y_test, preds, output_dict=True)
    print(report)

    mlflow.log_params(model.get_params())
    if "1" in report:
        mlflow.log_metrics({
            "precision": report["1"].get("precision", 0.0),
            "recall": report["1"].get("recall", 0.0),
            "f1-score": report["1"].get("f1-score", 0.0)
        })

    signature = infer_signature(X_train, model.predict(X_train_scaled))
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model",
        signature=signature,
        registered_model_name="AircraftAnomalyPredictor"
    )

    mlflow.log_metrics({
    "precision": report.get("1", {}).get("precision", 0.0),
    "recall": report.get("1", {}).get("recall", 0.0),
    "f1-score": report.get("1", {}).get("f1-score", 0.0)
})

    

### 🔁 Assigning "champion" Alias to Latest Registered Model Version

After logging the trained model with MLflow, we assign the alias `"champion"` to the latest version of the `AircraftAnomalyPredictor` model in Unity Catalog.

Using aliases like `"champion"` provides a consistent and flexible way to reference models during inference, avoiding hardcoding of version numbers. This allows downstream pipelines or applications to always use the most recent approved version of the model, improving maintainability and deployment flexibility.

This alias will later be used in the inference notebook to load the model as:
```python
model_uri = "models:/AircraftAnomalyPredictor@champion"

In [0]:
from mlflow import MlflowClient

client = MlflowClient()
model_name = "main.default.AircraftAnomalyPredictor"

# Use search_model_versions for Unity Catalog compatibility
latest_versions = client.search_model_versions(f"name='{model_name}'")

# Get the highest version number
if latest_versions:
    latest_version = max(int(m.version) for m in latest_versions)
    client.set_registered_model_alias(
        name=model_name,
        alias="champion",
        version=latest_version
    )
    print(f"✅ Assigned 'champion' alias to version {latest_version} of model '{model_name}'")
else:
    print(f"❌ No versions found for model '{model_name}'")

### 🧪 Inference Example 1: Load Model by Version

This cell demonstrates how to load a specific version of the `AircraftAnomalyPredictor` model from Unity Catalog and run inference on a sample data point.

- `model_uri = "models:/AircraftAnomalyPredictor/2"`: Loads version 2 of the registered model.
- The input `DataFrame` includes all features expected by the model, such as rolling averages and maintenance metrics.
- The model outputs a binary prediction: `0` (Normal) or `1` (Anomalous).

In [0]:
# import pandas as pd
# import numpy as np
# import mlflow
# import mlflow.pyfunc

# # Disable autologging before inference
# mlflow.sklearn.autolog(disable=True)

# # Sample input (cast engine_rpm to match model's int32 schema)
# sample_input = pd.DataFrame([{
#     "engine_temp": 612.5,
#     "fuel_efficiency": 76.0,
#     "vibration": 5.1,
#     "altitude": 31000.0,
#     "airspeed": 460.0,
#     "oil_pressure": 58.5,
#     "engine_rpm": np.int32(3900),  # Critical: cast to match schema
#     "battery_voltage": 25.0,
#     "prev_anomaly": 0.0,
#     "days_since_maint": 20,
#     "avg_engine_temp_7d": 608.3,
#     "avg_vibration_7d": 5.05,
#     "avg_rpm_7d": 3850
# }])

# # Load using the pyfunc flavor (sklearn flavor causes issues sometimes)
# model_uri = "models:/AircraftAnomalyPredictor/2"
# loaded_model = mlflow.pyfunc.load_model(model_uri)

# # Predict
# prediction = loaded_model.predict(sample_input)
# print("Predicted label (0 = Normal, 1 = Anomaly):")
# print(prediction)

### 🧪 Inference Example 2: Load Model by Alias (Recommended)

Instead of referencing a model by version number, this approach uses a **named alias** (`@champion`) which allows for flexible model lifecycle management.

- Aliases make it easier to swap production models without changing consuming code.
- Ensure an alias such as `champion` has been set using the Unity Catalog Model Registry.

In [0]:
model_uri = "models:/AircraftAnomalyPredictor@champion"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Predict again using the same sample_input
print(loaded_model.predict(sample_input))

### 🧪 Inference Example 3: Batch Scoring on Recent Data

This cell demonstrates how to run the model against a batch of real feature data from the `sensor_features` table.

- We sample a few rows from the full feature set.
- Ensure correct data types for all columns (e.g., `engine_rpm` as `int32`).
- The model is then used to generate predictions for the full batch.

Use this pattern for scoring new incoming data at scale.

In [0]:
# Simulate scoring on a small batch
batch_df = feature_df.sample(5).drop(columns=["anomaly_score"])

# Ensure columns match schema
batch_df = batch_df.astype({
    "engine_rpm": "int32",
    "prev_anomaly": "float64",
    "days_since_maint": "int64"
})

predictions = loaded_model.predict(batch_df)
print("Batch Predictions:")
print(predictions)

### 🧪 Inference Example 4: Inference Example using the "champion" Alias

This example demonstrates how to load the latest registered version of the model using the Unity Catalog alias `@champion`, which is ideal for production-grade inferencing.

- ✅ **Model URI** is resolved using the alias instead of a static version number.
- 🧾 **Input features** must match the schema registered during training.
- 📈 **Output** is a predicted anomaly classification:
  - `0` = Normal behavior
  - `1` = Potential anomaly requiring attention

This is the preferred approach for deploying and serving models in production environments, ensuring smooth upgrades without code changes.

In [0]:
# 📦 Import necessary libraries
import pandas as pd
import mlflow

# 🔄 Load model from Unity Catalog using the "champion" alias
model_uri = "models:/AircraftAnomalyPredictor@champion"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# 🛫 Create a new sample input DataFrame
# This must match the expected feature schema registered with the model
sample_input = pd.DataFrame([{
    "engine_temp": 610.0,
    "fuel_efficiency": 76.2,
    "vibration": 5.3,
    "altitude": 29950.0,
    "airspeed": 452.0,
    "oil_pressure": 61.0,
    "engine_rpm": 3900,
    "battery_voltage": 25.0,
    "prev_anomaly": 1.0
}])

# 🔍 Run inference
prediction = loaded_model.predict(sample_input)

# 📢 Display result
print("🧠 Predicted Anomaly (0 = Normal, 1 = Anomalous):", prediction[0])