# 03 – Model Training (Predictive Maintenance)

**Goal:**  
Train and evaluate machine-learning models to predict whether a machine
will fail within the next 72 hours using the engineered dataset
from Week 2.

We will:
- Load `dataset_ready_for_model.csv`
- Split into train/test sets
- Train Logistic Regression, Random Forest, and Gradient Boosting models
- Compare Accuracy, Precision, Recall, F1, and ROC–AUC
- Select the best model
- Save the trained model and metrics for later use

In [8]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score
)

import joblib
import json
import os

## 1. Load Engineered Dataset

We use the Week 2 output:

- `/data/dataset_ready_for_model.csv`

This dataset already contains:
- engineered rolling and trend features
- the target column `failure_within_72h`.


In [9]:
data_path = "../data/dataset_ready_for_model.csv"
df = pd.read_csv(data_path)

print("Shape:", df.shape)
df.head()


Shape: (42740, 25)


Unnamed: 0,machine_id,reading_time,temperature,vibration,pressure,current,rpm,status_code,temp_mean_6h,temp_std_6h,...,vib_std_12h,temp_mean_24h,temp_std_24h,vib_mean_24h,vib_std_24h,temp_delta_1h,vib_delta_1h,temp_delta_6h,vib_delta_6h,failure_within_72h
0,1,2023-01-01 23:00:00,57.150504,0.937167,30.24298,9.870475,1563.794434,0,59.187442,2.202235,...,0.106564,59.704723,1.947517,1.060162,0.089745,-2.984553,-0.133819,-3.477991,0.036166,0
1,1,2023-01-02 00:00:00,58.911235,1.12646,29.902485,10.259461,1503.926682,0,59.308655,2.155474,...,0.100237,59.617965,1.933944,1.065017,0.090061,1.760731,0.189293,0.727283,0.056084,0
2,1,2023-01-02 01:00:00,60.221845,0.997305,30.119151,9.905707,1437.663984,0,59.816397,1.895608,...,0.101074,59.638731,1.937798,1.060546,0.090671,1.310611,-0.129155,3.046453,0.035122,0
3,1,2023-01-02 02:00:00,57.698013,0.987546,29.613272,9.712392,1431.948007,0,58.944183,1.279613,...,0.102235,59.488841,1.943205,1.053577,0.089533,-2.523832,-0.009759,-5.233285,-0.284031,0
4,1,2023-01-02 03:00:00,60.751396,0.989948,30.191727,10.337799,1480.568773,0,59.144675,1.472871,...,0.099032,59.39323,1.812636,1.053666,0.089465,3.053383,0.002401,1.202949,-0.035349,0


## 2. Define Features and Target

Target:
- `failure_within_72h` (1 = failure within the next 72 hours, 0 = otherwise)

Features:
- All remaining numeric columns.

In [14]:
X = df.drop(columns=[target_col])

# Remove datetime column (not numeric)
if "reading_time" in X.columns:
    X = X.drop(columns=["reading_time"])

y = df[target_col]

## 3. Train/Test Split

We split the data into:

- 80% training
- 20% testing

We use `stratify=y` so the failure rate is similar in both sets.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

((34192, 23), (8548, 23))

## 4. Define Models

We compare three models:

1. **Logistic Regression**
   - Linear baseline model
   - Uses `StandardScaler` on features

2. **Random Forest**
   - Tree ensemble, good for non-linear patterns

3. **Gradient Boosting**
   - Boosted trees, often strong performance on tabular data

In [16]:
models = {
    "log_reg": Pipeline(steps=[
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(
            max_iter=1000,
            class_weight="balanced",
            random_state=42
        ))
    ]),
    "random_forest": RandomForestClassifier(
        n_estimators=200,
        max_depth=None,
        n_jobs=-1,
        class_weight="balanced",
        random_state=42
    ),
    "grad_boost": GradientBoostingClassifier(
        random_state=42
    )
}

list(models.keys())

['log_reg', 'random_forest', 'grad_boost']

## 5. Train and Evaluate Models

For each model we compute:

- Accuracy
- Precision
- Recall
- F1-score
- ROC–AUC

ROC–AUC uses the predicted probabilities of class 1.

In [17]:
results = []

for name, model in models.items():
    print(f"Training model: {name}")
    
    # Fit model
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Predicted probabilities for ROC–AUC
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        # Fallback for models without predict_proba
        y_scores = model.decision_function(X_test)
        y_proba = (y_scores - y_scores.min()) / (y_scores.max() - y_scores.min())
    
    # Metrics
    acc  = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec  = recall_score(y_test, y_pred, zero_division=0)
    f1   = f1_score(y_test, y_pred, zero_division=0)
    auc  = roc_auc_score(y_test, y_proba)
    
    results.append({
        "model": name,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
        "roc_auc": auc
    })

results_df = pd.DataFrame(results)
results_df


Training model: log_reg
Training model: random_forest
Training model: grad_boost


Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc
0,log_reg,0.991343,0.932039,0.987429,0.958935,0.9994
1,random_forest,0.995788,1.0,0.958857,0.978996,0.99799
2,grad_boost,0.995905,0.992958,0.966857,0.979734,0.998539


## 6. Select Best Model (by ROC–AUC)

We sort all models by ROC–AUC and choose the best one.

ROC–AUC is a good metric for imbalanced classification problems.

In [18]:
results_df_sorted = results_df.sort_values("roc_auc", ascending=False)
results_df_sorted

Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc
0,log_reg,0.991343,0.932039,0.987429,0.958935,0.9994
2,grad_boost,0.995905,0.992958,0.966857,0.979734,0.998539
1,random_forest,0.995788,1.0,0.958857,0.978996,0.99799


In [19]:
best_row = results_df_sorted.iloc[0]
best_model_name = best_row["model"]
print("Best model:", best_model_name)
best_row

Best model: log_reg


model         log_reg
accuracy     0.991343
precision    0.932039
recall       0.987429
f1           0.958935
roc_auc        0.9994
Name: 0, dtype: object

In [20]:
best_model = models[best_model_name]

# Ensure best model is fitted (it already is, but this is safe)
best_model.fit(X_train, y_train)

## 7. Feature Importance (Tree-Based Models Only)

If the best model is tree-based (Random Forest or Gradient Boosting),
we plot the top 15 most important features to understand which signals 
contribute most to predicting failures.ion.

In [23]:
import matplotlib.pyplot as plt

importances = None

# Check if best model is a tree model
if best_model_name in ["random_forest", "grad_boost"]:
    clf = best_model  # estimator is not in a pipeline for these models
    
    # Check that the model has feature_importances_
    if hasattr(clf, "feature_importances_"):
        importances = pd.DataFrame({
            "feature": feature_names,
            "importance": clf.feature_importances_
        }).sort_values("importance", ascending=False)
        
        # Show top 15 in table
        display(importances.head(15))
        
        # Plot top 15 features
        top_n = 15
        top_features = importances.head(top_n)
        
        plt.figure(figsize=(10, 6))
        plt.barh(top_features["feature"], top_features["importance"])
        plt.title("Top 15 Important Features")
        plt.xlabel("Importance")
        plt.ylabel("Feature")
        plt.gca().invert_yaxis()  # highest at top
        plt.tight_layout()
        plt.show()
        
    else:
        print("Best model has no feature_importances_ attribute.")
else:
    print(f"Best model ({best_model_name}) is not tree-based; skipping feature importance.")


Best model (log_reg) is not tree-based; skipping feature importance.


## 8. Save Best Model and Evaluation Metrics

We save:

- Best model object → `/models/best_model.pkl`
- All metrics for each model → `/models/model_metrics.json`

These will be used for documentation, deployment, or future dashboards.


In [24]:
# Ensure models directory exists
os.makedirs("../models", exist_ok=True)

model_path = "../models/best_model.pkl"
metrics_path = "../models/model_metrics.json"

# Save trained best model
joblib.dump(best_model, model_path)

# Save full metrics table (sorted)
metrics_dict = results_df_sorted.to_dict(orient="records")

with open(metrics_path, "w") as f:
    json.dump({
        "best_model": best_model_name,
        "results": metrics_dict
    }, f, indent=2)

print("Saved model to:", model_path)
print("Saved metrics to:", metrics_path)

Saved model to: ../models/best_model.pkl
Saved metrics to: ../models/model_metrics.json


#  Week 3 Summary – Model Training

In this notebook, we:

- Loaded the engineered dataset: `dataset_ready_for_model.csv`
- Defined features and the target `failure_within_72h`
- Split the data into train (80%) and test (20%) sets
- Trained and evaluated:
  - Logistic Regression (with scaling)
  - Random Forest
  - Gradient Boosting
- Compared models using:
  - Accuracy, Precision, Recall, F1-score, ROC–AUC
- Selected the best model based on ROC–AUC
- Saved:
  - Trained best model → `/models/best_model.pkl`
  - Evaluation metrics → `/models/model_metrics.json`

Next step (Week 4) – Optional:
- Integrate results with Azure SQL and/or a reporting dashboard (Power BI, Flask, etc.).