# DS3000 Baseline Modeling

This notebook documents the Milestone 3 **baseline models** for our inverter thermal-safety problem.  
We work with pre-cleaned **1 Hz inverter telemetry** and construct two simple reference models:

1. **Logistic regression** with class-balancing to detect imminent **overheating events**.
2. **Ridge regression** to forecast the **30-second hot-spot temperature delta** (ΔT₃₀s) as a short‑horizon proxy for thermal risk.

The goal here is **not** to find the best possible model, but to:

- Define a **reproducible baseline pipeline** using scikit‑learn.
- Make all feature/label choices explicit and documented.
- Produce metrics that later, more complex models (trees, ensembles, deep nets, etc.) must **beat** to be considered useful.


In [3]:
# Imports, path setup, and label engineering
# - Import core Python, pandas/numpy, plotting, and scikit‑learn utilities.
# - Locate the cleaned 1 Hz inverter CSV in the `clean/` data directory.
# - Load the time‑ordered dataframe and ensure it is sorted by `timestamp`.
# - If needed, create convenience label columns:
#     * `overheat_label` (binary) from a 65 °C hot‑spot threshold.
#     * `inv_hot_spot_temp_future` (30 s look‑ahead temperature).
#     * `delta_T_30s` (future minus current hot‑spot temperature).

from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    mean_absolute_error,
    mean_squared_error,
)

plt.style.use('seaborn-v0_8')
sns.set_theme(style='whitegrid')

CWD = Path.cwd()
if (CWD / 'clean').exists():
    CLEAN_DIR = CWD / 'clean'
elif (CWD / 'src' / 'notebooks' / 'clean').exists():
    CLEAN_DIR = CWD / 'src' / 'notebooks' / 'clean'
elif (CWD / 'src' / 'clean').exists():
    CLEAN_DIR = CWD / 'src' / 'clean'
else:
    raise FileNotFoundError('Could not locate clean/ directory')

PRIMARY_PATH = CLEAN_DIR / 'inverter_labeled_1hz.csv'
FALLBACK_PATH = CLEAN_DIR / 'inverter_merged_1hz.csv'
DATA_PATH = PRIMARY_PATH if PRIMARY_PATH.exists() else FALLBACK_PATH

print('Loading', DATA_PATH)
df = pd.read_csv(DATA_PATH, parse_dates=['timestamp']).sort_values('timestamp')

if 'overheat_label' not in df.columns:
    df['overheat_label'] = (df['inv_hot_spot_temp_mean'] >= 65).astype(int)
if 'inv_hot_spot_temp_future' not in df.columns:
    df['inv_hot_spot_temp_future'] = df['inv_hot_spot_temp_mean'].shift(-30)
if 'delta_T_30s' not in df.columns:
    df['delta_T_30s'] = df['inv_hot_spot_temp_future'] - df['inv_hot_spot_temp_mean']


Loading /Users/omarramadan/Desktop/Western_Stuff/2025-26/SE3000 - Intro to Machine Learning/GroupProject/DS3000-Battery-Analyzer/src/notebooks/clean/inverter_labeled_1hz.csv


Unnamed: 0,dc_raw_sample_count,inv_control_board_temp_count,inv_control_board_temp_max,inv_control_board_temp_mean,inv_control_board_temp_min,inv_coolant_temp_count,inv_coolant_temp_max,inv_coolant_temp_mean,inv_coolant_temp_min,inv_dc_bus_current_count,...,inv_phase_c_current_count,inv_phase_c_current_max,inv_phase_c_current_mean,inv_phase_c_current_min,phase_raw_sample_count,temps_raw_sample_count,timestamp,overheat_label,inv_hot_spot_temp_future,delta_T_30s
0,4.0,4.0,30.0,29.925,29.9,4.0,2.1,1.95,1.8,4.0,...,4.0,0.2,0.09375,0.04,4.0,4.0,2025-06-03 18:56:10,0,31.25,29.3
1,10.0,10.0,30.5,30.39,30.2,10.0,4.3,3.21,2.3,10.0,...,10.0,0.0833,-0.07967,-0.52,10.0,10.0,2025-06-03 18:56:11,0,31.34,28.13
2,10.0,10.0,30.8,30.71,30.6,10.0,7.5,6.07,4.6,10.0,...,10.0,0.2,-0.00133,-0.22,10.0,10.0,2025-06-03 18:56:12,0,31.4,25.33
3,10.0,8.0,31.0,30.8625,30.7,8.0,10.1,8.99375,7.8,10.0,...,10.0,0.225,0.01317,-0.14,10.0,9.0,2025-06-03 18:56:13,0,31.5,22.50625
4,10.0,10.0,31.0,30.89,30.8,10.0,12.7,11.58,10.4,10.0,...,10.0,0.14,-0.085,-0.3,10.0,10.0,2025-06-03 18:56:14,0,31.52,19.94


## 1. Feature Selection & Chronological Split

We start by defining which columns are allowed as **inputs** and how we split the data into train/test sets.

- We restrict predictors to telemetry columns ending in **`_mean`** (e.g., temperatures, voltages, currents).  
  These are relatively smooth, aggregated signals that are easier to model and less noisy than raw samples.
- We **explicitly exclude** columns that directly encode the target:
  - `inv_hot_spot_temp_mean` (current hot-spot temperature)
  - `inv_hot_spot_temp_future` (look‑ahead hot-spot temperature)
  - `delta_T_30s` (the 30-second temperature delta)
- Any row that is missing either the features or the label(s) is dropped with `dropna`.  
  This keeps the training set internally consistent and avoids silent NaN propagation.

For the split:

- We perform an **80/20 chronological split** using the row index (first ~80% for training, last ~20% for testing).
- We do **not shuffle** the data, because time ordering matters and random shuffling would leak information from the future into the past.
- The same feature set is later reused for both:
  - the **classification** task (`overheat_label`), and
  - the **regression** task (`delta_T_30s`).

This gives us a clean, time‑respecting partition that we can reuse across multiple models.


In [4]:
# Feature selection and chronological train/test split
# - Build `feature_cols` from all `_mean` telemetry columns, excluding target‑related fields.
# - Drop any rows that are missing features or labels to keep the modeling dataframe clean.
# - Perform an 80/20 **chronological** split (no shuffling) to avoid future‑to‑past leakage.
# - Extract `X_train`, `X_test` and the `overheat_label` targets for the classification baseline.
# - The same feature set will later be reused for the regression target `delta_T_30s`.

exclude_cols = {'inv_hot_spot_temp_mean', 'inv_hot_spot_temp_future', 'delta_T_30s'}
feature_cols = [c for c in df.columns if c.endswith('_mean') and c not in exclude_cols]

model_df = df.dropna(subset=feature_cols + ['overheat_label'])
split_idx = int(len(model_df) * 0.8)
train_df = model_df.iloc[:split_idx]
test_df = model_df.iloc[split_idx:]

X_train = train_df[feature_cols]
X_test = test_df[feature_cols]
y_train = train_df['overheat_label']
y_test = test_df['overheat_label']

print(f"Features: {len(feature_cols)}")
print(f"Train rows: {len(train_df)} | Test rows: {len(test_df)}")
print('First few features:', feature_cols[:10])


Features: 11
Train rows: 2658 | Test rows: 665


['inv_control_board_temp_mean',
 'inv_coolant_temp_mean',
 'inv_dc_bus_current_mean',
 'inv_dc_bus_voltage_mean',
 'inv_gate_driver_board_temp_mean',
 'inv_module_a_temp_mean',
 'inv_module_b_temp_mean',
 'inv_module_c_temp_mean',
 'inv_phase_a_current_mean',
 'inv_phase_b_current_mean']

## 2. Logistic Regression Baseline

Our first model is a **logistic regression classifier** that predicts whether an overheat event is about to occur.

Pipeline details:

- **StandardScaler**: each feature is standardized to zero mean and unit variance.  
  This is important for logistic regression, which is sensitive to the scale of the inputs.
- **LogisticRegression** with `class_weight='balanced'`:
  - Overheating events are typically **rare** compared to normal operation.
  - Using a balanced class weight up‑weights the minority (overheat) class, so the model does not simply predict “no overheat” all the time.

Training & evaluation:

- The model is trained on the chronologically earlier **train split** and evaluated on the held‑out **test split**.
- We compute:
  - `classification_report` (precision, recall, F1‑score per class),
  - **ROC AUC** using the predicted probabilities for the positive class,
  - the **confusion matrix**, which shows the trade‑off between missed overheat events (false negatives) and false alarms (false positives).

This model serves as a **simple, interpretable baseline**; any future classifier should beat its ROC AUC and F1‑score to justify added complexity.


In [8]:
# Logistic regression baseline: training and evaluation
# - Wrap preprocessing and the classifier in a scikit‑learn `Pipeline`:
#     * `StandardScaler` for feature normalization.
#     * `LogisticRegression` with `class_weight='balanced'` to handle class imbalance.
# - Fit on the training portion of the time series.
# - Use the trained model to obtain class predictions and probabilities on the test set.
# - Report precision/recall/F1 per class, ROC‑AUC, and the confusion matrix
#   as reference metrics for future, more advanced classifiers.

clf = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000, class_weight='balanced')),
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, digits=3))
print('ROC AUC:', roc_auc_score(y_test, y_prob))
print('Confusion matrix:'
, confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0      0.998     0.857     0.922       602
           1      0.419     0.984     0.588        63

    accuracy                          0.869       665
   macro avg      0.708     0.921     0.755       665
weighted avg      0.943     0.869     0.891       665

ROC AUC: 0.9937509887676
Confusion matrix: [[516  86]
 [  1  62]]


## 3. Ridge Regression for 30s Delta

The second baseline is a **regression** model that predicts the **30-second hot‑spot temperature change**, `delta_T_30s`.

Problem setup:

- We define the target column `delta_T_30s` as the difference between:
  - the hot‑spot temperature **30 seconds in the future**, and  
  - the **current** hot‑spot temperature.
- Rows where the future value is not available (e.g., near the end of the time series) are naturally dropped.
- We reuse the same `_mean` feature set as in the classification baseline, so both models operate on an identical input space.

Model:

- We use a **Ridge regression** pipeline:
  - `StandardScaler` to normalize inputs,
  - `Ridge(alpha=1.0)` as a simple L2‑regularized linear regressor.
- L2 regularization shrinks coefficients and reduces overfitting, especially when features are correlated.

Metrics:

- We evaluate on the held‑out test split using:
  - **MAE** (mean absolute error) in °C, which is easy to interpret as an average absolute miss in temperature,
  - **RMSE** (root mean squared error) in °C, which penalizes large errors more strongly.
- Together, MAE and RMSE give a first sense of how precise our short‑horizon temperature forecasts are.

Again, this is only a **baseline**: later models (e.g., tree‑based or sequence models) should deliver lower MAE/RMSE on the same target.


In [9]:
# Ridge regression baseline: 30 s temperature delta
# - Use `delta_T_30s` as the continuous regression target.
# - Construct train/test splits (chronological) for this target using the same feature set.
# - Build a `Pipeline` consisting of `StandardScaler` + `Ridge(alpha=1.0)`.
# - Fit the model on training data and predict on the held‑out test split.
# - Compute MAE and RMSE in °C to quantify short‑horizon forecast error.

reg_target = 'delta_T_30s'
target_df = df.dropna(subset=feature_cols + [reg_target])
split_idx_reg = int(len(target_df) * 0.8)
train_reg = target_df.iloc[:split_idx_reg]
test_reg = target_df.iloc[split_idx_reg:]

X_train_reg = train_reg[feature_cols]
X_test_reg = test_reg[feature_cols]
y_train_reg = train_reg[reg_target]
y_test_reg = test_reg[reg_target]

reg = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))
])
reg.fit(X_train_reg, y_train_reg)
pred_reg = reg.predict(X_test_reg)

mae = mean_absolute_error(y_test_reg, pred_reg)
rmse = mean_squared_error(y_test_reg, pred_reg) ** 0.5
print(f'MAE: {mae:.3f} °C')
print(f'RMSE: {rmse:.3f} °C')


MAE: 4.669 °C
RMSE: 9.270 °C


## 4. Next Steps
- Persist trained pipelines (`joblib.dump`) under `models/baseline/`.
- Explore hyperparameter sweeps and alternate algorithms (tree-based, sequential) for Milestone 4.
- Integrate scripts into automation to track metrics across commits.

### 5. Automation & Metrics
- The Makefile `baseline` target runs these data prep + training steps and saves models under `models/baseline/`.
- `metrics/logreg_metrics.json` and `metrics/ridge_metrics.json` capture the latest ROC AUC / MAE history every time the training CLIs run.
- CI (`.github/workflows/baseline.yml`) executes `make baseline`, `make test`, and `make infer` on each push/PR, so notebook results match the automated pipeline.