# 04. Modeling Pipeline Setup: Train-Test Split & Scaling

**Objective:** Load preprocessed data, perform a stratified train-test split with a fixed random seed, apply numerical scaling, and initialize the performance logging infrastructure.

**PRD References:** 3.1.5, 3.1.7, 9.3, 10.5; **NFR2**

## 1. Imports and Utility Functions

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE  # added for class imbalance handling
from modeling_utils import (
    compute_classification_metrics,
    init_performance_excel,
    append_performance_record
)

## 2. Load Preprocessed Data

In [13]:
# Load the fully preprocessed dataset
data_path = '../data/processed/preprocessed_data.csv'
df = pd.read_csv(data_path)
print(f"Loaded preprocessed data: {df.shape[0]} rows, {df.shape[1]} columns")

Loaded preprocessed data: 22072 rows, 42 columns


## 3. Define Features and Target

In [8]:
# Separate features and target
target_col = 'SEVERITY'
feature_cols = [c for c in df.columns if c != target_col]
X = df[feature_cols]
y = df[target_col]

## 4. Stratified Train-Test Split

In [14]:
# Perform stratified split to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
print(f"Train set: {X_train.shape[0]} rows")
print(f"Test set:  {X_test.shape[0]} rows")

Train set: 17657 rows
Test set:  4415 rows


## 5. Numerical Feature Scaling

In [10]:
# Identify numerical features for scaling
num_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Initialize scaler and fit on training data
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[num_features] = scaler.fit_transform(X_train[num_features])
X_test_scaled[num_features] = scaler.transform(X_test[num_features])

print("Scaled numerical features on training and test sets.")

Scaled numerical features on training and test sets.


## 6. Initialize Performance Logging

In [11]:
# Create Excel for logging model performance
performance_file = '../reports/model_performance_summary.xlsx'
init_performance_excel(performance_file)
print(f"Initialized performance log at {performance_file}")

Initialized performance log at ../reports/model_performance_summary.xlsx


## 7. Class Imbalance Handling

Apply SMOTE to generate synthetic samples for the minority class in the training data. This helps mitigate class imbalance before model training. (PRD 3.1.7, 11.2)

In [16]:
# Before resampling, display class distribution
print("Class distribution before SMOTE:")
print(y_train.value_counts(normalize=True))

# Exclude non-numeric columns from resampling
X_train_numeric = X_train_scaled.select_dtypes(include=['int64', 'float64'])

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_numeric, y_train)

# Add back non-numeric columns to the resampled dataset
X_train_resampled = pd.concat(
	[pd.DataFrame(X_train_resampled, columns=X_train_numeric.columns), 
	 X_train_scaled.drop(columns=X_train_numeric.columns).reset_index(drop=True)], 
	axis=1
)

# After resampling, display new distribution
print("\nClass distribution after SMOTE:")
print(y_train_resampled.value_counts(normalize=True))

Class distribution before SMOTE:
SEVERITY
Property    0.931245
Injury      0.067735
Fatal       0.001019
Name: proportion, dtype: float64

Class distribution after SMOTE:
SEVERITY
Property    0.333333
Injury      0.333333
Fatal       0.333333
Name: proportion, dtype: float64


**Next Steps:**
- Use `X_train_resampled` and `y_train_resampled` for model training and hyperparameter tuning (Commit 14+).
- Evaluate models on the untouched `X_test_scaled`, `y_test`. 