# 04. Modeling Pipeline Setup: Train-Test Split & Scaling

**Objective:** Load preprocessed data, perform a stratified train-test split with a fixed random seed, apply numerical scaling, and initialize the performance logging infrastructure.

**PRD References:** 3.1.5, 3.1.7, 9.3, 10.5; **NFR2**

## 1. Imports and Utility Functions

In [4]:
import sys
import os

print("--- sys.path contents ---")
for i, path in enumerate(sys.path):
    print(f"{i}: {path}")

# Optional: Check if the src directory exists relative to the project root
project_root = os.path.abspath(os.path.join(os.path.dirname(''), os.pardir))
src_path_check = os.path.join(project_root, 'src')
print("\n--- Project Structure Check ---")
print(f"Calculated Project Root: {project_root}")
print(f"Checking for src at: {src_path_check}")
print(f"Does src directory exist? {os.path.isdir(src_path_check)}")
print(f"Does modeling_utils.py exist? {os.path.exists(os.path.join(src_path_check, 'modeling_utils.py'))}")
print(f"Does __init__.py exist in src? {os.path.exists(os.path.join(src_path_check, '__init__.py'))}")

print("\n--- End of sys.path and Structure Check ---")

# Now try the import in the *next* cell
# from src.modeling_utils import ...


--- sys.path contents ---
0: /home/cmark/.pyenv/versions/3.12.9/lib/python312.zip
1: /home/cmark/.pyenv/versions/3.12.9/lib/python3.12
2: /home/cmark/.pyenv/versions/3.12.9/lib/python3.12/lib-dynload
3: 
4: /home/cmark/Projects/TrafficAccidentSeverity/.venv/lib/python3.12/site-packages
5: /home/cmark/Projects/TrafficAccidentSeverity/src

--- Project Structure Check ---
Calculated Project Root: /home/cmark/Projects/TrafficAccidentSeverity
Checking for src at: /home/cmark/Projects/TrafficAccidentSeverity/src
Does src directory exist? True
Does modeling_utils.py exist? True
Does __init__.py exist in src? False

--- End of sys.path and Structure Check ---


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from modeling_utils import (
    compute_classification_metrics,
    init_performance_excel,
    append_performance_record
)

ModuleNotFoundError: No module named 'src'

## 2. Load Preprocessed Data

In [None]:
# Load the fully preprocessed dataset
data_path = '../data/processed/preprocessed_data.csv'
df = pd.read_csv(data_path)
print(f"Loaded preprocessed data: {df.shape[0]} rows, {df.shape[1]} columns")

## 3. Define Features and Target

In [None]:
# Separate features and target
target_col = 'is_severe_accident'
feature_cols = [c for c in df.columns if c != target_col]
X = df[feature_cols]
y = df[target_col]

## 4. Stratified Train-Test Split

In [None]:
# Perform stratified split to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
print(f"Train set: {X_train.shape[0]} rows")
print(f"Test set:  {X_test.shape[0]} rows")

## 5. Numerical Feature Scaling

In [None]:
# Identify numerical features for scaling
num_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Initialize scaler and fit on training data
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[num_features] = scaler.fit_transform(X_train[num_features])
X_test_scaled[num_features] = scaler.transform(X_test[num_features])

print("Scaled numerical features on training and test sets.")

## 6. Initialize Performance Logging

In [None]:
# Create Excel for logging model performance
performance_file = '../reports/model_performance_summary.xlsx'
init_performance_excel(performance_file)
print(f"Initialized performance log at {performance_file}")

**Next Steps:**
- Implement class imbalance handling on X_train_scaled, y_train (Commit 13).
- Build model training and hyperparameter tuning workflows in subsequent notebooks.