# 🔧 Road Accident Analysis - Phase 3: Feature Engineering & Preprocessing

---

## 📋 Phase 3 Objectives:
1. **Feature Creation**: Time-based features, age groups, risk scores
2. **Handling Missing Values**: Imputation strategies
3. **Encoding Categorical Variables**: Label, One-Hot, Target encoding
4. **Feature Scaling**: StandardScaler, MinMaxScaler
5. **Handling Class Imbalance**: SMOTE, class weights
6. **Feature Selection**: Correlation analysis, feature importance
7. **Train-Test Split**: Stratified splitting

---

In [2]:
# ============================================================================
# CELL 1: Import Required Libraries
# ============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
from scipy import stats

# Preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.utils.class_weight import compute_class_weight

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

# Suppress warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('✓ Libraries imported successfully!')
print(f'Pandas Version: {pd.__version__}')
print(f'NumPy Version: {np.__version__}')
print(f'Scikit-learn & Imbalanced-learn ready!')

✓ Libraries imported successfully!
Pandas Version: 2.3.2
NumPy Version: 1.26.4
Scikit-learn & Imbalanced-learn ready!


In [4]:
# ============================================================================
# CELL 2: Load Dataset and Recreate Feature Lists
# ============================================================================

# Load the CSV file
df = pd.read_csv('dataset\\accident_prediction_india.csv')

print('='*80)
print('DATASET LOADED SUCCESSFULLY')
print('='*80)
print(f'Dataset Shape: {df.shape[0]} rows × {df.shape[1]} columns')
print('='*80)

# Create a copy for preprocessing
df_processed = df.copy()

print('\n✓ Working copy created: df_processed')

DATASET LOADED SUCCESSFULLY
Dataset Shape: 3000 rows × 22 columns

✓ Working copy created: df_processed


---
## 🆕 FEATURE CREATION
Creating new features from existing data
---

In [5]:
# ============================================================================
# CELL 3: Feature Engineering - Time-Based Features
# ============================================================================

print('='*80)
print('FEATURE ENGINEERING: TIME-BASED FEATURES')
print('='*80)

# 1. Extract Hour from Time of Day (if in HH:MM format)
def extract_hour(time_str):
    try:
        if pd.isna(time_str):
            return np.nan
        if ':' in str(time_str):
            return int(str(time_str).split(':')[0])
        return np.nan
    except:
        return np.nan

df_processed['Hour'] = df_processed['Time of Day'].apply(extract_hour)

# 2. Create Time Period categories
def categorize_time_period(hour):
    if pd.isna(hour):
        return 'Unknown'
    elif 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

df_processed['Time_Period'] = df_processed['Hour'].apply(categorize_time_period)

# 3. Weekend flag
df_processed['Is_Weekend'] = df_processed['Day of Week'].apply(
    lambda x: 1 if x in ['Saturday', 'Sunday'] else 0
)

# 4. Season from Month
def get_season(month):
    if month in ['December', 'January', 'February']:
        return 'Winter'
    elif month in ['March', 'April', 'May']:
        return 'Spring'
    elif month in ['June', 'July', 'August']:
        return 'Summer'
    else:
        return 'Autumn'

df_processed['Season'] = df_processed['Month'].apply(get_season)

print('\n✓ Time-based features created:')
print('  • Hour (0-23)')
print('  • Time_Period (Morning/Afternoon/Evening/Night)')
print('  • Is_Weekend (0/1)')
print('  • Season (Winter/Spring/Summer/Autumn)')

print(f'\n  Time_Period Distribution:\n{df_processed["Time_Period"].value_counts()}')
print('='*80)

FEATURE ENGINEERING: TIME-BASED FEATURES

✓ Time-based features created:
  • Hour (0-23)
  • Time_Period (Morning/Afternoon/Evening/Night)
  • Is_Weekend (0/1)
  • Season (Winter/Spring/Summer/Autumn)

  Time_Period Distribution:
Time_Period
Night        1039
Morning       894
Afternoon     597
Evening       470
Name: count, dtype: int64


In [6]:
# ============================================================================
# CELL 4: Feature Engineering - Age Groups & Demographics
# ============================================================================

print('='*80)
print('FEATURE ENGINEERING: AGE GROUPS & DEMOGRAPHICS')
print('='*80)

# 1. Driver Age Groups
def categorize_age(age):
    if pd.isna(age):
        return 'Unknown'
    elif age < 25:
        return 'Young'
    elif 25 <= age < 45:
        return 'Adult'
    elif 45 <= age < 60:
        return 'Middle-Aged'
    else:
        return 'Senior'

df_processed['Driver_Age_Group'] = df_processed['Driver Age'].apply(categorize_age)

# 2. Speed Category
def categorize_speed(speed):
    if pd.isna(speed):
        return 'Unknown'
    elif speed < 50:
        return 'Low_Speed'
    elif 50 <= speed < 80:
        return 'Medium_Speed'
    else:
        return 'High_Speed'

df_processed['Speed_Category'] = df_processed['Speed Limit (km/h)'].apply(categorize_speed)

print('\n✓ Demographic features created:')
print('  • Driver_Age_Group (Young/Adult/Middle-Aged/Senior)')
print('  • Speed_Category (Low/Medium/High)')

print(f'\n  Driver Age Group Distribution:\n{df_processed["Driver_Age_Group"].value_counts()}')
print(f'\n  Speed Category Distribution:\n{df_processed["Speed_Category"].value_counts()}')
print('='*80)

FEATURE ENGINEERING: AGE GROUPS & DEMOGRAPHICS

✓ Demographic features created:
  • Driver_Age_Group (Young/Adult/Middle-Aged/Senior)
  • Speed_Category (Low/Medium/High)

  Driver Age Group Distribution:
Driver_Age_Group
Adult          1069
Middle-Aged     886
Senior          628
Young           417
Name: count, dtype: int64

  Speed Category Distribution:
Speed_Category
High_Speed      1359
Medium_Speed     943
Low_Speed        698
Name: count, dtype: int64


In [7]:
# ============================================================================
# CELL 5: Feature Engineering - Risk Indicators & Interaction Features
# ============================================================================

print('='*80)
print('FEATURE ENGINEERING: RISK INDICATORS & INTERACTIONS')
print('='*80)

# 1. High Risk Weather Flag
high_risk_weather = ['Stormy', 'Foggy', 'Hazy']
df_processed['High_Risk_Weather'] = df_processed['Weather Conditions'].apply(
    lambda x: 1 if x in high_risk_weather else 0
)

# 2. Poor Visibility Flag (Dark + Bad Weather)
df_processed['Poor_Visibility'] = ((df_processed['Lighting Conditions'] == 'Dark') & 
                                    (df_processed['High_Risk_Weather'] == 1)).astype(int)

# 3. High Risk Road Condition
df_processed['High_Risk_Road'] = df_processed['Road Condition'].apply(
    lambda x: 1 if x in ['Wet', 'Under Construction', 'Damaged'] else 0
)

# 4. Fatal Risk Flag (multiple casualties or fatalities)
df_processed['High_Casualty_Count'] = (df_processed['Number of Casualties'] >= 5).astype(int)
df_processed['Has_Fatality'] = (df_processed['Number of Fatalities'] > 0).astype(int)

# 5. Multiple Vehicle Accident
df_processed['Multi_Vehicle_Accident'] = (df_processed['Number of Vehicles Involved'] > 2).astype(int)

# 6. Composite Risk Score (0-5 scale)
df_processed['Risk_Score'] = (
    df_processed['High_Risk_Weather'] +
    df_processed['Poor_Visibility'] +
    df_processed['High_Risk_Road'] +
    (df_processed['Alcohol Involvement'] == 'Yes').astype(int) +
    (df_processed['Speed_Category'] == 'High_Speed').astype(int)
)

print('\n✓ Risk indicator features created:')
print('  • High_Risk_Weather (0/1)')
print('  • Poor_Visibility (0/1)')
print('  • High_Risk_Road (0/1)')
print('  • High_Casualty_Count (0/1)')
print('  • Has_Fatality (0/1)')
print('  • Multi_Vehicle_Accident (0/1)')
print('  • Risk_Score (0-5 composite score)')

print(f'\n  Risk Score Distribution:\n{df_processed["Risk_Score"].value_counts().sort_index()}')
print('='*80)

FEATURE ENGINEERING: RISK INDICATORS & INTERACTIONS

✓ Risk indicator features created:
  • High_Risk_Weather (0/1)
  • Poor_Visibility (0/1)
  • High_Risk_Road (0/1)
  • High_Casualty_Count (0/1)
  • Has_Fatality (0/1)
  • Multi_Vehicle_Accident (0/1)
  • Risk_Score (0-5 composite score)

  Risk Score Distribution:
Risk_Score
0      91
1     460
2    1016
3     939
4     413
5      81
Name: count, dtype: int64


---
## 🔧 HANDLING MISSING VALUES
Imputation strategies for missing data
---

In [8]:
# ============================================================================
# CELL 6: Handling Missing Values
# ============================================================================

print('='*80)
print('HANDLING MISSING VALUES')
print('='*80)

# Check missing values before
missing_before = df_processed.isnull().sum()
missing_before = missing_before[missing_before > 0].sort_values(ascending=False)

print('\n📊 Missing Values Before Imputation:')
if len(missing_before) > 0:
    for col, count in missing_before.items():
        percentage = (count / len(df_processed)) * 100
        print(f'  • {col:<35} : {count:>5} ({percentage:>5.2f}%)')
else:
    print('  ✓ No missing values found!')

# Strategy: Fill missing values
# 1. Driver License Status - Fill with 'Unknown'
if 'Driver License Status' in df_processed.columns:
    df_processed['Driver License Status'].fillna('Unknown', inplace=True)

# 2. Traffic Control Presence - Fill with 'Unknown'
if 'Traffic Control Presence' in df_processed.columns:
    df_processed['Traffic Control Presence'].fillna('Unknown', inplace=True)

# 3. City Name - Already has 'Unknown' values
if 'City Name' in df_processed.columns:
    df_processed['City Name'].fillna('Unknown', inplace=True)

# 4. Numerical features - Fill with median
numerical_cols = ['Speed Limit (km/h)', 'Driver Age', 'Hour']
for col in numerical_cols:
    if col in df_processed.columns:
        df_processed[col].fillna(df_processed[col].median(), inplace=True)

# Check missing values after
missing_after = df_processed.isnull().sum()
missing_after = missing_after[missing_after > 0].sort_values(ascending=False)

print('\n📊 Missing Values After Imputation:')
if len(missing_after) > 0:
    for col, count in missing_after.items():
        percentage = (count / len(df_processed)) * 100
        print(f'  • {col:<35} : {count:>5} ({percentage:>5.2f}%)')
else:
    print('  ✓ All missing values handled!')

print('\n✓ Imputation Strategy Applied:')
print('  • Categorical: Filled with "Unknown"')
print('  • Numerical: Filled with median values')
print('='*80)

HANDLING MISSING VALUES

📊 Missing Values Before Imputation:
  • Driver License Status               :   975 (32.50%)
  • Traffic Control Presence            :   716 (23.87%)

📊 Missing Values After Imputation:
  ✓ All missing values handled!

✓ Imputation Strategy Applied:
  • Categorical: Filled with "Unknown"
  • Numerical: Filled with median values


---
## 🔤 ENCODING CATEGORICAL VARIABLES
Converting categorical features to numerical format
---

In [9]:
# ============================================================================
# CELL 7: Label Encoding for Target Variable
# ============================================================================

print('='*80)
print('ENCODING: TARGET VARIABLE (ACCIDENT SEVERITY)')
print('='*80)

# Label encode target variable (ordinal encoding)
severity_mapping = {'Minor': 0, 'Serious': 1, 'Fatal': 2}
df_processed['Severity_Encoded'] = df_processed['Accident Severity'].map(severity_mapping)

print('\n✓ Target Variable Encoded:')
print('  • Minor    → 0')
print('  • Serious  → 1')
print('  • Fatal    → 2')

print(f'\nEncoded Distribution:')
print(df_processed['Severity_Encoded'].value_counts().sort_index())
print('='*80)

ENCODING: TARGET VARIABLE (ACCIDENT SEVERITY)

✓ Target Variable Encoded:
  • Minor    → 0
  • Serious  → 1
  • Fatal    → 2

Encoded Distribution:
Severity_Encoded
0    1034
1     981
2     985
Name: count, dtype: int64


In [10]:
# ============================================================================
# CELL 8: Label Encoding for Ordinal Categorical Features
# ============================================================================

print('='*80)
print('ENCODING: ORDINAL CATEGORICAL FEATURES')
print('='*80)

# Define ordinal features with order
ordinal_mappings = {
    'Speed_Category': {'Low_Speed': 0, 'Medium_Speed': 1, 'High_Speed': 2, 'Unknown': -1},
    'Driver_Age_Group': {'Young': 0, 'Adult': 1, 'Middle-Aged': 2, 'Senior': 3, 'Unknown': -1},
    'Time_Period': {'Morning': 0, 'Afternoon': 1, 'Evening': 2, 'Night': 3, 'Unknown': -1}
}

for col, mapping in ordinal_mappings.items():
    if col in df_processed.columns:
        df_processed[f'{col}_Encoded'] = df_processed[col].map(mapping)
        print(f'\n✓ {col} encoded:')
        for key, val in mapping.items():
            print(f'  • {key:<15} → {val}')

print('='*80)

ENCODING: ORDINAL CATEGORICAL FEATURES

✓ Speed_Category encoded:
  • Low_Speed       → 0
  • Medium_Speed    → 1
  • High_Speed      → 2
  • Unknown         → -1

✓ Driver_Age_Group encoded:
  • Young           → 0
  • Adult           → 1
  • Middle-Aged     → 2
  • Senior          → 3
  • Unknown         → -1

✓ Time_Period encoded:
  • Morning         → 0
  • Afternoon       → 1
  • Evening         → 2
  • Night           → 3
  • Unknown         → -1


In [11]:
# ============================================================================
# CELL 9: Binary Encoding for Binary Categorical Features
# ============================================================================

print('='*80)
print('ENCODING: BINARY CATEGORICAL FEATURES')
print('='*80)

# Binary features
binary_mappings = {
    'Alcohol Involvement': {'Yes': 1, 'No': 0},
    'Driver Gender': {'Male': 0, 'Female': 1}
}

for col, mapping in binary_mappings.items():
    if col in df_processed.columns:
        df_processed[f'{col}_Encoded'] = df_processed[col].map(mapping)
        print(f'\n✓ {col} encoded:')
        for key, val in mapping.items():
            print(f'  • {key:<10} → {val}')

print('='*80)

ENCODING: BINARY CATEGORICAL FEATURES

✓ Alcohol Involvement encoded:
  • Yes        → 1
  • No         → 0

✓ Driver Gender encoded:
  • Male       → 0
  • Female     → 1


In [12]:
# ============================================================================
# CELL 10: One-Hot Encoding for Nominal Categorical Features
# ============================================================================

print('='*80)
print('ENCODING: ONE-HOT ENCODING FOR NOMINAL FEATURES')
print('='*80)

# Select nominal features for one-hot encoding (exclude high cardinality)
nominal_features = [
    'Weather Conditions',
    'Road Type',
    'Road Condition',
    'Lighting Conditions',
    'Vehicle Type Involved',
    'Day of Week',
    'Season',
    'Accident Location Details'
]

# Filter existing features
nominal_features = [col for col in nominal_features if col in df_processed.columns]

print(f'\n📋 Features to be one-hot encoded: {len(nominal_features)}')
for feat in nominal_features:
    print(f'  • {feat} ({df_processed[feat].nunique()} categories)')

# Perform one-hot encoding
df_encoded = pd.get_dummies(df_processed, columns=nominal_features, prefix=nominal_features, drop_first=True)

print(f'\n✓ One-hot encoding complete!')
print(f'  Original features: {len(df_processed.columns)}')
print(f'  Encoded features: {len(df_encoded.columns)}')
print(f'  New columns added: {len(df_encoded.columns) - len(df_processed.columns)}')
print('='*80)

ENCODING: ONE-HOT ENCODING FOR NOMINAL FEATURES

📋 Features to be one-hot encoded: 8
  • Weather Conditions (5 categories)
  • Road Type (4 categories)
  • Road Condition (4 categories)
  • Lighting Conditions (4 categories)
  • Vehicle Type Involved (7 categories)
  • Day of Week (7 categories)
  • Season (4 categories)
  • Accident Location Details (4 categories)

✓ One-hot encoding complete!
  Original features: 41
  Encoded features: 64
  New columns added: 23


---
## 📐 FEATURE SCALING
Normalizing numerical features for model training
---

In [13]:
# ============================================================================
# CELL 11: Feature Selection & Preparation for Scaling
# ============================================================================

print('='*80)
print('FEATURE SELECTION FOR MODELING')
print('='*80)

# Exclude features not needed for modeling
exclude_cols = [
    'State Name',  # High cardinality
    'City Name',  # High cardinality (71% unknown)
    'Time of Day',  # Already extracted Hour
    'Month',  # Converted to Season
    'Year',  # Optional - can keep for temporal analysis
    'Accident Severity',  # Original target (use encoded version)
    'Driver_Age_Group',  # Already encoded
    'Speed_Category',  # Already encoded
    'Time_Period',  # Already encoded
    'Alcohol Involvement',  # Already encoded
    'Driver Gender',  # Already encoded
    'Traffic Control Presence',  # 23.87% missing, low importance
    'Driver License Status'  # 32.5% missing, low importance
]

# Create modeling dataframe
df_model = df_encoded.copy()

# Drop excluded columns if they exist
cols_to_drop = [col for col in exclude_cols if col in df_model.columns]
df_model = df_model.drop(columns=cols_to_drop)

print(f'\n✓ Features selected for modeling: {len(df_model.columns)}')
print(f'  Excluded features: {len(cols_to_drop)}')

# Identify numerical features that need scaling
numerical_features_to_scale = [
    'Speed Limit (km/h)',
    'Driver Age',
    'Number of Vehicles Involved',
    'Number of Casualties',
    'Number of Fatalities',
    'Hour',
    'Risk_Score'
]

numerical_features_to_scale = [col for col in numerical_features_to_scale if col in df_model.columns]

print(f'\n📊 Numerical features to be scaled: {len(numerical_features_to_scale)}')
for feat in numerical_features_to_scale:
    print(f'  • {feat}')
print('='*80)

FEATURE SELECTION FOR MODELING

✓ Features selected for modeling: 51
  Excluded features: 13

📊 Numerical features to be scaled: 7
  • Speed Limit (km/h)
  • Driver Age
  • Number of Vehicles Involved
  • Number of Casualties
  • Number of Fatalities
  • Hour
  • Risk_Score


In [14]:
# ============================================================================
# CELL 12: Train-Test Split (Before Scaling)
# ============================================================================

print('='*80)
print('TRAIN-TEST SPLIT')
print('='*80)

# Separate features and target
X = df_model.drop(columns=['Severity_Encoded', 'Number of Casualties', 'Number of Fatalities'])
y_classification = df_model['Severity_Encoded']
y_casualties = df_model['Number of Casualties']
y_fatalities = df_model['Number of Fatalities']

# Stratified split for classification
X_train, X_test, y_train_class, y_test_class = train_test_split(
    X, y_classification, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_classification
)

# Same split for regression targets
_, _, y_train_casualties, y_test_casualties = train_test_split(
    X, y_casualties, 
    test_size=0.2, 
    random_state=42
)

_, _, y_train_fatalities, y_test_fatalities = train_test_split(
    X, y_fatalities, 
    test_size=0.2, 
    random_state=42
)

print(f'\n✓ Data split complete:')
print(f'  Training set: {X_train.shape[0]} samples ({(X_train.shape[0]/len(X))*100:.1f}%)')
print(f'  Test set: {X_test.shape[0]} samples ({(X_test.shape[0]/len(X))*100:.1f}%)')
print(f'  Number of features: {X_train.shape[1]}')

print(f'\n📊 Class Distribution in Training Set:')
train_dist = y_train_class.value_counts().sort_index()
for severity, count in train_dist.items():
    severity_name = {0: 'Minor', 1: 'Serious', 2: 'Fatal'}[severity]
    print(f'  • {severity_name:<10} : {count:>5} ({(count/len(y_train_class))*100:>5.2f}%)')

print(f'\n📊 Class Distribution in Test Set:')
test_dist = y_test_class.value_counts().sort_index()
for severity, count in test_dist.items():
    severity_name = {0: 'Minor', 1: 'Serious', 2: 'Fatal'}[severity]
    print(f'  • {severity_name:<10} : {count:>5} ({(count/len(y_test_class))*100:>5.2f}%)')
print('='*80)

TRAIN-TEST SPLIT

✓ Data split complete:
  Training set: 2400 samples (80.0%)
  Test set: 600 samples (20.0%)
  Number of features: 48

📊 Class Distribution in Training Set:
  • Minor      :   827 (34.46%)
  • Serious    :   785 (32.71%)
  • Fatal      :   788 (32.83%)

📊 Class Distribution in Test Set:
  • Minor      :   207 (34.50%)
  • Serious    :   196 (32.67%)
  • Fatal      :   197 (32.83%)


In [15]:
# ============================================================================
# CELL 13: Feature Scaling - StandardScaler
# ============================================================================

print('='*80)
print('FEATURE SCALING: STANDARDSCALER')
print('='*80)

# Initialize scaler
scaler_standard = StandardScaler()

# Identify numerical columns in X_train
numerical_cols_in_X = [col for col in numerical_features_to_scale if col in X_train.columns]

# Fit on training data and transform both train and test
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numerical_cols_in_X] = scaler_standard.fit_transform(X_train[numerical_cols_in_X])
X_test_scaled[numerical_cols_in_X] = scaler_standard.transform(X_test[numerical_cols_in_X])

print(f'\n✓ StandardScaler applied to {len(numerical_cols_in_X)} features')
print('  Formula: z = (x - μ) / σ')
print('  Where: μ = mean, σ = standard deviation')

print(f'\n📊 Scaling Statistics (Training Set):')
for col in numerical_cols_in_X[:5]:  # Show first 5
    mean_val = X_train[col].mean()
    std_val = X_train[col].std()
    print(f'  • {col:<30} | Mean: {mean_val:>8.2f} | Std: {std_val:>8.2f}')

print('='*80)

FEATURE SCALING: STANDARDSCALER

✓ StandardScaler applied to 5 features
  Formula: z = (x - μ) / σ
  Where: μ = mean, σ = standard deviation

📊 Scaling Statistics (Training Set):
  • Speed Limit (km/h)             | Mean:    74.99 | Std:    26.75
  • Driver Age                     | Mean:    44.16 | Std:    15.30
  • Number of Vehicles Involved    | Mean:     2.99 | Std:     1.43
  • Hour                           | Mean:    11.36 | Std:     6.96
  • Risk_Score                     | Mean:     2.46 | Std:     1.08


In [16]:
# ============================================================================
# CELL 14: Feature Scaling - MinMaxScaler (Alternative)
# ============================================================================

print('='*80)
print('FEATURE SCALING: MINMAXSCALER (Alternative for Neural Networks)')
print('='*80)

# Initialize MinMaxScaler
scaler_minmax = MinMaxScaler()

# Create normalized versions
X_train_normalized = X_train.copy()
X_test_normalized = X_test.copy()

X_train_normalized[numerical_cols_in_X] = scaler_minmax.fit_transform(X_train[numerical_cols_in_X])
X_test_normalized[numerical_cols_in_X] = scaler_minmax.transform(X_test[numerical_cols_in_X])

print(f'\n✓ MinMaxScaler applied to {len(numerical_cols_in_X)} features')
print('  Formula: x_scaled = (x - x_min) / (x_max - x_min)')
print('  Range: [0, 1]')

print(f'\n📊 Scaling Range (Training Set):')
for col in numerical_cols_in_X[:5]:  # Show first 5
    min_val = X_train[col].min()
    max_val = X_train[col].max()
    print(f'  • {col:<30} | Min: {min_val:>8.2f} | Max: {max_val:>8.2f}')

print('\n💡 Note: Use X_train_scaled/X_test_scaled for tree-based models')
print('         Use X_train_normalized/X_test_normalized for neural networks')
print('='*80)

FEATURE SCALING: MINMAXSCALER (Alternative for Neural Networks)

✓ MinMaxScaler applied to 5 features
  Formula: x_scaled = (x - x_min) / (x_max - x_min)
  Range: [0, 1]

📊 Scaling Range (Training Set):
  • Speed Limit (km/h)             | Min:    30.00 | Max:   120.00
  • Driver Age                     | Min:    18.00 | Max:    70.00
  • Number of Vehicles Involved    | Min:     1.00 | Max:     5.00
  • Hour                           | Min:     0.00 | Max:    23.00
  • Risk_Score                     | Min:     0.00 | Max:     5.00

💡 Note: Use X_train_scaled/X_test_scaled for tree-based models
         Use X_train_normalized/X_test_normalized for neural networks


---
## ⚖️ HANDLING CLASS IMBALANCE
Techniques to balance the target variable distribution
---

In [17]:
# ============================================================================
# CELL 15: Class Imbalance Analysis
# ============================================================================

print('='*80)
print('CLASS IMBALANCE ANALYSIS')
print('='*80)

# Calculate class distribution
class_counts = y_train_class.value_counts().sort_index()
class_percentages = (class_counts / len(y_train_class)) * 100

print('\n📊 Current Class Distribution (Training Set):')
for severity, count in class_counts.items():
    severity_name = {0: 'Minor', 1: 'Serious', 2: 'Fatal'}[severity]
    percentage = class_percentages[severity]
    print(f'  • {severity_name:<10} : {count:>5} ({percentage:>5.2f}%)')

# Calculate imbalance ratio
imbalance_ratio = class_counts.max() / class_counts.min()
print(f'\n📈 Imbalance Ratio: {imbalance_ratio:.2f}:1')

if imbalance_ratio < 1.5:
    print('  ✓ Classes are well balanced')
elif imbalance_ratio < 3:
    print('  ⚠ Slight imbalance - consider using class weights')
else:
    print('  ⚠ Significant imbalance - SMOTE recommended')

print('='*80)

CLASS IMBALANCE ANALYSIS

📊 Current Class Distribution (Training Set):
  • Minor      :   827 (34.46%)
  • Serious    :   785 (32.71%)
  • Fatal      :   788 (32.83%)

📈 Imbalance Ratio: 1.05:1
  ✓ Classes are well balanced


In [18]:
# ============================================================================
# CELL 16: SMOTE - Synthetic Minority Oversampling
# ============================================================================

print('='*80)
print('APPLYING SMOTE (Synthetic Minority Over-sampling Technique)')
print('='*80)

# Initialize SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)

# Apply SMOTE to training data
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train_class)

print('\n✓ SMOTE applied successfully!')
print(f'  Before SMOTE: {X_train_scaled.shape[0]} samples')
print(f'  After SMOTE: {X_train_smote.shape[0]} samples')
print(f'  New samples created: {X_train_smote.shape[0] - X_train_scaled.shape[0]}')

print(f'\n📊 Class Distribution After SMOTE:')
smote_dist = y_train_smote.value_counts().sort_index()
for severity, count in smote_dist.items():
    severity_name = {0: 'Minor', 1: 'Serious', 2: 'Fatal'}[severity]
    print(f'  • {severity_name:<10} : {count:>5} ({(count/len(y_train_smote))*100:>5.2f}%)')

print('\n💡 Note: Use X_train_smote & y_train_smote for model training with balanced data')
print('         Keep X_test_scaled & y_test_class unchanged for evaluation')
print('='*80)

APPLYING SMOTE (Synthetic Minority Over-sampling Technique)

✓ SMOTE applied successfully!
  Before SMOTE: 2400 samples
  After SMOTE: 2481 samples
  New samples created: 81

📊 Class Distribution After SMOTE:
  • Minor      :   827 (33.33%)
  • Serious    :   827 (33.33%)
  • Fatal      :   827 (33.33%)

💡 Note: Use X_train_smote & y_train_smote for model training with balanced data
         Keep X_test_scaled & y_test_class unchanged for evaluation


In [19]:
# ============================================================================
# CELL 17: Compute Class Weights (Alternative to SMOTE)
# ============================================================================

print('='*80)
print('COMPUTING CLASS WEIGHTS (Alternative to SMOTE)')
print('='*80)

# Compute class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_class),
    y=y_train_class
)

# Create class weight dictionary
class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}

print('\n✓ Class weights computed:')
for severity, weight in class_weight_dict.items():
    severity_name = {0: 'Minor', 1: 'Serious', 2: 'Fatal'}[severity]
    print(f'  • {severity_name:<10} : {weight:.4f}')

print('\n💡 Usage: Pass class_weight parameter to models')
print('  Example: RandomForestClassifier(class_weight=class_weight_dict)')
print('='*80)

COMPUTING CLASS WEIGHTS (Alternative to SMOTE)

✓ Class weights computed:
  • Minor      : 0.9674
  • Serious    : 1.0191
  • Fatal      : 1.0152

💡 Usage: Pass class_weight parameter to models
  Example: RandomForestClassifier(class_weight=class_weight_dict)


---
## 💾 SAVE PREPROCESSED DATA
Export processed datasets for model training
---

In [20]:
# ============================================================================
# CELL 18: Save Preprocessed Datasets
# ============================================================================

print('='*80)
print('SAVING PREPROCESSED DATASETS')
print('='*80)

# Save as CSV
X_train_scaled.to_csv('X_train_scaled.csv', index=False)
X_test_scaled.to_csv('X_test_scaled.csv', index=False)
X_train_normalized.to_csv('X_train_normalized.csv', index=False)
X_test_normalized.to_csv('X_test_normalized.csv', index=False)

# Save SMOTE data
pd.DataFrame(X_train_smote, columns=X_train_scaled.columns).to_csv('X_train_smote.csv', index=False)
pd.DataFrame(y_train_smote, columns=['Severity_Encoded']).to_csv('y_train_smote.csv', index=False)

# Save target variables
y_train_class.to_csv('y_train_classification.csv', index=False)
y_test_class.to_csv('y_test_classification.csv', index=False)
y_train_casualties.to_csv('y_train_casualties.csv', index=False)
y_test_casualties.to_csv('y_test_casualties.csv', index=False)
y_train_fatalities.to_csv('y_train_fatalities.csv', index=False)
y_test_fatalities.to_csv('y_test_fatalities.csv', index=False)

# Save class weights
import pickle
with open('class_weights.pkl', 'wb') as f:
    pickle.dump(class_weight_dict, f)

# Save scalers
with open('scaler_standard.pkl', 'wb') as f:
    pickle.dump(scaler_standard, f)
with open('scaler_minmax.pkl', 'wb') as f:
    pickle.dump(scaler_minmax, f)

print('\n✓ All datasets saved successfully!')
print('\n📁 Files created:')
print('  Classification Data:')
print('    • X_train_scaled.csv, X_test_scaled.csv')
print('    • X_train_normalized.csv, X_test_normalized.csv')
print('    • X_train_smote.csv, y_train_smote.csv')
print('    • y_train_classification.csv, y_test_classification.csv')
print('  Regression Data:')
print('    • y_train_casualties.csv, y_test_casualties.csv')
print('    • y_train_fatalities.csv, y_test_fatalities.csv')
print('  Model Objects:')
print('    • class_weights.pkl')
print('    • scaler_standard.pkl, scaler_minmax.pkl')
print('='*80)

SAVING PREPROCESSED DATASETS

✓ All datasets saved successfully!

📁 Files created:
  Classification Data:
    • X_train_scaled.csv, X_test_scaled.csv
    • X_train_normalized.csv, X_test_normalized.csv
    • X_train_smote.csv, y_train_smote.csv
    • y_train_classification.csv, y_test_classification.csv
  Regression Data:
    • y_train_casualties.csv, y_test_casualties.csv
    • y_train_fatalities.csv, y_test_fatalities.csv
  Model Objects:
    • class_weights.pkl
    • scaler_standard.pkl, scaler_minmax.pkl


In [21]:
# ============================================================================
# CELL 19: Phase 3 Summary Report
# ============================================================================

print('='*80)
print('PHASE 3 SUMMARY REPORT: FEATURE ENGINEERING & PREPROCESSING')
print('='*80)

print('\n🎯 TASKS COMPLETED:')
print('-' * 80)

print('\n1. FEATURE ENGINEERING:')
print('  ✓ Time-based features: Hour, Time_Period, Is_Weekend, Season')
print('  ✓ Demographic features: Driver_Age_Group, Speed_Category')
print('  ✓ Risk indicators: High_Risk_Weather, Poor_Visibility, Risk_Score')
print('  ✓ Interaction features: Multi_Vehicle_Accident, Has_Fatality')

print('\n2. MISSING VALUE TREATMENT:')
print('  ✓ Categorical: Filled with "Unknown"')
print('  ✓ Numerical: Filled with median values')
print('  ✓ Zero missing values after imputation')

print('\n3. FEATURE ENCODING:')
print('  ✓ Target variable: Label encoded (0-2)')
print('  ✓ Ordinal features: Label encoded with order')
print('  ✓ Binary features: Binary encoded (0/1)')
print(f'  ✓ Nominal features: One-hot encoded ({len(nominal_features)} features)')

print('\n4. FEATURE SCALING:')
print(f'  ✓ StandardScaler applied to {len(numerical_cols_in_X)} numerical features')
print('  ✓ MinMaxScaler created for neural network models')

print('\n5. TRAIN-TEST SPLIT:')
print(f'  ✓ Training samples: {X_train.shape[0]} (80%)')
print(f'  ✓ Test samples: {X_test.shape[0]} (20%)')
print(f'  ✓ Total features: {X_train.shape[1]}')
print('  ✓ Stratified split maintained class distribution')

print('\n6. CLASS IMBALANCE HANDLING:')
print(f'  ✓ SMOTE applied: {X_train_scaled.shape[0]} → {X_train_smote.shape[0]} samples')
print('  ✓ Class weights computed for weighted learning')
print('  ✓ Balanced dataset ready for training')

print('\n7. DATA EXPORT:')
print('  ✓ 12 CSV files saved')
print('  ✓ 3 pickle files saved (scalers & weights)')
print('  ✓ Ready for Phase 4: Model Training')

print('\n' + '='*80)
print('📊 FINAL DATASET STATISTICS:')
print('-' * 80)
print(f'Original Dataset: {df.shape[0]} rows × {df.shape[1]} columns')
print(f'Processed Dataset: {X_train_smote.shape[0]} training samples × {X_train_smote.shape[1]} features')
print(f'Feature Engineering Added: {len(df_processed.columns) - len(df.columns)} new features')
print(f'One-Hot Encoding Added: {len(df_encoded.columns) - len(df_processed.columns)} dummy variables')

print('\n✓ PHASE 3: FEATURE ENGINEERING & PREPROCESSING COMPLETED!')
print('='*80)
print('\n🚀 Ready for Phase 4: Model Development & Training')

PHASE 3 SUMMARY REPORT: FEATURE ENGINEERING & PREPROCESSING

🎯 TASKS COMPLETED:
--------------------------------------------------------------------------------

1. FEATURE ENGINEERING:
  ✓ Time-based features: Hour, Time_Period, Is_Weekend, Season
  ✓ Demographic features: Driver_Age_Group, Speed_Category
  ✓ Risk indicators: High_Risk_Weather, Poor_Visibility, Risk_Score
  ✓ Interaction features: Multi_Vehicle_Accident, Has_Fatality

2. MISSING VALUE TREATMENT:
  ✓ Categorical: Filled with "Unknown"
  ✓ Numerical: Filled with median values
  ✓ Zero missing values after imputation

3. FEATURE ENCODING:
  ✓ Target variable: Label encoded (0-2)
  ✓ Ordinal features: Label encoded with order
  ✓ Binary features: Binary encoded (0/1)
  ✓ Nominal features: One-hot encoded (8 features)

4. FEATURE SCALING:
  ✓ StandardScaler applied to 5 numerical features
  ✓ MinMaxScaler created for neural network models

5. TRAIN-TEST SPLIT:
  ✓ Training samples: 2400 (80%)
  ✓ Test samples: 600 (20%)
  