> # **Predicting Road Accident Severity using Machine Learning**

In [3]:
# Import necessary libaries

import numpy as np
import pandas as pd

In [4]:
# Read data

data = pd.read_csv("road_accident_dataset.csv")
data.head()

Unnamed: 0,Country,Year,Month,Day of Week,Time of Day,Urban/Rural,Road Type,Weather Conditions,Visibility Level,Number of Vehicles Involved,...,Number of Fatalities,Emergency Response Time,Traffic Volume,Road Condition,Accident Cause,Insurance Claims,Medical Cost,Economic Loss,Region,Population Density
0,USA,2002,October,Tuesday,Evening,Rural,Street,Windy,220.414651,1,...,2,58.62572,7412.75276,Wet,Weather,4,40499.856982,22072.878502,Europe,3866.273014
1,UK,2014,December,Saturday,Evening,Urban,Street,Windy,168.311358,3,...,1,58.04138,4458.62882,Snow-covered,Mechanical Failure,3,6486.600073,9534.399441,North America,2333.916224
2,USA,2012,July,Sunday,Afternoon,Urban,Highway,Snowy,341.286506,4,...,4,42.374452,9856.915064,Wet,Speeding,4,29164.412982,58009.145124,South America,4408.889129
3,UK,2017,May,Saturday,Evening,Urban,Main Road,Clear,489.384536,2,...,3,48.554014,4958.646267,Icy,Distracted Driving,3,25797.212566,20907.151302,Australia,2810.822423
4,Canada,2002,July,Tuesday,Afternoon,Rural,Highway,Rainy,348.34485,1,...,4,18.31825,3843.191463,Icy,Distracted Driving,8,15605.293921,13584.060759,South America,3883.645634


> ## **Goal:** 
> #### Train a machine learning model that can **accurately classify the severity of accidents** based on available features.

- **Target Variable**: `Accident Severity` (Categorical: `Minor`, `Moderate`, `Severe`)
- **Feature Selection**: We'll use relevant independent variables such as:
    - Road & Traffic Factors: `Road Type`, `Speed Limit`, `Traffic Volume`, `Road Condition`
    - Driver Factors: `Driver Age Group`, `Driver Gender`, `Driver Alcohol Level`, `Driver Fatigue`
    - Accident Conditions: `Weather Conditions`, `Visibility Level`, `Time of Day`, `Urban/Rural`
    - Impact Factors: `Number of Vehicles Involved`, `Pedestrians Involved`, `Cyclists Involved`

## **1. Logistic Regression**

### **Data Preprocessing**

In [5]:
# Check metadata
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132000 entries, 0 to 131999
Data columns (total 30 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Country                      132000 non-null  object 
 1   Year                         132000 non-null  int64  
 2   Month                        132000 non-null  object 
 3   Day of Week                  132000 non-null  object 
 4   Time of Day                  132000 non-null  object 
 5   Urban/Rural                  132000 non-null  object 
 6   Road Type                    132000 non-null  object 
 7   Weather Conditions           132000 non-null  object 
 8   Visibility Level             132000 non-null  float64
 9   Number of Vehicles Involved  132000 non-null  int64  
 10  Speed Limit                  132000 non-null  int64  
 11  Driver Age Group             132000 non-null  object 
 12  Driver Gender                132000 non-null  object 
 13 

In [6]:
# Check Missing Values
data.isna().sum()

Country                        0
Year                           0
Month                          0
Day of Week                    0
Time of Day                    0
Urban/Rural                    0
Road Type                      0
Weather Conditions             0
Visibility Level               0
Number of Vehicles Involved    0
Speed Limit                    0
Driver Age Group               0
Driver Gender                  0
Driver Alcohol Level           0
Driver Fatigue                 0
Vehicle Condition              0
Pedestrians Involved           0
Cyclists Involved              0
Accident Severity              0
Number of Injuries             0
Number of Fatalities           0
Emergency Response Time        0
Traffic Volume                 0
Road Condition                 0
Accident Cause                 0
Insurance Claims               0
Medical Cost                   0
Economic Loss                  0
Region                         0
Population Density             0
dtype: int

In [7]:
# Check duplicated records
data.duplicated().sum()

np.int64(0)

#### **Note:**
- The dataset does not have duplicated records and missing values. 
- The dataset also have appropriate datatypes.  

### **1.1 Select Relevant Features**

In [8]:
# target feature
target_feature= 'Accident Severity'

# independent features
independent_features = [i for i in data.columns if i != target_feature]

### **1.2 Encoding categorical features**

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
categorical_features = [i for i in data.columns if data[i].dtype == 'object']

# To store codes
label_encoders= {}

for i in categorical_features:
    le = LabelEncoder()
    data[i] = le.fit_transform(data[i])
    label_encoders[i] = le

### **1.3 Split the data into X and y**

In [11]:
X = data[independent_features]
y = LabelEncoder().fit_transform(data[target_feature])

### **1.4 Eliminate Unnecessary variables by checking Multicollinearity**

In [12]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(data):
    data_vif = pd.DataFrame()
    data_vif['Feature'] = data.columns
    data_vif['VIF'] = [ variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    return data_vif

vif_df = calculate_vif(X)
vif_df

Unnamed: 0,Feature,VIF
0,Country,3.430366
1,Year,73.119451
2,Month,3.550592
3,Day of Week,3.256135
4,Time of Day,2.801412
5,Urban/Rural,1.985246
6,Road Type,2.503156
7,Weather Conditions,3.003488
8,Visibility Level,5.482425
9,Number of Vehicles Involved,6.012174


In [13]:
while True:
    vif_df = calculate_vif(X)
    max_vif = vif_df["VIF"].max()
    if max_vif > 5:  # Threshold for multicollinearity
        remove_feature = vif_df.sort_values(by="VIF", ascending=False).iloc[0]["Feature"]
        X = X.drop(columns=[remove_feature])
        print(f"Removed {remove_feature} with VIF {max_vif}")
    else:
        break

Removed Year with VIF 73.11945100534781
Removed Speed Limit with VIF 8.193433874019304
Removed Number of Vehicles Involved with VIF 5.549488957202939
Removed Visibility Level with VIF 5.077798232965627


### **1.5 Splitting dataset**

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### **1.6 Scaling numerical features**

In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### **1.7 Train Logistic Regression Model**

In [16]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

### **1.8 Predictions**

In [17]:
y_pred = model.predict(X_test)

### **1.9 Evaluation**

In [18]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.3355681818181818
Classification Report:
               precision    recall  f1-score   support

           0       0.33      0.37      0.35      8813
           1       0.34      0.34      0.34      8800
           2       0.33      0.29      0.31      8787

    accuracy                           0.34     26400
   macro avg       0.34      0.34      0.33     26400
weighted avg       0.34      0.34      0.33     26400

Confusion Matrix:
 [[3294 2939 2580]
 [3266 2995 2539]
 [3344 2873 2570]]


### **`Reading Evaluation Metrics`: Logistic Regression**

**Accuracy: 33.56%**
- This is not good for a multi-class classification problem.
- Since there are three classes, a random guess would yield ~33.33% accuracy, meaning the model isn’t much better than random guessing.

**Precision, Recall, and F1-Score**
- All three classes (0 = Minor, 1 = Moderate, 2 = Severe) have similar precision and recall (~33-34%).
- Low Recall (0.29 - 0.37) => The model is missing actual cases of some severity levels.
- Poor F1-scores (~0.33) => The model struggles to distinguish between classes effectively.

**Confusion Matrix Analysis**  

| Actual → Predicted | Minor (0)        | Moderate (1)  | Severe (2) |
|--------------------|------------------|---------------|------------|
| Minor (0)          | 3,294            | 2,939         | 2,580      |
| Moderate (1)       | 3,266            | 2,995         | 2,539      |
| Severe (2)         | 3,344            | 2,873         | 2,570      |

- The model cannot differentiate between classes.
- Predictions are spread almost equally across all three classes.
- Potential Reason: poor feature separability.


---
#### **Note:** Let's try using a Random Forest Classifier for better feature separability.
---

## **2. Random Forest Classifier**

### **2.1 Train Random Forest Model**

In [19]:
# Import Random Forest Model

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

### **2.2 Predictions**

In [20]:
y_pred = model.predict(X_test)

### **2.3 Evaluation**

In [21]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Feature Importance
feature_importances = pd.DataFrame({"Feature": X.columns, "Importance": model.feature_importances_})
feature_importances = feature_importances.sort_values(by="Importance", ascending=False)
print("Top 10 Important Features:\n", feature_importances.head(10))

Accuracy: 0.33515151515151514
Classification Report:
               precision    recall  f1-score   support

           0       0.33      0.36      0.34      8813
           1       0.34      0.34      0.34      8800
           2       0.33      0.31      0.32      8787

    accuracy                           0.34     26400
   macro avg       0.34      0.34      0.33     26400
weighted avg       0.34      0.34      0.33     26400

Confusion Matrix:
 [[3159 2901 2753]
 [3137 2958 2705]
 [3214 2842 2731]]
Top 10 Important Features:
                     Feature  Importance
22            Economic Loss    0.077341
9      Driver Alcohol Level    0.077242
24       Population Density    0.077126
17           Traffic Volume    0.077120
16  Emergency Response Time    0.077076
21             Medical Cost    0.076872
14       Number of Injuries    0.053712
1                     Month    0.044454
20         Insurance Claims    0.043574
0                   Country    0.043521


### **`Reading Evaluation Metrics`: Random Forest Classifier**

**Accuracy (33.52%)**
- The model performs similarly to Logistic Regression (~33%).  
- Since there are three classes, a random guess would yield 33.33% accuracy.  
- This suggests that the model is not effectively learning meaningful patterns.

**Classification Report Breakdown**  
| Class (Accident Severity)| Precision | Recall | F1-Score| Support (Test Data)|
|--------------------------|-----------|--------|---------|--------------------|
| Minor (0)            | 0.33      | 0.36   | 0.34    | 8813               |
| Moderate (1)         | 0.34      | 0.34   | 0.34    | 8800               |
| Severe (2)           | 0.33      | 0.31   | 0.32    | 8787               |

- Low Precision & Recall across all classes means the model struggles to classify accidents accurately.  
- F1-Scores are around 0.33, confirming that the model is nearly guessing randomly.  

**Confusion Matrix Interpretation**  
| Actual → Predicted | Minor (0) | Serious (1) | Fatal (2) |
|-------------------|----------|-----------|----------|
| Minor (0)     | 3159     | 2901      | 2753     |
| Serious (1)   | 3137     | 2958      | 2705     |
| Fatal (2)     | 3214     | 2842      | 2731     |
  
- A lot of misclassifications between all three classes.
- The model is not able to distinguish accident severity well.

**Feature Importance**  
| Rank | Feature               | Importance |
|------|------------------------|------------|
| 1    | Economic Loss       | 0.077341   |
| 2    | Driver Alcohol Level | 0.077242   |
| 3    | Population Density   | 0.077126   |
| 4    | Traffic Volume       | 0.077120   |
| 5    | Emergency Response Time | 0.077076   |
| 6    | Medical Cost         | 0.076872   |
| 7    | Number of Injuries   | 0.053712   |
| 8    | Month                | 0.044454   |
| 9    | Insurance Claims     | 0.043574   |
| 10   | Country              | 0.043521   |

- Economic factors dominate: `Economic Loss`, `Medical Cost`, and `Insurance Claims` are top predictors.  
- Driver behavior matters: `Driver Alcohol Level` is crucial.  
- Road/Weather conditions are missing from the top features → These might not be strong predictors.  

---
#### **Note:** Let's try using XGBoost that can capture complex patterns better than Random Forest.
---

## **3. XGBoost**

### **3.1 Feature Engineering**

In [22]:
# Creating interaction features
data["Speed_Traffic_Interaction"] = data["Speed Limit"] * data["Traffic Volume"]
data["Alcohol_Time_Interaction"] = data["Driver Alcohol Level"] * data["Time of Day"]

# Binning continuous variables
data["Speed_Category"] = pd.cut(data["Speed Limit"], bins=[0, 40, 80, np.inf], labels=["Low", "Medium", "High"])
data["Response_Time_Category"] = pd.cut(data["Emergency Response Time"], bins=[0, 5, 15, np.inf], labels=["Fast", "Moderate", "Slow"])
data = pd.get_dummies(data, columns=["Speed_Category", "Response_Time_Category"], drop_first=True)

# Extracting temporal features
data["Is_Weekend"] = data["Day of Week"].isin(["Saturday", "Sunday"]).astype(int)

### **3.2 Split the data into X and y**

In [23]:
X = data[independent_features]
y = LabelEncoder().fit_transform(data[target_feature])

### **3.3 Eliminate Unnecessary variables by checking Multicollinearity**

In [24]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(data):
    data_vif = pd.DataFrame()
    data_vif['Feature'] = data.columns
    data_vif['VIF'] = [ variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    return data_vif

vif_df = calculate_vif(X)
vif_df

Unnamed: 0,Feature,VIF
0,Country,3.430366
1,Year,73.119451
2,Month,3.550592
3,Day of Week,3.256135
4,Time of Day,2.801412
5,Urban/Rural,1.985246
6,Road Type,2.503156
7,Weather Conditions,3.003488
8,Visibility Level,5.482425
9,Number of Vehicles Involved,6.012174


In [25]:
while True:
    vif_df = calculate_vif(X)
    max_vif = vif_df["VIF"].max()
    if max_vif > 5: 
        remove_feature = vif_df.sort_values(by="VIF", ascending=False).iloc[0]["Feature"]
        X = X.drop(columns=[remove_feature])
        print(f"Removed {remove_feature} with VIF {max_vif}")
    else:
        break

Removed Year with VIF 73.11945100534781
Removed Speed Limit with VIF 8.193433874019304
Removed Number of Vehicles Involved with VIF 5.549488957202939
Removed Visibility Level with VIF 5.077798232965627


### **3.4 Splitting dataset**

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### **3.5 Scaling numerical features**

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### **3.6 Train XGBoost Model**

In [28]:
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, learning_rate=0.05, max_depth=8, subsample=0.8, colsample_bytree=0.8, random_state=42)
model.fit(X_train, y_train)


#### **3.7 Predictions**

In [29]:
y_pred = model.predict(X_test)

#### **3.8 Evaluation**

In [30]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.33223484848484847
Classification Report:
               precision    recall  f1-score   support

           0       0.34      0.35      0.34      8813
           1       0.33      0.33      0.33      8800
           2       0.33      0.32      0.32      8787

    accuracy                           0.33     26400
   macro avg       0.33      0.33      0.33     26400
weighted avg       0.33      0.33      0.33     26400

Confusion Matrix:
 [[3082 2864 2867]
 [3015 2918 2867]
 [3058 2958 2771]]


### **`Reading Evaluation Metrics`: XGBoost**

- The XGBoost model is still performing similarly to previous models (~33% accuracy). 
- It suggests that the **features may not be effectively distinguishing accident severity**.

1. **Accuracy (33.22%)**
- Nearly the same as random guessing (since there are 3 classes).  

2. **Classification Report**
- Low precision & recall (~33%) across all classes => Model isn't learning meaningful patterns.  

3. **Confusion Matrix**  
- High misclassification => The model struggles to separate classes. 

## **Final Takeaways from the Project**
- After training three models (`Logistic Regression`, `Random Forest`, and `XGBoost`), `no model significantly outperformed a random guess` (~33% accuracy).
- `Features in the dataset may not be strong predictors of accident severity`.

### **Key Findings**
- **Feature Importance & Engineering:**
  - Economic factors (`Economic Loss`, `Medical Cost`, I`nsurance Claims`) were dominant but may not be directly linked to real-time accident severity.
  - Road conditions, weather, and visibility were not among the most influential factors.
  - Interaction terms (Speed × Traffic Volume, Alcohol Level × Time of Day) did not improve performance significantly.

- **Model Performance:**
  - **Logistic Regression:** No better than chance due to linear limitations.
  - **Random Forest:** Provided feature importance but had `high misclassification rates`.
  - **XGBoost:** Performed similarly `despite hyperparameter tuning and feature engineering`.