## Casualty Prediction

Lindy Castellaw

The goal of this project is the investigate casualty class (Driver or Rider, Pedestrian or Passenger) in accidents in hopes of implementing more safety measures for these classes. The data set has been prepared from manual records of road traffic accident of the year 2017-20 with sensitive information already removed. It has 33 features and 12316 instances of an accident. It also includes weather conditions, type of vehicles, number of casualties and information about them, there are a lot of features in this dataset for analysis. I hope to show casualty traits through visualizations and create an algorithm that can predict the severity of accidents. 


There are some questions that can be answered using this data such as:
- Does lighting affect class of casualty?
- Does gender affect severity?
- What are the age group are most likely to be involved in accidents?
- What are the areas with higher accident severity or lower accident severity?

We will be answering few of the questions as I mentioned above. We will also figure out some way to implement the machine learning on this dataset and see what we can come up with.

In [None]:
# Mathematical functions
import math
from scipy import stats 
# Data manipulation
import numpy as np
import pandas as pd

# Plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Missing data imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from io import StringIO 

# Categorical data encoding
from sklearn.preprocessing import LabelEncoder

# Train-test split and k-fold cross validation
from sklearn.model_selection import train_test_split

# Feature selection
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif


# Classification algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Model evaluation
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
# Explainable AI
!pip install --quiet shap==0.39.0
import shap

# Warning suppression
import warnings
warnings.filterwarnings('ignore')
import acquire
import prepare 
from functions import split, feature_chi2

Acquire data from csv and run it through the prep_data function. We prepared the df by:

- Fill missing values with the mode of the column
- Grouping outliers
- Encode Categorical columns
- Dropping columns we dont need
- Casualty_class is encoded to 1 - 'Driver or rider', 2 - 'Pedestrian', 3 - 'Passenger'

After preparing the data we are left with 32 columns and 12,316 rows to explore.


In [None]:
get = acquire.get_data()

In [None]:
df = prepare.prep_data(get)
df.head()

### Explore 

First we must split the data to train, validate and test by using the Split function.

In [None]:
train, X_train, X_validate, X_test, y_train, y_validate, y_test = split(df, stratify_by='Casualty_class')
train.head()

In [None]:
sns.countplot(x="Age_band_of_casualty", data=train, hue="Casualty_class")
plt.title('Age of casualty vs. Class')

In [None]:
fig, ax =plt.subplots(1,2,figsize = (15,8))
sns.despine(top=True, right=True, left=False, bottom=False)
ax1 =sns.countplot("Casualty_class", hue="Sex_of_casualty_Male", 
              palette="magma", data=train, ax=ax[0])

ax2 = sns.countplot("Accident_severity", hue="Sex_of_casualty_Male", 
              palette="magma", data=train, ax=ax[1])

In [None]:
var = ["Age_band_of_casualty","Area_accident_occured", "Light_conditions"]
for v in var:
    sns.set(style="darkgrid")
    sns.countplot(x=v, data=train)
    plt.show()

From these charts we can answer some of our questions. Generally there are more women in the casualty class, especially in driver or rider. The count for women is also higher for accident severity. I can also see that the highest age range for accidents is 18-30. From the last chart 3, which represents daylight, is where most accidents occure. Office Areas and Other have the highest counts out of locations.

## Testing 

Chi2 test on light conditions and casualty class:


 - Ho, light conditions effects casualty class
 - Ha, light conditions do not effect it 

In [None]:
a=train.Casualty_class
b=train.Light_conditions
observed = pd.crosstab(a,b)
chi2, p, degf, expected = stats.chi2_contingency(observed)
alpha = 0.05
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.4f}')
if p < alpha:
      print('We can reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

Chi2 test on driving experiance 

- Ho, Driving experiance effects casualty class

- Ha, Driving experiance does not effect it

In [None]:
a2=train.Casualty_class
b2=train.Driving_experience
observed2 = pd.crosstab(a2,b2)
chi2, p, degf, expected = stats.chi2_contingency(observed2)
alpha = 0.05
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.4f}')
if p < alpha:
      print('We can reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

One sample T-Test

- Ho, the mean age of casualty <= than the mean age of casualties
- Ha, the mean age of casualty > than the mean age of casualties

In [None]:
alpha = 0.05
churn_sample = train[train.Casualty_class == 1].Age_band_of_casualty
overall_mean = train.Age_band_of_casualty.mean()

t, p = stats.ttest_1samp(churn_sample, overall_mean)

print(t, p/2, alpha)
if p/2 > alpha:
    print("We fail to reject null")
elif t < 0:
    print("We fail to reject null")
else:
    print("We reject null")


We tested Age band of casualty, lighting and driving experiance. We failed to reject the null on all of them, so we will move onto Chi-square feature selection.

### Feature picking
Now We will use feature selection on train and set K to 25 of the best features to use for modeling. I wonder if any of the variables in the questions will be selected.

In [None]:
X_train_fs, X_validate_fs, X_test_fs = feature_chi2(X_train, X_validate, X_test, k = 25)