## Importing Libraries

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import pickle 

### Loading the dataset

In [8]:
df = pd.read_csv('../data/train.csv')
df.head()

Unnamed: 0,id,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
0,0,urban,2,0.06,35,daylight,rainy,False,True,afternoon,False,True,1,0.13
1,1,urban,4,0.99,35,daylight,clear,True,False,evening,True,True,0,0.35
2,2,rural,4,0.63,70,dim,clear,False,True,morning,True,False,2,0.3
3,3,highway,4,0.07,35,dim,rainy,True,True,morning,False,False,1,0.21
4,4,rural,1,0.58,60,daylight,foggy,False,False,evening,True,False,1,0.56


### Inspecting the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517754 entries, 0 to 517753
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      517754 non-null  int64  
 1   road_type               517754 non-null  object 
 2   num_lanes               517754 non-null  int64  
 3   curvature               517754 non-null  float64
 4   speed_limit             517754 non-null  int64  
 5   lighting                517754 non-null  object 
 6   weather                 517754 non-null  object 
 7   road_signs_present      517754 non-null  bool   
 8   public_road             517754 non-null  bool   
 9   time_of_day             517754 non-null  object 
 10  holiday                 517754 non-null  bool   
 11  school_season           517754 non-null  bool   
 12  num_reported_accidents  517754 non-null  int64  
 13  accident_risk           517754 non-null  float64
dtypes: bool(4), float64(

We see that the dataset contains categorical, boolean and numerical columns. However, there are no missing values, which saves us the trouble of having to imput or drop these missing values. The presence of non-numerical columns makes it necessary to do some form of encoding. 

In [6]:
df.describe()

Unnamed: 0,id,num_lanes,curvature,speed_limit,num_reported_accidents,accident_risk
count,517754.0,517754.0,517754.0,517754.0,517754.0,517754.0
mean,258876.5,2.491511,0.488719,46.112575,1.18797,0.352377
std,149462.849974,1.120434,0.272563,15.788521,0.895961,0.166417
min,0.0,1.0,0.0,25.0,0.0,0.0
25%,129438.25,1.0,0.26,35.0,1.0,0.23
50%,258876.5,2.0,0.51,45.0,1.0,0.34
75%,388314.75,3.0,0.71,60.0,2.0,0.46
max,517753.0,4.0,1.0,70.0,7.0,1.0


In [7]:
df.isna().sum()

id                        0
road_type                 0
num_lanes                 0
curvature                 0
speed_limit               0
lighting                  0
weather                   0
road_signs_present        0
public_road               0
time_of_day               0
holiday                   0
school_season             0
num_reported_accidents    0
accident_risk             0
dtype: int64

## Visualisations

We will look into visualisations to inspect our dataset and understand patterns within the data.

In [9]:
road_type_stats = df.groupby(['road_type'])['accident_risk'].mean().reset_index()
road_type_stats

Unnamed: 0,road_type,accident_risk
0,highway,0.349734
1,rural,0.349997
2,urban,0.357456


In [None]:
### Make bar chart for road_type_stats



In [10]:
weather_stats = df.groupby(['weather'])['accident_risk'].mean().reset_index()
print(weather_stats)

  weather  accident_risk
0   clear       0.310060
1   foggy       0.386305
2   rainy       0.361494


In [11]:
### Make bar chart here too

In [12]:
### Make visualisations showing diff in accident risk based on all the boolean variables

In [1]:
### Find out how to make a correlation matrix using sns.heatmap and use it on this data

### Inspecting the target variable 

We saw in df.describe that the mean accident risk across all variables is around 0.35 which is quite high. Now, let us look into more characteristics of this target variable and we will try to see the relation of the numerical variables on the target variable. We will do this by first changing the target variable into a binary variable. We will divide the accident risk into two: high risk instances (accident_risk > 0.5) and low risk instances. 

In [17]:
df['accident_risk_binary'] = (df['accident_risk'] > 0.5).astype(int)

In [19]:
df['accident_risk_binary'].value_counts()

accident_risk_binary
0    426581
1     91173
Name: count, dtype: int64

We see that there are almost 5 times as many instances of low accident risk than there are high accident risk instances.

In [20]:
curvature_stats = df.groupby(['accident_risk_binary'])['curvature'].mean().reset_index()
print(curvature_stats)

   accident_risk_binary  curvature
0                     0   0.440490
1                     1   0.714374


This provides us with some really useful insights about accident_risk. We see that on average, cases with higher accident risk tends to have a much higher curvature. This is expected as roads with a lot of curvature are generally harder to drive on.