## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [2]:
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'
df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))
weather_df = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'))

print("The shape of the dataset is {}.\n\n".format(df.shape))

The shape of the dataset is (6407, 16).




In [3]:
df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6407 entries, 0 to 6406
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            6407 non-null   int64  
 1   Lat           6407 non-null   float64
 2   Lng           6407 non-null   float64
 3   Bump          6407 non-null   bool   
 4   Distance(mi)  6407 non-null   float64
 5   Crossing      6407 non-null   bool   
 6   Give_Way      6407 non-null   bool   
 7   Junction      6407 non-null   bool   
 8   No_Exit       6407 non-null   bool   
 9   Railway       6407 non-null   bool   
 10  Roundabout    6407 non-null   bool   
 11  Stop          6407 non-null   bool   
 12  Amenity       6407 non-null   bool   
 13  Side          6407 non-null   object 
 14  Severity      6407 non-null   int64  
 15  timestamp     6407 non-null   object 
dtypes: bool(9), float64(3), int64(2), object(2)
memory usage: 406.8+ KB


In [5]:
weather_df.head()

Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,2020,27,7,18,Fair,64.0,0.0,64.0,70.0,20.0,10.0,No
1,2017,30,9,17,Partly Cloudy,,,71.1,57.0,9.2,10.0,No
2,2017,27,6,5,Overcast,,,57.9,87.0,15.0,9.0,No
3,2016,7,9,9,Clear,,,66.9,73.0,4.6,10.0,No
4,2019,19,10,2,Fair,52.0,0.0,52.0,89.0,0.0,9.0,No


In [6]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901 entries, 0 to 6900
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               6901 non-null   int64  
 1   Day                6901 non-null   int64  
 2   Month              6901 non-null   int64  
 3   Hour               6901 non-null   int64  
 4   Weather_Condition  6900 non-null   object 
 5   Wind_Chill(F)      3292 non-null   float64
 6   Precipitation(in)  3574 non-null   float64
 7   Temperature(F)     6899 non-null   float64
 8   Humidity(%)        6899 non-null   float64
 9   Wind_Speed(mph)    6556 non-null   float64
 10  Visibility(mi)     6900 non-null   float64
 11  Selected           6901 non-null   object 
dtypes: float64(6), int64(4), object(2)
memory usage: 647.1+ KB


In [7]:
def convert_time(data_frame):
    data_frame["timestamp"] = data_frame["timestamp"].str.split(":", n = 1, expand = True)[0]
    data_frame[['Year','Month','timestamp']] = data_frame["timestamp"].str.split("-", expand = True)
    data_frame[['Day','Hour']] = data_frame["timestamp"].str.split(" ", expand = True)
    data_frame['Year' ] = data_frame['Year' ].astype(int)
    data_frame['Month'] = data_frame['Month'].astype(int)
    data_frame['Day'  ] = data_frame['Day'  ].astype(int)
    data_frame['Hour' ] = data_frame['Hour' ].astype(int)
    return data_frame.drop(columns=['timestamp'])

In [8]:
df=convert_time(df)
df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,Year,Month,Day,Hour
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016,3,25,15
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020,5,5,19
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016,9,16,19
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020,3,29,19
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019,10,9,8


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6407 entries, 0 to 6406
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            6407 non-null   int64  
 1   Lat           6407 non-null   float64
 2   Lng           6407 non-null   float64
 3   Bump          6407 non-null   bool   
 4   Distance(mi)  6407 non-null   float64
 5   Crossing      6407 non-null   bool   
 6   Give_Way      6407 non-null   bool   
 7   Junction      6407 non-null   bool   
 8   No_Exit       6407 non-null   bool   
 9   Railway       6407 non-null   bool   
 10  Roundabout    6407 non-null   bool   
 11  Stop          6407 non-null   bool   
 12  Amenity       6407 non-null   bool   
 13  Side          6407 non-null   object 
 14  Severity      6407 non-null   int64  
 15  Year          6407 non-null   int64  
 16  Month         6407 non-null   int64  
 17  Day           6407 non-null   int64  
 18  Hour          6407 non-null 

In [10]:
weather_df=weather_df.drop_duplicates(subset=['Year', 'Month','Day','Hour'])

In [11]:
from lxml import objectify
import pandas as pd

xml_data = objectify.parse(os.path.join(dataset_path, 'holidays.xml'))  # Parse XML data
root = xml_data.getroot()  # Root element

data = []

for i in range(len(root.getchildren())):
    child = root.getchildren()[i]
    data.append(dict(timestamp=child.getchildren()[0].text, Holiday=child.getchildren()[1].text))

df_holidays = pd.DataFrame(data)  # Create DataFrame and transpose it

In [12]:
df_holidays[['Year','Month','Day']] = df_holidays["timestamp"].str.split("-", expand = True)
df_holidays['Year' ] = df_holidays['Year' ].astype(int)
df_holidays['Month'] = df_holidays['Month'].astype(int)
df_holidays['Day'  ] = df_holidays['Day'  ].astype(int)
df_holidays = df_holidays.drop(columns=['Year','timestamp'])
df_holidays['Holiday'] = df_holidays['Holiday'].astype('category').cat.codes
df_holidays = df_holidays.drop_duplicates(subset=['Month','Day'])
df_holidays.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 0 to 62
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Holiday  44 non-null     int8 
 1   Month    44 non-null     int64
 2   Day      44 non-null     int64
dtypes: int64(2), int8(1)
memory usage: 1.1 KB


In [13]:
df_merge_weather = pd.merge(df, weather_df, how="left", on=['Year','Month','Day','Hour'])

In [14]:
df_merge = pd.merge(df_merge_weather, df_holidays, how='left', on=['Month','Day'])

In [15]:
df_merge.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Holiday
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,...,15,Scattered Clouds,,,64.0,58.0,23.0,10.0,No,
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,...,19,Mostly Cloudy / Windy,57.0,0.0,57.0,83.0,22.0,10.0,No,
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,...,19,Clear,,,62.1,80.0,9.2,10.0,No,
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,...,19,Fair,58.0,0.0,58.0,70.0,10.0,10.0,No,
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,...,8,Fair,58.0,0.0,58.0,65.0,3.0,10.0,No,1.0


In [16]:
df_merge['Holiday'] = df_merge['Holiday'] + 1
df_merge['Holiday'] = df_merge['Holiday'].fillna(0)
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6407 non-null   int64  
 1   Lat                6407 non-null   float64
 2   Lng                6407 non-null   float64
 3   Bump               6407 non-null   bool   
 4   Distance(mi)       6407 non-null   float64
 5   Crossing           6407 non-null   bool   
 6   Give_Way           6407 non-null   bool   
 7   Junction           6407 non-null   bool   
 8   No_Exit            6407 non-null   bool   
 9   Railway            6407 non-null   bool   
 10  Roundabout         6407 non-null   bool   
 11  Stop               6407 non-null   bool   
 12  Amenity            6407 non-null   bool   
 13  Side               6407 non-null   object 
 14  Severity           6407 non-null   int64  
 15  Year               6407 non-null   int64  
 16  Month              6407 

We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [17]:
df.drop(columns='ID').describe()

Unnamed: 0,Lat,Lng,Distance(mi),Severity,Year,Month,Day,Hour
count,6407.0,6407.0,6407.0,6407.0,6407.0,6407.0,6407.0,6407.0
mean,37.765653,-122.40599,0.135189,2.293429,2018.407835,6.744498,15.656626,12.873888
std,0.032555,0.028275,0.39636,0.521225,1.375794,3.568445,8.750849,5.824203
min,37.609619,-122.51044,0.0,1.0,2016.0,1.0,1.0,0.0
25%,37.737096,-122.41221,0.0,2.0,2017.0,4.0,8.0,8.0
50%,37.768238,-122.404835,0.0,2.0,2019.0,7.0,16.0,14.0
75%,37.787813,-122.392477,0.041,3.0,2020.0,10.0,23.0,17.0
max,37.825626,-122.349734,6.82,4.0,2020.0,12.0,31.0,23.0


In [18]:
#df['Side'] = df['Side'] == 'L'
#df.head()

The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

In [19]:
corr = df_merge.corr()
corr.style.background_gradient(cmap='coolwarm')

  smin = np.nanmin(s.to_numpy()) if vmin is None else vmin
  smax = np.nanmax(s.to_numpy()) if vmax is None else vmax


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,Year,Month,Day,Hour,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Holiday
ID,1.0,0.012043,0.009135,,0.024395,0.010125,-0.021899,0.007417,-0.009012,0.004945,,0.003892,0.003176,0.020703,-0.005391,0.010793,-0.000216,-0.013887,-0.004933,0.001146,-0.010452,0.012566,-0.009141,-0.028679,-0.013748
Lat,0.012043,1.0,0.388177,,0.03976,0.040971,0.011296,0.012027,0.010412,-0.002388,,0.39053,0.088219,0.099581,0.00255,-0.031227,0.006292,0.02207,0.030113,-0.054282,0.037806,0.015215,-0.05578,-0.005805,-0.003785
Lng,0.009135,0.388177,1.0,,0.074003,-0.12317,-0.032626,0.191309,-0.000733,-0.030877,,0.385575,-0.102261,0.145313,-0.016629,-0.020087,-0.007517,0.009822,0.03148,-0.014761,0.028449,0.018423,-0.068629,-0.027623,0.005016
Bump,,,,,,,,,,,,,,,,,,,,,,,,,
Distance(mi),0.024395,0.03976,0.074003,,1.0,-0.020309,-0.007164,-0.028275,0.000341,-0.033987,,-0.054533,-0.033825,-0.013141,0.040211,0.08352,0.008718,0.00581,-0.019257,-0.020088,-0.018042,0.003105,-0.025099,-0.000339,0.015591
Crossing,0.010125,0.040971,-0.12317,,-0.020309,1.0,0.072222,-0.160848,-0.003744,0.430823,,-0.037446,0.319284,-0.090314,0.015127,-0.00606,0.025577,-0.037465,-0.048528,-0.019649,-0.040751,0.015003,-0.039408,0.006442,0.006782
Give_Way,-0.021899,0.011296,-0.032626,,-0.007164,0.072222,1.0,-0.012378,-0.00027,0.041317,,0.041475,-0.004251,-0.012186,-0.001172,0.01166,-0.004922,0.007902,0.009616,-0.00526,0.00843,-0.014525,-0.010231,0.007352,-0.007022
Junction,0.007417,0.012027,0.191309,,-0.028275,-0.160848,-0.012378,1.0,-0.007145,-0.094416,,0.07529,-0.089347,-0.068328,0.073461,-0.041369,-0.022799,0.019163,0.026493,0.001821,0.014891,-0.030626,-0.007079,-0.010522,-0.009559
No_Exit,-0.009012,0.010412,-0.000733,,0.000341,-0.003744,-0.00027,-0.007145,1.0,-0.002063,,-0.004111,-0.002454,-0.007034,-0.021868,-0.006108,-0.015216,0.000271,,,0.010947,-0.007096,0.008134,0.004244,-0.004053
Railway,0.004945,-0.002388,-0.030877,,-0.033987,0.430823,0.041317,-0.094416,-0.002063,1.0,,-0.034703,0.126759,-0.033322,0.037886,0.02162,0.02235,-0.013434,-0.024541,-0.015629,-0.003378,-0.02617,-0.010384,0.002874,0.0056


In [20]:
#for col in df.columns:
    #print(df[str(col)].value_counts())

In [21]:
df_merge=df_merge.drop(columns=['ID','Bump', 'Roundabout','Give_Way','No_Exit','Selected','Year','Month','Day','Hour'])

In [22]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6407 non-null   float64
 1   Lng                6407 non-null   float64
 2   Distance(mi)       6407 non-null   float64
 3   Crossing           6407 non-null   bool   
 4   Junction           6407 non-null   bool   
 5   Railway            6407 non-null   bool   
 6   Stop               6407 non-null   bool   
 7   Amenity            6407 non-null   bool   
 8   Side               6407 non-null   object 
 9   Severity           6407 non-null   int64  
 10  Weather_Condition  6406 non-null   object 
 11  Wind_Chill(F)      3274 non-null   float64
 12  Precipitation(in)  3529 non-null   float64
 13  Temperature(F)     6405 non-null   float64
 14  Humidity(%)        6405 non-null   float64
 15  Wind_Speed(mph)    6111 non-null   float64
 16  Visibility(mi)     6406 

In [23]:
df_means=df_merge.mean()

In [24]:
df_merge=df_merge.fillna(df_means)

In [25]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6407 non-null   float64
 1   Lng                6407 non-null   float64
 2   Distance(mi)       6407 non-null   float64
 3   Crossing           6407 non-null   bool   
 4   Junction           6407 non-null   bool   
 5   Railway            6407 non-null   bool   
 6   Stop               6407 non-null   bool   
 7   Amenity            6407 non-null   bool   
 8   Side               6407 non-null   object 
 9   Severity           6407 non-null   int64  
 10  Weather_Condition  6406 non-null   object 
 11  Wind_Chill(F)      6407 non-null   float64
 12  Precipitation(in)  6407 non-null   float64
 13  Temperature(F)     6407 non-null   float64
 14  Humidity(%)        6407 non-null   float64
 15  Wind_Speed(mph)    6407 non-null   float64
 16  Visibility(mi)     6407 

In [26]:
def onehot(data_frame ,col_name):
    # Get one hot encoding of columns B
    one_hot = pd.get_dummies(data_frame[col_name], prefix=col_name)
    # Drop column as it is now encoded
    data_frame = data_frame.drop(col_name,axis = 1)
    # Join the encoded df
    data_frame = data_frame.join(one_hot)
    return data_frame  

In [27]:
df_after=onehot(df_merge,'Side')
df_after['Weather_Condition'] = df_after['Weather_Condition'].astype('category').cat.codes
#df=onehot(df,'Weather_Condition')

In [28]:
df_after.head()

Unnamed: 0,Lat,Lng,Distance(mi),Crossing,Junction,Railway,Stop,Amenity,Severity,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Holiday,Side_L,Side_R
0,37.76215,-122.40566,0.044,False,False,False,False,True,2,22,59.936836,0.006228,64.0,58.0,23.0,10.0,0.0,0,1
1,37.719157,-122.448254,0.0,False,False,False,False,False,2,15,57.0,0.0,57.0,83.0,22.0,10.0,0.0,0,1
2,37.808498,-122.366852,0.0,False,False,False,True,False,3,0,59.936836,0.006228,62.1,80.0,9.2,10.0,0.0,0,1
3,37.78593,-122.39108,0.009,False,True,False,False,False,1,3,58.0,0.0,58.0,70.0,10.0,10.0,0.0,0,1
4,37.719141,-122.448457,0.0,False,False,False,False,False,2,3,58.0,0.0,58.0,65.0,3.0,10.0,2.0,0,1


In [29]:
df_after.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6407 non-null   float64
 1   Lng                6407 non-null   float64
 2   Distance(mi)       6407 non-null   float64
 3   Crossing           6407 non-null   bool   
 4   Junction           6407 non-null   bool   
 5   Railway            6407 non-null   bool   
 6   Stop               6407 non-null   bool   
 7   Amenity            6407 non-null   bool   
 8   Severity           6407 non-null   int64  
 9   Weather_Condition  6407 non-null   int8   
 10  Wind_Chill(F)      6407 non-null   float64
 11  Precipitation(in)  6407 non-null   float64
 12  Temperature(F)     6407 non-null   float64
 13  Humidity(%)        6407 non-null   float64
 14  Wind_Speed(mph)    6407 non-null   float64
 15  Visibility(mi)     6407 non-null   float64
 16  Holiday            6407 

In [30]:
# import matplotlib.pyplot as plt

# plt.scatter(df_after['Lng'], df_after['Lat'])

In [31]:
# from sklearn.cluster import KMeans

# cities = KMeans(n_clusters=6).fit(df[['Lng', 'Lat']])
# plt.scatter(df_after['Lng'], df_after['Lat'])
# plt.scatter(cities.cluster_centers_.T[0], cities.cluster_centers_.T[1], color='red')
# df_cities = df_after.copy()
# df_cities['City'] = cities.labels_
# df_cities = df_cities.drop(columns=['Lng', 'Lat'])

In [32]:
df_after.info()
columns = df_after.columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6407 non-null   float64
 1   Lng                6407 non-null   float64
 2   Distance(mi)       6407 non-null   float64
 3   Crossing           6407 non-null   bool   
 4   Junction           6407 non-null   bool   
 5   Railway            6407 non-null   bool   
 6   Stop               6407 non-null   bool   
 7   Amenity            6407 non-null   bool   
 8   Severity           6407 non-null   int64  
 9   Weather_Condition  6407 non-null   int8   
 10  Wind_Chill(F)      6407 non-null   float64
 11  Precipitation(in)  6407 non-null   float64
 12  Temperature(F)     6407 non-null   float64
 13  Humidity(%)        6407 non-null   float64
 14  Wind_Speed(mph)    6407 non-null   float64
 15  Visibility(mi)     6407 non-null   float64
 16  Holiday            6407 

In [33]:
df_after.corr()

Unnamed: 0,Lat,Lng,Distance(mi),Crossing,Junction,Railway,Stop,Amenity,Severity,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Holiday,Side_L,Side_R
Lat,1.0,0.388177,0.03976,0.040971,0.012027,-0.002388,0.39053,0.088219,0.099581,-0.00231,0.021321,-0.040242,0.037799,0.015212,-0.054576,-0.005805,-0.003785,-0.029349,0.029349
Lng,0.388177,1.0,0.074003,-0.12317,0.191309,-0.030877,0.385575,-0.102261,0.145313,-0.002475,0.022693,-0.011067,0.028444,0.01842,-0.067363,-0.027623,0.005016,-0.082084,0.082084
Distance(mi),0.03976,0.074003,1.0,-0.020309,-0.028275,-0.033987,-0.054533,-0.033825,-0.013141,0.001113,-0.015525,-0.016415,-0.018038,0.003104,-0.024452,-0.000339,0.015591,-0.002264,0.002264
Crossing,0.040971,-0.12317,-0.020309,1.0,-0.160848,0.430823,-0.037446,0.319284,-0.090314,-0.016173,-0.035506,-0.01461,-0.04075,0.015002,-0.038431,0.006442,0.006782,0.256629,-0.256629
Junction,0.012027,0.191309,-0.028275,-0.160848,1.0,-0.094416,0.07529,-0.089347,-0.068328,-0.023545,0.01976,0.001406,0.014887,-0.030618,-0.006923,-0.010519,-0.009559,-0.123946,0.123946
Railway,-0.002388,-0.030877,-0.033987,0.430823,-0.094416,1.0,-0.034703,0.126759,-0.033322,-0.037469,-0.020179,-0.012706,-0.003378,-0.02617,-0.010288,0.002874,0.0056,0.116734,-0.116734
Stop,0.39053,0.385575,-0.054533,-0.037446,0.07529,-0.034703,1.0,-0.039619,0.229269,-0.002933,0.010077,-0.009832,0.00428,0.025161,-0.0406,-0.028883,0.009712,-0.040135,0.040135
Amenity,0.088219,-0.102261,-0.033825,0.319284,-0.089347,0.126759,-0.039619,1.0,-0.078915,0.002051,-0.006411,-0.025517,-0.005286,-0.006624,-0.001628,0.015618,0.029934,0.247947,-0.247947
Severity,0.099581,0.145313,-0.013141,-0.090314,-0.068328,-0.033322,0.229269,-0.078915,1.0,0.104039,0.000464,0.033137,-0.018302,0.076726,0.021992,-0.015714,-0.004193,-0.060545,0.060545
Weather_Condition,-0.00231,-0.002475,0.001113,-0.016173,-0.023545,-0.037469,-0.002933,0.002051,0.104039,1.0,-0.036486,0.082308,-0.029927,0.091386,0.157044,-0.015485,-0.025568,0.014551,-0.014551


## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [34]:
df_after.sort_index(inplace = True, axis = 1)

In [35]:
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn import preprocessing
from sklearn.decomposition import PCA

train_df, val_df = train_test_split(df_after, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['Severity'])
y_train = train_df['Severity']

# x = X_train.values #returns a numpy array
# min_max_scaler = preprocessing.MinMaxScaler()
# x_scaled = min_max_scaler.fit_transform(x)
# X_train = pd.DataFrame(x_scaled)
# X_train = pd.DataFrame(PCA(n_components=4).fit(X_train).transform(X_train))

X_val = val_df.drop(columns=['Severity'])
y_val = val_df['Severity']

# x = X_val.values #returns a numpy array
# min_max_scaler = preprocessing.MinMaxScaler()
# x_scaled = min_max_scaler.fit_transform(x)
# X_val = pd.DataFrame(x_scaled)
# X_val = pd.DataFrame(PCA(n_components=4).fit(X_val).transform(X_val))



In [36]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5125 entries, 748 to 860
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Amenity            5125 non-null   bool   
 1   Crossing           5125 non-null   bool   
 2   Distance(mi)       5125 non-null   float64
 3   Holiday            5125 non-null   float64
 4   Humidity(%)        5125 non-null   float64
 5   Junction           5125 non-null   bool   
 6   Lat                5125 non-null   float64
 7   Lng                5125 non-null   float64
 8   Precipitation(in)  5125 non-null   float64
 9   Railway            5125 non-null   bool   
 10  Side_L             5125 non-null   uint8  
 11  Side_R             5125 non-null   uint8  
 12  Stop               5125 non-null   bool   
 13  Temperature(F)     5125 non-null   float64
 14  Visibility(mi)     5125 non-null   float64
 15  Weather_Condition  5125 non-null   int8   
 16  Wind_Chill(F)      5125

As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [37]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [38]:
print("The accuracy of the classifier on the training set is ", (classifier.score(X_train, y_train)))
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the training set is  0.735609756097561
The accuracy of the classifier on the validation set is  0.7449297971918877


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [39]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


In [40]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1601 entries, 0 to 1600
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            1601 non-null   int64  
 1   Lat           1601 non-null   float64
 2   Lng           1601 non-null   float64
 3   Bump          1601 non-null   bool   
 4   Distance(mi)  1601 non-null   float64
 5   Crossing      1601 non-null   bool   
 6   Give_Way      1601 non-null   bool   
 7   Junction      1601 non-null   bool   
 8   No_Exit       1601 non-null   bool   
 9   Railway       1601 non-null   bool   
 10  Roundabout    1601 non-null   bool   
 11  Stop          1601 non-null   bool   
 12  Amenity       1601 non-null   bool   
 13  Side          1601 non-null   object 
 14  timestamp     1601 non-null   object 
dtypes: bool(9), float64(3), int64(1), object(2)
memory usage: 89.2+ KB


In [41]:
test_df=convert_time(test_df)
print('convert_time')
display(test_df.info())
test_df=pd.merge(test_df, weather_df, how="left", on=['Year','Month','Day','Hour'])
test_df=pd.merge(test_df, df_holidays, how="left", on=['Month','Day'])
print('merge')
display(test_df.info())
test_df=test_df.drop(columns=['Bump', 'Roundabout','Give_Way','No_Exit','Selected','Year','Month','Day','Hour'])
print('drop')
display(test_df.info())
test_df=test_df.fillna(df_means)
print('fillna')
display(test_df.info())
test_df=onehot(test_df,'Side')
test_df['Weather_Condition'] = test_df['Weather_Condition'].astype('category').cat.codes
print('onehot')
display(test_df.info())

convert_time
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1601 entries, 0 to 1600
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            1601 non-null   int64  
 1   Lat           1601 non-null   float64
 2   Lng           1601 non-null   float64
 3   Bump          1601 non-null   bool   
 4   Distance(mi)  1601 non-null   float64
 5   Crossing      1601 non-null   bool   
 6   Give_Way      1601 non-null   bool   
 7   Junction      1601 non-null   bool   
 8   No_Exit       1601 non-null   bool   
 9   Railway       1601 non-null   bool   
 10  Roundabout    1601 non-null   bool   
 11  Stop          1601 non-null   bool   
 12  Amenity       1601 non-null   bool   
 13  Side          1601 non-null   object 
 14  Year          1601 non-null   int64  
 15  Month         1601 non-null   int64  
 16  Day           1601 non-null   int64  
 17  Hour          1601 non-null   int64  
dtypes: bool(9), flo

None

merge
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 1601 non-null   int64  
 1   Lat                1601 non-null   float64
 2   Lng                1601 non-null   float64
 3   Bump               1601 non-null   bool   
 4   Distance(mi)       1601 non-null   float64
 5   Crossing           1601 non-null   bool   
 6   Give_Way           1601 non-null   bool   
 7   Junction           1601 non-null   bool   
 8   No_Exit            1601 non-null   bool   
 9   Railway            1601 non-null   bool   
 10  Roundabout         1601 non-null   bool   
 11  Stop               1601 non-null   bool   
 12  Amenity            1601 non-null   bool   
 13  Side               1601 non-null   object 
 14  Year               1601 non-null   int64  
 15  Month              1601 non-null   int64  
 16  Day               

None

drop
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 1601 non-null   int64  
 1   Lat                1601 non-null   float64
 2   Lng                1601 non-null   float64
 3   Distance(mi)       1601 non-null   float64
 4   Crossing           1601 non-null   bool   
 5   Junction           1601 non-null   bool   
 6   Railway            1601 non-null   bool   
 7   Stop               1601 non-null   bool   
 8   Amenity            1601 non-null   bool   
 9   Side               1601 non-null   object 
 10  Weather_Condition  1601 non-null   object 
 11  Wind_Chill(F)      818 non-null    float64
 12  Precipitation(in)  881 non-null    float64
 13  Temperature(F)     1601 non-null   float64
 14  Humidity(%)        1601 non-null   float64
 15  Wind_Speed(mph)    1528 non-null   float64
 16  Visibility(mi)     

None

fillna
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 1601 non-null   int64  
 1   Lat                1601 non-null   float64
 2   Lng                1601 non-null   float64
 3   Distance(mi)       1601 non-null   float64
 4   Crossing           1601 non-null   bool   
 5   Junction           1601 non-null   bool   
 6   Railway            1601 non-null   bool   
 7   Stop               1601 non-null   bool   
 8   Amenity            1601 non-null   bool   
 9   Side               1601 non-null   object 
 10  Weather_Condition  1601 non-null   object 
 11  Wind_Chill(F)      1601 non-null   float64
 12  Precipitation(in)  1601 non-null   float64
 13  Temperature(F)     1601 non-null   float64
 14  Humidity(%)        1601 non-null   float64
 15  Wind_Speed(mph)    1601 non-null   float64
 16  Visibility(mi)   

None

onehot
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 1601 non-null   int64  
 1   Lat                1601 non-null   float64
 2   Lng                1601 non-null   float64
 3   Distance(mi)       1601 non-null   float64
 4   Crossing           1601 non-null   bool   
 5   Junction           1601 non-null   bool   
 6   Railway            1601 non-null   bool   
 7   Stop               1601 non-null   bool   
 8   Amenity            1601 non-null   bool   
 9   Weather_Condition  1601 non-null   int8   
 10  Wind_Chill(F)      1601 non-null   float64
 11  Precipitation(in)  1601 non-null   float64
 12  Temperature(F)     1601 non-null   float64
 13  Humidity(%)        1601 non-null   float64
 14  Wind_Speed(mph)    1601 non-null   float64
 15  Visibility(mi)     1601 non-null   float64
 16  Holiday          

None

In [42]:
for df_col in df_after.columns:
    if df_col not in test_df.columns:
        test_df[df_col] = 0
        print(df_col)
test_df.sort_index(inplace = True, axis = 1)

Severity


In [43]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Amenity            1601 non-null   bool   
 1   Crossing           1601 non-null   bool   
 2   Distance(mi)       1601 non-null   float64
 3   Holiday            1601 non-null   float64
 4   Humidity(%)        1601 non-null   float64
 5   ID                 1601 non-null   int64  
 6   Junction           1601 non-null   bool   
 7   Lat                1601 non-null   float64
 8   Lng                1601 non-null   float64
 9   Precipitation(in)  1601 non-null   float64
 10  Railway            1601 non-null   bool   
 11  Severity           1601 non-null   int64  
 12  Side_L             1601 non-null   uint8  
 13  Side_R             1601 non-null   uint8  
 14  Stop               1601 non-null   bool   
 15  Temperature(F)     1601 non-null   float64
 16  Visibility(mi)     1601 

Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [44]:
X_test = test_df.drop(columns=['ID','Severity'])

y_test_predicted = classifier.predict(X_test)

test_df['Severity'] = y_test_predicted

test_df.head()

Unnamed: 0,Amenity,Crossing,Distance(mi),Holiday,Humidity(%),ID,Junction,Lat,Lng,Precipitation(in),Railway,Severity,Side_L,Side_R,Stop,Temperature(F),Visibility(mi),Weather_Condition,Wind_Chill(F),Wind_Speed(mph)
0,False,False,0.039,0.660371,60.0,6407,True,37.78606,-122.3909,0.006228,False,2,0,1,False,63.0,10.0,14,59.936836,10.4
1,False,False,0.202,0.660371,56.0,6408,False,37.769609,-122.415057,0.0,False,2,0,1,False,65.0,9.0,3,65.0,5.0
2,False,False,0.0,0.660371,90.0,6409,False,37.807495,-122.476021,0.0,False,2,0,1,False,58.0,10.0,11,58.0,18.0
3,False,False,0.0,0.660371,59.0,6410,True,37.761818,-122.405869,0.0,False,2,0,1,False,72.0,10.0,3,72.0,17.0
4,False,False,0.67,0.660371,77.0,6411,False,37.73235,-122.4141,0.006228,False,2,0,1,False,57.0,10.0,18,59.936836,5.8


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [45]:
test_df[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.