# IBM Data Science Capstone: Car Accident Severity Report – Week 1

### Table of Content: 

01.	Introduction

     01.01.	 Business Problem

     01.02.	 Target Audience / Stakeholders
     

02.	Data

    02.01.	 Data Source

    02.02.	 Data Understanding

    02.03.	Data Loading

    02.04.	 Data Pre-Processing

            02.04.1.	Removing unnecessary columns

            02.04.2.	Removing records with empty data

            02.04.3.	Label Encoding 

            02.04.4.	Balancing the data

            02.04.5.	Define the Feature Set

            02.04.6.	Normalizing the data

            02.04.7.	Split the data (into: Train/Test Sets)


## 01.	 Introduction

In this section I intend to discuss the business problem (using a scenario description), who would be interested in the project (target audience / stakeholders) and the business understanding of what algorithm needs (in my opinion) to be developed.


### 01.01	 Business Problem

To better understand the problem and its background, let me start with the following scenario description (as described in the introduction video of the class): 

    Say you are driving to another city for work or to visit some friends.

    It is rainy and windy, and on the way, you come across a terrible traffic jam on the other side of the highway. Long lines of cars barely moving. 

    As you keep driving, police cars start appearing from afar shutting down the highway. 

    Oh, it is an accident and there is a helicopter transporting the ones involved in the crash to the nearest hospital. 

    They must be in critical condition for all of this to be happening. 

    Now, wouldn't it be great if there is something in place that could warn you, 

    given the weather and the road conditions about the possibility of you getting into a car accident and how severe it would be, 

    so that you would drive more carefully or even change your travel if you are able to.


So, the problem we have at hand here is: **How to reduce the frequency of car collisions in a community.**


### 01.02. Target Audience / Stakeholders

Data science problems always target an audience and are meant to help a group of stakeholders solve a problem. Let me describe here the target audience to the problem solution and why they would care about it.
Recall that our problem is how to reduce the frequency of car collisions in a community. Let us think then who the Target Audience / Stakeholders are here:

-	First and foremost, it is the community itself. The drivers who want to avoid those kind of accidents (both in participating in such accidents and in avoiding the long lines and delays when such accidents occur on their route ahead). Very often than not, we use GPS software like Waze which instruct us to change our normal route due to an accident ahead and that alone causes delays in our drive time. What if we could minimize those accidents in the first place.


-	The traffic-police. The less traffic accidents there are, the more availability the traffic policemen and policewomen are for their ongoing day to day work. Also, with long lines of cars behind such accident, it is more difficult for the traffic police car to reach the accident scene.


-	Traffic Command and Control Center. Especially in large urban cities, the more accidents, the more “headache” for the people at the Command and Control Center to resolve the traffic jams, handle re-routes and restore the normal order.


-	Hospitals and Ambulance teams. Obviously, as traffic accidents rate lowers, also the number of people hurt in those accidents is decreased. So there is less need in sending ambulances/helicopters to the problematic locations of the accidents and there is reduction in the need to treat people in hospitals allowing the medical professionals to provide more attention and care to non-accidents related patients.


-	Insurance companies. This is another given. The less accidents there are, the less payments the insurance companies need to provide both on car damages and people’s health related damages (including life insurance in cases of deaths).


-	Another beneficiate for the solution to the problem is the various workplaces of the people that wait in the long accidents lines on their way to work. Had there been the accidents, they could have been already working and produce profit to the companies they work at. 


## 02.	 Data

In this section I intend to describe the data that I will be using, its source, how it will be used to solve the problem and the pre-processing actions needed on it before it can become useful. The explanations will be accompanied with examples to make the points I will be making clearer.


### 02.01. Data Source

The Data Source that I was using was in essence the same data source that the students of this class were provided with in the project description: 

    https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv


### 02.02. Data Understanding

To better understand the data, I performed the following tasks: 

a.	Observe the csv file that was provided and the meta data file that describes it

(can be located under: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf )

b.	Decide on the Predictor or Target Variable. It did not take long to understand that this should be the ‘SEVERITYCODE’ field because it is used measure the severity of an accident. According to the Meta Data file description, there are several types of Severity Codes: 
- 0 (Unknown)
<font color=green>
- 1 (Property Damage Only Collision)
- 2 (Injury Collision)
</font>
- 2b (Serious Injury)
- 3 (Fatality)

However, in the csv file provided, there are only records with SeverityCode of either 1 or 2. So, I decided to check how the attributes mentioned later below (item d) can predict a severity code of either 1 or 2.

c.	Decide which fields are irrelevant. When browsing through the csv data, I decided that there were quite a few columns that I would not use for this model. Below are those columns with the reasoning: 
•	‘SEVERITYCODE’ – This column is a duplicate and appears twice, it will be used only once.
•	‘X’, ‘Y’, ‘OBJECTID’, ‘INCKEY’, ‘COLDETKEY’, ‘REPORTNO’, ‘STATUS’, ‘INTKEY’, ‘LOCATION’, ‘EXCEPTRSNCODE’, ‘EXCEPTRSNDESC’, ‘SEVERITYDESC’, ‘INCDATE’, ‘INCDTTM’, ‘SDOT_COLCODE’, ‘SDOT_COLDESC’, ‘SDOTCOLNUM’, ‘ST_COLCODE’, ‘ST_COLDESC’, ‘SEGLANEKEY’, ‘CROSSWALKKEY’

The reason for not using them is that they do not provide any value to the classifications.
•	‘INATTENTIONIND’, ‘PEDROWNOTGRNT’, ‘SPEEDING’

While those three columns could provide valuable data to the algorithm (“Whether or not collision was due to inattention (Y/N)”, “Whether or not the pedestrian right of way was not granted. (Y/N)” and “Whether or not speeding was a factor in the collision. (Y/N)” (respectively)), I could not use them because in the csv data the values they had were either blank (empty value) or ‘Y’ and I could not tell whether the empty value represents a ‘N’ value or a missing data.

d.	So, by the process of elimination, I was left with the relevant fields which became the attributes listed below that weigh the severity of an accident (either ‘1’ or ‘2’). In other words, to reduce the frequency of car collisions in a community, an algorithm must be developed to predict the severity of an accident given:

-	'ADDRTYPE' - The collision address type (Possible values: ‘Alley’, ‘Block’, ‘Intersection’)
-	'COLLISIONTYPE' – Collision Type (Possible values: 'Angles', 'Cycles', 
'Head On', 'Left Turn', 'Parked Car', 'Pedestrian', 'Rear Ended', 'Right Turn', 'Sideswipe')
-	'PERSONCOUNT' - The total number of people involved in the collision
-	'PEDCOUNT' - The number of pedestrians involved in the collision
-	'PEDCYLCOUNT' - The number of bicycles involved in the collision
-	'VEHCOUNT' - The number of vehicles involved in the collision
-	'JUNCTIONTYPE' - Category of junction at which collision took place (Possible values: 'At Intersection (but not related to intersection)', 'At Intersection (intersection related)', 'Driveway Junction', 'Mid-Block (but intersection related)', 'Mid-Block (not related to intersection)', 'Ramp Junction')
-	'UNDERINFL' - Whether a driver involved was under the influence of drugs or alcohol (Possible values: ‘0’, ‘1’, ‘Y’, ‘N’)
-	'WEATHER' - The weather conditions during the time of the collision (Possible values: 'Blowing Sand/Dirt', 'Clear', 'Fog/Smog/Smoke', 'Overcast', 'Partly Cloudy', 'Raining',
'Severe Crosswind', 'Sleet/Hail/Freezing Rain', 'Snowing')
-	'ROADCOND' - The condition of the road during the collision (Possible values: 'Dry', 'Ice', 'Oil', 'Sand/Mud/Dirt', 'Snow/Slush', 'Standing Water', 'Wet')
-	'LIGHTCOND' - The light conditions during the collision (Possible values: 'Dark - No Street Lights', 'Dark - Street Lights Off', 'Dark - Street Lights On', 'Dawn', 'Daylight', 'Dusk')
-	'HITPARKEDCAR' - Whether the collision involved hitting a parked car (Possible values: ‘Y’, ‘N’)


### 02.03 Data Loading

Before we start dealing with the data, let’s import the necessary libraries/packages which will be needed for this project: 


In [74]:
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import urllib.request
from sklearn import metrics
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

Let us download the data from the data source: 

In [75]:
# Method-1: Jupyter Notebook Compatible
!wget -O Data-Collisions.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

# Method-2: Non-Jupyter Notebook Compatible
# url = 'https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
# with urllib.request.urlopen(url) as testfile, open('Data-Collisions.csv', 'w', encoding='utf-8') as f:
#     f.write(testfile.read().decode())

--2020-10-03 13:19:22--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘Data-Collisions.csv’


2020-10-03 13:19:25 (35.9 MB/s) - ‘Data-Collisions.csv’ saved [73917638/73917638]



Enter the csv data into a data-frame: 

In [76]:
df = pd.read_csv("Data-Collisions.csv", delimiter=",", low_memory=False)

Observe the dataframe head: 

In [77]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


### 02.04 Data Pre-Processing

In its original form, this data is not fit for analysis for the following reasons: 
1.	Like mentioned above, there are quite a few columns that I will not use for this model.
2.	In almost all the columns there are instances of empty values (blanks)
3.	A label encoding process is in place. Most of the features are of type object when they should be category type and have additional column added for each with matching numerical values.
4.	Data is not balanced.
5.	Data is not normalized.
6.	Data needs to be split into Train Data and Test Data.

Let us go over each of the reasons below and see how I managed to resolve them.


### 02.04.1	Removing unnecessary columns

As mentioned previously, we need to take out columns that are either:

- Not useful (such as EXCEPTRSNCODE or EXCEPTRSNDESC)

- We cannot get information from (such as Speeding where the values are either blank or Y and we cannot know from the blanks whether the info is missing, or the value should be 'N')


In [78]:
col_df = df[['ADDRTYPE', 'SEVERITYCODE', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT',
             'JUNCTIONTYPE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'HITPARKEDCAR']]
col_df.head()

Unnamed: 0,ADDRTYPE,SEVERITYCODE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,HITPARKEDCAR
0,Intersection,2,Angles,2,0,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight,N
1,Block,1,Sideswipe,2,0,0,2,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,N
2,Block,1,Parked Car,4,0,0,3,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,N
3,Block,1,Other,3,0,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,N
4,Intersection,2,Angles,2,0,0,2,At Intersection (intersection related),0,Raining,Wet,Daylight,N


### 02.04.2	Removing records with empty data

Taking down lines with null values in each column:


In [80]:
col_df.replace('', np.nan, inplace=True)
col_df = col_df.dropna()

### 02.04.3	Label Encoding

We must use label encoding to covert the features to our desired data type.

The label encoding here is done in two approaches (each on different fields): 

a.	The fields: UNDERINFL, HITPARKEDCAR and SEVERITYCODE
    
    The UNDERINFL column has values of ‘0’, ‘1’, ‘Y’ and ‘N’. Change all ‘Y’ and ‘N’ to ‘1’ and ‘0’ respectively. 
    While we are at it, the HITPARKEDCAR column has values of ‘Y’ and ‘N’. Change them to ‘1’ and ‘0’ (respectively).


In [81]:
col_df = col_df.replace({'UNDERINFL': {'N': '0', 'Y': '1'}})
col_df = col_df.replace({'HITPARKEDCAR': {'N': '0', 'Y': '1'}})

Changing the SEVERITYCODE, UNDERINFL and HITPARKEDCAR columns from type object to type int

In [82]:
col_df['SEVERITYCODE'] = col_df['SEVERITYCODE'].astype('int')
col_df['UNDERINFL'] = col_df['UNDERINFL'].astype('int')
col_df['HITPARKEDCAR'] = col_df['HITPARKEDCAR'].astype('int')

b.	The fields: ‘ADDRTYPE’, COLLISONTYPE’, ‘JUNCTIONTYPE’, ‘WEATHER’, ‘ROADCOND’, ‘LIGHTCOND’

    For each of those fields, create a white list of the values that we want to keep (excluding values like ‘Other’ or ‘Unknown’). 
    Convert the columns to type category and add another column to the data-frame that would have the numerical value of the category field assigned.


In [83]:
working_dict = {}
working_dict["ADDRTYPE"] = [['Alley', 'Block', 'Intersection'], "ADDRTYPE_CAT"]
working_dict["COLLISIONTYPE"] = [['Angles', 'Cycles', 'Head On', 'Left Turn', 'Parked Car', 'Pedestrian', 'Rear Ended', 'Right Turn', 'Sideswipe'], "COLLISIONTYPE_CAT"]
working_dict["JUNCTIONTYPE"] = [['At Intersection (but not related to intersection)', 'At Intersection (intersection related)', 'Driveway Junction', 'Mid-Block (but intersection related)', 'Mid-Block (not related to intersection)', 'Ramp Junction'], "JUNCTIONTYPE_CAT"]
working_dict["WEATHER"] = [['Blowing Sand/Dirt', 'Clear', 'Fog/Smog/Smoke', 'Overcast', 'Partly Cloudy', 'Raining',
'Severe Crosswind', 'Sleet/Hail/Freezing Rain', 'Snowing'], "WEATHER_CAT"]
working_dict["ROADCOND"] = [['Dry', 'Ice', 'Oil', 'Sand/Mud/Dirt', 'Snow/Slush', 'Standing Water', 'Wet'],
"ROADCOND_CAT"]
working_dict["LIGHTCOND"] = [['Dark - No Street Lights', 'Dark - Street Lights Off', 'Dark - Street Lights On', 'Dawn',
'Daylight', 'Dusk'], "LIGHTCOND_CAT"]
for k in working_dict:
    # Keep only required category values (i.e.: no Unknowns or Others)
    col_df = col_df[col_df[k].isin(working_dict[k][0])]
    # Converting column type to 'category'
    col_df[k] = col_df[k].astype('category')
    # Assigning numerical values and storing in another column
    col_df[working_dict[k][1]] = col_df[k].cat.codes

Let us see how the data-frame (head) looks after the label encoding

In [84]:
col_df[['SEVERITYCODE', 'ADDRTYPE', 'COLLISIONTYPE', 'JUNCTIONTYPE', 'UNDERINFL', 'HITPARKEDCAR', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'ADDRTYPE_CAT', 'COLLISIONTYPE_CAT', 
        'JUNCTIONTYPE_CAT', 'WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT']].head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,UNDERINFL,HITPARKEDCAR,WEATHER,ROADCOND,LIGHTCOND,ADDRTYPE_CAT,COLLISIONTYPE_CAT,JUNCTIONTYPE_CAT,WEATHER_CAT,ROADCOND_CAT,LIGHTCOND_CAT
0,2,Intersection,Angles,At Intersection (intersection related),0,0,Overcast,Wet,Daylight,2,0,1,3,6,4
1,1,Block,Sideswipe,Mid-Block (not related to intersection),0,0,Raining,Wet,Dark - Street Lights On,1,8,4,5,6,2
2,1,Block,Parked Car,Mid-Block (not related to intersection),0,0,Overcast,Dry,Daylight,1,4,4,3,0,4
4,2,Intersection,Angles,At Intersection (intersection related),0,0,Raining,Wet,Daylight,2,0,1,5,6,4
5,1,Intersection,Angles,At Intersection (intersection related),0,0,Clear,Dry,Daylight,2,0,1,1,0,4


With the new columns, we can now use this data in our analysis and Machine Learner models (Part of Week-2)

### 02.04.4	Balancing the data

First let us discover how many different values in SEVERITYCODE (our target) field and the number of records in each value


In [85]:
col_df['SEVERITYCODE'].value_counts()

1    95918
2    49445
Name: SEVERITYCODE, dtype: int64

As we can see, there are two values in total: 
-	1's (Property Damage Only Collision)
-	2's (Injury Collision)

The number of values with SEVERITYCODE=1 is: 95,918 while the number of values with SEVERITYCODE=2 is 49,445
That is close to a 1:2 ratio (1:1.94). 

Let us down-sample the majority class to be equal with the minority class.


In [86]:
# Separate majority and minority classes
df_majority = col_df[col_df.SEVERITYCODE == 1]
df_minority = col_df[col_df.SEVERITYCODE == 2]
# Down-sample majority class
df_majority_downsampled = resample(df_majority,
                                   replace=False,     # sample without replacement
                                   n_samples=49445,   # to match minority class 
                                   random_state=123)  # reproducible results 
# Combine minority class with down-sampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])


Let us now see whether the data is balanced or not:

In [87]:
df_downsampled['SEVERITYCODE'].value_counts()

2    49445
1    49445
Name: SEVERITYCODE, dtype: int64

Now the dataset is beautifully balanced.

Let us discover how many rows and columns are in this dataset in total:

In [88]:
df_downsampled.shape

(98890, 19)

Let us see what the name of columns and their type are:

In [89]:
df_downsampled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98890 entries, 66075 to 194671
Data columns (total 19 columns):
ADDRTYPE             98890 non-null category
SEVERITYCODE         98890 non-null int64
COLLISIONTYPE        98890 non-null category
PERSONCOUNT          98890 non-null int64
PEDCOUNT             98890 non-null int64
PEDCYLCOUNT          98890 non-null int64
VEHCOUNT             98890 non-null int64
JUNCTIONTYPE         98890 non-null category
UNDERINFL            98890 non-null int64
WEATHER              98890 non-null category
ROADCOND             98890 non-null category
LIGHTCOND            98890 non-null category
HITPARKEDCAR         98890 non-null int64
ADDRTYPE_CAT         98890 non-null int8
COLLISIONTYPE_CAT    98890 non-null int8
JUNCTIONTYPE_CAT     98890 non-null int8
WEATHER_CAT          98890 non-null int8
ROADCOND_CAT         98890 non-null int8
LIGHTCOND_CAT        98890 non-null int8
dtypes: category(6), int64(7), int8(6)
memory usage: 7.2 MB


Notice above the category value vs the newly added columns int value.

From this point onwards, we will only use the new columns for our analysis.

### 02.04.5	Define the Feature Set

To use scikit-learn library, we must convert the Pandas data frame to a Numpy array.

In [90]:
X = df_downsampled[['ADDRTYPE_CAT', 'COLLISIONTYPE_CAT', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT',
                    'JUNCTIONTYPE_CAT', 'UNDERINFL', 'WEATHER_CAT', 'ROADCOND_CAT', 'LIGHTCOND_CAT',
                    'HITPARKEDCAR']].values

Display head of the features set (before normalization):

In [91]:
X[0:5]

array([[1, 6, 2, 0, 0, 2, 3, 0, 1, 0, 4, 0],
       [2, 8, 3, 0, 0, 2, 1, 0, 1, 0, 4, 0],
       [1, 4, 2, 0, 0, 2, 4, 0, 5, 1, 2, 0],
       [1, 6, 3, 0, 0, 3, 3, 0, 5, 6, 4, 0],
       [1, 8, 2, 0, 0, 2, 4, 0, 1, 0, 4, 0]])

Define label (target) set:

In [92]:
y = df_downsampled['SEVERITYCODE'].values

Display head of the label (target) set:

In [93]:
y[0:5]

array([1, 1, 1, 1, 1])

### 02.04.6	Normalizing the data

Data Standardization give data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on distance of cases.

In [94]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))



Display head of the features set (after normalization):

In [95]:
X[0:5]

array([[-0.85876841,  0.83441066, -0.4709329 , -0.25259959, -0.22821246,
        -0.05122548,  0.35782358, -0.22074884, -0.67431245, -0.6105998 ,
         0.55116918, -0.15978737],
       [ 1.15870695,  1.58085096,  0.23185974, -0.25259959, -0.22821246,
        -0.05122548, -1.07107215, -0.22074884, -0.67431245, -0.6105998 ,
         0.55116918, -0.15978737],
       [-0.85876841,  0.08797036, -0.4709329 , -0.25259959, -0.22821246,
        -0.05122548,  1.07227144, -0.22074884,  1.78523953, -0.23362798,
        -1.50706469, -0.15978737],
       [-0.85876841,  0.83441066,  0.23185974, -0.25259959, -0.22821246,
         1.66945662,  0.35782358, -0.22074884,  1.78523953,  1.65123107,
         0.55116918, -0.15978737],
       [-0.85876841,  1.58085096, -0.4709329 , -0.25259959, -0.22821246,
        -0.05122548,  1.07227144, -0.22074884, -0.67431245, -0.6105998 ,
         0.55116918, -0.15978737]])

### 02.04.7	Split the data (into: Train/Test Sets)

Out of Sample Accuracy is the percentage of correct predictions that the model makes on data that that the model has NOT been trained on. 

Doing a train and test on the same dataset will most likely have low out-of-sample accuracy, due to the likelihood of being over-fit.

It is important that our models have a high, out-of-sample accuracy, because the purpose of any model, of course, is to make correct predictions on unknown data. 

So how can we improve out-of-sample accuracy?

One way is to use an evaluation approach called Train/Test Split.

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. 

After which, you train with the training set and test with the testing set.

This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. 

It is more realistic for real world problems.


In [96]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

Display the dimensions (shape) of the Train/Test sets:

In [97]:
print('Train set shape:', X_train.shape,  y_train.shape)
print('Test set shape:', X_test.shape,  y_test.shape)

Train set shape: (79112, 12) (79112,)
Test set shape: (19778, 12) (19778,)
