# Motorized Vehicles: Data Science Challenge

Based on the E-Bike Survey Response Results from TO Open Data, the goal is to train a model to predict whether the responder will answer "No - I do not have access to a private motorized vehicle" to the question "Does your household have access to any of the following types of private motorized vehicles?". 

Clearly, since the goal is to predict whether the output for this feature is a given category, this is a problem of classification. The dataset is labelled (since we have access to survey respones to that question), and hence it is a supervised learning classification problem.

The following notebook will cover the entire Data Science process, starting from loading the data, cleaning it, doing exploratory data analysis, creating a classification model and finally, evaluating the model.

## Data Import & First Look

Let's start by importing any necessary modules we need for data manipulation.

In [189]:
# Import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [190]:
# Read in survey data
responses = pd.read_csv("E-Bike_Survey_Responses.csv")

In [191]:
# Take a look at available features + responses
responses.head(10)

Unnamed: 0,Timestamp,What age range do you fall in?,Sex,How would you describe your level of physical health?,What level of education have you reached?,What is your household income?,Which category best describes your employment?,What Toronto district is your primary address located in?,On average what distance do you travel most days of the week?,On average how long is your commute?,...,Do you support any of the following statements?,When you use Toronto's Multi-Use Trails do you mostly,Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?,Have you witnessed a collision or conflict on a trail between,Do you think more should be done to manage trail users who do not respect the 20 km/h speed limit?,Currently any kind of e-bike may use a multi-use path if they are propelled by pedaling only and those propelled by motor power may be fined,When you use Toronto's bicycle lanes do you mostly,Currently any kind of e-bike may use a bicycle lane if they are propelled by pedaling only and those propelled by motor power may be fined,With regards to illegal use of bicycles and e-bikes on sidewalks should the City,Toronto Bylaws consider personal mobility devices (such as electric wheel chairs) to be pedestrians In your opinion should the City
0,2013-04-10 12:10,35 to 49,Male,Good,Post graduate,$100K+,Self Employed,Central Toronto York or East York,Under 2 km,15 minutes or less,...,On scooter type e-bikes the pedals are unneces...,drive a motor propelled e-bike,No,pedestrians and/or runners a conflict relating...,No - the trails are fine as they are,The bylaw should be modified to allow any kind...,drive a scooter type e-bike propelled by an el...,The bylaw should be modified to allow e-bikes ...,be tolerant of bikes and e-bikes on the walksi...,Institute a speed limit for sidewalks
1,2013-04-10 12:30,18 to 34,Male,Excellent,University degree,$40K to $59K,Full Time,Central Toronto York or East York,10 - 20 km,30 - 44 minutes,...,On scooter type e-bikes the pedals are unneces...,I very rarely use any of Toronto's Multi-Use P...,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,No changes are necessary to the existing bylaw,drive a scooter type e-bike propelled by an el...,The bylaw should be modified to allow e-bikes ...,maintain the existing programs for signage edu...,Do nothing
2,2013-04-10 12:33,50 to 64,Male,Good,University degree,$40K to $59K,Self Employed,Central Toronto York or East York,10 - 20 km,15 minutes or less,...,Most scooter type e-bikes are wider than a bic...,cycle I very rarely use any of Toronto's Multi...,No,I am not aware of any conflicts on the trails,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a road bicycle or a fixie,No changes are necessary to the existing bylaw,increase signage increase enforcement increase...,only wheelchairs at walking speed
3,2013-04-10 12:52,50 to 64,Male,Good,4 years university no degree,$80K to $99K,Self Employed,Central Toronto York or East York,Under 2 km,15 minutes or less,...,On scooter type e-bikes the pedals are unneces...,cycle,No,a conflict between cyclists and pedestrians a ...,Yes - more signage Yes - more enforcement (tic...,Motorized vehicles should generally not be all...,ride a commuter or cruiser style bicycle,Motorized vehicles should generally not be all...,increase signage increase enforcement,Update the definition of a personal mobility d...
4,2013-04-10 13:24,18 to 34,Male,Very good,College or trade school diploma,$40K to $59K,Self Employed,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,Most scooter type e-bikes are wider than a bic...,cycle I very rarely use any of Toronto's Multi...,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,Motorized vehicles should generally not be all...,ride a road bicycle or a fixie,Motorized vehicles should generally not be all...,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
5,2013-04-10 13:26,35 to 49,Male,Very good,College or trade school diploma,$40K to $59K,Full Time,Central Toronto York or East York,10 - 20 km,16 - 29 minutes,...,scooter style e-bikes are different than pedal...,cycle,No,a conflict between e-biker and a cyclist,Yes - more educational programs,Motorized vehicles should generally not be all...,ride a road bicycle or a fixie,Motorized vehicles should generally not be all...,increase education,Update the definition of a personal mobility d...
6,2013-04-10 13:27,18 to 34,Male,Good,College or trade school diploma,$100K+,Full Time,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,cycle,No,a conflict relating to a dog(s) a conflict bet...,Yes - more signage Yes - more educational prog...,The bylaw should be modified to allow any kind...,ride a commuter or cruiser style bicycle,An e-bike should never be in a bike lane,increase signage increase enforcement increase...,Update the definition of a personal mobility d...
7,2013-04-10 13:28,35 to 49,Male,Very good,Post graduate,$100K+,Full Time,Central Toronto York or East York,Under 2 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,run,No,pedestrians and/or runners a conflict relating...,Yes - more signage,Motorized vehicles should generally not be all...,ride a mountain downhill or BMX bicycle,Motorized vehicles should generally not be all...,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
8,2013-04-10 13:31,35 to 49,Female,Fairly good,Post graduate,$80K to $99K,Unemployed,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,walk,No,a conflict relating to a dog(s) a conflict bet...,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a mountain downhill or BMX bicycle,The bylaw should be modified to allow e-bikes ...,increase signage increase enforcement increase...,Update the definition of a personal mobility d...
9,2013-04-10 13:33,18 to 34,Male,Good,College or trade school diploma,$60K to $79K,Full Time,Central Toronto York or East York,20 -35 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,walk,No,I am not aware of any conflicts on the trails,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a commuter or cruiser style bicycle,No changes are necessary to the existing bylaw,increase signage increase enforcement increase...,Update the definition of a personal mobility d...


**Note: ** Looking at the above dataframe, we can see that there's 22 different columns/features of data, each one corresponding to a different survey question. Since Pandas shortens the df to fit the given view, let's look at the complete list of columns below.



In [192]:
responses.columns

Index([u'Timestamp', u'What age range do you fall in?', u'Sex',
       u'How would you describe your level of physical health?',
       u'What level of education have you reached?',
       u'What is your household income?',
       u'Which category best describes your employment?',
       u'What Toronto district is your primary address located in?',
       u'On average what distance do you travel most days of the week?',
       u'On average how long is your commute?',
       u'Which transportation option do you end up using most often?',
       u'Does your household have access to any of the following private motorized vehicles?',
       u'Do you support any of the following statements?',
       u'When you use Toronto's Multi-Use Trails do you mostly',
       u'Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?',
       u'Have you witnessed a collision or conflict on a trail between',
       u'Do you think more should be done to manage trail users who do not respect t

The target/output variable ("Does your household have access to any of the following private motorized vehicles?") is one of the given columns (Column # 10 to be precise). The remaining variables will be the *feature variables* used to train the model. 

Noticing that almost all the questions are *categorical* in nature (meaning they take on a fixed value from a given set of multiple possiblities), let's take a look at all the possible values that each category can have:

In [193]:
responses.describe()

Unnamed: 0,Timestamp,What age range do you fall in?,Sex,How would you describe your level of physical health?,What level of education have you reached?,What is your household income?,Which category best describes your employment?,What Toronto district is your primary address located in?,On average what distance do you travel most days of the week?,On average how long is your commute?,...,Do you support any of the following statements?,When you use Toronto's Multi-Use Trails do you mostly,Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?,Have you witnessed a collision or conflict on a trail between,Do you think more should be done to manage trail users who do not respect the 20 km/h speed limit?,Currently any kind of e-bike may use a multi-use path if they are propelled by pedaling only and those propelled by motor power may be fined,When you use Toronto's bicycle lanes do you mostly,Currently any kind of e-bike may use a bicycle lane if they are propelled by pedaling only and those propelled by motor power may be fined,With regards to illegal use of bicycles and e-bikes on sidewalks should the City,Toronto Bylaws consider personal mobility devices (such as electric wheel chairs) to be pedestrians In your opinion should the City
count,2238,2234,2221,2227,2220,2178,2223,2237,2237,2237,...,2238,2238,2238,2238,2237,2238,2238,2238,2238,2232
unique,1722,5,12,12,24,6,33,68,5,6,...,534,89,2,221,134,25,69,258,224,81
top,2013-04-12 11:19,35 to 49,Male,Very good,University degree,$100K+,Full Time,Central Toronto York or East York,5 - 10 km,16 - 29 minutes,...,scooter style e-bikes are different than pedal...,cycle,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,No changes are necessary to the existing bylaw,ride a commuter or cruiser style bicycle,No changes are necessary to the existing bylaw,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
freq,7,863,1554,891,895,831,1405,1634,847,782,...,84,738,1131,847,646,811,705,664,493,1379


There's a few things we can immediately observe:

**1) Missing Values**

The total number of values/responses should be 2238, but many columns have a total # of responses less than this. This will need to be dealt with during cleaning.

**2) Unique Values**

Notice that each of the questions have a varying set of possible responses. Whereas the survey probably only lists a few given options, those answering likely have responded outside of those given options. For example, a few questions have a pretty small/contained set of answers (Age Range - 5 possible types of values, likely the different age ranges), whereas in contrast categories like Education or Support of Statements have a large variance in type of input. This will also need to be dealt with during cleaning.

Having identified these issues and taken a first look at the data, we can now start the data cleaning process.

## Data Cleaning

Like every Data Science process, the most important and time-consuming segment is cleaning the data and getting it ready for modelling. Since there's 22 total features (21 - X, 1 - Y), this will take some time but it is vital. 


### Target Variable

Let's start with the target variable, which is the responses to the question * "Does your household have access to any of the following private motorized vehicles?"*

Looking at the range of input responses:

In [194]:
responses["Does your household have access to any of the following private motorized vehicles?"].value_counts()

Yes - a car SUV truck or van                                                                                                                 1146
No - I do not have access to a private motorized vehicle                                                                                      522
Yes a motorcycle                                                                                                                              162
Yes - a pedal assist type e-bike                                                                                                               99
Yes - a scooter style e-bike                                                                                                                   85
Yes - a car SUV truck or van Yes a motorcycle                                                                                                  25
Yes - a car SUV truck or van Yes - a scooter style e-bike                                                                   

We can see that although a majority of responses fall under a small set of values (either No, pedal assist type e-bike, scooter style e-bike, car SUV truck or van, Autoshare or limited speed motorcycle), there are a large number of values (approx. 140 values) that do ** not ** fit the given standard.

These need to be intelligently dealt with/filtered, with some heurestics. Before we start, it's important to remember that we want to predict how likely the responder is to say **No**, so a negative response will be assigned a class of 1, and all positive responses (regardless of type of vehicle) will be assigned a class of 0.

Regarding filtering heuristics, let's start with the basics

1) If the response is "No - I do not have access to a private motorized vehicle", we assign it a value of 1

2) One hypothesis I have is that if the response contains the word "Yes", it implies that they have access to some type of private motorized vehicle, and the resulting value is a 0. To test this hypothesis, let's look at the above *value_counts()* call, and all the values that have the word "Yes" in them:

In [195]:
Y = responses["Does your household have access to any of the following private motorized vehicles?"]
Y[Y.str.contains("Yes")]

0                            Yes - a scooter style e-bike
1                            Yes - a scooter style e-bike
2                            Yes - a car SUV truck or van
3                            Yes - a car SUV truck or van
4                            Yes - a car SUV truck or van
6                                        Yes a motorcycle
7                            Yes - a car SUV truck or van
8                            Yes - a car SUV truck or van
9                            Yes - a car SUV truck or van
11                           Yes - a car SUV truck or van
13                           Yes - a car SUV truck or van
15                           Yes - a car SUV truck or van
16                           Yes - a car SUV truck or van
18                           Yes - a car SUV truck or van
19                           Yes - a car SUV truck or van
20                           Yes - a car SUV truck or van
21                           Yes - a car SUV truck or van
22            

Given the above list + the original *value_counts()*, it's fair to say that the hypothesis does work. It's important to note the underlying assumption asssumes a "Yes" for this type of response (Yes - a car SUV truck or van No - I do not have access to a private motorized vehicle). This response, where the user has given both options, only occurs twice (as seen in *value_counts()*), and thus setting that to a Yes should be okay.

After applying this rule, there's still quite a few values that are left out (probably to do with car rentals + people who didn't know how to deal with multiple options). Let's take a look:

In [196]:
# Rule 1: If they said no -> Set value to 1
Y[Y == "No - I do not have access to a private motorized vehicle"] = 1
Y[Y == "No - I do not own a private motorized vehicle"] = 1

# Rule 2: If their response contains the word Yes -> Set value to 0
Y[Y.str.contains("Yes").fillna(False)] = 0

# Check out remaining data
Y[Y != 0][Y != 1]

40      also motorcycle - not allowing more than one c...
44                                   No - only autoshare 
45                   I can borrow a car from time to time
72                                              Autoshare
132     The author of this survey is a fuk'g idiot how...
143                                             Car share
156                               Car van motorcycle bike
168     will only let you choose one so i will choose ...
306                                             Autoshare
316                                         Car and Vespa
341                                    AutoShare vehicles
372                      car and limited speed motorcycle
393                                             autoshare
422     Please note that this form does not allow the ...
461                                                zipcar
499                                           Car-sharing
503                     can't choose ore than on the form
518           

Having gone through the dataset and cleaned a bulk of the data, it's clear that from what's remaining, there are a few standouts:

1. Car Rental (Autoshare, Car2Go, Car Sharing, Rent/Borrow)
2. People who have listed multiple vehicles

For the purpose of this study, I'm assuming that a car rental/sharing service does **not count as the household having access to a motorized vehicle**. Thus, it will be mapped to a value of 1 (representing a "No"). Detecting this as a response also needs to be done intelligently, since there are string variants of the terms present throughout the dataset (i.e. Autoshare vs. autoshoare, Car2Go vs. Cars2Go etc.)

For those that have listed multiple vehicles, this should be mapped to a value of 0 (representing a "Yes" to the question). Clearly from the numerous mentions regarding not being able to submit multiple options, this is a point of improvement for the form that will be discussed in the analysis section.

Let's try and clean this remaining data

In [197]:
Y.value_counts()

0                                                                                                                                            1605
1                                                                                                                                             523
Autoshare                                                                                                                                       9
car share                                                                                                                                       4
car and motorcycle                                                                                                                              3
SUV and motorcycle                                                                                                                              2
car rental                                                                                                                  

In [198]:
# Autoshare
Y[Y.str.contains("Autoshare") | Y.str.contains("AutoShare") | Y.str.contains("autoshare") | Y.str.contains("AUTOSHARE")] = 1

# Zipcar
Y[Y.str.contains("Zipcar") | Y.str.contains("zipcar") | Y.str.contains("zip car") | Y.str.contains("ZIPCAR")] = 1

# Car2Go
Y[Y.str.contains("Car2Go") | Y.str.contains("car2go") | Y.str.contains("Cars2Go") | Y.str.contains("CAR2GO") | Y.str.contains("cars2go")] = 1

# Car rental
Y[Y.str.contains("rent").fillna(False)] = 1

# Car sharing
Y[Y.str.contains("shar").fillna(False)] = 1

# Car borrowing
Y[Y.str.contains("borrow").fillna(False)] = 1

Y.value_counts()

0                                                                                                                                            1605
1                                                                                                                                             568
car and motorcycle                                                                                                                              3
SUV and motorcycle                                                                                                                              2
car & motorcycle                                                                                                                                2
Car and motorcycle                                                                                                                              2
2 Cars 2 motorcycles and 1 scooter                                                                                          

To complete this cleaning, based on the above responses a fair assumption is that the remaining answers all correspond to the multiple vehicles category. There are responders who have left comments regarding the inavailability of multiple options, and for them, I'm assuming that they require multiple options (and hence, have access to a private motorized vehicle).

This assumption for the remaining values works really well, the only possible edge case being the entry "Considering an e-bike", in which case it can be written off as a false negative (statistically, the error is fairly insiginificant because of the massive ratio of 1's to 0's).

Completing the target variable cleaning:

In [208]:
# Assume remaining values correspond to multiple vehicles -> YES
Y[(Y != 0) & (Y != 1)] = 1
Y.value_counts()

0    1605
1     633
Name: Does your household have access to any of the following private motorized vehicles?, dtype: int64

### Feature Variables

Having prepared the target variable, we now need to go through each feature variable and clean it accordingly. Let's start with age range, and make our way through the different features.

#### Age Range

Looking at the different age ranges, and the given number of values:

In [211]:
ageRange = responses["What age range do you fall in?"]
ageRange.value_counts()

35 to 49            863
18 to 34            789
50 to 64            463
65 years or more    111
17 or younger         8
Name: What age range do you fall in?, dtype: int64

In [212]:
ageRange.describe()

count         2234
unique           5
top       35 to 49
freq           863
Name: What age range do you fall in?, dtype: object

From the above information, we can see that the 5 available age ranges are:

1. 17 or younger
2. 18 to 34
3. 35 to 49
4. 50 to 64
5. 65 years or more

Since there's only 5 unique values in the dataset, we know that there isn't an issue with erroneous / invalid age ranges. However, instead of 2238 values we have 2234. Thus, we're **missing 4 values**. These numbers can be filled in simply by following the existing distribution in ages.

As seen above, a majority of survey responders fall within the 18 to 34 / 35 to 49 categories (it's almost an equal split, 863 - 789). Thus, in order to fill the missing values we can just follow this distribution and assign 2 values to the 18 to 34 bucket, and another 2 to the 35 to 49 bucket. This ensures that we aren't meddling with the inherent distribution in the data.

This can also be done in other ways (i.e creating a pivot table and looking at other factors like Education & Employment to determine Age Range), however since it's only 4 values within a dataset of 2238, this simple method will work well!

In [225]:
# Locate missing values
ageRange[ageRange.isnull()]

# Fill them in - 18 to 34
ageRange[782] = "18 to 34"
ageRange[845] = "18 to 34"

# Fill them in - 35 to 49
ageRange[868] = "35 to 49"
ageRange[1931] = "35 to 49"

ageRange.describe()

count         2238
unique           5
top       35 to 49
freq           865
Name: What age range do you fall in?, dtype: object

#### Sex

Similarly, let's take a look at the distribution of values for the gender/sex category.

In [226]:
gender = responses["Sex"]
gender.value_counts()

Male                       1554
Female                      657
Genderqueer                   1
Irrelevant                    1
please                        1
Transgender                   1
trans                         1
Trans Female                  1
they                          1
unspecified                   1
prefer not to disclose        1
fifth                         1
Name: Sex, dtype: int64

In [227]:
gender.describe()

count     2221
unique      12
top       Male
freq      1554
Name: Sex, dtype: object