# Motorized Vehicles: Data Science Challenge

Based on the E-Bike Survey Response Results from TO Open Data, the goal is to train a model to predict whether the responder will answer "No - I do not have access to a private motorized vehicle" to the question "Does your household have access to any of the following types of private motorized vehicles?". 

Clearly, since the goal is to predict whether the output for this feature is a given category, this is a problem of classification. The dataset is labelled (since we have access to survey respones to that question), and hence it is a supervised learning classification problem.

The following notebook will cover the entire Data Science process, starting from loading the data, cleaning it, doing exploratory data analysis, creating a classification model and finally, evaluating the model.

## Data Import & First Look

Let's start by importing any necessary modules we need for data manipulation.

In [572]:
# Import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [573]:
# Read in survey data
responses = pd.read_csv("E-Bike_Survey_Responses.csv")

In [574]:
# Take a look at available features + responses
responses.head(10)

Unnamed: 0,Timestamp,What age range do you fall in?,Sex,How would you describe your level of physical health?,What level of education have you reached?,What is your household income?,Which category best describes your employment?,What Toronto district is your primary address located in?,On average what distance do you travel most days of the week?,On average how long is your commute?,...,Do you support any of the following statements?,When you use Toronto's Multi-Use Trails do you mostly,Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?,Have you witnessed a collision or conflict on a trail between,Do you think more should be done to manage trail users who do not respect the 20 km/h speed limit?,Currently any kind of e-bike may use a multi-use path if they are propelled by pedaling only and those propelled by motor power may be fined,When you use Toronto's bicycle lanes do you mostly,Currently any kind of e-bike may use a bicycle lane if they are propelled by pedaling only and those propelled by motor power may be fined,With regards to illegal use of bicycles and e-bikes on sidewalks should the City,Toronto Bylaws consider personal mobility devices (such as electric wheel chairs) to be pedestrians In your opinion should the City
0,2013-04-10 12:10,35 to 49,Male,Good,Post graduate,$100K+,Self Employed,Central Toronto York or East York,Under 2 km,15 minutes or less,...,On scooter type e-bikes the pedals are unneces...,drive a motor propelled e-bike,No,pedestrians and/or runners a conflict relating...,No - the trails are fine as they are,The bylaw should be modified to allow any kind...,drive a scooter type e-bike propelled by an el...,The bylaw should be modified to allow e-bikes ...,be tolerant of bikes and e-bikes on the walksi...,Institute a speed limit for sidewalks
1,2013-04-10 12:30,18 to 34,Male,Excellent,University degree,$40K to $59K,Full Time,Central Toronto York or East York,10 - 20 km,30 - 44 minutes,...,On scooter type e-bikes the pedals are unneces...,I very rarely use any of Toronto's Multi-Use P...,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,No changes are necessary to the existing bylaw,drive a scooter type e-bike propelled by an el...,The bylaw should be modified to allow e-bikes ...,maintain the existing programs for signage edu...,Do nothing
2,2013-04-10 12:33,50 to 64,Male,Good,University degree,$40K to $59K,Self Employed,Central Toronto York or East York,10 - 20 km,15 minutes or less,...,Most scooter type e-bikes are wider than a bic...,cycle I very rarely use any of Toronto's Multi...,No,I am not aware of any conflicts on the trails,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a road bicycle or a fixie,No changes are necessary to the existing bylaw,increase signage increase enforcement increase...,only wheelchairs at walking speed
3,2013-04-10 12:52,50 to 64,Male,Good,4 years university no degree,$80K to $99K,Self Employed,Central Toronto York or East York,Under 2 km,15 minutes or less,...,On scooter type e-bikes the pedals are unneces...,cycle,No,a conflict between cyclists and pedestrians a ...,Yes - more signage Yes - more enforcement (tic...,Motorized vehicles should generally not be all...,ride a commuter or cruiser style bicycle,Motorized vehicles should generally not be all...,increase signage increase enforcement,Update the definition of a personal mobility d...
4,2013-04-10 13:24,18 to 34,Male,Very good,College or trade school diploma,$40K to $59K,Self Employed,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,Most scooter type e-bikes are wider than a bic...,cycle I very rarely use any of Toronto's Multi...,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,Motorized vehicles should generally not be all...,ride a road bicycle or a fixie,Motorized vehicles should generally not be all...,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
5,2013-04-10 13:26,35 to 49,Male,Very good,College or trade school diploma,$40K to $59K,Full Time,Central Toronto York or East York,10 - 20 km,16 - 29 minutes,...,scooter style e-bikes are different than pedal...,cycle,No,a conflict between e-biker and a cyclist,Yes - more educational programs,Motorized vehicles should generally not be all...,ride a road bicycle or a fixie,Motorized vehicles should generally not be all...,increase education,Update the definition of a personal mobility d...
6,2013-04-10 13:27,18 to 34,Male,Good,College or trade school diploma,$100K+,Full Time,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,cycle,No,a conflict relating to a dog(s) a conflict bet...,Yes - more signage Yes - more educational prog...,The bylaw should be modified to allow any kind...,ride a commuter or cruiser style bicycle,An e-bike should never be in a bike lane,increase signage increase enforcement increase...,Update the definition of a personal mobility d...
7,2013-04-10 13:28,35 to 49,Male,Very good,Post graduate,$100K+,Full Time,Central Toronto York or East York,Under 2 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,run,No,pedestrians and/or runners a conflict relating...,Yes - more signage,Motorized vehicles should generally not be all...,ride a mountain downhill or BMX bicycle,Motorized vehicles should generally not be all...,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
8,2013-04-10 13:31,35 to 49,Female,Fairly good,Post graduate,$80K to $99K,Unemployed,Central Toronto York or East York,5 - 10 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,walk,No,a conflict relating to a dog(s) a conflict bet...,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a mountain downhill or BMX bicycle,The bylaw should be modified to allow e-bikes ...,increase signage increase enforcement increase...,Update the definition of a personal mobility d...
9,2013-04-10 13:33,18 to 34,Male,Good,College or trade school diploma,$60K to $79K,Full Time,Central Toronto York or East York,20 -35 km,15 minutes or less,...,scooter style e-bikes are different than pedal...,walk,No,I am not aware of any conflicts on the trails,Yes - more signage Yes - more enforcement (tic...,No changes are necessary to the existing bylaw,ride a commuter or cruiser style bicycle,No changes are necessary to the existing bylaw,increase signage increase enforcement increase...,Update the definition of a personal mobility d...


**Note: ** Looking at the above dataframe, we can see that there's 22 different columns/features of data, each one corresponding to a different survey question. Since Pandas shortens the df to fit the given view, let's look at the complete list of columns below.



In [575]:
responses.columns

Index([u'Timestamp', u'What age range do you fall in?', u'Sex',
       u'How would you describe your level of physical health?',
       u'What level of education have you reached?',
       u'What is your household income?',
       u'Which category best describes your employment?',
       u'What Toronto district is your primary address located in?',
       u'On average what distance do you travel most days of the week?',
       u'On average how long is your commute?',
       u'Which transportation option do you end up using most often?',
       u'Does your household have access to any of the following private motorized vehicles?',
       u'Do you support any of the following statements?',
       u'When you use Toronto's Multi-Use Trails do you mostly',
       u'Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?',
       u'Have you witnessed a collision or conflict on a trail between',
       u'Do you think more should be done to manage trail users who do not respect t

The target/output variable ("Does your household have access to any of the following private motorized vehicles?") is one of the given columns (Column # 10 to be precise). The remaining variables will be the *feature variables* used to train the model. 

Noticing that almost all the questions are *categorical* in nature (meaning they take on a fixed value from a given set of multiple possiblities), let's take a look at all the possible values that each category can have:

In [576]:
responses.describe()

Unnamed: 0,Timestamp,What age range do you fall in?,Sex,How would you describe your level of physical health?,What level of education have you reached?,What is your household income?,Which category best describes your employment?,What Toronto district is your primary address located in?,On average what distance do you travel most days of the week?,On average how long is your commute?,...,Do you support any of the following statements?,When you use Toronto's Multi-Use Trails do you mostly,Are you aware that the Multi-Use Paths have a speed limit of 20 km/h?,Have you witnessed a collision or conflict on a trail between,Do you think more should be done to manage trail users who do not respect the 20 km/h speed limit?,Currently any kind of e-bike may use a multi-use path if they are propelled by pedaling only and those propelled by motor power may be fined,When you use Toronto's bicycle lanes do you mostly,Currently any kind of e-bike may use a bicycle lane if they are propelled by pedaling only and those propelled by motor power may be fined,With regards to illegal use of bicycles and e-bikes on sidewalks should the City,Toronto Bylaws consider personal mobility devices (such as electric wheel chairs) to be pedestrians In your opinion should the City
count,2238,2234,2221,2227,2220,2178,2223,2237,2237,2237,...,2238,2238,2238,2238,2237,2238,2238,2238,2238,2232
unique,1722,5,12,12,24,6,33,68,5,6,...,534,89,2,221,134,25,69,258,224,81
top,2013-04-12 11:19,35 to 49,Male,Very good,University degree,$100K+,Full Time,Central Toronto York or East York,5 - 10 km,16 - 29 minutes,...,scooter style e-bikes are different than pedal...,cycle,No,I am not aware of any conflicts on the trails,No - the trails are fine as they are,No changes are necessary to the existing bylaw,ride a commuter or cruiser style bicycle,No changes are necessary to the existing bylaw,maintain the existing programs for signage edu...,Update the definition of a personal mobility d...
freq,7,863,1554,891,895,831,1405,1634,847,782,...,84,738,1131,847,646,811,705,664,493,1379


There's a few things we can immediately observe:

**1) Missing Values**

The total number of values/responses should be 2238, but many columns have a total # of responses less than this. This will need to be dealt with during cleaning.

**2) Unique Values**

Notice that each of the questions have a varying set of possible responses. Whereas the survey probably only lists a few given options, those answering likely have responded outside of those given options. For example, a few questions have a pretty small/contained set of answers (Age Range - 5 possible types of values, likely the different age ranges), whereas in contrast categories like Education or Support of Statements have a large variance in type of input. This will also need to be dealt with during cleaning.

Having identified these issues and taken a first look at the data, we can now start the data cleaning process.

## Data Cleaning

Like every Data Science process, the most important and time-consuming segment is cleaning the data and getting it ready for modelling. Since there's 22 total features (21 - X, 1 - Y), this will take some time but it is vital. 


### Target Variable

Let's start with the target variable, which is the responses to the question * "Does your household have access to any of the following private motorized vehicles?"*

Looking at the range of input responses:

In [577]:
responses["Does your household have access to any of the following private motorized vehicles?"].value_counts()

Yes - a car SUV truck or van                                                                                                                 1146
No - I do not have access to a private motorized vehicle                                                                                      522
Yes a motorcycle                                                                                                                              162
Yes - a pedal assist type e-bike                                                                                                               99
Yes - a scooter style e-bike                                                                                                                   85
Yes - a car SUV truck or van Yes a motorcycle                                                                                                  25
Yes - a car SUV truck or van Yes - a scooter style e-bike                                                                   

We can see that although a majority of responses fall under a small set of values (either No, pedal assist type e-bike, scooter style e-bike, car SUV truck or van, Autoshare or limited speed motorcycle), there are a large number of values (approx. 140 values) that do ** not ** fit the given standard.

These need to be intelligently dealt with/filtered, with some heurestics. Before we start, it's important to remember that we want to predict how likely the responder is to say **No**, so a negative response will be assigned a class of 1, and all positive responses (regardless of type of vehicle) will be assigned a class of 0.

Regarding filtering heuristics, let's start with the basics

1) If the response is "No - I do not have access to a private motorized vehicle", we assign it a value of 1

2) One hypothesis I have is that if the response contains the word "Yes", it implies that they have access to some type of private motorized vehicle, and the resulting value is a 0. To test this hypothesis, let's look at the above *value_counts()* call, and all the values that have the word "Yes" in them:

In [578]:
Y = responses["Does your household have access to any of the following private motorized vehicles?"]
Y[Y.str.contains("Yes")]

0                            Yes - a scooter style e-bike
1                            Yes - a scooter style e-bike
2                            Yes - a car SUV truck or van
3                            Yes - a car SUV truck or van
4                            Yes - a car SUV truck or van
6                                        Yes a motorcycle
7                            Yes - a car SUV truck or van
8                            Yes - a car SUV truck or van
9                            Yes - a car SUV truck or van
11                           Yes - a car SUV truck or van
13                           Yes - a car SUV truck or van
15                           Yes - a car SUV truck or van
16                           Yes - a car SUV truck or van
18                           Yes - a car SUV truck or van
19                           Yes - a car SUV truck or van
20                           Yes - a car SUV truck or van
21                           Yes - a car SUV truck or van
22            

Given the above list + the original *value_counts()*, it's fair to say that the hypothesis does work. It's important to note the underlying assumption asssumes a "Yes" for this type of response (Yes - a car SUV truck or van No - I do not have access to a private motorized vehicle). This response, where the user has given both options, only occurs twice (as seen in *value_counts()*), and thus setting that to a Yes should be okay.

After applying this rule, there's still quite a few values that are left out (probably to do with car rentals + people who didn't know how to deal with multiple options). Let's take a look:

In [579]:
# Rule 1: If they said no -> Set value to 1
Y[Y == "No - I do not have access to a private motorized vehicle"] = 1
Y[Y == "No - I do not own a private motorized vehicle"] = 1

# Rule 2: If their response contains the word Yes -> Set value to 0
Y[Y.str.contains("Yes").fillna(False)] = 0

# Check out remaining data
Y[Y != 0][Y != 1]

40      also motorcycle - not allowing more than one c...
44                                   No - only autoshare 
45                   I can borrow a car from time to time
72                                              Autoshare
132     The author of this survey is a fuk'g idiot how...
143                                             Car share
156                               Car van motorcycle bike
168     will only let you choose one so i will choose ...
306                                             Autoshare
316                                         Car and Vespa
341                                    AutoShare vehicles
372                      car and limited speed motorcycle
393                                             autoshare
422     Please note that this form does not allow the ...
461                                                zipcar
499                                           Car-sharing
503                     can't choose ore than on the form
518           

Having gone through the dataset and cleaned a bulk of the data, it's clear that from what's remaining, there are a few standouts:

1. Car Rental (Autoshare, Car2Go, Car Sharing, Rent/Borrow)
2. People who have listed multiple vehicles

For the purpose of this study, I'm assuming that a car rental/sharing service does **not count as the household having access to a motorized vehicle**. Thus, it will be mapped to a value of 1 (representing a "No"). Detecting this as a response also needs to be done intelligently, since there are string variants of the terms present throughout the dataset (i.e. Autoshare vs. autoshoare, Car2Go vs. Cars2Go etc.)

For those that have listed multiple vehicles, this should be mapped to a value of 0 (representing a "Yes" to the question). Clearly from the numerous mentions regarding not being able to submit multiple options, this is a point of improvement for the form that will be discussed in the analysis section.

Let's try and clean this remaining data

In [580]:
Y.value_counts()

0                                                                                                                                            1605
1                                                                                                                                             523
Autoshare                                                                                                                                       9
car share                                                                                                                                       4
car and motorcycle                                                                                                                              3
SUV and motorcycle                                                                                                                              2
car rental                                                                                                                  

In [581]:
# Autoshare
Y[Y.str.contains("Autoshare") | Y.str.contains("AutoShare") | Y.str.contains("autoshare") | Y.str.contains("AUTOSHARE")] = 1

# Zipcar
Y[Y.str.contains("Zipcar") | Y.str.contains("zipcar") | Y.str.contains("zip car") | Y.str.contains("ZIPCAR")] = 1

# Car2Go
Y[Y.str.contains("Car2Go") | Y.str.contains("car2go") | Y.str.contains("Cars2Go") | Y.str.contains("CAR2GO") | Y.str.contains("cars2go")] = 1

# Car rental
Y[Y.str.contains("rent").fillna(False)] = 1

# Car sharing
Y[Y.str.contains("shar").fillna(False)] = 1

# Car borrowing
Y[Y.str.contains("borrow").fillna(False)] = 1

Y.value_counts()

0                                                                                                                                            1605
1                                                                                                                                             568
car and motorcycle                                                                                                                              3
SUV and motorcycle                                                                                                                              2
car & motorcycle                                                                                                                                2
Car and motorcycle                                                                                                                              2
2 Cars 2 motorcycles and 1 scooter                                                                                          

To complete this cleaning, based on the above responses a fair assumption is that the remaining answers all correspond to the multiple vehicles category. There are responders who have left comments regarding the inavailability of multiple options, and for them, I'm assuming that they require multiple options (and hence, have access to a private motorized vehicle).

This assumption for the remaining values works really well, the only possible edge case being the entry "Considering an e-bike", in which case it can be written off as a false negative (statistically, the error is fairly insiginificant because of the massive ratio of 1's to 0's).

Completing the target variable cleaning:

In [582]:
# Assume remaining values correspond to multiple vehicles -> YES
Y[(Y != 0) & (Y != 1)] = 1
Y.value_counts()

0    1605
1     633
Name: Does your household have access to any of the following private motorized vehicles?, dtype: int64

### Feature Variables

Having prepared the target variable, we now need to go through each feature variable and clean it accordingly. Let's start with age range, and make our way through the different features.

#### Age Range

Looking at the different age ranges, and the given number of values:

In [583]:
ageRange = responses["What age range do you fall in?"]
ageRange.value_counts()

35 to 49            863
18 to 34            789
50 to 64            463
65 years or more    111
17 or younger         8
Name: What age range do you fall in?, dtype: int64

In [584]:
ageRange.describe()

count         2234
unique           5
top       35 to 49
freq           863
Name: What age range do you fall in?, dtype: object

From the above information, we can see that the 5 available age ranges are:

1. 17 or younger
2. 18 to 34
3. 35 to 49
4. 50 to 64
5. 65 years or more

Since there's only 5 unique values in the dataset, we know that there isn't an issue with erroneous / invalid age ranges. However, instead of 2238 values we have 2234. Thus, we're **missing 4 values**. These numbers can be filled in simply by following the existing distribution in ages.

As seen above, a majority of survey responders fall within the 18 to 34 / 35 to 49 categories (it's almost an equal split, 863 - 789). Thus, in order to fill the missing values we can just follow this distribution and assign 2 values to the 18 to 34 bucket, and another 2 to the 35 to 49 bucket. This ensures that we aren't meddling with the inherent distribution in the data.

This can also be done in other ways (i.e creating a pivot table and looking at other factors like Education & Employment to determine Age Range), however since it's only 4 values within a dataset of 2238, this simple method will work well!

In [585]:
# Locate missing values
ageRange[ageRange.isnull()]

# Fill them in - 18 to 34
ageRange[782] = "18 to 34"
ageRange[845] = "18 to 34"

# Fill them in - 35 to 49
ageRange[868] = "35 to 49"
ageRange[1931] = "35 to 49"

ageRange.describe()

count         2238
unique           5
top       35 to 49
freq           865
Name: What age range do you fall in?, dtype: object

#### Sex

Similarly, let's take a look at the distribution of values for the gender/sex category.

In [586]:
gender = responses["Sex"]
gender.value_counts()

Male                       1554
Female                      657
Genderqueer                   1
Irrelevant                    1
please                        1
Transgender                   1
trans                         1
Trans Female                  1
they                          1
unspecified                   1
prefer not to disclose        1
fifth                         1
Name: Sex, dtype: int64

In [587]:
gender.describe()

count     2221
unique      12
top       Male
freq      1554
Name: Sex, dtype: object

There are two issues here:

1. Missing values (2221 total values instead of 2238)
2. Unspecified/Other genders

Missing values can be dealt similarly to how it was dealt above, by following the distribution in the data. In this case, the survey is heavily male dominant (1554 Males to 657 Females) and thus the remaining 17 values can be added as "Males" without skewing the existing data.

There are 10 other unique values that responders have provided, ranging from "Transgender" to "Prefer not to disclose". All of these can be grouped under one "Other category" to simplify analysis.

In [588]:
# Fill in missing values
gender[gender.isnull()] = "Male"
gender.describe()

count     2238
unique      12
top       Male
freq      1571
Name: Sex, dtype: object

In [589]:
# Group all other genders (other than Male/Female) into their own "Other" category
gender[(gender != "Male") & (gender != "Female")] = "Other"
gender.value_counts()

Male      1571
Female     657
Other       10
Name: Sex, dtype: int64

#### Physical Health

Let's observe the distribution of values for physical health:

In [590]:
physicalHealth = responses["How would you describe your level of physical health?"]
physicalHealth.value_counts()

Very good                                    891
Good                                         667
Excellent                                    411
Fairly good                                  210
Poor                                          41
healing fractured heel                         1
Disabled                                       1
need knee replacement                          1
In poor shape but active & improving           1
healthy but with arthritis mobilty issues      1
Back Injury                                    1
Obese diabetic but trying to be healthy        1
Name: How would you describe your level of physical health?, dtype: int64

In [591]:
physicalHealth.describe()

count          2227
unique           12
top       Very good
freq            891
Name: How would you describe your level of physical health?, dtype: object

Judging from the above categories, there seems to be the following hiearchy of buckets/values (this is the assumed distribution moving forward):

Disabled -> Poor -> Fairly good -> Good -> Very good -> Excellent

The disabled category is assumed to include responders that are either permanently or temporarily disabled (via an injury / illness / health impairment). 

With regards to issues with this feature, the same two problem persist:

1. Missing Values (2227 as opposed to 2238)
2. Unique Values (most of which fall under the disabled category -> easy to deal with)

Dealing with the missing values can be done in the same manner as done so far, by following the given distribution in the data (since the number of missing values is pretty small, only 11). As seen above, a majority of responders report themselves in "Very good" shape, or "Good" shape. There's approximately twice as many people that reported being in "Excellent" shape, as opposed to "Fairly good" shape. Following this distribution, we can fill out 4/11 values as "Very good", 3/11 values as "Good", 2 values as "Excellent", and one each for "Fairly good" and "Poor". The goal is to distribute values in a way that produces the least amount of skew in the existing distribution. Let's do this:

In [592]:
# Locate missing values
physicalHealth[physicalHealth.isnull()]

# Update missing values based on distribution
physicalHealth[215] = "Very good"
physicalHealth[357] = "Very good"
physicalHealth[555] = "Very good"
physicalHealth[782] = "Very good"

physicalHealth[868] = "Good"
physicalHealth[1079] = "Good"
physicalHealth[1131] = "Good"

physicalHealth[1656] = "Excellent"
physicalHealth[1693] = "Excellent"

physicalHealth[1931] = "Fairly good"

physicalHealth[2076] = "Poor"

physicalHealth.describe()

count          2238
unique           12
top       Very good
freq            895
Name: How would you describe your level of physical health?, dtype: object

In [593]:
physicalHealth.value_counts()

Very good                                    895
Good                                         670
Excellent                                    413
Fairly good                                  211
Poor                                          42
healing fractured heel                         1
Disabled                                       1
need knee replacement                          1
In poor shape but active & improving           1
healthy but with arthritis mobilty issues      1
Back Injury                                    1
Obese diabetic but trying to be healthy        1
Name: How would you describe your level of physical health?, dtype: int64

In [594]:
# Single edge case: belongs to Poor category
physicalHealth[physicalHealth == "In poor shape but active & improving"] = "Poor"

# Remaining entries can be filed as Disabled
physicalHealth[(physicalHealth != "Very good") & (physicalHealth != "Fairly good") & (physicalHealth != "Good") & (physicalHealth != "Excellent") & (physicalHealth != "Poor")] = "Disabled"
physicalHealth.value_counts()

Very good      895
Good           670
Excellent      413
Fairly good    211
Poor            43
Disabled         6
Name: How would you describe your level of physical health?, dtype: int64

#### Education


In [595]:
education = responses["What level of education have you reached?"]
education.value_counts()

University degree                                    895
Post graduate                                        591
College or trade school diploma                      521
High school diploma                                  192
Some university                                        2
Professional Certifications                            1
college degree (not university and not a diploma)      1
Grade 9                                                1
Graduate                                               1
still in school                                        1
in HS                                                  1
Professional degree                                    1
University Student                                     1
university                                             1
M D                                                    1
working towards OSSD                                   1
some university                                        1
some uni                       

In [596]:
education.describe()

count                  2220
unique                   24
top       University degree
freq                    895
Name: What level of education have you reached?, dtype: object

As we can see above, there is a quite a bit of variability in the answers provided. There are also 18 missing values in the data, although if we follow the distribution these values will not be difficult to fill in.

The primary issue with the Education category is dealing with the variability in responses. Some of the responses include people who dropped out of high school, others are currently students (either in high school or in university). 

Here, an assumption is made that the question "What level of education have you reached?" implies/equals "What is the highest degree of education/schooling you have **attained/completed?**". The latter is typical wording for a question of this nature, and it allows you to group existing students into buckets as well (for example: a University student has attained a High School Diploma). Thus, there's also room for improvement with regards to the question in this category.

Note that we're also assuming that professional/online certifications do **not** count as a formal "education" level/category.

Given this information, the set of categories that can be created are:

1. Postgraduate (implies a Masters, PhD, M.D., Law etc)
2. University degree
3. College or Trade School Diploma
4. High School Diploma (OSSD or equivalent)
5. No formal educational credential (implies that the responder is still a high school student, or had dropped out)

Let's see if we can get this:

In [597]:
# Filter existing students
education[(education.str.contains("student", False).fillna(False)) & (education.str.contains("University", False))] = "High school diploma "
education[(education.str.contains("student", False).fillna(False)) & (education.str.contains("High school", False))] = "No formal educational credential"

education[education.str.contains("in", False).fillna(False) & ((education.str.contains("HS", False)) | (education.str.contains("school", False)) | (education.str.contains("OSSD", False)))] = "No formal educational credential"
education.value_counts()

University degree                                    895
Post graduate                                        591
College or trade school diploma                      521
High school diploma                                  193
No formal educational credential                       4
Some university                                        2
Professional Certifications                            1
college degree (not university and not a diploma)      1
Grade 9                                                1
Graduate                                               1
Professional degree                                    1
university                                             1
M D                                                    1
Law School                                             1
some university                                        1
some uni                                               1
PhD                                                    1
4 years university no degree   

In [598]:
# Look at university graduates
education[((education.str.contains("university", False)) | (education.str.contains("uni", False))) & (education != "University degree")]

3                            4 years university no degree
630                                       Some university
764     college degree (not university and not a diploma)
880                                       Some University
1712                                      some university
1784                                           university
1787                                      Some university
2048                                             some uni
Name: What level of education have you reached?, dtype: object

From the above breakdown, we can see two edge cases:

1. 4 years university no degree (should be classified as a high school diploma)
2. College Degree (not university and not a diploma)

The second category is a tricky response, and essentially a design decision as to which category it should fall under. Since generally degree's are a higher form of education / higher education level in comparison to diplomas (regardless of whether it's achieved from a university or a college), I'm assuming that this should fall under the "University degree" category.

Let's deal with these edge cases (again, ideally the survey question should be changed to better the input responses, or the edge case filtering can be done more intelligently, but for the sake of brevity we can deal with these 2 cases individually).

In [599]:
# Deal with one-offs / edge cases
education[3] = "High school diploma "
education[764] = "University degree"
education[((education.str.contains("university", False)) | (education.str.contains("uni", False)) | (education.str.contains("graduate", False))) & (education != "University degree") & (education != "Post graduate")]

630     Some university
689            Graduate
880     Some University
1712    some university
1784         university
1787    Some university
2048           some uni
Name: What level of education have you reached?, dtype: object

In [600]:
# Assuming that responders that have put "some university/uni" or "graduate" are graduates from that/a university
education[((education.str.contains("university", False)) | (education.str.contains("uni", False)) | (education.str.contains("graduate", False))) & (education != "University degree") & (education != "Post graduate")] = "University degree"
education.value_counts()

University degree                   903
Post graduate                       591
College or trade school diploma     521
High school diploma                 194
No formal educational credential      4
Professional Certifications           1
Law School                            1
PhD                                   1
M D                                   1
Professional degree                   1
Grade 9                               1
Grade 8                               1
Name: What level of education have you reached?, dtype: int64

Having done a bulk of the filtering, now the only unique values left either correspond to post graduate programs (i.e. PhD, M.D, LLB etc.) or those with no formal educational credential (i.e. Professional certification, high school dropouts etc.).

Note that ideally, doctorate programs should be their own category. It doesn't make sense to add that category now however, since it's impossible to tell how many responders that responded with "Post graduate" held doctorate degrees (and wrote that down as the best possible answer). This is another point of improvement for the survey.

In [601]:
# Classify doctorate/post graduate programs
education[(education.str.contains("PhD", False)) | (education.str.contains("M D", False)) | (education.str.contains("Law School", False))] = "Post graduate"
education.value_counts()

University degree                   903
Post graduate                       594
College or trade school diploma     521
High school diploma                 194
No formal educational credential      4
Professional Certifications           1
Professional degree                   1
Grade 9                               1
Grade 8                               1
Name: What level of education have you reached?, dtype: int64

In [602]:
# Classify remaining as "No formal educational credential"
education[(education != "University degree") & (education != "Post graduate") & (education != "College or trade school diploma") & (education != "High school diploma ") & (education != "No formal educational credential") & (education.notnull())] = "No formal educational credential"
education.value_counts()

University degree                   903
Post graduate                       594
College or trade school diploma     521
High school diploma                 194
No formal educational credential      8
Name: What level of education have you reached?, dtype: int64

Having dealt with the variability in input, we now need to go in and fill missing values. Again, this can be done in numerous ways, like looking at other categories, ie. Income or Employment and inferring the education. However, inferrence based methods are both complex (for the given number of missing values, it doens't necessarily make sense) and also require the other categories to be clean data (which isn't true).

Hence, we'll follow the method of using the existing distribution of data to our advantage. Currently, there's an overwhelming majority of responders that have university degrees. Twice as many responders have university degrees, in comparison to those with post graduate degrees or college diplomas. Following this distribution, we get:

In [603]:
# Locate missing values
education[education.isnull()]

232     NaN
245     NaN
491     NaN
782     NaN
805     NaN
845     NaN
868     NaN
877     NaN
887     NaN
1058    NaN
1197    NaN
1340    NaN
1376    NaN
1656    NaN
1841    NaN
1931    NaN
1958    NaN
1991    NaN
Name: What level of education have you reached?, dtype: object

In [604]:
# Fill missing values
education.fillna("University degree", limit = 9, inplace=True)
education.fillna("Post graduate", limit = 4, inplace=True)
education.fillna("College or trade school diploma", limit = 4, inplace=True)
education.fillna("High school diploma ", limit = 1, inplace=True)
education.describe()

count                  2238
unique                    5
top       University degree
freq                    912
Name: What level of education have you reached?, dtype: object

#### Household Income


In [605]:
income = responses["What is your household income?"]
income.value_counts()

$100K+          831
$60K to $79K    372
$80K to $99K    335
$40K to $59K    298
$20K to $39K    235
Under $20K      107
Name: What is your household income?, dtype: int64

In [606]:
income.describe()

count       2178
unique         6
top       $100K+
freq         831
Name: What is your household income?, dtype: object

The only issue here is **missing values**, we're missing 60 values to be precise. 

Rather than just trying to mimic the distribution, let's see if we can be more intelligent here. Intuitively, a primary hypothesis that makes sense is that the **higher your education level, the greater your household income**. Thus, in order to assign the missing incomes, we can do it based on the responder's education level.

Let's take a look at the education level of those missing values:

In [607]:
# Education Level of missing income values
responses["What level of education have you reached?"][income[income.isnull()].index]

23                        Post graduate
60                    University degree
245                   University degree
361                   University degree
417                   University degree
530     College or trade school diploma
609                   University degree
659                   University degree
678                       Post graduate
696                   University degree
706     College or trade school diploma
755                       Post graduate
782                   University degree
810                       Post graduate
827                       Post graduate
845                   University degree
868                   University degree
877                   University degree
887                   University degree
908     College or trade school diploma
917                   University degree
939                   University degree
1008    College or trade school diploma
1089                      Post graduate
1136                      Post graduate


In [608]:
# Education level of people with 100K+ incomes
responses["What level of education have you reached?"][income[income == "$100K+"].index].value_counts()

University degree                   346
Post graduate                       294
College or trade school diploma     138
High school diploma                  50
No formal educational credential      3
Name: What level of education have you reached?, dtype: int64

In [609]:
# Education level of people with $80K to $99K incomes
responses["What level of education have you reached?"][income[income == "$80K to $99K"].index].value_counts()

University degree                   147
College or trade school diploma      86
Post graduate                        76
High school diploma                  24
No formal educational credential      2
Name: What level of education have you reached?, dtype: int64

In [610]:
# Education level of people with $20K to #39K incomes
responses["What level of education have you reached?"][income[income == "$20K to $39K"].index].value_counts()

University degree                  81
College or trade school diploma    61
High school diploma                47
Post graduate                      46
Name: What level of education have you reached?, dtype: int64

Based on the above information, it's clear that the hypothesis mentioned earlier generally holds true. There's definitely exceptions (where people with no formal educational credential have a $100K+ income or postgraduate degree holders have a lower income), but generally the rule holds. This can be seen above, where 100K+ incomes are generally dominated by Education levels of "University degree" or "Post graduate". In contrast, 60-79K incomes are dominated by Education levels of "University degree" or "College/Trade School diploma".

Based on the above information, we can make the following split:

Post graduate -> 100K+ income

1/2 of University degree holders -> 100K+ income

1/2 of University degree holders -> 80K - 99K

College or Trade school diploma -> 60K - 79K

High school diploma -> 20K - 39K

In [611]:
# Post graduate (have to use temp because of the way pandas does inplace with variables)
temp = income[responses["What level of education have you reached?"] == "Post graduate"]
temp.fillna("$100K+", inplace=True)
income[responses["What level of education have you reached?"] == "Post graduate"] = temp
income.describe()

count       2195
unique         6
top       $100K+
freq         848
Name: What is your household income?, dtype: object

In [612]:
# University degree
temp = income[responses["What level of education have you reached?"] == "University degree"]
temp.fillna("$100K+", limit = len(temp)/2, inplace=True)
temp.fillna("$80K to $99K", inplace=True)
income[responses["What level of education have you reached?"] == "University degree"] = temp
income.describe()

count       2221
unique         6
top       $100K+
freq         874
Name: What is your household income?, dtype: object

In [613]:
# College or Trade School diploma
temp = income[responses["What level of education have you reached?"] == "College or trade school diploma"]
temp.fillna("$60K to $79K", inplace=True)
income[responses["What level of education have you reached?"] == "College or trade school diploma"] = temp
income.describe()

count       2235
unique         6
top       $100K+
freq         874
Name: What is your household income?, dtype: object

In [614]:
# High school diploma
temp = income[responses["What level of education have you reached?"] == "High school diploma "]
temp.fillna("$20K to $39K", inplace=True)
income[responses["What level of education have you reached?"] == "High school diploma "] = temp
income.describe()

count       2238
unique         6
top       $100K+
freq         874
Name: What is your household income?, dtype: object

#### Employment

In [617]:
employment = responses['Which category best describes your employment?']
employment.value_counts()

Full Time                                                                                                                            1405
Self Employed                                                                                                                         361
Retired                                                                                                                               136
Part Time                                                                                                                             121
Student                                                                                                                               118
Unemployed                                                                                                                             37
Home Maker                                                                                                                             17
disabled                          

In [618]:
employment.describe()

count          2223
unique           33
top       Full Time
freq           1405
Name: Which category best describes your employment?, dtype: object

Similar to other categories, there are:

1. 15 Missing Values
2. Unique Values needed to be categorized

Since there is an overwhelmingly large majority of Full Time workers (> 60%), we will categorize the 15 missing values as Full Time as well.

Regarding the unique values, judging from the responses, it's reasonable to assume that Free Lance / Contract falls under the "Self Employed" category. For answers like "None of your business", we will assume "Unemployed". For responders that fall under both category, as long as they're placed under one of the corresponding labels we will assume that their "employment" has been recorded (this may need to change when we manipulate features during model selection and prediction).

In [620]:
# Missing values
employment.fillna("Full Time", inplace = True)
employment.describe()

count          2238
unique           33
top       Full Time
freq           1420
Name: Which category best describes your employment?, dtype: object

In [640]:
# Filter unique values and categorize accordingly

employment[(employment.str.contains("student", False)) & (employment != "Student")] = "Student"

employment[(employment.str.contains("disab", False) | (employment.str.contains("dissab", False)) | employment.str.contains("odsp", False))] = "Disabled"

employment[employment.str.contains("contract", False) | employment.str.contains("free", False) | employment.str.contains("self")] = "Self Employed"

employment[(employment.str.contains("occas", False) | employment.str.contains("part")) & (employment != "Part Time")] = "Part Time"

employment.value_counts()

Full Time                                                                                                                            1420
Self Employed                                                                                                                         369
Retired                                                                                                                               136
Part Time                                                                                                                             124
Student                                                                                                                               121
Unemployed                                                                                                                             37
Home Maker                                                                                                                             17
Disabled                          

In [647]:
# Assuming casual/seasonal employees are still full-time (based on Google search of Employment Rules/Categories)
employment[employment.str.contains("full", False) & (employment != "Full Time")] = "Full Time"

# Classify remaining as Unemployed (only 2 values, an incorrect assumption doesn't necessarily hurt us)
employment[(employment != "Full Time") & (employment != "Part Time") & (employment != "Self Employed") & (employment != "Retired") & (employment != "Student") & (employment != "Home Maker") & (employment != "Disabled") & (employment != "Unemployed")] = "Unemployed"

employment.value_counts()

Full Time        1424
Self Employed     369
Retired           136
Part Time         124
Student           121
Unemployed         39
Home Maker         17
Disabled            8
Name: Which category best describes your employment?, dtype: int64

#### Toronto District

In [648]:
torontoDistrict = responses["What Toronto district is your primary address located in?"]
torontoDistrict.value_counts()

Central Toronto York or East York                                                1634
Etobicoke                                                                         236
North York                                                                        129
Scarborough                                                                       120
Mississauga                                                                        24
Brampton                                                                            9
Waterloo                                                                            4
Pickering                                                                           4
Guelph                                                                              3
oshawa                                                                              3
Richmond Hill                                                                       3
Oakville                                              

In [649]:
torontoDistrict.describe()

count                                  2237
unique                                   68
top       Central Toronto York or East York
freq                                   1634
Name: What Toronto district is your primary address located in?, dtype: object

In [651]:
# Only one missing value -> assign to the majority (Central Toronto York or East York)
torontoDistrict.fillna("Central Toronto York or East York", inplace = True)
torontoDistrict.describe()

count                                  2238
unique                                   68
top       Central Toronto York or East York
freq                                   1635
Name: What Toronto district is your primary address located in?, dtype: object

Regarding the number of unique answers that exist, it's important to remember that the question is *What Toronto District is your primary address located in*.

Toronto has 4 primary districts:
1. Toronto East/Central York
2. Etobicoke
3. Scarborough
4. North York

This is seen in the image below:
<img src = "https://www1.toronto.ca/City%20Of%20Toronto/Toronto%20Building/Shared%20Content/Images/Ward%20Images/torontoWard_1540x1140.jpg" />

Thus, all cities outside of these 4 entries can be grouped together into a common "Outside the above 4 districts" entry! This includes the GTA, and any other residences outside of the 4 primary districts of the city. This should also be how the survey is structured, and thus acts as an additional improvement reccomendation.

In [659]:
# Get values outside of Toronto
torontoDistrict[~torontoDistrict.str.contains("York", False) & ~torontoDistrict.str.contains("Etobicoke", False) & ~torontoDistrict.str.contains("Scarborough", False)]

21                         Mississauga
49                            Hamilton
61                         out of town
79                         Mississauga
96                         mississauga
130                              other
135                             oshawa
171                           Hamilton
218                          Pickering
235                           Brampton
241                             Guelph
248                           Waterloo
274                             oshawa
321                        Mississauga
323                            Markham
330                      Richmomd Hill
360                             Durham
364                        Leslieville
391                              Ajax 
395                             ottawa
447                           Oakville
451                          Thornhill
458                        Mississauga
463                        Mississauga
465                           Brampton
485                      

In [660]:
# Assign to new category
torontoDistrict[~torontoDistrict.str.contains("York", False) & ~torontoDistrict.str.contains("Etobicoke", False) & ~torontoDistrict.str.contains("Scarborough", False)] = "Outside of the above 4 primary districts of Toronto"
torontoDistrict.value_counts()

Central Toronto York or East York                      1635
Etobicoke                                               236
North York                                              129
Scarborough                                             120
Outside of the above 4 primary districts of Toronto     115
York Region                                               1
YORK REGION                                               1
i live in hamilton partner lives in north york            1
Name: What Toronto district is your primary address located in?, dtype: int64

In [663]:
# Assign remaining values to appropriate region -> York Region is NOT a part of Toronto
torontoDistrict[torontoDistrict.str.contains("york region", False)] = "Outside of the above 4 primary districts of Toronto"

# The person responding has their primary residence in Hamilton -> not in Toronto
torontoDistrict[torontoDistrict.str.contains("hamilton")] = "Outside of the above 4 primary districts of Toronto"

torontoDistrict.value_counts()

Central Toronto York or East York                      1635
Etobicoke                                               236
North York                                              129
Scarborough                                             120
Outside of the above 4 primary districts of Toronto     118
Name: What Toronto district is your primary address located in?, dtype: int64