# SMTO 2015 Data Transformation

In this notebook, I select relevant data from the StudentMoveTO 2015 survey and transofmr it into the format that will be used for the machine learning classification algorithm.

This data will be used as training data for the algorithm, which will be a binary classifier to determine whether a given student is living with their family or not. The data used was selected and transformed so that it is directly comparable with data available from the TTS data, as this is the data we wish to use the classifier on.

First, let's load the raw data from the `SMTO_2015_Households.csv` and `SMTO_2015_Households.csv` files.

In [1]:
import pandas as pd
hh_df = pd.read_csv('../Data/SMTO_2015/Households.csv')
ps_df = pd.read_csv('../Data/SMTO_2015/Respondents.csv')

Now, let us take the columns we want and put them in a new dataframe with the data for the machine learning algorithm. We will rename the columns into shorter names in the process.

In [2]:
# Load Households columns
hh_columns = {'hhlivingsituation': 'Family', 'HmTTS2006': 'HomeZone', 'hhcarnumber': 'Cars', 'hhnumyoungerthan18': 'Children', 'hhnumolderorequalto18': 'Adults', 'hhincomelevel': 'Income'}
ML_df = hh_df[list(hh_columns.keys())]
ML_df = ML_df.rename(columns=hh_columns)

# Load Respondents columns
ps_columns = {'pscmpgenderkey': 'Gender', 'psage': 'Age', 'personstatustime': 'Status', 'psdrivinglicenseownerflag': 'License', 'psmainmodefalltypical': 'Old_Mode', 'PsFallAllModesInclBus': 'Bus?', 'PsFallAllModesInclGOBus': 'Go Bus?', 'PsFallAllModesInclGOTrain': 'Go Train?', 'PsFallAllModesInclStreetcar': 'Streetcar?', 'PsFallAllModesInclSubway': 'Subway?', 'HomeToMainCampusKM': 'Distance', 'psworknumhoursperweek': 'Employment'}
ML_df = ML_df.join(ps_df[list(ps_columns.keys())])
ML_df = ML_df.rename(columns=ps_columns)
 
# Clean missing values and values with wrong data type
ML_df = ML_df.fillna(value={'Employment': 0})
ML_df = ML_df[ML_df['Status'] != 'Other']
ML_df = ML_df.dropna(subset=['HomeZone', 'Distance', 'Family'])
ML_df['HomeZone'] = ML_df['HomeZone'].astype(int)

### Employment Status

We need to determine a student's employment status and classify the student according to the following legend:

0 - Unknown  
1 - Unemployed  
2 - Part-time worker (up to 30 hours a week)  
3 - Full-time worker (31 or more hours a week)

In [3]:
employment_conversion = {'Employment': {"No, I don't work": 1, "Yes, I work part time (<10 hours per week)": 2, "Yes, I work part time (11-20 hours per week)": 2, "Yes, I work part time (21-30 hours per week)": 2, "Yes, I work 31-40 hours per week": 3, "Yes, I work > 40 hours per week": 3}}
ML_df.replace(employment_conversion, inplace=True)
ML_df['Employment'].value_counts()

0    11267
2     1586
1     1567
3      237
Name: Employment, dtype: int64

### Enrolment Status

We need to determine a student's enrolment status and classify the student according to the following legend:

0 - Other  
1 - Part-time student  
2 - Full-time student

In [4]:
ML_df.replace({'Status': {'FT': 2, 'PT': 1, 'Other':0}}, inplace=True)
ML_df['Status'].value_counts()

2    13593
1     1064
Name: Status, dtype: int64

### Household Income

We need to classify students according to their reported household income. To align the income ranges with those from TTS, we partition the students into three brackets by income level as follows:

0 - Unknown/unreported  
1 - Low (up to 60,000)  
2 - High (over 60,000)

In [5]:
income_conversion = {'Income': {'Unknown': 0, 'Less than $ 30,000': 1, "I donâ€™t know": 0, "$ 30,000 - 59,999": 1, "$ 60,000 - 89,999": 2, "$ 90,000 - 119,999": 2, "$ 120,000 - 149,999": 2, "$ 150,000 - 179,999": 2, "$ 240,000 +": 2, "$ 180,000 - 209,999": 2, "$ 210,000 - 239,999": 2}}
ML_df.replace(income_conversion, inplace=True)
ML_df['Income'].value_counts()

0    8756
1    3404
2    2497
Name: Income, dtype: int64

### Commute Mode Choice

We need to classify students according to their typical commute mode. To align the modes with those from TTS, we use the following partition:

0 - Unknown/Do not travel  
1 - Public transit (not GO Rail)  
2 - Bicycle  
3 - Driver (alone or with a passenger)  
4 - Only GO Train  
5 - GO Train and other public transit  
6 - Motorcycle  
7 - Other (including paratransit and intercampus shuttle)  
8 - Car passenger  
9 - Taxi  
10 - Walking

In [6]:
mode_conversion = {'Old_Mode': {'Subway/RT': 1, "Transit Bus": 1, "Walk": 10, "Bicycle": 2, "Car - Driver alone": 3, "GO Bus": 1, "Streetcar": 1, "Car - Passenger": 8, "Car - Driver with passenger(s)": 3, "Intercampus Shuttle": 7, "Other": 7, "I do not travel to the university (distance learners only)": 0, "Motorcycle, moped or scooter": 6, "Taxi": 9, "Paratransit": 7}}
ML_df.replace(mode_conversion, inplace=True)
ML_df['OtherTransit?'] = (ML_df['Bus?']) | (ML_df['Go Bus?']) | (ML_df['Streetcar?']) | (ML_df['Subway?'])
ML_df['Mode'] = ML_df.apply(lambda row: row['Old_Mode'] if (row['Old_Mode'] != 'GO Train') else (5 if row['OtherTransit?'] else 4), axis = 1)
ML_df = ML_df.drop(columns=['Bus?', 'Go Bus?', 'Streetcar?', 'Subway?', 'Go Train?', 'OtherTransit?', 'Old_Mode'])
ML_df['Mode'].value_counts()

1     8182
10    2805
3     1066
2     1002
5      938
8      313
4      200
7      109
6       15
0       15
9       12
Name: Mode, dtype: int64

### Living Arrangment

The label which the machine learning algorithm will predict is whether or not a student is living with their family. We store this data in a column containing Boolean values: 1 if the student is living with their family and 0 otherwise.

In [7]:
ML_df['Family'] = (ML_df['Family'] == "Live with family/parents") * 1
ML_df['Family'].value_counts()

1    8333
0    6324
Name: Family, dtype: int64

## Exporting to File

We have the following dataframe:

In [8]:
ML_df

Unnamed: 0,Family,HomeZone,Cars,Children,Adults,Income,Gender,Age,Status,License,Distance,Employment,Mode
0,1,261,1,3,4,0,1,20,2,0,14.00,0,1
1,0,71,0,0,2,2,1,25,2,1,0.75,0,10
2,1,3714,1,0,4,0,1,23,2,1,29.50,0,1
3,0,74,0,0,4,0,2,20,2,1,0.75,0,10
4,0,71,0,0,2,1,2,27,2,1,0.75,0,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15221,0,212,2,0,2,1,1,27,2,1,11.75,0,1
15222,1,233,1,2,3,0,1,20,2,1,15.00,0,1
15223,0,95,0,0,3,0,1,25,2,1,13.75,0,1
15224,1,2221,2,1,2,0,1,22,2,0,17.50,0,1


We export this dataframe into a file called `SMTO_2015_ML_Format.csv`.

In [9]:
ML_df.to_csv('SMTO_2015_Input.csv', index=False)