# 2016 TTS Data Transformation

In this notebook, we take the data from __2016 TTS__ and we transforme it into a format that matches the dataframe from `2015_SMTO_ML_Format.csv`, which was created from the __2015 StudentMoveTO__ data. The new dataframe created will be used for Machine Learning purposes. 

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

We will read data from `TTS_2016_Filtered_With_Unknown.csv`. This file contains all 2016 TTS data after filtering a large number of secondary students, and after having added a column that contains distance between _HomeZone_ and _SchoolZone_. This dataframe includes all the rows where the respondents did not specify their mode of transportation or their trip to school time.

In [2]:
df = pd.read_csv('..\Data\TTS_2016\TTS_2016_Filtered_With_Unknowns.csv')
df.head()

Unnamed: 0,HomeZone,SchoolZone,SchoolCode,Age,Sex,License,Occupation,EmploymentStatus,StudentStatus,Vehicles,IncomeClass,Adults,Children,ExpansionFactor,ModeTakenToSchool,TripToSchoolTime,Distance
0,264,25,9244,19,M,N,O,O,S,2,3,5,0,42.03,B,1300,5801.714
1,38,25,9243,22,M,Y,S,P,S,0,3,2,0,21.01,W,1000,1529.49
2,1039,25,9244,26,M,Y,O,O,S,3,7,5,1,16.17,D,700,32200.97
3,613,25,9244,19,M,Y,O,O,S,1,2,4,0,44.06,B,645,23502.05
4,4160,4160,9998,21,M,Y,M,P,P,4,7,3,0,32.75,9,-1,0.0


__Removing unncecessary columns:__

The following columns are unnecessary since the __SMTO data__ does not have this information:

- School Zone
- School Code
- Occupation
- Expansion Factor
- Trip To School Time

In [3]:
df = df.drop(columns = ['SchoolZone', 'SchoolCode', 'Occupation', 'ExpansionFactor', 'TripToSchoolTime'])

__Renaming certain columns__ (to match chosen name convention):

In [4]:
renaming_columns = {'HomeZone': 'Home Zone', 'EmploymentStatus':'Employment', 'StudentStatus':'Status', 'IncomeClass':'Income', 'ModeTakenToSchool':'Mode', 'Vehicles':'Cars', 'Sex':'Gender'}
df = df.rename(columns=renaming_columns)

__Replacing entries:__

In [5]:
# Dictionary with code numbers
codes_obj = {"Gender":     {"F": 1, "M": 2},
             "License": {"N": 0, "Y": 1},
             "Employment": {"F": 3, "H": 3, "P": 2, "J": 2, "O": 1, "9":0},
             "Status": {"P": 1, "S": 2},
             "Mode": {"B": 1, "C": 2, "D": 3, "G": 4, "J": 5, "M": 6, "O": 7, "U": 7, "P": 8, "T": 9, "W": 10, "9": 0}}

# Replacing entries of object type columns with values from dictionary above
df.replace(codes_obj, inplace=True)

# Replacing entries from income (int type) with our chosen values (int type as well)
df['Income']= df['Income'].replace(3,1).replace(4,2).replace(5,2).replace(6,2).replace(7,0)

# Dividing distance by 1000 to obtain data in km (originally in m)
df['Distance'] = df['Distance'] / 1000

Final dataframe looks like this:

In [6]:
df.head()

Unnamed: 0,Home Zone,Age,Gender,License,Employment,Status,Cars,Income,Adults,Children,Mode,Distance
0,264,19,2,0,1,2,2,1,5,0,1,5.801714
1,38,22,2,1,2,2,0,1,2,0,10,1.52949
2,1039,26,2,1,1,2,3,0,5,1,3,32.20097
3,613,19,2,1,1,2,1,2,4,0,1,23.50205
4,4160,21,2,1,2,1,4,0,3,0,0,0.0


__Saving dataframe to .csv file:__

The dataframe is saved to a .csv file named `TTS_2016_ML_Format`. Change address as needed. 

In [7]:
df.to_csv(r'..\Data\TTS_2016\TTS_2016_ML_Format.csv', index=False)