## Feature Engineering and Selection: Distance

This notebook details another attmpt at feature engineering and selection for distance-related variables. It is motivated by the new descriptive analysis results on distance and the idea to use zone centroid coordinates as a features in the Random Forest model.

First, let's load the data, taking care to remove `Other` students.

In [208]:
import pandas as pd

df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Complete_Input.csv')
df = df[df['Level'] != 'Other']
df.head()

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,HomeZone,Family,...,Domestic.OC,Admission_Avg.SG,Admission_Avg.SC,Admission_Avg.MI,Admission_Avg.YK,Admission_Avg.YG,Admission_Avg.RY,Admission_Avg.OC,Exp_Segment,Exp_Level
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,261,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.944738,0.944738
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,71,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,3714,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,74,0,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,71,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085


In [209]:
df.columns

Index(['Campus', 'Level', 'Status', 'Mode_Actual', 'Gender', 'Licence', 'Work',
       'Age', 'HomeZone', 'Family', 'Cars', 'Children', 'Adults', 'Income',
       'Mode', 'School_Codes', 'Segment', 'Dist.SG', 'Dist.SC', 'Dist.MI',
       'Dist.YK', 'Dist.YG', 'Dist.RY', 'Dist.OC', 'WTT.SG', 'WTT.SC',
       'WTT.MI', 'WTT.YK', 'WTT.YG', 'WTT.RY', 'WTT.OC', 'AIVTT.SG',
       'AIVTT.SC', 'AIVTT.MI', 'AIVTT.YK', 'AIVTT.YG', 'AIVTT.RY', 'AIVTT.OC',
       'TPTT.SG', 'TPTT.SC', 'TPTT.MI', 'TPTT.YK', 'TPTT.YG', 'TPTT.RY',
       'TPTT.OC', 'Total.SG', 'Total.SC', 'Total.MI', 'Total.YK', 'Total.YG',
       'Total.RY', 'Total.OC', 'UG.SG', 'UG.SC', 'UG.MI', 'UG.YK', 'UG.YG',
       'UG.RY', 'UG.OC', 'Grad.SG', 'Grad.SC', 'Grad.MI', 'Grad.YK', 'Grad.YG',
       'Grad.RY', 'Grad.OC', 'Tuition.SG', 'Tuition.SC', 'Tuition.MI',
       'Tuition.YK', 'Tuition.YG', 'Tuition.RY', 'Tuition.OC', 'Domestic.SG',
       'Domestic.SC', 'Domestic.MI', 'Domestic.YK', 'Domestic.YG',
       'Domestic.RY', 'Dome

Let us quickly look at the correlation of the distance columns.

In [210]:
df.iloc[:, 17:24].corr()

Unnamed: 0,Dist.SG,Dist.SC,Dist.MI,Dist.YK,Dist.YG,Dist.RY,Dist.OC
Dist.SG,1.0,0.584396,0.588353,0.731334,0.864714,0.99741,0.998566
Dist.SC,0.584396,1.0,-0.139397,0.51181,0.847196,0.618299,0.56758
Dist.MI,0.588353,-0.139397,1.0,0.464808,0.2728,0.549017,0.600958
Dist.YK,0.731334,0.51181,0.464808,1.0,0.817802,0.715375,0.71221
Dist.YG,0.864714,0.847196,0.2728,0.817802,1.0,0.878822,0.848379
Dist.RY,0.99741,0.618299,0.549017,0.715375,0.878822,1.0,0.996756
Dist.OC,0.998566,0.56758,0.600958,0.71221,0.848379,0.996756,1.0


We notice that `Dist.SG`, `Dist.RY`, and `Dist.OC` are very highly correlated. This is not surprising as these campuses are in close proximity. Hence, we can try running models with only one of these columns included. Furthermore, the `Dist.YG` column is correlated with those three columns and with `Dist.SC`. We also try excluding that column from the model input.

In addition to changing the number of distance columns passed, we also engineer some additional features. We add six "flag" columns which indicate whether a student's HomeZone is within a certain distance of particular campuses. These thresholds were determined by observing trends found in the descriptive analysis. Finally, we also add flags for whether a student lives in the same zone as each campus.

In [211]:
# Adding flag columns for distances
df['SC<25'] = df['Dist.SC'] < 25
df['SG<10'] = df['Dist.SG'] < 10
df['SG<20'] = df['Dist.SG'] < 20
df['MI<10'] = df['Dist.MI'] < 10
df['MI<20'] = df['Dist.MI'] < 20
df['YK<20'] = df['Dist.YK'] < 20

# Adding flags for same zone as campus
school_codes = list(df['School_Codes'].unique())
for school in school_codes:
    if school == 'YG':
        continue
    df[school + '0'] = df['Dist.' + school] == 0
print(list(df.columns))

['Campus', 'Level', 'Status', 'Mode_Actual', 'Gender', 'Licence', 'Work', 'Age', 'HomeZone', 'Family', 'Cars', 'Children', 'Adults', 'Income', 'Mode', 'School_Codes', 'Segment', 'Dist.SG', 'Dist.SC', 'Dist.MI', 'Dist.YK', 'Dist.YG', 'Dist.RY', 'Dist.OC', 'WTT.SG', 'WTT.SC', 'WTT.MI', 'WTT.YK', 'WTT.YG', 'WTT.RY', 'WTT.OC', 'AIVTT.SG', 'AIVTT.SC', 'AIVTT.MI', 'AIVTT.YK', 'AIVTT.YG', 'AIVTT.RY', 'AIVTT.OC', 'TPTT.SG', 'TPTT.SC', 'TPTT.MI', 'TPTT.YK', 'TPTT.YG', 'TPTT.RY', 'TPTT.OC', 'Total.SG', 'Total.SC', 'Total.MI', 'Total.YK', 'Total.YG', 'Total.RY', 'Total.OC', 'UG.SG', 'UG.SC', 'UG.MI', 'UG.YK', 'UG.YG', 'UG.RY', 'UG.OC', 'Grad.SG', 'Grad.SC', 'Grad.MI', 'Grad.YK', 'Grad.YG', 'Grad.RY', 'Grad.OC', 'Tuition.SG', 'Tuition.SC', 'Tuition.MI', 'Tuition.YK', 'Tuition.YG', 'Tuition.RY', 'Tuition.OC', 'Domestic.SG', 'Domestic.SC', 'Domestic.MI', 'Domestic.YK', 'Domestic.YG', 'Domestic.RY', 'Domestic.OC', 'Admission_Avg.SG', 'Admission_Avg.SC', 'Admission_Avg.MI', 'Admission_Avg.YK', 'Admiss

Now, let us load coordinate and planning district information. To avoid scaling issues, we normalize the coordinates so that the values are between 0 and 1, inclusive. We also plot their correlations.

In [212]:
# Load zone coordinates
zones = pd.read_csv('../Data/Zones.csv')
zones.set_index('Zone#', inplace=True)

# Normalize from 0 to 1
zones['X'] = (zones['X'] - zones['X'].min()) / (zones['X'].max() - zones['X'].min())
zones['Y'] = (zones['Y'] - zones['Y'].min()) / (zones['Y'].max() - zones['Y'].min())
zones.corr()

Unnamed: 0,PD,X,Y
PD,1.0,-0.456036,-0.371999
X,-0.456036,1.0,0.633247
Y,-0.371999,0.633247,1.0


Interestingly, the X- and Y-coordinates are moderately positively correlated. This might be indicative of the shape of the GTA being tilted from the southwest to the northeast due to Lake Ontario, as well as York Region's large extent into the northeast.

Let us add this zone information to our dataframe.

In [213]:
zones.head()

Unnamed: 0_level_0,PD,X,Y
Zone#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,0.456789,0.445361
2,1,0.45368,0.44218
3,1,0.451497,0.442861
4,1,0.449033,0.44102
5,1,0.446403,0.439285


In [214]:
# Add zone information to df
temp = pd.DataFrame([[a[i] for a in (zones['PD'], zones['X'], zones['Y'])] for i in df['HomeZone']], columns=['PD', 'X', 'Y'], index=df.index)
df = pd.concat((df, temp), axis=1)

In [215]:
df.head()

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,HomeZone,Family,...,YK<20,SC0,SG0,MI0,OC0,RY0,YK0,PD,X,Y
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,261,1,...,False,False,False,False,False,False,False,6,0.45215,0.469128
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,71,0,...,True,False,False,False,False,False,False,1,0.428078,0.447426
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,3714,1,...,False,False,False,False,False,False,False,36,0.322351,0.385293
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,74,0,...,True,False,False,False,False,False,False,1,0.426631,0.44036
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,71,0,...,True,False,False,False,False,False,False,1,0.428078,0.447426


In [216]:
#df = pd.get_dummies(df, columns = ['Level', 'Status','Work','Mode','Campus','Income', 'Gender'])

In [217]:
def car_to_flag(x):
    if x > 1:
        return 1
    else:
        return 0
    
df['Cars_2+'] = df['Cars'].apply(lambda x: car_to_flag(x))

In [218]:
school_codes = df['School_Codes'].unique().tolist()

log_enrolls = []
for i in school_codes:
    log_enrolls.append(df['Total.' + i][10])
    
print(log_enrolls)

a = log_enrolls
amin, amax = min(a), max(a)
for i, val in enumerate(a):
    a[i] = 1 - ((val-amin) / (amax-amin))

print(a)
norm_enrolls = a

[9.37339415841248, 10.895460730714523, 9.495444123413165, 8.158229916959494, 10.24565781027198, 10.624809082278963, 7.807103290125981]
[0.4928401590756235, 0.0, 0.4533207811057517, 0.8863063510010682, 0.21040405229736325, 0.08763611519785186, 1.0]


In [219]:
for i in range(len(school_codes)):
    df['W_Dist_' + school_codes[i]] = df['Dist.' + school_codes[i]]*norm_enrolls[i]

In [220]:
school_codes = df['School_Codes'].unique().tolist()

tuitions = []
for i in school_codes:
    tuitions.append(df['Total.' + i][10])
    
print(tuitions)

a = tuitions
amin, amax = min(a), max(a)
for i, val in enumerate(a):
    a[i] = ((val-amin) / (amax-amin))

print(a)
norm_tuitions = a

[9.37339415841248, 10.895460730714523, 9.495444123413165, 8.158229916959494, 10.24565781027198, 10.624809082278963, 7.807103290125981]
[0.5071598409243765, 1.0, 0.5466792188942483, 0.11369364899893172, 0.7895959477026367, 0.9123638848021481, 0.0]


In [221]:
for i in range(len(school_codes)):
    df['Wtuit_Dist_' + school_codes[i]] = df['Dist.' + school_codes[i]]*norm_tuitions[i]

In [222]:
dist_df = df[['Dist.SG','Dist.SC','Dist.MI','Dist.YK','Dist.YG','Dist.RY','Dist.OC']]       
df['Closest_School'] = dist_df.idxmin(axis = 1)
names = df['Closest_School'].unique().tolist()
print(names)

def name_to_num(x):
    if x == 'Dist.SG':
        return 1
    elif x == 'Dist.SC':
        return 2
    elif x == 'Dist.MI':
        return 3
    elif x == 'Dist.YK':
        return 4
    elif x == 'Dist.YG':
        return 5
    elif x == 'Dist.RY':
        return 6
    elif x == 'Dist.OC':
        return 7
    
df['Cl_Sch_Num'] = df['Closest_School'].apply(lambda x: name_to_num(x))

['Dist.YG', 'Dist.SG', 'Dist.MI', 'Dist.SC', 'Dist.YK', 'Dist.OC', 'Dist.RY']


In [223]:
df

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,HomeZone,Family,...,W_Dist_YG,Wtuit_Dist_SC,Wtuit_Dist_SG,Wtuit_Dist_MI,Wtuit_Dist_OC,Wtuit_Dist_RY,Wtuit_Dist_YK,Wtuit_Dist_YG,Closest_School,Cl_Sch_Num
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,261,1,...,9.218413,7.547035,10.256060,15.966625,1.278113,7.564831,20.612253,0.0,Dist.YG,5
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,71,0,...,11.211150,11.684557,1.132351,10.738365,0.309683,2.112306,14.487481,0.0,Dist.SG,1
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,3714,1,...,32.555200,23.143078,23.319230,2.469580,2.692965,19.711473,26.075770,0.0,Dist.MI,3
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,74,0,...,12.830410,12.232462,0.699414,10.627072,0.175233,1.827131,15.338534,0.0,Dist.SG,1
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,71,0,...,11.211150,11.684557,1.132351,10.738365,0.309683,2.112306,14.487481,0.0,Dist.SG,1
5,Downtown Toronto (St. George),UG,FT,Walk,Female,0,PT,20,72,0,...,11.887390,12.093916,1.595176,10.410494,0.330553,2.780505,14.361228,0.0,Dist.SG,1
6,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,18,600,1,...,15.841280,5.832115,25.838390,22.567536,3.049725,19.868571,17.163134,0.0,Dist.SC,2
7,Scarborough (UTSC),UG,FT,Car - Driver alone,Female,1,Unknown,21,3420,1,...,32.631030,23.471388,31.593750,13.774879,3.643132,26.143751,17.626797,0.0,Dist.YK,4
8,Downtown Toronto (St. George),Grad,FT,Bicycle,Female,1,Unknown,33,113,0,...,17.716080,14.677013,6.260033,8.623515,0.572565,5.320172,14.804682,0.0,Dist.OC,7
9,Downtown Toronto (St. George),Grad,FT,GO Train,Female,1,Unknown,31,1031,0,...,23.421850,3.878390,29.734180,26.537460,3.462287,22.944671,31.142054,0.0,Dist.SC,2


In [225]:
multi_df = df.groupby(['PD', 'School_Codes'])['Mode'].agg(lambda x:x.value_counts().index[0])
multi_df = df.groupby(['PD', 'School_Codes'])['Mode'].value_counts(normalize = True).unstack().fillna(0)
multi_df

Unnamed: 0_level_0,Mode,Active,Auto,Transit
PD,School_Codes,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,SG,0.000000,0.333333,0.666667
0,YK,0.000000,0.250000,0.750000
1,MI,0.500000,0.277778,0.222222
1,OC,0.800000,0.008000,0.192000
1,RY,0.843750,0.006250,0.150000
1,SC,0.083333,0.145833,0.770833
1,SG,0.874664,0.008505,0.116831
1,YG,0.000000,0.090909,0.909091
1,YK,0.021429,0.071429,0.907143
2,MI,0.200000,0.333333,0.466667


In [239]:
cols = {'Transit': 'TPTT', 'Auto': 'AIVTT', 'Active': 'WTT', 'T': 'TPTT', 'O':'TPTT', 'A':'AIVTT'}

def get_mode(school,PD):
    try:
        return PD_df[school][PD]
    except:
        return 'O'
    
def func(x,code):
    return x[cols[get_mode(x.PD, code)] + '.' + code]

for i in PD_df.columns.tolist()[1:8]:
    df['TT_' + i] = df.apply(lambda x: func(x,i), axis=1)

In [254]:
three_weights = multi_df.loc[0].loc['YK'].values
print(three_weights)
three_times = [8, 9 ,7]

sum([x*y for x, y in zip(three_weights, three_times)])

[0.   0.25 0.75]


7.5

In [260]:
std_weights = df['Mode'].value_counts(normalize = True).sort_index()

In [261]:
def w_time(x):
    for school in PD_df.columns.tolist()[1:8]:
        try:
            weights = multi_df.loc[x['PD']].loc[school].values
        except:
            weights = std_weights
        times = [x[i + school] for i in ['TPTT.', 'AIVTT.', 'WTT.']]
        
        x['TTw_' + school] = sum([x*y for x, y in zip(weights, times)])
    return x   

df = df.apply(w_time, axis=1)

In [164]:
PD_df.columns.tolist()[1:8]

['MI', 'OC', 'RY', 'SC', 'SG', 'YG', 'YK']

In [262]:
df.head()

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,HomeZone,Family,...,TTw_SG,TTw_YG,TTw_YK,TT_MI,TT_OC,TT_RY,TT_SC,TT_SG,TT_YG,TT_YK
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,261,1,...,133.294534,113.658165,305.947022,129.369386,81.104413,75.379996,75.468478,73.276483,64.076936,197.858689
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,71,0,...,23.113516,154.651475,221.004787,77.13505,27.804764,31.74921,78.020223,24.128386,66.918003,144.674281
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,3714,1,...,326.908639,388.277414,379.467915,42.439563,146.732427,147.452825,124.359872,155.55117,186.874439,85.084262
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,74,0,...,15.820283,176.987726,233.912527,83.728866,19.897948,21.273948,88.246135,16.675728,77.003957,152.561525
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,71,0,...,23.113516,154.651475,221.004787,77.13505,27.804764,31.74921,78.020223,24.128386,66.918003,144.674281


In [169]:
list(df.columns)

['Mode_Actual',
 'Licence',
 'Age',
 'HomeZone',
 'Family',
 'Cars',
 'Children',
 'Adults',
 'School_Codes',
 'Segment',
 'Dist.SG',
 'Dist.SC',
 'Dist.MI',
 'Dist.YK',
 'Dist.YG',
 'Dist.RY',
 'Dist.OC',
 'WTT.SG',
 'WTT.SC',
 'WTT.MI',
 'WTT.YK',
 'WTT.YG',
 'WTT.RY',
 'WTT.OC',
 'AIVTT.SG',
 'AIVTT.SC',
 'AIVTT.MI',
 'AIVTT.YK',
 'AIVTT.YG',
 'AIVTT.RY',
 'AIVTT.OC',
 'TPTT.SG',
 'TPTT.SC',
 'TPTT.MI',
 'TPTT.YK',
 'TPTT.YG',
 'TPTT.RY',
 'TPTT.OC',
 'Total.SG',
 'Total.SC',
 'Total.MI',
 'Total.YK',
 'Total.YG',
 'Total.RY',
 'Total.OC',
 'UG.SG',
 'UG.SC',
 'UG.MI',
 'UG.YK',
 'UG.YG',
 'UG.RY',
 'UG.OC',
 'Grad.SG',
 'Grad.SC',
 'Grad.MI',
 'Grad.YK',
 'Grad.YG',
 'Grad.RY',
 'Grad.OC',
 'Tuition.SG',
 'Tuition.SC',
 'Tuition.MI',
 'Tuition.YK',
 'Tuition.YG',
 'Tuition.RY',
 'Tuition.OC',
 'Domestic.SG',
 'Domestic.SC',
 'Domestic.MI',
 'Domestic.YK',
 'Domestic.YG',
 'Domestic.RY',
 'Domestic.OC',
 'Admission_Avg.SG',
 'Admission_Avg.SC',
 'Admission_Avg.MI',
 'Admission_Avg.Y

With the data prepared, we can begin preparing the Random Forest classifier. First, we import the relevant packages.

In [263]:
# Machine learning packages
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from sklearn.inspection import permutation_importance
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import matthews_corrcoef

def average(l):
    return sum(l) / len(l)

We declare variables for the distance-related columns so we can pass them into the model.

In [None]:
std_dists = df.iloc[:, 17:24]
dist_flags = df.iloc[:, 89:95]
same_zone_flags = df.iloc[:, 96:101]
four_dists = df.iloc[:, 17:21] # No OC, RY, YG
five_dists = df.iloc[:, 17:22] # No OC, RY
coords = df[['X', 'Y']]
three_dists = df.iloc[:, 18:21]
three_dists_b = df.iloc[:, 17:20]
coords_fam = df[['X', 'Y', 'Family']]
dist_fam = df[['Dist.MI','Dist.SC','Dist.YK', 'Family']]
coords_level = df[['X', 'Y', 'Level_Grad', 'Level_UG']] 
dist_level = df[['Dist.MI','Dist.SC','Dist.YK',  'Level_Grad', 'Level_UG']]
coords_status = df[['X', 'Y', 'Status_FT', 'Status_PT']]
dist_status = df[['Dist.MI','Dist.SC','Dist.YK', 'Status_FT', 'Status_PT']]
dist_level_family = df[['Dist.MI','Dist.SC','Dist.YK',  'Level_Grad', 'Level_UG', 'Family']]
coords_level_family = df[['X', 'Y', 'Level_Grad', 'Level_UG', 'Family']] 
dist_status_family = df[['Dist.MI','Dist.SC','Dist.YK', 'Family', 'Status_FT', 'Status_PT']]
coords_status_family = df[['X', 'Y', 'Status_FT', 'Status_PT', 'Family']] 
dist_status_level = df[['Dist.MI','Dist.SC','Dist.YK',  'Level_Grad', 'Level_UG','Status_FT', 'Status_PT']]
coords_status_level = df[['X', 'Y', 'Status_FT', 'Status_PT', 'Level_Grad', 'Level_UG']] 
dist_status_level_family = df[['Dist.MI','Dist.SC','Dist.YK',  'Level_Grad', 'Level_UG','Status_FT', 'Status_PT', 'Family']]
coords_status_level_family = df[['X', 'Y', 'Status_FT', 'Status_PT', 'Level_Grad', 'Level_UG','Family']] 

cars = df['Cars']
cars_flag = df['Cars_2+']
licence = df['Licence']

campus = df[['Campus_Downtown Toronto (St. George)','Campus_Glendon','Campus_Keele','Campus_Mississauga (UTM)','Campus_OCADu','Campus_RyersonU','Campus_Scarborough (UTSC)']]
mode = df[['Mode_Active', 'Mode_Transit', 'Mode_Auto']]

aivtt = df[['AIVTT.SG','AIVTT.SC','AIVTT.MI','AIVTT.YK','AIVTT.YG','AIVTT.RY','AIVTT.OC']]
tptt = df[['TPTT.SG','TPTT.SC','TPTT.MI','TPTT.YK','TPTT.YG','TPTT.RY','TPTT.OC']]
wtt = df[['WTT.SG','WTT.SC','WTT.MI','WTT.YK','WTT.YG','WTT.RY','WTT.OC']]
aivtt_3 = df[['AIVTT.SC','AIVTT.MI','AIVTT.YK']]
tptt_3 = df[['TPTT.SC','TPTT.MI','TPTT.YK']]
wtt_3 = df[['WTT.SC','WTT.MI','WTT.YK']]

income = df[['Income_High', 'Income_Low', 'Income_Unknown']]
gender = df[['Gender_Female', 'Gender_Male', 'Gender_Other']]

domestics = df[['Domestic.SG','Domestic.SC','Domestic.MI','Domestic.YK','Domestic.YG','Domestic.RY','Domestic.OC',]]

wtuit_distances = df[['Wtuit_Dist_SG','Wtuit_Dist_SC','Wtuit_Dist_MI','Wtuit_Dist_OC','Wtuit_Dist_RY','Wtuit_Dist_YG','Wtuit_Dist_YK']]

In [265]:
good_times = df[['TTw_SG','TTw_SC','TTw_MI','TTw_RY','TTw_OC','TTw_YK','TTw_YG']]

We also prepare a dataframe to store the model results, including the following metrics:
- Accuracy: The accuracy of the model as calculated by `sklearn`, also the micro precision/recall/F-1 score  
- Prec, Rec, F1: The macro precision, recall, and F-1 score  
- MCC: The Matthews Correlation Coefficient
- APO (Average Probabilities of Observations): The average predicted probabilities produced by the classifier of observed campus choices for the testing data.

In [266]:
# Prepare results dataframe
metric_names = ['Acc', 'Prec', 'Rec', 'F1', 'MCC', 'APO']
results = pd.DataFrame(columns=['Model'] + metric_names)

Finally, let us run our models. For each model, we average the metrics across 10 trials.

In [267]:
# Prepare classifier
rf = RandomForestClassifier(n_estimators=100)
y = df['School_Codes']
num_trials = 3

# Run models
for x_temp, name in ((std_dists, 'All_Dists'),
#(campus, 'Campus'),
#(pd.concat((coords, df['Segment'], df['Age']), axis=1), "Coords + Segment + Age"),
#(pd.concat((coords, df['Segment'], gender), axis=1), "Coords + Segment + Gender"),
#(pd.concat((coords, df['Segment'], df['Children']), axis=1), "Coords + Segment + Children"),
#(pd.concat((coords, df['Segment'], df['Adults']), axis=1), "Coords + Segment + Adults"),
#(pd.concat((coords, df['Segment'], df['Age'], gender), axis=1), "Coords + Segment + Age + Gender"),
#(pd.concat((coords, df['Segment'], df['Age'], df['Children']), axis=1), "Coords + Segment + Age + Children"),
#(pd.concat((coords, df['Segment'], df['Age'], df['Adults']), axis=1), "Coords + Segment + Age + Adults"),
#(pd.concat((coords, df['Segment'], df['Adults'], df['Children']), axis=1), "Coords + Segment + Adults + Children"),
#(pd.concat((coords, df['Segment'], df['Adults'], df['Children'], df['Age'], gender), axis=1), "Coords + Segment + Age + Gender + Adults + Children"),
#(pd.concat((coords, df['Segment'],campus), axis=1), "Coords + Segment + Campus"),
#(pd.concat((coords, df['Segment'],campus,mode), axis=1), "Coords + Segment + Campus + Mode"),
#(pd.concat((std_dists, df['Segment']), axis=1), "All_Dists + Segment"),
#(pd.concat((three_dists, df['Segment']), axis=1), "Three_Dists + Segment"),
(pd.concat((coords, df['Segment']), axis=1), "Coords + Segment"),
(good_times, 'TTs_w'),
#(pd.concat((coords, df['Segment'], df['Cl_Sch_Num']), axis=1), "Coords + Segment + closest_school"),
#(pd.concat((coords, df['Segment'], aivtt_3), axis=1), "Coords + Segment + aivtt"),
#(pd.concat((coords, df['Segment'], tptt_3), axis=1), "Coords + Segment + tptt"),
#(pd.concat((coords, df['Segment'], wtt_3), axis=1), "Coords + Segment + wtt"),
#(pd.concat((coords, df['Segment'], aivtt_3, tptt_3), axis=1), "Coords + Segment + aivtt + tptt"),
#(pd.concat((coords, df['Segment'], aivtt_3, tptt_3, wtt_3), axis=1), "Coords + Segment + aivtt + tptt + wtt"),
#(pd.concat((coords, df['Segment'], df['Mode_Active'],df['Mode_Auto'],df['Mode_Transit']), axis=1), "Coords + Segment + Mode"),
#(pd.concat((std_dists, df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "All_Dists + Work"),
#(pd.concat((three_dists, df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Three_Dists + Work"),
#(pd.concat((coords, df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Coords + Work"),
#(pd.concat((std_dists, df['Family'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "All_Dists + Family + Work"),
#(pd.concat((three_dists, df['Family'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Three_Dists + Family + Work"),
#(pd.concat((coords,  df['Family'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Coords + Family + Work"),  
#(pd.concat((std_dists, df['Level_UG'], df['Level_Grad'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "All_Dists + Level + Work"),
#(pd.concat((three_dists, df['Level_UG'], df['Level_Grad'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Three_Dists + Level + Work"),
#(pd.concat((coords,  df['Level_UG'], df['Level_Grad'], df['Work_FT'],df['Work_PT'],df['Work_NW'],df['Work_Unknown']), axis=1), "Coords + Level + Work"),
#(coords_fam, 'Coords + Family'),
#(coords_level, 'Coords + Level'),
#(coords_status, 'Coords + Status'),
#(dist_fam, 'Three_Dists + Family'),
#(dist_level, 'Three_Dists + Level'),
#(dist_status, 'Three_Dists + Status'),
#(three_dists, 'Three_Dists'),
#(dist_level_family,'Three_Dists + Family + Level'),
#(pd.concat((dist_level_family, df['Segment']), axis=1), 'Three_Dists + Family + Level + Segment'),
#(coords_level_family,'Coords + Family + Level'),
#(pd.concat((coords_level_family, df['Segment']), axis=1), 'Coords + Family + Level + Segment'),
#(dist_status_family,'Three_Dists + Family + Status'),
#(coords_status_family,'Coords + Family + Status'),
#(dist_status_level,'Three_Dists + Level + Status'),
#(coords_status_level,'Coords + Level + Status'),
#(dist_status_level_family ,'Three_Dists + Family + Level + Status'),
#(pd.concat((dist_status_level_family, df['Segment']), axis=1), 'Three_Dists + Family + Level + Status + Segment'),
#(coords_status_level_family ,'Coords + Family + Level + Status'),
#(pd.concat((coords_status_level_family, df['Segment']), axis=1), 'Coords + Status + Family + Level + Segment'),
#(coords, 'Coords'),
#(pd.concat((std_dists, df['Family']), axis=1), "All_Dists + Family"),
#(pd.concat((std_dists, df['Level_Grad'], df['Level_UG']), axis=1), "All_Dists + Level"),
#(pd.concat((std_dists, df['Status_FT'], df['Status_PT']), axis=1), "All_Dists + Status"),
#(pd.concat((std_dists, df['Family'], df['Level_UG'], df['Level_Grad']), axis=1), "All_Dists + Family + Level"),
#(pd.concat((std_dists, df['Family'], df['Status_FT'], df['Status_PT']), axis=1), "All_Dists + Family + Status"), 
#(pd.concat((std_dists, df['Status_FT'], df['Status_PT'], df['Level_UG'], df['Level_Grad']), axis=1), "All_Dists + Status + Level"),
#(pd.concat((std_dists, df['Status_FT'], df['Status_PT'], df['Level_UG'], df['Level_Grad'], df['Family']), axis=1), "All_Dists + Family + Status + Level"),
#(pd.concat((std_dists, df['Status_FT'], df['Status_PT'], df['Level_UG'], df['Level_Grad'], df['Segment']), axis=1), "All_Dists + Status + Level + Segment"),
#(pd.concat((std_dists, df['Status_FT'], df['Status_PT'], df['Level_UG'], df['Level_Grad'], df['Family'], df['Segment']), axis=1), "All_Dists + Family + Status + Level + Segment"),   
#(pd.concat((std_dists, df['Cars']), axis=1), "All_Dists + Cars"),
#(pd.concat((std_dists, df['Licence']), axis=1), "All_Dists + Licence"),
#(pd.concat((three_dists, df['Cars']), axis=1), "Three_Dists + Cars"),
#(pd.concat((three_dists, df['Licence']), axis=1), "Three_Dists + Licence"),
#(pd.concat((coords, df['Cars']), axis=1), "Coords + Cars"),
#(pd.concat((coords, df['Licence']), axis=1), "Coords + Licence"),
#(pd.concat((std_dists, df['Cars'], df['Licence']), axis=1), "All_Dists + Cars + Licence"),
#(pd.concat((three_dists, df['Cars'], df['Licence']), axis=1), "Three_Dists + Cars + Licence"),
#(pd.concat((coords, df['Cars'], df['Licence']), axis=1), "Coords + Cars + Licence"),
#(three_dists_b, 'Three_b'),
#(five_dists, 'Five'),
#(four_dists, 'Four'),
#(dist_flags, 'Flags'),
#(same_zone_flags, 'SameZone'),
#(df['HomeZone'].values.reshape(-1, 1), 'HZ'),
#(df['PD'].values.reshape(-1, 1), 'PD'),
#(df['HomeZone'].isnull().values.reshape(-1,1), "Null"),
#(pd.concat((four_dists, df['PD']), axis=1), "Four+PD"),
#(pd.concat((coords, std_dists), axis=1), "Std+Coords"),
#(pd.concat((coords, five_dists), axis=1), "Five+Coords"),
#(pd.concat((coords, four_dists), axis=1), "Four+Coords"),
#(pd.concat((coords, std_dists, df['HomeZone']), axis=1), "Std+Coords+HZ"),
#(pd.concat((coords, five_dists, df['HomeZone']), axis=1), "Five+Coords+HZ"),
#(pd.concat((coords, four_dists, df['HomeZone']), axis=1), "Four+Coords+HZ"),
#(pd.concat((coords, four_dists), axis=1), "Four+Coords+HZ+PD"),
#(pd.concat((coords, four_dists, df['HomeZone']), axis=1), "Four+Coords+PD"),
#(pd.concat((df['PD'], dist_flags), axis=1), "PD+Flags"),
#(pd.concat((four_dists, dist_flags), axis=1), "Four+Flags"),
#(pd.concat((dist_flags, std_dists), axis=1), "Std+Flags"),
#(pd.concat((dist_flags, five_dists), axis=1), "Five+Flags"),
(pd.concat((std_dists, dist_flags, same_zone_flags, coords, df['HomeZone'], df['PD']), axis=1), "Full")):
    # Prepare metrics
    #metrics = {}
    #metrics_list = [None,None,None,None,None,None]
    #for metric in metric_names:
    #    metrics[metric] = []
    
    # Run trials
    for i in range(num_trials):
        metrics_list = [None,None,None,None,None,None]
        
        # Split data and run model
        X_train, X_test, y_train, y_test = train_test_split(x_temp, y, test_size=0.3)
        rf.fit(X_train, y_train, sample_weight = df['Exp_Segment'].loc[X_train.index])
        y_pred = rf.predict(X_test)
        
        # Metrics
        metrics_list[0] = (rf.score(X_test, y_test))
        metrics_list[1] = precision_recall_fscore_support(y_test, y_pred, average = 'macro')[0]
        metrics_list[2] = precision_recall_fscore_support(y_test, y_pred, average = 'macro')[1]
        metrics_list[3] = precision_recall_fscore_support(y_test, y_pred, average = 'macro')[2]
        metrics_list[4] = (matthews_corrcoef(y_test, y_pred))
        
        # APO
        schools = list(rf.classes_)
        probs = pd.concat((y_test.reset_index(drop=True), pd.DataFrame(rf.predict_proba(X_test))), axis=1)
        metrics_list[5] = (probs.apply(lambda z: z[schools.index(z.School_Codes)], axis=1).mean())
    
        # Add results to dataframe
        #ave_metrics = [average(v) for v in metrics.values()]
        #results.loc[len(results)] = [name, num_trials] + ave_metrics
        results.loc[len(results)] = [name] + metrics_list

The last model we ran was the "Full" model which included all the relevant columns. We can take a look at the feature importances from the last trial of this model:

In [25]:
#pd.DataFrame({'Features' : x_temp.columns, 'Importance': rf.feature_importances_}).sort_values('Importance', ascending=False)

It seems that the `HomeZone` column, coordinates, and distances were most important in this model. The planning district and flags were less important.

We can look at a correlation matrix for the different metrics.

In [26]:
#results.corr()

Interestingly, the correlation between Accuracy and APO is only 0.657. Macro F-1 score was quite closely correlated with APO so we may prioritize it as a computationally faster and more well-known indicator of effectiveness in the future.

At this point, we should look at the results for our different models:

In [268]:
results

Unnamed: 0,Model,Acc,Prec,Rec,F1,MCC,APO
0,All_Dists,0.476555,0.377252,0.321936,0.324502,0.264604,0.396069
1,All_Dists,0.468431,0.370397,0.318018,0.313288,0.257758,0.39304
2,All_Dists,0.459378,0.336048,0.312827,0.307671,0.252224,0.389166
3,Coords + Segment,0.474698,0.342227,0.322648,0.322173,0.272702,0.414513
4,Coords + Segment,0.467502,0.34743,0.312036,0.312333,0.262828,0.413656
5,Coords + Segment,0.469127,0.340963,0.322203,0.32261,0.271501,0.413845
6,TTs_w,0.46611,0.393789,0.310064,0.305498,0.245671,0.395863
7,TTs_w,0.456592,0.387065,0.306089,0.309382,0.242113,0.390751
8,TTs_w,0.468199,0.400745,0.315205,0.31809,0.252902,0.393429
9,Full,0.457985,0.367033,0.313947,0.319069,0.249098,0.390656


In [269]:
results.groupby('Model').mean().sort_values('APO', ascending=False)

Unnamed: 0_level_0,Acc,Prec,Rec,F1,MCC,APO
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Coords + Segment,0.470443,0.34354,0.318962,0.319039,0.26901,0.414004
TTs_w,0.463634,0.393866,0.310452,0.31099,0.246895,0.393348
All_Dists,0.468121,0.361233,0.317593,0.315154,0.258195,0.392759
Full,0.46139,0.368957,0.312915,0.312271,0.249426,0.392525


In [253]:
results.groupby('Model').mean().to_csv('Variable_Runs_Results.csv')

PermissionError: [Errno 13] Permission denied: 'Variable_Runs_Results.csv'

Notice that there are significant fluctuations in these metrics across trials. Because of this, we are considering all the models with 0.38 < APO < 0.39 to be similarly effective. These models include any combination of distance columns, home zone, and/or coordinates. The different combinations of these variables, or the inclusion of flags, planning district, etc., did not seem to cause significant increases in effectiveness.

Interestingly, the "PD + Flags" model offered a significant imporvement on both the "PD" and "Flags" models individually, approaching the accuracy of the stronger models. Of note, however, was that this model had a very high micro accuracy.

Finally, it seems that including only four or five distance columns results in models that are no less effective than for all seven distance models. This might be an approach we will keep in the future.