## Feature Engineering and Selection: Distance

This notebook details another attmpt at feature engineering and selection for distance-related variables. It is motivated by the new descriptive analysis results on distance and the idea to use zone centroid coordinates as a features in the Random Forest model.

First, let's load the data, taking care to remove `Other` students.

In [1]:
import pandas as pd

df = pd.read_csv('../../Data/SMTO_2015/SMTO_2015_Complete_Input.csv')
df = df[df['Level'] != 'Other']
df.head()

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,HomeZone,Family,...,Domestic.OC,Admission_Avg.SG,Admission_Avg.SC,Admission_Avg.MI,Admission_Avg.YK,Admission_Avg.YG,Admission_Avg.RY,Admission_Avg.OC,Exp_Segment,Exp_Level
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,261,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.944738,0.944738
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,71,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,3714,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,74,0,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,71,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085


Let us quickly look at the correlation of the distance columns.

In [2]:
df.iloc[:, 17:24].corr()

Unnamed: 0,Dist.SG,Dist.SC,Dist.MI,Dist.YK,Dist.YG,Dist.RY,Dist.OC
Dist.SG,1.0,0.584396,0.588353,0.731334,0.864714,0.99741,0.998566
Dist.SC,0.584396,1.0,-0.139397,0.51181,0.847196,0.618299,0.56758
Dist.MI,0.588353,-0.139397,1.0,0.464808,0.2728,0.549017,0.600958
Dist.YK,0.731334,0.51181,0.464808,1.0,0.817802,0.715375,0.71221
Dist.YG,0.864714,0.847196,0.2728,0.817802,1.0,0.878822,0.848379
Dist.RY,0.99741,0.618299,0.549017,0.715375,0.878822,1.0,0.996756
Dist.OC,0.998566,0.56758,0.600958,0.71221,0.848379,0.996756,1.0


We notice that `Dist.SG`, `Dist.RY`, and `Dist.OC` are very highly correlated. This is not surprising as these campuses are in close proximity. Hence, we can try running models with only one of these columns included. Furthermore, the `Dist.YG` column is correlated with those three columns and with `Dist.SC`. We also try excluding that column from the model input.

In addition to changing the number of distance columns passed, we also engineer some additional features. We add six "flag" columns which indicate whether a student's HomeZone is within a certain distance of particular campuses. These thresholds were determined by observing trends found in the descriptive analysis. Finally, we also add flags for whether a student lives in the same zone as each campus.

In [3]:
# Adding flag columns for distances
df['SC<25'] = df['Dist.SC'] < 25
df['SG<10'] = df['Dist.SG'] < 10
df['SG<20'] = df['Dist.SG'] < 20
df['MI<10'] = df['Dist.MI'] < 10
df['MI<20'] = df['Dist.MI'] < 20
df['YK<20'] = df['Dist.YK'] < 20

# Adding flags for same zone as campus
school_codes = list(df['School_Codes'].unique())
for school in school_codes:
    if school == 'YG':
        continue
    df[school + '0'] = df['Dist.' + school] == 0
print(list(df.columns))

['Campus', 'Level', 'Status', 'Mode_Actual', 'Gender', 'Licence', 'Work', 'Age', 'HomeZone', 'Family', 'Cars', 'Children', 'Adults', 'Income', 'Mode', 'School_Codes', 'Segment', 'Dist.SG', 'Dist.SC', 'Dist.MI', 'Dist.YK', 'Dist.YG', 'Dist.RY', 'Dist.OC', 'WTT.SG', 'WTT.SC', 'WTT.MI', 'WTT.YK', 'WTT.YG', 'WTT.RY', 'WTT.OC', 'AIVTT.SG', 'AIVTT.SC', 'AIVTT.MI', 'AIVTT.YK', 'AIVTT.YG', 'AIVTT.RY', 'AIVTT.OC', 'TPTT.SG', 'TPTT.SC', 'TPTT.MI', 'TPTT.YK', 'TPTT.YG', 'TPTT.RY', 'TPTT.OC', 'Total.SG', 'Total.SC', 'Total.MI', 'Total.YK', 'Total.YG', 'Total.RY', 'Total.OC', 'UG.SG', 'UG.SC', 'UG.MI', 'UG.YK', 'UG.YG', 'UG.RY', 'UG.OC', 'Grad.SG', 'Grad.SC', 'Grad.MI', 'Grad.YK', 'Grad.YG', 'Grad.RY', 'Grad.OC', 'Tuition.SG', 'Tuition.SC', 'Tuition.MI', 'Tuition.YK', 'Tuition.YG', 'Tuition.RY', 'Tuition.OC', 'Domestic.SG', 'Domestic.SC', 'Domestic.MI', 'Domestic.YK', 'Domestic.YG', 'Domestic.RY', 'Domestic.OC', 'Admission_Avg.SG', 'Admission_Avg.SC', 'Admission_Avg.MI', 'Admission_Avg.YK', 'Admiss

Now, let us load coordinate and planning district information. To avoid scaling issues, we normalize the coordinates so that the values are between 0 and 1, inclusive. We also plot their correlations.

In [4]:
# Load zone coordinates
zones = pd.read_csv('../../Data/Zones.csv')
zones.set_index('Zone#', inplace=True)

# Normalize from 0 to 1
zones['X'] = (zones['X'] - zones['X'].min()) / (zones['X'].max() - zones['X'].min())
zones['Y'] = (zones['Y'] - zones['Y'].min()) / (zones['Y'].max() - zones['Y'].min())
zones.corr()

Unnamed: 0,PD,X,Y
PD,1.0,-0.456036,-0.371999
X,-0.456036,1.0,0.633247
Y,-0.371999,0.633247,1.0


Interestingly, the X- and Y-coordinates are moderately positively correlated. This might be indicative of the shape of the GTA being tilted from the southwest to the northeast due to Lake Ontario, as well as York Region's large extent into the northeast.

Let us add this zone information to our dataframe.

In [5]:
# Add zone information to df
temp = pd.DataFrame([[a[i] for a in (zones['PD'], zones['X'], zones['Y'])] for i in df['HomeZone']], columns=['PD', 'X', 'Y'], index=df.index)
df = pd.concat((df, temp), axis=1)

With the data prepared, we can begin preparing the Random Forest classifier. First, we import the relevant packages.

In [6]:
# Machine learning packages
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import matthews_corrcoef

def average(l):
    return sum(l) / len(l)

We declare variables for the distance-related columns so we can pass them into the model.

In [7]:
std_dists = df.iloc[:, 17:24]
dist_flags = df.iloc[:, 89:95]
same_zone_flags = df.iloc[:, 96:101]
four_dists = df.iloc[:, 17:21] # No OC, RY, YG
five_dists = df.iloc[:, 17:22] # No OC, RY
coords = df[['X', 'Y']]

We also prepare a dataframe to store the model results, including the following metrics:
- Accuracy: The accuracy of the model as calculated by `sklearn`, also the micro precision/recall/F-1 score  
- Prec, Rec, F1: The macro precision, recall, and F-1 score  
- MCC: The Matthews Correlation Coefficient
- APO (Average Probabilities of Observations): The average predicted probabilities produced by the classifier of observed campus choices for the testing data.

In [8]:
# Prepare results dataframe
metric_names = ['Acc', 'Prec', 'Rec', 'F1', 'MCC', 'APO']
results = pd.DataFrame(columns=['Model', 'Trials'] + metric_names)

Finally, let us run our models. For each model, we average the metrics across 10 trials.

In [9]:
# Prepare classifier
rf = RandomForestClassifier(n_estimators=100)
y = df['School_Codes']
num_trials = 10

# Run models
for x_temp, name in ((std_dists, 'Std'),
(five_dists, 'Five'),
(four_dists, 'Four'),
(dist_flags, 'Flags'),
(same_zone_flags, 'SameZone'),
(df['HomeZone'].values.reshape(-1, 1), 'HZ'),
(df['PD'].values.reshape(-1, 1), 'PD'),
(coords, 'Coords'),
(df['HomeZone'].isnull().values.reshape(-1,1), "Null"),
(pd.concat((four_dists, df['PD']), axis=1), "Four+PD"),
(pd.concat((coords, std_dists), axis=1), "Std+Coords"),
(pd.concat((coords, five_dists), axis=1), "Five+Coords"),
(pd.concat((coords, four_dists), axis=1), "Four+Coords"),
(pd.concat((coords, std_dists, df['HomeZone']), axis=1), "Std+Coords+HZ"),
(pd.concat((coords, five_dists, df['HomeZone']), axis=1), "Five+Coords+HZ"),
(pd.concat((coords, four_dists, df['HomeZone']), axis=1), "Four+Coords+HZ"),
(pd.concat((coords, four_dists), axis=1), "Four+Coords+HZ+PD"),
(pd.concat((coords, four_dists, df['HomeZone']), axis=1), "Four+Coords+PD"),
(pd.concat((df['PD'], dist_flags), axis=1), "PD+Flags"),
(pd.concat((four_dists, dist_flags), axis=1), "Four+Flags"),
(pd.concat((dist_flags, std_dists), axis=1), "Std+Flags"),
(pd.concat((dist_flags, five_dists), axis=1), "Five+Flags"),
(pd.concat((std_dists, dist_flags, same_zone_flags, coords, df['HomeZone'], df['PD']), axis=1), "Full")):
    # Prepare metrics
    metrics = {}
    for metric in metric_names:
        metrics[metric] = []
    
    # Run trials
    for i in range(num_trials):
        # Split data and run model
        X_train, X_test, y_train, y_test = train_test_split(x_temp, y, test_size=0.3)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
        
        # Metrics
        metrics['Acc'].append(rf.score(X_test, y_test))
        [a.append(b) for a,b in zip([metrics['Prec'], metrics['Rec'], metrics['F1']], precision_recall_fscore_support(y_test, y_pred, average = 'macro')[:3])]
        metrics['MCC'].append(matthews_corrcoef(y_test, y_pred))
        
        # APO
        schools = list(rf.classes_)
        probs = pd.concat((y_test.reset_index(drop=True), pd.DataFrame(rf.predict_proba(X_test))), axis=1)
        metrics['APO'].append(probs.apply(lambda z: z[schools.index(z.School_Codes)], axis=1).mean())
    
    # Add results to dataframe
    ave_metrics = [average(v) for v in metrics.values()]
    results.loc[len(results)] = [name, num_trials] + ave_metrics

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The last model we ran was the "Full" model which included all the relevant columns. We can take a look at the feature importances from the last trial of this model:

In [10]:
pd.DataFrame({'Features' : x_temp.columns, 'Importance': rf.feature_importances_}).sort_values('Importance', ascending=False)

Unnamed: 0,Features,Importance
20,HomeZone,0.10567
18,X,0.099743
0,Dist.SG,0.096465
5,Dist.RY,0.095083
3,Dist.YK,0.092927
19,Y,0.090098
6,Dist.OC,0.089544
1,Dist.SC,0.083458
2,Dist.MI,0.075097
4,Dist.YG,0.070717


It seems that the `HomeZone` column, coordinates, and distances were most important in this model. The planning district and flags were less important.

We can look at a correlation matrix for the different metrics.

In [11]:
results.corr()

Unnamed: 0,Acc,Prec,Rec,F1,MCC,APO
Acc,1.0,0.330238,0.908178,0.767161,0.932443,0.657454
Prec,0.330238,1.0,0.411571,0.536721,0.555626,0.551773
Rec,0.908178,0.411571,1.0,0.952044,0.973933,0.889206
F1,0.767161,0.536721,0.952044,1.0,0.918007,0.985582
MCC,0.932443,0.555626,0.973933,0.918007,1.0,0.84449
APO,0.657454,0.551773,0.889206,0.985582,0.84449,1.0


Interestingly, the correlation between Accuracy and APO is only 0.657. Macro F-1 score was quite closely correlated with APO so we may prioritize it as a computationally faster and more well-known indicator of effectiveness in the future.

At this point, we should look at the results for our different models:

In [12]:
results.groupby('Model').mean().to_csv('ModelResults.csv')
results.groupby('Model').mean().sort_values('APO', ascending=False)

Unnamed: 0_level_0,Acc,Prec,Rec,F1,MCC,APO
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Five,0.463533,0.346844,0.306282,0.311087,0.246908,0.388892
Four+Flags,0.461351,0.335119,0.299525,0.303699,0.242973,0.388706
Four+Coords+PD,0.461513,0.341621,0.301287,0.306288,0.243453,0.388592
Four,0.461653,0.342845,0.303704,0.307152,0.243648,0.387946
Four+Coords,0.459285,0.33419,0.297033,0.301418,0.23797,0.387832
Coords,0.462604,0.341306,0.301837,0.306506,0.243907,0.38771
Full,0.462233,0.347666,0.30251,0.307094,0.242642,0.387631
Four+PD,0.459123,0.340834,0.298733,0.303926,0.240495,0.387495
Four+Coords+HZ+PD,0.460237,0.336564,0.299432,0.304184,0.240297,0.387486
Five+Coords,0.461537,0.349905,0.306027,0.309709,0.244731,0.387361


Notice that there are significant fluctuations in these metrics across trials. Because of this, we are considering all the models with 0.38 < APO < 0.39 to be similarly effective. These models include any combination of distance columns, home zone, and/or coordinates. The different combinations of these variables, or the inclusion of flags, planning district, etc., did not seem to cause significant increases in effectiveness.

Interestingly, the "PD + Flags" model offered a significant imporvement on both the "PD" and "Flags" models individually, approaching the accuracy of the stronger models. Of note, however, was that this model had a very high micro accuracy.

Finally, it seems that including only four or five distance columns results in models that are no less effective than for all seven distance models. This might be an approach we will keep in the future.