# Distance Feature Engineering and Selection for Location Choice Model

In this notebook, we analyze the distance information to be used for our location choice model. In particular, we do some feature engineering to test various input formats for the random forest classifier. After this, we identify the features which we wish to keep.

### Load Data

First, let's load our data.

In [1]:
import pandas as pd

df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Complete_Input.csv')
df.head()

Unnamed: 0,Campus,Level,Status,Mode_Actual,Gender,Licence,Work,Age,Family,Cars,...,Domestic.OC,Admission_Avg.SG,Admission_Avg.SC,Admission_Avg.MI,Admission_Avg.YK,Admission_Avg.YG,Admission_Avg.RY,Admission_Avg.OC,Exp_Segment,Exp_Level
0,Scarborough (UTSC),UG,FT,Transit Bus,Female,0,Unknown,20,Family,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.383705,0.383705
1,Downtown Toronto (St. George),Grad,FT,Walk,Female,1,Unknown,25,Other,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085
2,Downtown Toronto (St. George),UG,FT,Transit Bus,Female,1,Unknown,23,Family,1,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
3,Downtown Toronto (St. George),UG,FT,Walk,Male,1,Unknown,20,Roommates,0,...,0.8998,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.91927,0.91927
4,Downtown Toronto (St. George),Grad,FT,Walk,Male,1,Unknown,27,Other,0,...,0.6786,0.893,0.841,0.83,0.817,0.817,0.84,0.824,0.986085,0.986085


### Feature Engineering

There are several formats in which we can pass distance information into the random forest model. These include:

- Standard: Columns containing the distances from each students' home zone to each campus zone  
- Closest labels: Columns n from 0 to 6 containing the label of the campus that is nth closest to the students' home zone  
- Closest distances: Columns n from 0 to 6 containing the distance to the campus that is nth closest to the students' home zone 

Note that the closest labels format are one-hot-encoded using `get_dummies()`.

In [2]:
y = df['School_Codes']

closest_labels = pd.DataFrame(df.iloc[:, 16:23].apply(lambda x: x.nsmallest(7).index.tolist(), axis=1).tolist(), index=df.index)
closest_distances = pd.DataFrame(df.iloc[:, 16:23].apply(lambda x: x.nsmallest(7).tolist(), axis=1).tolist(), index=df.index)
x = pd.concat((df.iloc[:, 16:23], closest_labels, closest_distances), axis=1)
x = pd.get_dummies(x)
x.columns

Index([  'Dist.SG',   'Dist.SC',   'Dist.MI',   'Dist.YK',   'Dist.YG',
         'Dist.RY',   'Dist.OC',           0,           1,           2,
                 3,           4,           5,           6, '0_Dist.MI',
       '0_Dist.OC', '0_Dist.RY', '0_Dist.SC', '0_Dist.SG', '0_Dist.YG',
       '0_Dist.YK', '1_Dist.MI', '1_Dist.OC', '1_Dist.RY', '1_Dist.SC',
       '1_Dist.SG', '1_Dist.YG', '1_Dist.YK', '2_Dist.MI', '2_Dist.OC',
       '2_Dist.RY', '2_Dist.SC', '2_Dist.SG', '2_Dist.YG', '2_Dist.YK',
       '3_Dist.MI', '3_Dist.OC', '3_Dist.RY', '3_Dist.SC', '3_Dist.SG',
       '3_Dist.YG', '3_Dist.YK', '4_Dist.MI', '4_Dist.OC', '4_Dist.RY',
       '4_Dist.SC', '4_Dist.SG', '4_Dist.YG', '4_Dist.YK', '5_Dist.MI',
       '5_Dist.OC', '5_Dist.RY', '5_Dist.SC', '5_Dist.YG', '5_Dist.YK',
       '6_Dist.MI', '6_Dist.SC'],
      dtype='object')

### Running Model

Now, let us run the model with all of the extracted features. We report the following performance metrics:

- Accuracy: The accuracy of the model, also the micro precision/recall/f-1 score  
- PRF1 Mac: The macro precision, recall, and f-1 score  
- Matthews: The Matthews Correlation Coefficient (MCC)

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import matthews_corrcoef

rf = RandomForestClassifier(n_estimators=100)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print("Accuracy\t", rf.score(X_test, y_test)) # Acc
print("PRF1 Mac\t", precision_recall_fscore_support(y_test, y_pred, average = 'macro')[:3]) # Rec = Bal Acc
print("Matthews\t", matthews_corrcoef(y_test, y_pred))

Accuracy	 0.4586175534333788
PRF1 Mac	 (0.37854012613995025, 0.31216688125447584, 0.321015143183594)
Matthews	 0.24392244699637025


### Feature Selection

Now, we can take a look at which features were most important. We analyze two such metrics:

- FeatImportance: The impurity-based feature importances  
- PermImportance: The permutation importance of each feature, averaged over 5 trials  

Including the permutation importance acts as a way to mitigate the bias towards high-cardinality features from the impourity-based feature importance.

In [4]:
features = pd.DataFrame(index = X_test.columns)
features['FeatImportance'] = rf.feature_importances_

result = permutation_importance(rf, X_train, y_train)
features['PermImportance'] = result.importances_mean

features.head()

Unnamed: 0,FeatImportance,PermImportance
Dist.SG,0.071306,0.002709
Dist.SC,0.062737,0.003937
Dist.MI,0.056874,0.002495
Dist.YK,0.066629,0.005282
Dist.YG,0.04753,0.001813


We've generated a dataframe containing the importance metrics for each feature. Let's see which features were identified as most important by each metric.

In [5]:
print(sorted(list(map(str, list(features['PermImportance'].sort_values(ascending=False)[:14].index)))))
print(sorted(list(map(str, list(features['FeatImportance'].sort_values(ascending=False)[:14].index)))))

['0', '0_Dist.MI', '1', '2', '5', '5_Dist.MI', '6', 'Dist.MI', 'Dist.OC', 'Dist.RY', 'Dist.SC', 'Dist.SG', 'Dist.YG', 'Dist.YK']
['0', '1', '2', '3', '4', '5', '6', 'Dist.MI', 'Dist.OC', 'Dist.RY', 'Dist.SC', 'Dist.SG', 'Dist.YG', 'Dist.YK']


From this we see that both metrics agree on which 14 features are most important. These are the standard distances and closest distances.

### Verification

To verify this finding, we can use `sklearn`'s `SelectFromModel`. Notice that these produce the same results.

In [6]:
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, y_train)
print(sorted(list(map(str, list(X_train.columns[(sel.get_support())])))))

['0', '1', '2', '3', '4', '5', '6', 'Dist.MI', 'Dist.OC', 'Dist.RY', 'Dist.SC', 'Dist.SG', 'Dist.YG', 'Dist.YK']


We also compare the results using all the extracted features with the results using only the selected ones, and using only standard distances (the benchmark model).

In [7]:
from statistics import stdev
def average(l):
    return sum(l) / len(l)


schools = list(rf.classes_)

for x_temp, name in ((x.iloc[:, :7], "Benchmark"), (x.iloc[:, :14], "Selected"), (x, "Full")):
    acc = []
    prec = []
    rec = []
    f_1 = []
    mcc = []
    apo = []
    for i in range(10):
        X_train, X_test, y_train, y_test = train_test_split(x_temp, y, test_size=0.3)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)        
        acc.append(rf.score(X_test, y_test))
        p, r, f = precision_recall_fscore_support(y_test, y_pred, average = 'macro')[:3]
        prec.append(p)
        rec.append(r)
        f_1.append(f)
        mcc.append(matthews_corrcoef(y_test, y_pred))
        probs = rf.predict_proba(X_test)
        results = pd.concat((y_test.reset_index(drop=True), pd.DataFrame(probs)), axis=1)
        apo.append(results.apply(lambda z: z[schools.index(z.School_Codes)], axis=1).mean())
        
    
    print("Results for " + name + " model:")
    print("Accuracy\t", average(acc), "\t", stdev(acc))
    print("Prec Mac\t", average(prec), "\t", stdev(prec))
    print("Rec  Mac\t", average(rec), "\t", stdev(rec))
    print("F-1  Mac\t", average(f_1), "\t", stdev(f_1))
    print("Matthews\t", average(mcc), "\t", stdev(mcc))
    print("Ave Prob\t", average(apo), "\t", stdev(apo))
    print()

Results for Benchmark model:
Accuracy	 0.4560936789449749 	 0.005224099798199897
Prec Mac	 0.3371474593284371 	 0.0206911857546598
Rec  Mac	 0.29799048903053954 	 0.006142840523763042
F-1  Mac	 0.3000913243560917 	 0.00844401896986435
Matthews	 0.23602959667377524 	 0.006795194377594011
Ave Prob	 0.3831216755576636 	 0.0032419106926779196

Results for Selected model:
Accuracy	 0.45798090040927686 	 0.006641907457464536
Prec Mac	 0.3409157638330924 	 0.021410468845564315
Rec  Mac	 0.30083099769117055 	 0.008598171519038312
F-1  Mac	 0.30532814260275487 	 0.00896481143771878
Matthews	 0.24030387413679843 	 0.009463891424651873
Ave Prob	 0.3850053870512864 	 0.003163422738580068

Results for Full model:
Accuracy	 0.45784447476125517 	 0.005324032479858166
Prec Mac	 0.33970505934700773 	 0.017845542887049638
Rec  Mac	 0.3001656351003855 	 0.0075434028388277116
F-1  Mac	 0.30516473768253694 	 0.0075150201381538875
Matthews	 0.23967684827829921 	 0.008589817638841996
Ave Prob	 0.384518982311

From these results, we see that the model with the selected features indeed outperforms the benchmark model. We also see that the Full model performed similarly to the Benchmark model, suggesting potential overfitting in the training process.

To further refine this feature selection process, let us look at how accuracy changes for different numbers of campuses included.

In [8]:
for i in range(7):
    x_temp = x.iloc[:, :8+i]
    name = str(i+1) + " Closest"
    acc = []
    prec = []
    rec = []
    f_1 = []
    mcc = []
    apo = []
    for i in range(10):
        X_train, X_test, y_train, y_test = train_test_split(x_temp, y, test_size=0.3)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)        
        acc.append(rf.score(X_test, y_test))
        p, r, f = precision_recall_fscore_support(y_test, y_pred, average = 'macro')[:3]
        prec.append(p)
        rec.append(r)
        f_1.append(f)
        mcc.append(matthews_corrcoef(y_test, y_pred))
        probs = rf.predict_proba(X_test)
        results = pd.concat((y_test.reset_index(drop=True), pd.DataFrame(probs)), axis=1)
        apo.append(results.apply(lambda z: z[schools.index(z.School_Codes)], axis=1).mean())
        
    
    print("Results for " + name + " model:")
    print("Accuracy\t", average(acc), "\t", stdev(acc))
    print("Prec Mac\t", average(prec), "\t", stdev(prec))
    print("Rec  Mac\t", average(rec), "\t", stdev(rec))
    print("F-1  Mac\t", average(f_1), "\t", stdev(f_1))
    print("Matthews\t", average(mcc), "\t", stdev(mcc))
    print("Ave Prob\t", average(apo), "\t", stdev(apo))
    print()

Results for 1 Closest model:
Accuracy	 0.45691223283310595 	 0.007393705467715972
Prec Mac	 0.3431081815867248 	 0.022133931822934542
Rec  Mac	 0.3008328188626208 	 0.006522013598438416
F-1  Mac	 0.30459932104693993 	 0.007560828819467769
Matthews	 0.23833230197475572 	 0.009908071560663267
Ave Prob	 0.38400987753248045 	 0.0035961423344140317

Results for 2 Closest model:
Accuracy	 0.4561164165529787 	 0.007725583556660303
Prec Mac	 0.33734415955832453 	 0.008342404141081042
Rec  Mac	 0.2989671190485113 	 0.006487775375254521
F-1  Mac	 0.30454782103032324 	 0.006563971090816411
Matthews	 0.23531341719025728 	 0.008425495947551534
Ave Prob	 0.3835732171187555 	 0.003511782275478776

Results for 3 Closest model:
Accuracy	 0.4592314688494771 	 0.0042131778012676705
Prec Mac	 0.3392717315629014 	 0.01465343493861074
Rec  Mac	 0.3007991932065454 	 0.005219789859541177
F-1  Mac	 0.30584096549855916 	 0.005164391142772159
Matthews	 0.23972168364603025 	 0.005744416403541396
Ave Prob	 0.38573

It seems that while the effectiveness of the model fluctuates for different numbers of labels included, these changes are not significant.

### Next Steps

Some next steps include:

- Tune hyperparameters (tree depth, splitting criterion, etc.)
- Introduce additional variables and perform feature engineering and selection