# Model Training and Evaluation

The ultimate goal of this project is to predict new areas that would be suitable for building trails. Areas that allow mountain biking will be treated as observations that are suitable for mountain biking, areas that do not allow biking will be treated as unsuitable. This therefore becomes a binary classification task.

# Setup

In [101]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV # Parameter Tuning
from sklearn.utils import resample # For up sampling MTB Trails

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

FILE_PATH = "G:/UNCG_Capstone/"
TRAIN_PATH = FILE_PATH + "Output/Training_Data/"

# Random Sample
SEED = 42

## Data

In [2]:
# Traing Data
trails_std = pd.read_csv(TRAIN_PATH + "training_std.csv")
trails_reg = pd.read_csv(TRAIN_PATH + "training_non.csv")
trails_std["mtb"] = np.where(trails_std.id.str.contains("trail_row", regex = False), "MTB", "No MTB")
trails_reg["mtb"] = np.where(trails_reg.id.str.contains("trail_row", regex = False), "MTB", "No MTB")

In [3]:
trails_std[trails_std.isna().any(axis = 1)]

Unnamed: 0,L0,L1,L2,L3,L4,L5,L6,L7,L8,L9,...,L12,L13,L14,id,County,mean,min,max,sd,mtb
311,0.147815,0.393641,0.634145,0.849844,1.109429,1.359835,1.610185,1.750925,2.418184,4.118827,...,,0.161999,0.147315,MA_ID-13568,Brunswick,4.743326,1.352,7.94,1.917174,No MTB


In [4]:
trails_reg[trails_reg.isna().any(axis = 1)]

Unnamed: 0,L0,L1,L2,L3,L4,L5,L6,L7,L8,L9,...,L12,L13,L14,id,County,mean,min,max,sd,mtb
311,0.543304,1.446848,2.330833,3.123649,4.07777,4.99815,5.918326,6.435622,8.888168,15.138975,...,,0.595435,0.541463,MA_ID-13568,Brunswick,4.743326,1.352,7.94,1.917174,No MTB


There is one observation in Brunswick County (along the NC coast) that has several missing values. This is likely due to the vector shape being divided into multiple shapes. It will be excluded prior to model training.

In [5]:
trails_std.drop(index = trails_std[trails_std.isna().any(axis = 1)].index, inplace = True)
trails_reg.drop(index = trails_reg[trails_reg.isna().any(axis = 1)].index, inplace = True)

## Upsample Data



In [6]:
trails_std.mtb.value_counts()

mtb
No MTB    296
MTB       104
Name: count, dtype: int64

Due to the class imbalance MTB will be upsampled to 250.

In [7]:
trails_std = pd.concat([trails_std, resample(trails_std[trails_std.mtb == "MTB"], replace = True, n_samples = 146, random_state = SEED)])
trails_reg = pd.concat([trails_reg, resample(trails_reg[trails_reg.mtb == "MTB"], replace = True, n_samples = 146, random_state = SEED)])

# Train Models

Models will be trained and validated. Then a set of holdout data will be used to assess performance on unseen data. Approximately 1/3 of the data will be used as hold out data.

In [8]:
# Features to Train Models on
feature_cols = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10','L11', 'L12', 'L13', 'L14']
resp_col = "mtb"

# Model type used for analysis
models = dict(
    log = "logistic",
    rf = "RandomForest",
)

datasets = dict(
    std = "Standardized",
    reg = "Untransformed",
)

## Split Training Data

In [44]:
# Split up standardized data 
x_train, x_test, y_train, y_test = train_test_split(trails_std[feature_cols], trails_std[resp_col], test_size = 0.33, random_state = SEED)
# Use index to grab same observations from non-transformed data
xr_train, xr_test, yr_train, yr_test = train_test_split(trails_reg[feature_cols], trails_reg[resp_col], test_size = 0.33, random_state = SEED)

## Standardized Data

### Logistic Regression
This will act as a base line. No parameter tuning will be performed.

In [111]:
lg_model = LogisticRegression().fit(x_train, y_train)
y_test_lg_pred = lg_model.predict(x_test)
print(classification_report(y_test, y_test_lg_pred))

              precision    recall  f1-score   support

         MTB       0.70      0.70      0.70        91
      No MTB       0.70      0.70      0.70        90

    accuracy                           0.70       181
   macro avg       0.70      0.70      0.70       181
weighted avg       0.70      0.70      0.70       181



In [112]:
# Adding results
result_std_logistic = classification_report(y_test, y_test_lg_pred, output_dict = True)
result_std_logistic["model"] = models["log"]
result_std_logistic["dataset"] = datasets["std"]
result_std_logistic["auc"] = roc_auc_score(y_test, lg_model.predict_proba(x_test)[:,1])

### Random Forest

A grid search will be used to evaluate the following parameters:
- 100, 200, 500 trees
- 5, 10, 15 max features

In [113]:
params = {'n_estimators' : (100, 200, 400),
          'max_features' : (5, 10, 15)}

rf_model = RandomForestClassifier(random_state = SEED)
clf = GridSearchCV(rf_model, params, scoring = 'roc_auc')
clf.fit(x_train, y_train)

In [114]:
clf.best_params_

{'max_features': 15, 'n_estimators': 100}

In [115]:
rf_final = clf.best_estimator_
y_test_pred = clf.best_estimator_.predict(x_test)
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

         MTB       0.82      0.82      0.82        91
      No MTB       0.82      0.81      0.82        90

    accuracy                           0.82       181
   macro avg       0.82      0.82      0.82       181
weighted avg       0.82      0.82      0.82       181



In [116]:
# Add Results
result_std_rf = classification_report(y_test, y_test_pred, output_dict = True)
result_std_rf["model"] = models["rf"]
result_std_rf["dataset"] = datasets["std"]
result_std_rf["auc"] = roc_auc_score(y_test, clf.best_estimator_.predict_proba(x_test)[:,1])

## Non-Standardized Data

### Logistic Regression

In [117]:
lg_model_r = LogisticRegression().fit(xr_train, yr_train)
yr_test_lg_pred = lg_model_r.predict(xr_test)
print(classification_report(yr_test, yr_test_lg_pred))

              precision    recall  f1-score   support

         MTB       0.83      0.57      0.68        91
      No MTB       0.67      0.88      0.76        90

    accuracy                           0.72       181
   macro avg       0.75      0.72      0.72       181
weighted avg       0.75      0.72      0.72       181



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [118]:
# Add Results
result_reg_logistic = classification_report(yr_test, yr_test_lg_pred, output_dict = True)
result_reg_logistic["model"] = models["log"]
result_reg_logistic["dataset"] = datasets["reg"]
result_reg_logistic["auc"] = roc_auc_score(yr_test, lg_model_r.predict_proba(xr_test)[:,1])

### Random Forest

In [119]:
params = {'n_estimators' : (100, 200, 400),
          'max_features' : (5, 10, 15)}

rf_model_r = RandomForestClassifier(random_state = SEED)
clf = GridSearchCV(rf_model_r, params, scoring = 'roc_auc')
clf.fit(xr_train, yr_train)

In [120]:
clf.best_params_

{'max_features': 5, 'n_estimators': 400}

In [121]:
yr_test_pred = clf.best_estimator_.predict(xr_test)
print(classification_report(yr_test, yr_test_pred))

              precision    recall  f1-score   support

         MTB       0.83      0.80      0.82        91
      No MTB       0.81      0.83      0.82        90

    accuracy                           0.82       181
   macro avg       0.82      0.82      0.82       181
weighted avg       0.82      0.82      0.82       181



In [122]:
# Adding Results
result_reg_rf = classification_report(yr_test, yr_test_pred, output_dict = True)
result_reg_rf["model"] = models["rf"]
result_reg_rf["dataset"] = datasets["reg"]
result_reg_rf["auc"] = roc_auc_score(yr_test, clf.best_estimator_.predict_proba(xr_test)[:,1])

# Results

Here the results from the models on each data set are shown. They have been ranked according the theri Area-Under-Curve (AUC) which acts as a general way of comparing their predictive accuracy. 

In [154]:
results_df = pd.concat([
    pd.DataFrame(result_std_logistic),
    pd.DataFrame(result_std_rf),
    pd.DataFrame(result_reg_logistic),
    pd.DataFrame(result_reg_rf)
])

# Pivot For easier reading
eval_table = (results_df.loc[["recall", "precision"],["model", "dataset", "MTB", "auc"]]
    .reset_index()
    .pivot(index = ["model", "dataset"], columns = ["index"], values = ["MTB", "auc"])
)
# Drop Extra column
eval_table.drop(('auc', 'precision'), axis = 1, inplace = True)
# Flatten multi index 
eval_table.columns = eval_table.columns.to_flat_index()
# Rename columns for readability
eval_table.rename(columns = {("MTB", "precision") : "precision", 
                             ("MTB", "recall") : "recall", 
                             ("auc", "recall") : "auc"}, inplace = True)
eval_table.sort_values(by = "auc", ascending = False)


Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,auc
model,dataset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForest,Untransformed,0.829545,0.802198,0.907875
RandomForest,Standardized,0.815217,0.824176,0.886996
logistic,Untransformed,0.825397,0.571429,0.798901
logistic,Standardized,0.703297,0.703297,0.786203


## Exporting Prediction Results

In [163]:
pred_df = pd.DataFrame(y_test)
pred_df["rf_reg"] = yr_test_pred # Random Forest on regular data
pred_df["rf_std"] = y_test_pred # Random Forest on standarized data
pred_df["missed"] = pred_df.apply(lambda x: (x.rf_reg == x.rf_std) and (x.rf_reg != x.mtb), axis = 1)
pred_df.to_csv(TRAIN_PATH + "prediction_results.csv") # Keeping index for mapping

# Conclusion

The random forest achieved the best results and had similar performance on both the standardized and non-standardized data. In both cases it had an overall accuracy of around 82%. Both the precision and recall were close which shows that the model is systematically failing to predict either of the classes.

Ultimately the goal of any one of these models is to predict whether a given piece of land would be suitable for building mountain bike trails. In that context, for every 100 areas successfully predicted as suitable for mountain biking, the random forest model is going to make 20 poor suggestions. 