# Model Tuning

The purpose of this notebook is to go through RandomizedSearchCV tuning of the KNN, Logistic Regression, and Random Forest Classication Models to determine which model will be my final choice.

Importing packages:

In [1]:
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

from collections import OrderedDict
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn import datasets
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, precision_recall_curve,f1_score, fbeta_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import log_loss
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

Loading in Functions:

In [2]:
from knn_model_eval import *

In [3]:
from logistic_reg_model_eval import *

In [4]:
from random_forest_evaluator import *

## Loading and CLeaning Data:

In [5]:
training_horses_3_cleaned = pd.read_pickle('./Data/training_horses_3_cleaned.pkl')

In [6]:
training_horses_3_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59605 entries, 1 to 79203
Data columns (total 40 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   race_id                                       59605 non-null  int64  
 1   horse_id                                      59605 non-null  int64  
 2   result                                        59605 non-null  int64  
 3   lengths_behind                                59605 non-null  float64
 4   horse_age                                     59605 non-null  int64  
 5   horse_type                                    59605 non-null  object 
 6   horse_country                                 59605 non-null  object 
 7   horse_rating                                  59605 non-null  int64  
 8   horse_gear                                    59605 non-null  object 
 9   declared_weight                               59605 non-null 

### One Hot Encoding of Gender (horse_type):

In [7]:
ohe = OneHotEncoder(drop='first', sparse=False)

In [8]:
training_model_gender_cat = training_horses_3_cleaned.loc[:, ['horse_type']]

In [9]:
ohe.fit(training_model_gender_cat)

ohe_X = ohe.transform(training_model_gender_cat)

columns = ohe.get_feature_names(['horse_type'])

ohe_X_df = pd.DataFrame(ohe_X, columns=columns, index=training_model_gender_cat.index)

ohe_X_df.sample(20)

Unnamed: 0,horse_type_Filly,horse_type_Gelding,horse_type_Horse,horse_type_Mare,horse_type_Rig
62569,0.0,1.0,0.0,0.0,0.0
75691,0.0,1.0,0.0,0.0,0.0
74427,0.0,1.0,0.0,0.0,0.0
73550,0.0,1.0,0.0,0.0,0.0
77978,0.0,1.0,0.0,0.0,0.0
46575,0.0,1.0,0.0,0.0,0.0
28670,0.0,1.0,0.0,0.0,0.0
36939,0.0,1.0,0.0,0.0,0.0
11895,0.0,1.0,0.0,0.0,0.0
2918,0.0,1.0,0.0,0.0,0.0


Concatenating back into the original data frame:

In [10]:
comb_training_horses_3_cleaned = pd.concat([training_horses_3_cleaned, ohe_X_df], axis=1)

In [11]:
comb_training_horses_3_cleaned.head()

Unnamed: 0,race_id,horse_id,result,lengths_behind,horse_age,horse_type,horse_country,horse_rating,horse_gear,declared_weight,...,diff_from_field_handicap_wgt_avg,field_handicap_wgt_rank,career_races,career_shows,shows_in_last_5_races,horse_type_Filly,horse_type_Gelding,horse_type_Horse,horse_type_Mare,horse_type_Rig
1,6348,1698,2,0.2,6,Gelding,AUS,92,TT/B,1172.0,...,4.214286,4,36,13.0,2.0,0.0,1.0,0.0,0.0,0.0
3,6348,833,4,0.75,4,Gelding,IRE,89,CP,1154.0,...,1.214286,7,19,9.0,2.0,0.0,1.0,0.0,0.0,0.0
4,6348,3368,5,1.25,3,Gelding,GER,91,TT/B/H,1147.0,...,3.214286,5,4,4.0,4.0,0.0,1.0,0.0,0.0,0.0
5,6348,1238,6,1.25,5,Gelding,AUS,87,TT/V,1191.0,...,-2.785714,10,25,12.0,1.0,0.0,1.0,0.0,0.0,0.0
6,6348,985,7,2.25,5,Gelding,NZ,84,--,1070.0,...,-5.785714,12,28,8.0,0.0,0.0,1.0,0.0,0.0,0.0


In [12]:
#Removing original horse_type column:
comb_training_horses_3_cleaned.drop(columns='horse_type', inplace=True)
comb_training_horses_3_cleaned.head()

Unnamed: 0,race_id,horse_id,result,lengths_behind,horse_age,horse_country,horse_rating,horse_gear,declared_weight,actual_weight,...,diff_from_field_handicap_wgt_avg,field_handicap_wgt_rank,career_races,career_shows,shows_in_last_5_races,horse_type_Filly,horse_type_Gelding,horse_type_Horse,horse_type_Mare,horse_type_Rig
1,6348,1698,2,0.2,6,AUS,92,TT/B,1172.0,129,...,4.214286,4,36,13.0,2.0,0.0,1.0,0.0,0.0,0.0
3,6348,833,4,0.75,4,IRE,89,CP,1154.0,126,...,1.214286,7,19,9.0,2.0,0.0,1.0,0.0,0.0,0.0
4,6348,3368,5,1.25,3,GER,91,TT/B/H,1147.0,128,...,3.214286,5,4,4.0,4.0,0.0,1.0,0.0,0.0,0.0
5,6348,1238,6,1.25,5,AUS,87,TT/V,1191.0,122,...,-2.785714,10,25,12.0,1.0,0.0,1.0,0.0,0.0,0.0
6,6348,985,7,2.25,5,NZ,84,--,1070.0,119,...,-5.785714,12,28,8.0,0.0,0.0,1.0,0.0,0.0,0.0


OHE Complete.

### Feature Selection:

Adding one more feature for now - show rate (i.e. column career_shows / career_races):

In [13]:
comb_training_horses_3_cleaned['career_show_rate'] = comb_training_horses_3_cleaned['career_shows'] / comb_training_horses_3_cleaned['career_races']

In [14]:
comb_training_horses_3_cleaned.sample(5)

Unnamed: 0,race_id,horse_id,result,lengths_behind,horse_age,horse_country,horse_rating,horse_gear,declared_weight,actual_weight,...,field_handicap_wgt_rank,career_races,career_shows,shows_in_last_5_races,horse_type_Filly,horse_type_Gelding,horse_type_Horse,horse_type_Mare,horse_type_Rig,career_show_rate
43423,2877,3274,3,0.5,3,AUS,60,--,1048.0,119,...,4,11,7.0,4.0,0.0,1.0,0.0,0.0,0.0,0.636364
20165,4740,3867,4,1.5,4,NZ,76,--,1106.0,129,...,3,15,4.0,1.0,0.0,1.0,0.0,0.0,0.0,0.266667
20037,4750,152,6,4.75,3,AUS,60,--,1068.0,120,...,7,26,5.0,1.0,0.0,1.0,0.0,0.0,0.0,0.192308
27724,4131,2960,6,8.0,5,IRE,63,TT/SR,1145.0,116,...,12,12,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.083333
13115,5300,1295,11,6.5,5,IRE,64,TT,1123.0,118,...,11,23,5.0,1.0,0.0,1.0,0.0,0.0,0.0,0.217391


In [15]:
comb_training_horses_3_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59605 entries, 1 to 79203
Data columns (total 45 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   race_id                                       59605 non-null  int64  
 1   horse_id                                      59605 non-null  int64  
 2   result                                        59605 non-null  int64  
 3   lengths_behind                                59605 non-null  float64
 4   horse_age                                     59605 non-null  int64  
 5   horse_country                                 59605 non-null  object 
 6   horse_rating                                  59605 non-null  int64  
 7   horse_gear                                    59605 non-null  object 
 8   declared_weight                               59605 non-null  float64
 9   actual_weight                                 59605 non-null 

In [16]:
columns_1 = ['horse_age', 'distance', 'horses_in_field','declared_weight', 'actual_weight', 'horse_rating', 'draw', 'three_race_rolling_avg_finish', 'three_race_rolling_average_lengths', 'three_race_rolling_average_time', 'three_race_rolling_average_distance_per_time', 'field_rating_rank', 'diff_from_field_rating_avg', 'field_age_rank', 'diff_from_field_age_avg', 'diff_from_field_declared_wgt_avg', 'field_dec_wgt_rank', 'diff_from_field_handicap_wgt_avg', 'field_handicap_wgt_rank', 'career_races', 'career_shows', 'shows_in_last_5_races', 'career_show_rate', 'horse_type_Filly', 'horse_type_Gelding', 'horse_type_Horse', 'horse_type_Mare', 'horse_type_Rig']

In [17]:
for i, feature in enumerate(columns_1, 1):
    print("Feature {}: {}".format(i, feature))

Feature 1: horse_age
Feature 2: distance
Feature 3: horses_in_field
Feature 4: declared_weight
Feature 5: actual_weight
Feature 6: horse_rating
Feature 7: draw
Feature 8: three_race_rolling_avg_finish
Feature 9: three_race_rolling_average_lengths
Feature 10: three_race_rolling_average_time
Feature 11: three_race_rolling_average_distance_per_time
Feature 12: field_rating_rank
Feature 13: diff_from_field_rating_avg
Feature 14: field_age_rank
Feature 15: diff_from_field_age_avg
Feature 16: diff_from_field_declared_wgt_avg
Feature 17: field_dec_wgt_rank
Feature 18: diff_from_field_handicap_wgt_avg
Feature 19: field_handicap_wgt_rank
Feature 20: career_races
Feature 21: career_shows
Feature 22: shows_in_last_5_races
Feature 23: career_show_rate
Feature 24: horse_type_Filly
Feature 25: horse_type_Gelding
Feature 26: horse_type_Horse
Feature 27: horse_type_Mare
Feature 28: horse_type_Rig


## Running Baseline Models:

All models with K-Fold cross validation w/ k=5 folds.

In [18]:
X = comb_training_horses_3_cleaned[columns_1]
y = comb_training_horses_3_cleaned['show']

In [19]:
knn = KNN_accuracy_scorer_f_fold(X, y, n = 21, k=5)

Confusion Matrix for Fold 1
[[8654  350]
 [2556  361]]


Confusion Matrix for Fold 2
[[8642  357]
 [2577  345]]


Confusion Matrix for Fold 3
[[8658  365]
 [2551  347]]


Confusion Matrix for Fold 4
[[8612  355]
 [2612  342]]


Confusion Matrix for Fold 5
[[8740  347]
 [2484  350]]


KNN Classification w/ KFOLD CV Results (k=5):
KNN Accuracy scores:  [0.7562285043201074, 0.7538797080781814, 0.7553896485194195, 0.7511114839359114, 0.7625199228252664] 

Simple mean cv accuracy: 0.756 + 0.004 

KNN Precision scores:  [0.5077355836849508, 0.49145299145299143, 0.48735955056179775, 0.49067431850789095, 0.5021520803443329] 

Simple mean cv precision: 0.496 +- 0.008 

KNN Recall scores:  [0.1237572848817278, 0.11806981519507187, 0.11973775017253278, 0.11577522004062288, 0.12350035285815103] 

Simple mean cv recall: 0.120 +- 0.003 

KNN Fbeta (beta=0.5) scores:  [0.3133136608227738, 0.3010471204188481, 0.30194918203967974, 0.29780564263322884, 0.31127712557808607] 

Simple mean cv Fbeta (beta=0

In [20]:
logreg = log_accuracy_scorer_k_fold(X, y, k=5, threshold=0.5, C=1)

Confusion Matrix for Fold 1
[[8724  280]
 [2617  300]]


Confusion Matrix for Fold 2
[[8743  256]
 [2585  337]]


Confusion Matrix for Fold 3
[[8758  265]
 [2573  325]]


Confusion Matrix for Fold 4
[[8698  269]
 [2628  326]]


Confusion Matrix for Fold 5
[[8831  256]
 [2534  300]]


Logistic Regression Classification w/ KFOLD CV Results (k=5):
Log. Reg Accuracy scores:  [0.7569834745407265, 0.7616810670245785, 0.7619327237647848, 0.7569834745407265, 0.7659592316080865] 

Simple mean cv accuracy: 0.761 + 0.003 

Log. Reg Precision scores:  [0.5172413793103449, 0.5682967959527825, 0.5508474576271186, 0.5478991596638656, 0.539568345323741] 

Simple mean cv precision: 0.545 +- 0.017 

Log. Reg Recall scores:  [0.10284538909838875, 0.11533196440793976, 0.11214630779848171, 0.1103588354773189, 0.1058574453069866] 

Simple mean cv recall: 0.109 +- 0.004 

Log. Reg Fbeta (beta=0.5) scores:  [0.28642352491884665, 0.31828485077446167, 0.30905287181437807, 0.3055868016497938, 0.2965599051008304]

In [21]:
random_forest_eval_kfold(X,y, k=5)

Confusion Matrix for Fold 1
[[8702  302]
 [2568  349]]


Confusion Matrix for Fold 2
[[8669  330]
 [2563  359]]


Confusion Matrix for Fold 3
[[8698  325]
 [2555  343]]


Confusion Matrix for Fold 4
[[8622  345]
 [2585  369]]


Confusion Matrix for Fold 5
[[8748  339]
 [2485  349]]


Random Forest Classification w/ KFOLD CV Results (k=5, threshold = 0.5):
Random Forest Accuracy scores:  [0.7584095294018958, 0.7577384447613456, 0.7590806140424461, 0.754383021558594, 0.7637782065262981] 

Simple mean cv accuracy: 0.759 + 0.003
Random Forest Precision scores:  [0.5360983102918587, 0.521044992743106, 0.5134730538922155, 0.5168067226890757, 0.5072674418604651] 

Simple mean cv precision: 0.519 +- 0.010
Random Forest Recall scores:  [0.11964346931779225, 0.12286105407255304, 0.11835748792270531, 0.12491536899119837, 0.12314749470712774] 

Simple mean cv recall: 0.122 +- 0.002
Random Forest Fbeta (beta=0.5) scores:  [0.3160659300851295, 0.31613244100035226, 0.307899461400359, 0.31755593803786

RandomForestClassifier()

Based on the overall results, Log Reg and Random Forest are outperforming KNN on precision and FBeta (beta=0.5).

## RandomizedSearchCV Hypertuning

### KNN

In [None]:
k_range = list(range(1, 40))

In [None]:
weight_options = ['uniform', 'distance']

In [None]:
param_dist = dict(n_neighbors=k_range, weights=weight_options)

In [None]:
knn = KNeighborsClassifier()

In [None]:
rand = RandomizedSearchCV(knn, param_dist, cv=3, scoring='precision', n_iter=10, n_jobs = -1)

In [None]:
rand.best_params_

In [None]:
rand.best_score_

### Log. Reg.

### Random Forest