## Comparing Models
In this notebook, I will compare the default and optimized performance of several popular classification algorithms in order to create the best final model.

Here are the algorithms that I will compare:
- Random Forest
- AdaBoost
- Support Vector Machines
- XG Boost

In [23]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
%matplotlib inline
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import xgboost as xgb
import pickle
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 225)
pickle_in=open("cleaned_data.pickle","rb")
df=pickle.load(pickle_in)
# I need to drop two columns that I left in for the visualization notebook
df.drop(['home_score','away_score'],axis=1,inplace=True)
# This is the same function from the previous notebook and it will be used to
# evaluate model performance
def calc_return(X_analyse):
    total_risk=[]
    total_reward=[]
    equal_bet_return=[]
    for i in range(len(X_analyse)):
        k=pd.DataFrame(X_analyse.iloc[i]).transpose()
        k.reset_index(drop=True,inplace=True)
        if int(k.preds[0])==1:
            if int(k.real[0])==1:
                if int(k.home_money[0])<0:
                    risk=k.home_money[0]
                    reward=100
                else:
                    risk=-100
                    reward=k.home_money[0]
            else:
                if k.home_money[0]<0:
                    risk=k.home_money[0]
                    reward=k.home_money[0]
                else:
                    risk=-100
                    reward=-100
        else:
            if int(k.real[0])==0:
                if k.away_money[0]<0:
                    risk=k.away_money[0]
                    reward=100
                else:
                    risk=-100
                    reward=k.away_money[0]
            else:
                if k.away_money[0]<0:
                    risk=k.away_money[0]
                    reward=k.away_money[0]
                else:
                    risk=-100
                    reward=-100
        total_risk.append(risk)
        total_reward.append(reward)
        equal_bet_winnings=reward/-risk*100
        equal_bet_return.append(equal_bet_winnings)
    natural_ror=round(-np.mean(total_reward)/np.mean(total_risk)*100,2)
    equal_bet_ror=round(np.mean(equal_bet_return),2)
    return natural_ror,equal_bet_ror
# This is a function for creating train-test splits that will work with my way of
# scoring model performance based on real world return on risk
def test_split(data,test_size,random_state):
    shuf_df=data.sample(frac=1,random_state=random_state)
    shuf_df.reset_index(drop=True,inplace=True)
    df2=shuf_df.copy()
    # This seperates the dataframe into data and target 
    X_temp=df2[df2.columns[1:]]
    y=df2.home_win
    # This standardized the data
    scaler=StandardScaler()
    X_s = scaler.fit_transform(X_temp)
    X=pd.DataFrame(X_s)
    # This does the train-test split in a way that I can carry through the odds values in order to calculate
    # the real-world usefulness of the model
    if len(X)==len(y):
        split_value=int(round(len(X)*(1-test_size),0))
        X_train=X.iloc[0:split_value]
        X_test=X.iloc[split_value:len(X)]
        y_train=y.iloc[0:split_value]
        y_test=y.iloc[split_value:len(X)]
        home_money=shuf_df.iloc[:,-2]
        away_money=shuf_df.iloc[:,-1]
    return X_train,X_test,y_train,y_test,home_money,away_money,split_value

In order to get the be able to carry through the betting odds for post modeling analysis, I have created a framework for doing train-test splits that is slightly different than normal.  For all cross-validation tests I will use the same randomeness settings to isolate the model's performance.

## Random Forest Classifier:
I will begin by examining the Random Forest Classifier, both with default settings and with some tuning.

In [30]:
results=pd.DataFrame([])
for j in range(0,200):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
    forest=RandomForestClassifier()
    forest.fit(X_train,y_train)
    train_pred=forest.predict(X_train)
    test_pred=forest.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(test_pred)):
        if test_pred[i]>.5:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    train_acc=round(accuracy_score(y_train,train_pred)*100,1)
    test_acc = round(accuracy_score(y_test,test_pred)*100,1)
    nat,equal=calc_return(bets)
    results=results.append(pd.DataFrame({'fold':j+1,'train':train_acc,'test':test_acc,
                                        'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Mean Results')
print('Training Accuracy: ',round(results.train.mean(),1))
print("Test Accuracy: ",round(results.test.mean(),1))
print('Natural Return on Risk:', round(results.nat.mean(),1),'%')
print('Equal Wager Return on Risk:',round(results.equal.mean(),1),'%')

Mean Results
Training Accuracy:  98.5
Test Accuracy:  53.9
Natural Return on Risk: -0.9 %
Equal Wager Return on Risk: -1.6 %


In [64]:
results=pd.DataFrame([])
for crit in ['gini','entropy']:
    for depth in [3,10,20]:
        for samp_split in [10,25]:
            for num in [20,50,100]:
                holder=pd.DataFrame([])
                for j in range(0,10):
                    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
                    forest=RandomForestClassifier(n_estimators=num,criterion=crit,max_depth=depth,
                                                 min_samples_split=samp_split,random_state=j+10)
                    forest.fit(X_train,y_train)
                    train_pred=forest.predict(X_train)
                    test_pred=forest.predict(X_test)
                    bets=pd.DataFrame([])
                    for i in range(len(test_pred)):
                        if test_pred[i]>.5:
                            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                        else:
                            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                    train_acc=round(accuracy_score(y_train,train_pred)*100,1)
                    test_acc = round(accuracy_score(y_test,test_pred)*100,1)
                    nat,equal=calc_return(bets)
                    holder=holder.append(pd.DataFrame({'train':train_acc,'test':test_acc,
                                                        'nat':nat,'equal':equal},index=[0]),ignore_index=True)
                results=results.append(pd.DataFrame({'criteria':crit,'depth':depth,'samp_split':samp_split,
                                                    'num':num,'train':holder.train.mean(),
                                                     'test':holder.test.mean(),
                                                     'nat':holder.nat.mean(),
                                                     'equal':holder.equal.mean()},index=[0]),
                                       ignore_index=True)
print('Best Test Score:')
display(results[results.test==results.test.max()])
print('Best Natural Return on Risk:')
display(results[results.nat==results.nat.max()])
print('Best Equal Wager Return on Risk:')
display(results[results.equal==results.equal.max()])

Best Test Score:


Unnamed: 0,criteria,depth,samp_split,num,train,test,nat,equal
35,entropy,20,25,100,96.74,60.39,3.676,3.411


Best Natural Return on Risk:


Unnamed: 0,criteria,depth,samp_split,num,train,test,nat,equal
6,gini,10,10,20,97.74,59.37,4.339,4.389


Best Equal Wager Return on Risk:


Unnamed: 0,criteria,depth,samp_split,num,train,test,nat,equal
6,gini,10,10,20,97.74,59.37,4.339,4.389


By tuning parameters we were able to achieve a model that averages almost 60% of games correct and a return on risk of 4.3%.  This is a significant improvement over the default scores of 54% test accuracy and a slightly negative return on risk.

## AdaBoost algorithm:

In [65]:
results=pd.DataFrame([])
for j in range(0,200):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
    ada=AdaBoostClassifier()
    ada.fit(X_train,y_train)
    train_pred=ada.predict(X_train)
    test_pred=ada.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(test_pred)):
        if test_pred[i]>.5:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    train_acc=round(accuracy_score(y_train,train_pred)*100,1)
    test_acc = round(accuracy_score(y_test,test_pred)*100,1)
    nat,equal=calc_return(bets)
    results=results.append(pd.DataFrame({'fold':j+1,'train':train_acc,'test':test_acc,
                                        'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Mean Results')
print('Training Accuracy: ',round(results.train.mean(),1))
print("Test Accuracy: ",round(results.test.mean(),1))
print('Natural Return on Risk:', round(results.nat.mean(),1),'%')
print('Equal Wager Return on Risk:',round(results.equal.mean(),1),'%')

Mean Results
Training Accuracy:  84.5
Test Accuracy:  52.5
Natural Return on Risk: -3.1 %
Equal Wager Return on Risk: -4.1 %


In [66]:
results=pd.DataFrame([])
for rate in [1,5,10,20]:
    for num in [20,50,100]:
        holder=pd.DataFrame([])
        for j in range(0,10):
            X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
            ada=AdaBoostClassifier(learning_rate=rate,n_estimators=num)
            ada.fit(X_train,y_train)
            train_pred=ada.predict(X_train)
            test_pred=ada.predict(X_test)
            bets=pd.DataFrame([])
            for i in range(len(test_pred)):
                if test_pred[i]>.5:
                    bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                             'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                else:
                    bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                             'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
            train_acc=round(accuracy_score(y_train,train_pred)*100,1)
            test_acc = round(accuracy_score(y_test,test_pred)*100,1)
            nat,equal=calc_return(bets)
            holder=holder.append(pd.DataFrame({'train':train_acc,'test':test_acc,
                                                'nat':nat,'equal':equal},index=[0]),ignore_index=True)
        results=results.append(pd.DataFrame({'learning_rate':rate,'num':num,
                                             'train':holder.train.mean(),
                                             'test':holder.test.mean(),
                                             'nat':holder.nat.mean(),
                                             'equal':holder.equal.mean()},index=[0]),
                               ignore_index=True)
print('Best Test Score:')
display(results[results.test==results.test.max()])
print('Best Natural Return on Risk:')
display(results[results.nat==results.nat.max()])
print('Best Equal Wager Return on Risk:')
display(results[results.equal==results.equal.max()])

Best Test Score:


Unnamed: 0,learning_rate,num,train,test,nat,equal
0,1,20,73.68,52.45,-3.373,-5.171


Best Natural Return on Risk:


Unnamed: 0,learning_rate,num,train,test,nat,equal
0,1,20,73.68,52.45,-3.373,-5.171


Best Equal Wager Return on Risk:


Unnamed: 0,learning_rate,num,train,test,nat,equal
0,1,20,73.68,52.45,-3.373,-5.171


The results of the AdaBoost algorithm were significantly worse than the Random Forest and we can throw out the AdaBoost as it is not useful for our purposes.

## Support Vector Machines:

In [68]:
results=pd.DataFrame([])
for j in range(0,100):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
    svm_clf=svm.SVC()
    svm_clf.fit(X_train,y_train)
    train_pred=svm_clf.predict(X_train)
    test_pred=svm_clf.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(test_pred)):
        if test_pred[i]>.5:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    train_acc=round(accuracy_score(y_train,train_pred)*100,1)
    test_acc = round(accuracy_score(y_test,test_pred)*100,1)
    nat,equal=calc_return(bets)
    results=results.append(pd.DataFrame({'fold':j+1,'train':train_acc,'test':test_acc,
                                        'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Mean Results')
print('Training Accuracy: ',round(results.train.mean(),1))
print("Test Accuracy: ",round(results.test.mean(),1))
print('Natural Return on Risk:', round(results.nat.mean(),1),'%')
print('Equal Wager Return on Risk:',round(results.equal.mean(),1),'%')

Mean Results
Training Accuracy:  91.0
Test Accuracy:  58.4
Natural Return on Risk: 1.8 %
Equal Wager Return on Risk: 2.0 %


In [72]:
results=pd.DataFrame([])
for kern in ['linear','poly','sigmoid','rbf']:
    for c in [3,5,7,9]:
        for gam in ['auto',10,25]:
            holder=pd.DataFrame([])
            for j in range(0,10):
                X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
                svm_clf=svm.SVC(kernel=kern,C=c,gamma=gam,random_state=j+10)
                svm_clf.fit(X_train,y_train)
                train_pred=svm_clf.predict(X_train)
                test_pred=svm_clf.predict(X_test)
                bets=pd.DataFrame([])
                for i in range(len(test_pred)):
                    if test_pred[i]>.5:
                        bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                 'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                    else:
                        bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                 'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                train_acc=round(accuracy_score(y_train,train_pred)*100,1)
                test_acc = round(accuracy_score(y_test,test_pred)*100,1)
                nat,equal=calc_return(bets)
                holder=holder.append(pd.DataFrame({'train':train_acc,'test':test_acc,
                                                    'nat':nat,'equal':equal},index=[0]),ignore_index=True)
            results=results.append(pd.DataFrame({'kernel':kern,'C':c,'gamma':gam,
                                                'train':holder.train.mean(),
                                                 'test':holder.test.mean(),
                                                 'nat':holder.nat.mean(),
                                                 'equal':holder.equal.mean()},index=[0]),
                                   ignore_index=True)
print('Best Test Score:')
display(results[results.test==results.test.max()])
print('Best Natural Return on Risk:')
display(results[results.nat==results.nat.max()])
print('Best Equal Wager Return on Risk:')
display(results[results.equal==results.equal.max()])

Best Test Score:


Unnamed: 0,kernel,C,gamma,train,test,nat,equal
3,linear,5,auto,84.42,57.09,5.94,5.361
4,linear,5,10,84.42,57.09,5.94,5.361
5,linear,5,25,84.42,57.09,5.94,5.361


Best Natural Return on Risk:


Unnamed: 0,kernel,C,gamma,train,test,nat,equal
3,linear,5,auto,84.42,57.09,5.94,5.361
4,linear,5,10,84.42,57.09,5.94,5.361
5,linear,5,25,84.42,57.09,5.94,5.361


Best Equal Wager Return on Risk:


Unnamed: 0,kernel,C,gamma,train,test,nat,equal
3,linear,5,auto,84.42,57.09,5.94,5.361
4,linear,5,10,84.42,57.09,5.94,5.361
5,linear,5,25,84.42,57.09,5.94,5.361


This result is interesting because there is a very clear winner and also gamma appears to have no impact on the results.

## XG Boost algorithm:

In [77]:
results=pd.DataFrame([])
for j in range(0,100):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
    boost=xgb.XGBClassifier()
    boost.fit(X_train,y_train)
    train_pred=boost.predict(X_train)
    test_pred=boost.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(test_pred)):
        if test_pred[i]>.5:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    train_acc=round(accuracy_score(y_train,train_pred)*100,1)
    test_acc = round(accuracy_score(y_test,test_pred)*100,1)
    nat,equal=calc_return(bets)
    results=results.append(pd.DataFrame({'fold':j+1,'train':train_acc,'test':test_acc,
                                        'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Mean Results')
print('Training Accuracy: ',round(results.train.mean(),1))
print("Test Accuracy: ",round(results.test.mean(),1))
print('Natural Return on Risk:', round(results.nat.mean(),1),'%')
print('Equal Wager Return on Risk:',round(results.equal.mean(),1),'%')

Mean Results
Training Accuracy:  99.2
Test Accuracy:  54.0
Natural Return on Risk: -2.7 %
Equal Wager Return on Risk: -3.6 %


In [102]:
# Note: Because there so many tunable parameters in the XG Boost model,
# I did a large amount of testing outside of this notebook and have only included
# a handful of potential values in order to save runtime
results=pd.DataFrame([])
for learn in [.01,.001]:
    for depth in [20,50]:
        for child in [3,10]:
            for num in [50,100,200]:
                for sub in [.4]:
                    for gam in [10]:
                        holder=pd.DataFrame([])
                        for j in range(0,10):
                            X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.2,j)
                            boost=xgb.XGBClassifier(learning_rate=learn,max_depth=depth,
                                                    min_child_weight=child,n_estimators=num,
                                                    subsample=sub,gamma=gam)
                            boost.fit(X_train,y_train)
                            train_pred=boost.predict(X_train)
                            test_pred=boost.predict(X_test)
                            bets=pd.DataFrame([])
                            for i in range(len(test_pred)):
                                if test_pred[i]>.5:
                                    bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                             'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                                else:
                                    bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                                             'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
                            train_acc=round(accuracy_score(y_train,train_pred)*100,1)
                            test_acc = round(accuracy_score(y_test,test_pred)*100,1)
                            nat,equal=calc_return(bets)
                            holder=holder.append(pd.DataFrame({'train':train_acc,'test':test_acc,
                                                                'nat':nat,'equal':equal},index=[0]),ignore_index=True)
                        results=results.append(pd.DataFrame({'learning_rate':learn,'depth':depth,
                                                             'child':child,'num':num,'sub':sub,
                                                             'gamma':gam,'train':holder.train.mean(),
                                                             'test':holder.test.mean(),
                                                             'nat':holder.nat.mean(),
                                                             'equal':holder.equal.mean()},index=[0]),
                                               ignore_index=True)
print('Best Test Score:')
display(results[results.test==results.test.max()])
print('Best Natural Return on Risk:')
display(results[results.nat==results.nat.max()])
print('Best Equal Wager Return on Risk:')
display(results[results.equal==results.equal.max()])

Best Test Score:


Unnamed: 0,learning_rate,depth,child,num,sub,gamma,train,test,nat,equal
17,0.001,20,10,200,0.4,10,62.88,60.94,3.392,3.11
23,0.001,50,10,200,0.4,10,62.88,60.94,3.392,3.11


Best Natural Return on Risk:


Unnamed: 0,learning_rate,depth,child,num,sub,gamma,train,test,nat,equal
17,0.001,20,10,200,0.4,10,62.88,60.94,3.392,3.11
23,0.001,50,10,200,0.4,10,62.88,60.94,3.392,3.11


Best Equal Wager Return on Risk:


Unnamed: 0,learning_rate,depth,child,num,sub,gamma,train,test,nat,equal
17,0.001,20,10,200,0.4,10,62.88,60.94,3.392,3.11
23,0.001,50,10,200,0.4,10,62.88,60.94,3.392,3.11


Again after parameter tuning, we have significantly improved the results of the XG Boost model.

To recap, we have found a solid model using three of the four algorithms tested.  Here is how they compare to the baseline of simply betting with the Vegas odds:

|Model|Accuracy|Natural Return on Risk|Equal Wager Return on Risk|
|----|----|----|----|
|Vegas Odds|61.1%|2.53%|3.11%|
|Random Forest|59.4%|4.4%|4.4%|
|Support Vector Machine|57.1%|5.9%|5.4%|
|XG Boost|60.9%|3.4%|3.1%|

Interestingly, none of the three models matched the accuracy of the Vegas odds but they all generated a higher return on risk.  This suggests that the models are better at picking the underdog and benefiting from the higher payout.  In the next section, I will be taking the parameters for these three models and attempting to combine them to make an even more predictive model.

#### For the next section please see the notebook titled "Combining_Models"