## Combining Models
In this notebook I will take the three best models from the previous section and combine them in various ways in order to create a more powerful model for predicting MLB games.  For details on what the parameters are, please see the previous notebook.

Below I have imported the necessary libraries along with the functions created in the previous notebook.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
%matplotlib inline
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from scipy import stats
import xgboost as xgb
import pickle
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 225)
pickle_in=open("cleaned_data.pickle","rb")
df=pickle.load(pickle_in)
# I need to drop two columns that I left in for the visualization notebook
df.drop(['home_score','away_score'],axis=1,inplace=True)
# This is the same function from the previous notebook and it will be used to
# evaluate model performance
def calc_return(X_analyse):
    total_risk=[]
    total_reward=[]
    equal_bet_return=[]
    for i in range(len(X_analyse)):
        k=pd.DataFrame(X_analyse.iloc[i]).transpose()
        k.reset_index(drop=True,inplace=True)
        if int(k.preds[0])==1:
            if int(k.real[0])==1:
                if int(k.home_money[0])<0:
                    risk=k.home_money[0]
                    reward=100
                else:
                    risk=-100
                    reward=k.home_money[0]
            else:
                if k.home_money[0]<0:
                    risk=k.home_money[0]
                    reward=k.home_money[0]
                else:
                    risk=-100
                    reward=-100
        else:
            if int(k.real[0])==0:
                if k.away_money[0]<0:
                    risk=k.away_money[0]
                    reward=100
                else:
                    risk=-100
                    reward=k.away_money[0]
            else:
                if k.away_money[0]<0:
                    risk=k.away_money[0]
                    reward=k.away_money[0]
                else:
                    risk=-100
                    reward=-100
        total_risk.append(risk)
        total_reward.append(reward)
        equal_bet_winnings=reward/-risk*100
        equal_bet_return.append(equal_bet_winnings)
    natural_ror=round(-np.mean(total_reward)/np.mean(total_risk)*100,2)
    equal_bet_ror=round(np.mean(equal_bet_return),2)
    return natural_ror,equal_bet_ror
# This is a function for creating train-test splits that will work with my way of
# scoring model performance based on real world return on risk
def test_split(data,test_size,random_state):
    shuf_df=data.sample(frac=1,random_state=random_state)
    shuf_df.reset_index(drop=True,inplace=True)
    df2=shuf_df.copy()
    # This seperates the dataframe into data and target 
    X_temp=df2[df2.columns[1:]]
    y=df2.home_win
    # This standardized the data
    scaler=StandardScaler()
    X_s = scaler.fit_transform(X_temp)
    X=pd.DataFrame(X_s)
    # This does the train-test split in a way that I can carry through the odds values in order to calculate
    # the real-world usefulness of the model
    if len(X)==len(y):
        split_value=int(round(len(X)*(1-test_size),0))
        X_train=X.iloc[0:split_value]
        X_test=X.iloc[split_value:len(X)]
        y_train=y.iloc[0:split_value]
        y_test=y.iloc[split_value:len(X)]
        home_money=shuf_df.iloc[:,-2]
        away_money=shuf_df.iloc[:,-1]
    return X_train,X_test,y_train,y_test,home_money,away_money,split_value
# This function visualizes the results of a model combination
def plot_results(results):
    nat_plmean=np.mean(results.nat)
    nat_plu=nat_plmean+np.std(results.nat)
    nat_pll=nat_plmean-np.std(results.nat)
    eq_plmean=np.mean(results.equal)
    eq_plu=eq_plmean+np.std(results.equal)
    eq_pll=eq_plmean-np.std(results.equal)
    scmean=np.mean(results.combo_acc)
    scu=scmean+np.std(results.combo_acc)
    scl=scmean-np.std(results.combo_acc)
    plt.figure(figsize=(20,10))
    plt.plot(results.fold,results.nat,label="Natural Wagers Return on Risk: {0}%".format(round(results.nat.mean(),2)))
    plt.plot(results.fold,results.equal,label="Equal Wagers Return on Risk: {0}%".format(round(results.equal.mean(),2)))
    plt.plot(results.fold,results.vegas_nat,label="Vegas Natural Wagers Return on Risk: {0}%".format(round(results.vegas_nat.mean(),2)))
    plt.plot(results.fold,results.vegas_equal,label="Vegas Equal Wagers Return on Risk: {0}%".format(round(results.vegas_equal.mean(),2)))
    plt.legend()
    plt.title('Real World Return on Risk')
    plt.xlabel('Trial Number')
    plt.ylabel('Percent Return')
    plt.figure(figsize=(20,10))
    plt.plot(results.fold,results.combo_acc,label="Combined Accuracy Score: {0}".format(round(results.combo_acc.mean(),2)))
    plt.plot(results.fold,results.vegas_acc,label="Vegas Accuracy Score: {0}".format(round(results.vegas_acc.mean(),2)))
    if any("boost" in s for s in list(results.columns.values)):
        plt.plot(results.fold,results.boost_acc,alpha=.5,label="Boost Accuracy Score: {0}".format(round(results.boost_acc.mean(),2)))
    else:
        None
    if any("svm" in s for s in list(results.columns.values)):
        plt.plot(results.fold,results.svm_acc,alpha=.5,label="SVM Accuracy Score: {0}".format(round(results.svm_acc.mean(),2)))
    else:
        None
    if any("forest" in s for s in list(results.columns.values)):
        plt.plot(results.fold,results.forest_acc,alpha=.5,label="Random Forest Accuracy Score: {0}".format(round(results.forest_acc.mean(),2)))
    else:
        None
    plt.legend() 
    plt.title('Visualization of Accuracy Scores')
    plt.xlabel('Trial Number')
    plt.ylabel('Accuracy Score')

## Random Forest and the Support Vector Machine combo:
The models vote on the winner and only make predictions when they both agree.

In [30]:
results_rf_svm=pd.DataFrame([])
for j in range(0,50):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.1,j*3)
    forest=RandomForestClassifier(n_estimators=20,criterion='gini',max_depth=10,
                                                 min_samples_split=10,random_state=j*3)
    forest.fit(X_train,y_train)
    forest_pred=forest.predict(X_test)
    svm_clf=svm.SVC(kernel='linear',C=6,random_state=j*3)
    svm_clf.fit(X_train,y_train)
    svm_pred=svm_clf.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(y_test)):
        if forest_pred[i]+svm_pred[i]==2:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            None
        if forest_pred[i]+svm_pred[i]==0:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    forest_acc=round(accuracy_score(y_test,forest_pred)*100,1)
    svm_acc=round(accuracy_score(y_test,svm_pred)*100,1)
    combo_acc=round(accuracy_score(bets.real,bets.preds)*100,1)
    nat,equal=calc_return(bets)
    results_rf_svm=results_rf_svm.append(pd.DataFrame({'fold':j+1,'forest_acc':forest_acc,
                                                       'svm_acc':svm_acc,'combo_acc':combo_acc,
                                                       'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Average Random Forest Accuracy Score: ',round(results_rf_svm.forest_acc.mean(),2))
print('Average SVM Accuracy Score: ',round(results_rf_svm.svm_acc.mean(),2))
print('Average Combined Accuracy Score: ',round(results_rf_svm.combo_acc.mean(),2))
print('Average Natural Wager Return on Risk: ',round(results_rf_svm.nat.mean(),2))
print('Average Equal Wager Return on Risk: ',round(results_rf_svm.equal.mean(),2))

Average Random Forest Accuracy Score:  57.03
Average SVM Accuracy Score:  55.99
Average Combined Accuracy Score:  60.36
Average Natural Wager Return on Risk:  4.2
Average Equal Wager Return on Risk:  4.76


As we can see, combining both predictions results in an improvement in the accuracy score. 

## Random Forest and the XG Boost combo:

In [32]:
results_rf_xg=pd.DataFrame([])
for j in range(0,50):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.1,j*3)
    forest=RandomForestClassifier(n_estimators=20,criterion='gini',max_depth=10,
                                                 min_samples_split=10,random_state=j*3)
    forest.fit(X_train,y_train)
    forest_pred=forest.predict(X_test)
    boost=xgb.XGBClassifier(learning_rate=.001,max_depth=20,
                            min_child_weight=10,n_estimators=200,subsample=.4,gamma=10,random_state=j*3)
    boost.fit(X_train,y_train)
    boost_pred=boost.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(y_test)):
        if forest_pred[i]+boost_pred[i]==2:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            None
        if forest_pred[i]+boost_pred[i]==0:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    forest_acc=round(accuracy_score(y_test,forest_pred)*100,1)
    boost_acc=round(accuracy_score(y_test,boost_pred)*100,1)
    combo_acc=round(accuracy_score(bets.real,bets.preds)*100,1)
    nat,equal=calc_return(bets)
    results_rf_xg=results_rf_xg.append(pd.DataFrame({'fold':j+1,'forest_acc':forest_acc,
                                                       'boost_acc':boost_acc,'combo_acc':combo_acc,
                                                       'nat':nat,'equal':equal,},index=[0]),ignore_index=True)
print('Average Random Forest Accuracy Score: ',round(results_rf_xg.forest_acc.mean(),2))
print('Average XG Boost Accuracy Score: ',round(results_rf_xg.boost_acc.mean(),2))
print('Average Combined Accuracy Score: ',round(results_rf_xg.combo_acc.mean(),2))
print('Average Natural Wager Return on Risk: ',round(results_rf_xg.nat.mean(),2))
print('Average Equal Wager Return on Risk: ',round(results_rf_xg.equal.mean(),2))

Average Random Forest Accuracy Score:  57.03
Average XG Boost Accuracy Score:  61.26
Average Combined Accuracy Score:  61.78
Average Natural Wager Return on Risk:  2.95
Average Equal Wager Return on Risk:  3.35


Again we see that combining the models impoved the accuracy score, although this time only very slightly. 
##  SVM and XG Boost combo:

In [33]:
results_svm_xg=pd.DataFrame([])
for j in range(0,50):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.1,j*3)
    svm_clf=svm.SVC(C=6,kernel='linear',random_state=j*3)
    svm_clf.fit(X_train,y_train)
    svm_pred=svm_clf.predict(X_test)
    boost=xgb.XGBClassifier(learning_rate=.001,max_depth=50,
                            min_child_weight=10,n_estimators=200,subsample=.4,gamma=10,random_state=j*3)
    boost.fit(X_train,y_train)
    boost_pred=boost.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(y_test)):
        if svm_pred[i]+boost_pred[i]==2:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            None
        if svm_pred[i]+boost_pred[i]==0:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    svm_acc=round(accuracy_score(y_test,svm_pred)*100,1)
    boost_acc=round(accuracy_score(y_test,boost_pred)*100,1)
    combo_acc=round(accuracy_score(bets.real,bets.preds)*100,1)
    nat,equal=calc_return(bets)
    results_svm_xg=results_svm_xg.append(pd.DataFrame({'fold':j+1,'svm_acc':svm_acc,
                                                       'boost_acc':boost_acc,'combo_acc':combo_acc,
                                                       'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Average SVM Accuracy Score: ',round(results_svm_xg.svm_acc.mean(),2))
print('Average XG Boost Accuracy Score: ',round(results_svm_xg.boost_acc.mean(),2))
print('Average Combined Accuracy Score: ',round(results_svm_xg.combo_acc.mean(),2))
print('Average Natural Wager Return on Risk: ',round(results_svm_xg.nat.mean(),2))
print('Average Equal Wager Return on Risk: ',round(results_svm_xg.equal.mean(),2))

Average SVM Accuracy Score:  55.99
Average XG Boost Accuracy Score:  61.26
Average Combined Accuracy Score:  63.65
Average Natural Wager Return on Risk:  6.24
Average Equal Wager Return on Risk:  6.88


Once again we see that combining the models improves the accuracy score. 

## All three model combo:

In [34]:
results_all3=pd.DataFrame([])
for j in range(0,50):
    X_train,X_test,y_train,y_test,home_money,away_money,split_value=test_split(df,.1,j*3)
    forest=RandomForestClassifier(n_estimators=20,criterion='gini',max_depth=10,
                                                 min_samples_split=10,random_state=j*3)
    forest.fit(X_train,y_train)
    forest_pred=forest.predict(X_test)
    svm_clf=svm.SVC(C=6,kernel='linear',random_state=j*3)
    svm_clf.fit(X_train,y_train)
    svm_pred=svm_clf.predict(X_test)
    boost=xgb.XGBClassifier(learning_rate=.001,max_depth=50,
                            min_child_weight=10,n_estimators=200,subsample=.4,gamma=10,random_state=j*3)
    boost.fit(X_train,y_train)
    boost_pred=boost.predict(X_test)
    bets=pd.DataFrame([])
    for i in range(len(y_test)):
        if svm_pred[i]+boost_pred[i]+forest_pred[i]==3:
            bets=bets.append(pd.DataFrame({'preds':1,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
        else:
            None
        if svm_pred[i]+boost_pred[i]+forest_pred[i]==0:
            bets=bets.append(pd.DataFrame({'preds':0,'real':y_test[i+split_value],'home_money':home_money[i+split_value],
                                     'away_money':away_money[i+split_value]},index=[0]),ignore_index=True)
    forest_acc=round(accuracy_score(y_test,forest_pred)*100,1)
    svm_acc=round(accuracy_score(y_test,svm_pred)*100,1)
    boost_acc=round(accuracy_score(y_test,boost_pred)*100,1)
    combo_acc=round(accuracy_score(bets.real,bets.preds)*100,1)
    nat,equal=calc_return(bets)
    results_all3=results_all3.append(pd.DataFrame({'fold':j+1,'svm_acc':svm_acc,'boost_acc':boost_acc,
                                         'forest_acc':forest_acc,'combo_acc':combo_acc,
                                         'nat':nat,'equal':equal},index=[0]),ignore_index=True)
print('Average Random Forerst Accuracy Score: ',round(results_all3.forest_acc.mean(),2))
print('Average SVM Accuracy Score: ',round(results_all3.svm_acc.mean(),2))
print('Average XG Boost Accuracy Score: ',round(results_all3.boost_acc.mean(),2))
print('Average Combined Accuracy Score: ',round(results_all3.combo_acc.mean(),2))
print('Average Natural Wager Return on Risk: ',round(results_all3.nat.mean(),2))
print('Average Equal Wager Return on Risk: ',round(results_all3.equal.mean(),2))

Average Random Forerst Accuracy Score:  57.03
Average SVM Accuracy Score:  55.99
Average XG Boost Accuracy Score:  61.26
Average Combined Accuracy Score:  63.77
Average Natural Wager Return on Risk:  5.13
Average Equal Wager Return on Risk:  6.21


As in all three previous examples, combining the models led to an improvement over each individual model.  Below is a table comparing the performances of all of the combinations:

|Model|Accuracy|Natural Return on Risk|Equal Wager Return on Risk|
|----|----|----|----|
|Vegas Odds|61.1%|2.53%|3.11%|
|Random Forest + Support Vector Machines|60.36%|4.2%|4.76%|
|Random Forest + XG Boost|61.78%|2.95%|3.35%|
|Support Vector Machines + XG Boost|63.65%|6.24%|6.88%|
|All Three|63.77%|5.13%|6.21%|

In light of these results, I am choosing the combination of Support Vector Machines and XG Boost as my final model.  In the next section, I will dive into evaluating how good of a predictor this model is.

#### For the next section please see the notebook titled "Evaluting_Final_Model"