Part three of my attempt to predict NCAA tourney results based on past game data. As cleaning and var creation is out of the way now, this segment will focus on fitting the data to different classifiers in scikit-lelarn and tweaking parameters to determine a classification model with the best prediction power... without overfitting.

In [7]:
import os
import numpy as np 
import pandas as pd 
import math
from sklearn.utils import shuffle
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import grid_search
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

Import list is large here as I threw in a large amount of the classifiers included in sklearn. As noticeable later, once I get the trainn and test data split I toss it into several different models and use the accuracy scores of each to fine tune//pick a final model.  Data importing next:

In [8]:
df = pd.read_csv('../data/train_data_diff.csv')
data_dir = '../data/'
sample_sub = pd.read_csv(data_dir + 'SampleSubmissionStage1.csv')
n_test_games = len(sample_sub)

df.head()

Unnamed: 0,AST,AST_Diff,BLK,BLK_Diff,Coach,DR,DR_Diff,FGP,FGP3,FGP3_Diff,...,PPG_Diff,Rank,Rank_Diff,Result,SEED_Diff,STL,STL_Diff,Season,Seed,TeamID
0,14.0,4.666667,4.176471,0.676471,mark_gottfried,26.411765,3.911765,0.444393,0.347418,0.04075,...,18.539216,3.0,35,1,9.0,7.235294,1.401961,2003,10.0,1104
1,15.380952,6.047619,4.095238,0.595238,rick_stansbury,26.380952,3.880952,0.495357,0.36198,0.055311,...,17.5,3.0,21,1,4.0,9.285714,3.452381,2003,5.0,1280
2,13.4,4.066667,5.75,2.25,eddie_sutton,24.5,2.0,0.474798,0.382793,0.076124,...,17.233333,3.0,19,1,5.0,9.65,3.816667,2003,6.0,1329
3,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,22.060606,3.0,1,1,0.0,6.954545,1.121212,2003,1.0,1400
4,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,22.060606,3.0,1,1,0.0,6.954545,1.121212,2003,1.0,1400


Next is a handful of helper function that will facilitate in extracting the year and teams from Kaggles sample submission. This will allow me to use their sample submission to set predictions and submit to the competition. The get_year_t1_t2 function came from the Basic Starter Kernel of the competition site by Julie Elliott.  

The other two functions will are used to create a win/loss ration which I missed in earlier manipulation. Get_stat will be used to iterate over the sample sub teams and retrieve the appropriate stat differentials as this is what is used in the training of the model.

In [9]:
def get_count(teamID, year, wl):
    if wl == 1:
        try:
            return df[(df.TeamID == teamID) & (df.Season == year) & (df.Result == 1)].TeamID.value_counts().iloc[0]
        except IndexError:
            return 0
    else:
        try:
            return df[(df.TeamID == teamID) & (df.Season == year) & (df.Result == 0)].TeamID.value_counts().iloc[0]
        except IndexError:
            return 0

'''FUNCTIONS FOR TEST DATA'''
def get_year_t1_t2(ID):
    """Return a tuple with ints `year`, `team1` and `team2`."""
    return (int(x) for x in ID.split('_'))


def get_stat(stat, t1, t2, year):
    if not math.isnan(df[(df.TeamID == t1) & (df.Season == year)][stat].mean() - df[(df.TeamID == t2) & (df.Season == year)][stat].mean()):
        return (df[(df.TeamID == t1) & (df.Season == year)][stat].mean() - df[(df.TeamID == t2) & (df.Season == year)][stat].mean())  
    else:
        return df[(df.TeamID == t1)][stat].mean() - df[(df.TeamID == t2)][stat].mean()


A little more housecleaning before actual model experimentation.  This will set the win loss ratio for each team in the training dataframe and turn each location variable into categorical dummies.


In [10]:
'''WIN LOSS RATIO'''
df['WLRatio'] = df.apply(lambda row: get_count(row.TeamID, row.Season, 1)/ (get_count(row.TeamID, row.Season, 0) + get_count(row.TeamID, row.Season, 1)).astype('float') - \
      get_count(row.OtherTeamID, row.Season, 1)/ (get_count(row.OtherTeamID, row.Season, 0) + get_count(row.OtherTeamID, row.Season, 1)).astype('float'), axis = 1)

'''SET DUMMIES'''
loc_dummies = pd.get_dummies(df.Loc)
df = pd.concat([df, loc_dummies], axis = 1)
df.head()

Unnamed: 0,AST,AST_Diff,BLK,BLK_Diff,Coach,DR,DR_Diff,FGP,FGP3,FGP3_Diff,...,SEED_Diff,STL,STL_Diff,Season,Seed,TeamID,WLRatio,A,H,N
0,14.0,4.666667,4.176471,0.676471,mark_gottfried,26.411765,3.911765,0.444393,0.347418,0.04075,...,9.0,7.235294,1.401961,2003,10.0,1104,-0.258929,0,0,1
1,15.380952,6.047619,4.095238,0.595238,rick_stansbury,26.380952,3.880952,0.495357,0.36198,0.055311,...,4.0,9.285714,3.452381,2003,5.0,1280,-0.133929,0,0,1
2,13.4,4.066667,5.75,2.25,eddie_sutton,24.5,2.0,0.474798,0.382793,0.076124,...,5.0,9.65,3.816667,2003,6.0,1329,-0.214286,0,1,0
3,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,0.0,6.954545,1.121212,2003,1.0,1400,-0.171429,0,1,0
4,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,0.0,6.954545,1.121212,2003,1.0,1400,-0.171429,1,0,0


At last to the interesting part.. Setting the training data and splitting into a test set.  This will allow for a good amount (66%) of the data to be used in training the model, with the remainder available to test the accuracy of the fit.  

I had tried a feature selector with a few of the final models, yet inclusion of all 11 features yeilded better results over the versions with only the top 20% influencers... So this is commented out. While it might be useful in explaination of the predictions (PPG and FGP are by far the top two influenctial variables), I opted to include even minimally influental variables in that resulted in higher accuracy of the model.  

The only unexpected change is the drop of offensive rebounds from the variable list.. They were way over the place and had not correlation with winning the game.  Similarly, game location can not be deciphered in the kaggle submission file, so I drop this dummy var as well.  I hope to include it back in for the final test when the tourny teams are selected.

In [11]:
'''FEATURE SELECTION'''
X_train = df[['PPG_Diff', 'FGP_Diff', 'AST_Diff', 'FGP3_Diff', 'SEED_Diff',
             'FTP_Diff', 'DR_Diff', 'STL_Diff', 'BLK_Diff', 'Rank_Diff', 'WLRatio']]
y_train = df['Result']
X_train, y_train = shuffle(X_train, y_train)

'''SELECTION SCORES'''
#X_new = SelectPercentile(percentile = 20).fit_transform(X_train, y_train)

'''TRAIN-TEST SPLIT - (for testing locally)'''
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
X_train.head()
y_train.head()

15740    0
7233     1
14898    0
12298    0
1371     1
Name: Result, dtype: int64

And finally to the model fitting.  Most are commented out which indicates that they were not asd powerful and accurate as the final selected: SVC with a linear kernel. I worry simply judging by accuracy is not the most efficient and apt method to deciding between fits, but it appears to have a direct relation to how highly the final submission scores on Kaggle as well. 

While Random Forest and Adaboost both have great acucuracy.. as excpected with a dataset like this.. I was excited to see SVC b so efficient at prediction based on the inputs.  The clf.score reported .9993. An insanely accurate score which actually ends up worrying me.  I fear this would mean it is largely overfit, but as the test data was shuffled and so large, I find that it might be acceptable?   Regardless, this is the final model I went with! 

In [12]:
'''GAUSSIAN NAIVE BAYES'''
#clf = GaussianNB()
#clf.fit(X_train, y_train)
#clf.score(X_test, y_test)


'''LOGISTIC REGRESSION'''
#logreg = LogisticRegression()
#params = {'C': np.logspace(start=-5, stop=3, num=9)}
#clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)
#clf.fit(X_train, y_train)
#print('Best log_loss: {:.4}, with best C: {}'
#      .format(clf.best_score_, clf.best_params_['C']))


'''LOG REG CV'''
#clf = LogisticRegressionCV()
#clf.fit(X_train, y_train)


'''TRAIN MODEL - RANDOM FOREST'''
#clf = RandomForestClassifier(random_state=42)
#clf.fit(X_train, y_train)
#clf.feature_importances_
#clf.score(X_test, y_test)


'''ADABOOST'''
#ada = AdaBoostClassifier()
#parameters = {'n_estimators':[10,50,100], 'random_state': [None, 0, 42, 138]}
#clf = grid_search.GridSearchCV(ada, parameters)
#clf = AdaBoostClassifier(n_estimators=50, random_state=138)
#clf.fit(X_train, y_train)

#clf.score(X_test, y_test)


'''K Nearest Neighbor'''
#clf = KNeighborsClassifier(n_neighbors=5)
#clf.fit(X_train, y_train) 

#clf.score(X_test, y_test)


'''Support Vector Machine''' 
clf = svm.SVC(kernel='linear', gamma=0.7, C=1.0, probability = True)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)


0.99894273127753308

The next segment is partially thanks to Julie Elliott again with her starter kernel.  I was at a loss of how to easily and quickly iterate through the sample submission and report the model predicted probabilities. I tweaked it to utilize the earlier created stat differential function to create appropriate variables to run the model on.  

In [None]:
'''SET TEST DATAFRAME'''
X_sub = np.zeros(shape=(n_test_games, 11))


'''SETTING FEATURES'''
stat_list = ['PPG', 'FGP', 'AST', 'FGP3', 'Seed', 'FTP', 'DR', 'STL', 'BLK', 'Rank'] 

for ii, row in sample_sub.iterrows():
    year, t1, t2 = get_year_t1_t2(row.ID)
    col_num = 0
    
    for team_stat in stat_list:
        X_sub[ii, col_num] = get_stat(team_stat, t1, t2, year)
        col_num += 1
        
    X_sub[ii, col_num] =  get_count(t1, year, 1)/ (get_count(t1, year, 0) + get_count(t1, year, 1)).astype('float') - \
     get_count(t2, year, 1)/ (get_count(t2, year, 0) + get_count(t2, year, 1)).astype('float')

      
'''EDIT NaN's and infinite values'''
imp = Imputer(missing_values='NaN', strategy='median', axis=1) 
imp.fit(X_sub)
X_sub = imp.fit_transform(X_sub)
X_sub.head()

And to the final part of the project.. Using the model on the Kaggle sample submission and turning in on the competition site to see how I fare.  Predict_proba is used here as the competition is graded in a log loss method which scores based on the probability scores set to each game. I then clip the extreme ends of the scores as log loss methods highly deduct for highly certain false positives. 

In [None]:
'''MAKE PREDICTIONS'''
preds = clf.predict_proba(X_sub)[:,1]

'''CLIP PREDICTIONS'''
clipped_preds = np.clip(preds, 0.05, 0.95)
sample_sub.Pred = clipped_preds

'''WRITE TO CSV'''
#sample_sub.to_csv('SVC_11FD_clipped_sub.csv', index=False)

###Kaggle Results
So the extremely high accuracy of the model score on here does not directly relate to the same in a log loss environment.  I scored 0.633 on the competition site.. I believe this equates to around 92% accuracy? Slightly worse than the Basic starter kernel which only used Seed differential in predicting probabilities.  

This is not to say it will not be effective in round two of the competition where current tourney data is unavaible.  I believe my model's use of only regular season data is both a pro and a con.  It can allow me to appropriately fit each game in the 2018 bracket based soley on regular season data from this year. Conversely, it also does not take into account the randomness that is seen within Tourny games- Hence the 'madness' of the bracket. 

If time becomes available again before the begining of the tourney, I plan to find a way to include betting market odds of the games into the training data. I also think utilizing those dummy location variables and possibly including some lagged past season tourney placements.

###How Did My Bracket Do: (UPCOMING)