# Selecting a Model

The goal of this notebook is to choose the best model for predicting scores of NCAA Tournament games. We will use the training data to run several machine learning models for our data. Finally, using the results, we select a modle to use for our NCAA Tournament predictions.

In [1]:
# Import packages
import sys
sys.path.append('../')

import datetime
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import collegebasketball as cbb
cbb.__version__

## Load the Data

First, we need to load the data that we previously retrieved and cleaned. More information about how this was done can be found in the Data Preparation and Cleaning notebook.

In [2]:
# Load the csv files that contain the scores/kenpom data
path = '../Data/Training/'
kenpom_season = cbb.load_csv('{}kenpom_season.csv'.format(path))
kenpom_march = cbb.load_csv('{}kenpom_march.csv'.format(path))
TRank_season = cbb.load_csv('{}TRank_season.csv'.format(path))
TRank_march = cbb.load_csv('{}TRank_march.csv'.format(path))
stats_season = cbb.load_csv('{}stats_season.csv'.format(path))
stats_march = cbb.load_csv('{}stats_march.csv'.format(path))
all_season = cbb.load_csv('{}all_season.csv'.format(path))
all_march = cbb.load_csv('{}all_march.csv'.format(path))

# Get a sense for the size of each data set
print('Length of kenpom data: {}'.format(len(kenpom_season) + len(kenpom_march)))
print('Length of TRank data: {}'.format(len(TRank_season) + len(TRank_march)))
print('Length of basis stats data: {}'.format(len(stats_season) + len(stats_march)))
print('Length of all data: {}'.format(len(all_season) + len(all_march)))

Length of kenpom data: 60252
Length of TRank data: 63670
Length of basis stats data: 36512
Length of all data: 21788


## Selecting a ML Model

Next, we will train the models and test them using the march data to select a ML model that can best predict upsets in the NCAA Tournament for each training data set. We will be choosing between K Nearest Neighbors, Decision Tree, Random Forest, SVM, and Logistic Regression. We will be using classifiers from scikit learn.

In [3]:
# Create the models
knn =  KNeighborsClassifier()
dt = DecisionTreeClassifier(min_samples_leaf=5)
rf = RandomForestClassifier(n_estimators=100, min_samples_split=5)
log = LogisticRegression(penalty='l1', C=10)

cls = [knn, dt, rf, log]
cl_names = ['KNN', 'Decision Tree', 'Random Forest', 'Logistic Regression']
exclude = ['Favored', 'Underdog', 'Year', 'Label']

In [4]:
# Kenpom Data
cbb.evaluate(kenpom_season, kenpom_march, exclude, cls, cl_names)

Unnamed: 0,Classifier,Precision,Recall,F1,Accuracy
0,KNN,0.383133,0.270869,0.317365,0.686239
1,Decision Tree,0.329134,0.356048,0.342062,0.631193
2,Random Forest,0.432039,0.303237,0.356356,0.705046
3,Logistic Regression,0.547059,0.316865,0.401294,0.745413


In [5]:
# TRank Data
cbb.evaluate(TRank_season, TRank_march, exclude, cls, cl_names)

Unnamed: 0,Classifier,Precision,Recall,F1,Accuracy
0,KNN,0.331536,0.28341,0.30559,0.643267
1,Decision Tree,0.319635,0.322581,0.321101,0.622208
2,Random Forest,0.374558,0.24424,0.295676,0.677728
3,Logistic Regression,0.362319,0.230415,0.28169,0.674537


In [6]:
# Basic Stats Data
cbb.evaluate(stats_season, stats_march, exclude, cls, cl_names)

Unnamed: 0,Classifier,Precision,Recall,F1,Accuracy
0,KNN,0.385827,0.239609,0.295626,0.620016
1,Decision Tree,0.321321,0.261614,0.28841,0.570382
2,Random Forest,0.340909,0.036675,0.066225,0.655818
3,Logistic Regression,0.549451,0.122249,0.2,0.674532


In [7]:
# All Data
cbb.evaluate(all_season, all_march, exclude, cls, cl_names)

Unnamed: 0,Classifier,Precision,Recall,F1,Accuracy
0,KNN,0.298429,0.215909,0.250549,0.647727
1,Decision Tree,0.31686,0.412879,0.358553,0.597107
2,Random Forest,0.415205,0.268939,0.326437,0.697314
3,Logistic Regression,0.378378,0.265152,0.311804,0.680785


The results above show that the Logistic Regression Model on just the Kenpom works best. For now, we will just use this model with this data set. In the future, I plan to use multiple models on each of these different data sets weighted by thier performance, but for now we will just use this model.