# Selecting a Model

The goal of this notebook is to choose the best model for predicting scores of NCAA Tournament games. We will generate features and then run several machine learning models for our data. Finally, using the results, we select a modle to use for our NCAA Tournament predictions.

In [1]:
# Import packages
import sys
sys.path.append('/Users/phil/Documents/Documents/College Basketball')

import pandas as pd
import collegebasketball as cbb
cbb.__version__

import warnings
warnings.filterwarnings('ignore')

## Load the Data

First, we need to load the data that we previously retrieved and cleaned. More information about how this was done can be found in the Data Preparation and Cleaning notebook.

In [2]:
# Load the csv files that contain the scores/kenpom data
season = cbb.load_csv('/Users/phil/Documents/Documents/College Basketball/Data/regular_season.csv')
march = cbb.load_csv('/Users/phil/Documents/Documents/College Basketball/Data/march.csv')

# Take a look at the data
season.head()

Unnamed: 0,Year,Home,Away,Home_Score,Away_Score,Rank_Home,Team_Home,Conf_Home,Wins_Home,Losses_Home,...,Luck_Away,Luck Rank_Away,OppAdjEM_Away,OppAdjEM Rank_Away,OppO_Away,OppO Rank_Away,OppD_Away,OppD Rank_Away,NCSOS AdjEM_Away,NCSOS AdjEM Rank_Away
0,2002,Maryland,Arizona,67,71,3,Maryland,ACC,32,4,...,0.079,15,14.22,1,111.5,1,97.3,3,17.56,1
1,2002,Florida,Arizona,71,75,7,Florida,SEC,22,9,...,0.079,15,14.22,1,111.5,1,97.3,3,17.56,1
2,2002,Washington,Arizona,81,92,99,Washington,P10,10,18,...,0.079,15,14.22,1,111.5,1,97.3,3,17.56,1
3,2002,Washington,Arizona,82,91,99,Washington,P10,10,18,...,0.079,15,14.22,1,111.5,1,97.3,3,17.56,1
4,2002,University of California,Arizona,53,99,37,University of California,P10,23,9,...,0.079,15,14.22,1,111.5,1,97.3,3,17.56,1


## Feature Generation

Now that we have our data, we need to create some features for the ML algorithms. For each kenpom attribute, there is a feature to show the attribute for the favored team, the attribute for the underdog, and the difference between the two. The favored team is defined as the team with a higher AdjEM on kenpom. Using this system, a label of '1' represents an upset and a label of '0' means that the favored team won the game.

In [3]:
# Generate the Features
season_vecs, features = cbb.gen_features(season)
march_vecs, features = cbb.gen_features(march)

season_vecs.head()

Unnamed: 0,Favored,Underdog,Year,Win_Loss_Fav,Win_Loss,Win_Loss_Diff,Rank_Fav,Rank,Rank_Diff,AdjEM_Fav,...,OppD Rank_Fav,OppD Rank,OppD Rank_Diff,NCSOS AdjEM_Fav,NCSOS AdjEM,NCSOS AdjEM_Diff,NCSOS AdjEM Rank_Fav,NCSOS AdjEM Rank,NCSOS AdjEM Rank_Diff,Label
0,Maryland,Arizona,2002,0.888889,0.705882,0.183007,3,13,-10,29.25,...,32,3,29,1.62,17.56,-15.94,120,1,119,1
1,Florida,Arizona,2002,0.709677,0.705882,0.003795,7,13,-6,24.72,...,25,3,22,-0.56,17.56,-18.12,173,1,172,1
2,Arizona,Washington,2002,0.705882,0.357143,0.348739,13,99,-86,20.54,...,3,18,-15,17.56,7.6,9.96,1,25,-24,0
3,Arizona,Washington,2002,0.705882,0.357143,0.348739,13,99,-86,20.54,...,3,18,-15,17.56,7.6,9.96,1,25,-24,0
4,Arizona,University of California,2002,0.705882,0.71875,-0.012868,13,37,-24,20.54,...,3,53,-50,17.56,-5.85,23.41,1,285,-284,0


## Blocking

We now have features for every game played in division one from the 2002 season to the 2017 season. However, we can improve the accuracy of our models if we remove results unrealated to our test set. Since the goal of this project is to predict specifically games for the NCAA Tournament, we will remove any games with teams that are not good enough.

In [4]:
print('We have data for ' + str(len(season_vecs) + len(march_vecs)) + ' games.')
print(str(len(season_vecs[season_vecs['Label'] == 1]) + len(march_vecs[march_vecs['Label'] == 1])) + ' of those games are upsets')

We have data for 85683 games.
25685 of those games are upsets


In [5]:
# Block the feature vector tables
train = cbb.block_table(season_vecs)
test = cbb.block_table(march_vecs)
print('We have data for ' + str(len(season_vecs) + len(march_vecs)) + ' games.')
print(str(len(season_vecs[season_vecs['Label'] == 1]) + len(march_vecs[march_vecs['Label'] == 1])) + ' of those games are upsets')

We have data for 85683 games.
25685 of those games are upsets


## Selecting a ML Model

Next, we will train the models and test them using the march data to select a ML model that can best predict upsets in the NCAA Tournament. We will be choosing between K Nearest Neighbors, Decision Tree, Random Forest, SVM, Logistic Regression, and AdaBoost classifiers. We will be using classifiers from scikit learn.

In [6]:
# Create the models
knn = cbb.KNeighborsClassifier()
dt = cbb.DecisionTreeClassifier()
rf = cbb.RandomForestClassifier()
log = cbb.LogisticRegression()
ada = cbb.AdaBoostClassifier()

cls = [knn, dt, rf,log, ada]
cl_names = ['KNN', 'Decision Tree', 'Random Forest', 'Logistic Regression', 'AdaBoost']

In [7]:
cbb.evaluate(train, test, features, cls, cl_names)

Unnamed: 0,Classifier,Precision,Recall,F1,Accuracy
0,KNN,0.327451,0.347917,0.337374,0.6017
1,Decision Tree,0.329134,0.435417,0.374888,0.576806
2,Random Forest,0.364008,0.370833,0.367389,0.627808
3,Logistic Regression,0.492126,0.260417,0.340599,0.706132
4,AdaBoost,0.437333,0.341667,0.383626,0.680024


Since the AdaBoost classifier has the best F1 score and the second best Precision score, we will be using it to generate predictions in the NCAA Tournament. When predicting upsets, it is important to choose a classifier with high precision because we want to be a sure as possible that we are correct when we predict an upset. This is the case because any predition we make will affect our preditions int he next round. If we are making too many upset preditions, we will most likly fail to predict the correct teams later in the bracket.