# Generating Predictions

Using the Logistic Regression model that we chose in the Selecting a Model notebook, we will create predictions for the 2021 NCAA Tournament.

In [1]:
# Import packages
import sys
sys.path.append('../')

import pandas as pd
from sklearn.linear_model import LogisticRegression
import collegebasketball as cbb

import warnings
warnings.filterwarnings('ignore')

cbb.__version__

'2023'

## Train the Model

Using the same method as before, we will train the model. To understand how I arrived at this model, please look at the Selecting a Model notebook for more information.

However, there is one major difference in how we will train the model this time. Before, we split the data into training and testing sets, but since we are predicting for new games, we will use all of the training data to train the model.

In [2]:
# Load the csv files that contain the scores/kenpom data
path = '../Data/Training/training.csv'
train = pd.read_csv(path)

# Get a sense for the size of each data set
print('Length of training data: {}'.format(len(train)))

Length of training data: 16239


In [3]:
train.head(3)

Unnamed: 0,Favored,Underdog,Year,Tournament,Label,Win_Loss_Fav,Win_Loss,AdjEM_Fav,AdjEM,AdjO_Fav,...,FT%_opp_Fav,FT%_opp,AST_Fav,AST,AST_opp_Fav,AST_opp,BLK_Fav,BLK,BLK_opp_Fav,BLK_opp
0,UNC,William & Mary,2010,NIT,0,0.540541,0.666667,13.39,6.58,107.4,...,0.699,0.681,15.594595,14.030303,14.378378,13.151515,5.675676,2.545455,4.486486,3.333333
1,Wake Forest,UNC,2010,,0,0.645161,0.540541,14.12,13.39,107.1,...,0.687,0.699,11.903226,15.594595,11.419355,14.378378,5.225806,5.675676,3.903226,4.486486
2,UNC,Nevada,2010,,0,0.540541,0.617647,13.39,10.2,107.4,...,0.699,0.713,15.594595,14.117647,14.378378,15.029412,5.675676,4.382353,4.486486,2.588235


In [4]:
# Get feature names
exclude = ['Favored', 'Underdog', 'Year', 'Tournament', 'Label']

features = list(train.columns)
for col in exclude:
    features.remove(col)

In [5]:
# Train the classifier
log = LogisticRegression(penalty='l2', C=10, solver='liblinear', random_state=77)
log.fit(train[features], train[['Label']])

LogisticRegression(C=10, random_state=77, solver='liblinear')

## Get Input Data for this Year

Next, we'll need to get the input data for this year so we can use it to predict game results for tournament games. We'll retrieve data from each source for this year, clean the data and combine it into a single data set.

In [6]:
year = 2023
stats_path = '../Data/SportsReference/' + str(year) + '_stats.csv'
# stats = cbb.load_stats_dataframe(year=year, csv_file_path=stats_path)
stats = pd.read_csv(stats_path)
stats = cbb.update_basic(stats.rename(index=str, columns={'School': 'Team'}))

# Fix absolute stats to be per game
cols_to_fix = ['3PA', '3PA_opp',  'AST', 'AST_opp', 'BLK', 'BLK_opp']
for c in cols_to_fix:
    stats[c] = stats[c] / stats['G']

stats[stats['Team'] == 'Marquette']

Unnamed: 0,Team,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
159,Marquette,34,17.23,7.87,2718,2400,1380,872,1960,0.445,...,403,559,0.721,303,1084,17.588235,320,3.147059,365,553


In [7]:
kp_path = '../Data/Kenpom/' + str(year) + '_kenpom.csv'
# kenpom = cbb.load_kenpom_dataframe(year=year, csv_file_path=kp_path)
kenpom = pd.read_csv(kp_path)
kenpom = cbb.update_kenpom(kenpom)
kenpom[kenpom['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
11,12,Marquette,2.0,BE,28,6,21.83,119.3,8,97.5,...,0.014,150,8.67,39,109.5,35,100.9,51,-1.14,196


In [8]:
TRank_path = '../Data/TRank/' + str(year) + '_TRank.csv'
# TRank = cbb.load_TRank_dataframe(year=year, csv_file_path=TRank_path)
TRank = pd.read_csv(TRank_path)
TRank = cbb.update_TRank(TRank)
TRank[TRank['Team'] == 'Marquette']

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,2P%D,2P%D Rank,3P%,3P% Rank,3P%D,3P%D Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
12,13,Marquette,BE,34,28,6,119.0,9,97.1,44,...,50.0,169,34.8,134,35.2,253,68.6,115,7.5,8


In [9]:
# Merge the data from each source (and drop columns that are repeats)
team_stats = pd.merge(kenpom, TRank.drop(['Conf', 'Wins', 'Losses'], axis=1), on='Team', sort=False)
team_stats = pd.merge(team_stats, stats.drop(['G', 'ORB', '3P%', 'ORB'], axis=1), on='Team', sort=False)
team_stats[team_stats['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,3PA,FT,FTA,FT%,TRB,AST,STL,BLK,TOV,PF
11,12,Marquette,2.0,BE,28,6,21.83,119.3,8,97.5,...,25.441176,403,559,0.721,1084,17.588235,320,3.147059,365,553


In [10]:
# Load Tournament games
games_path = '../Data/Tourney/{}.csv'.format(year)
games = pd.read_csv(games_path)
games.head(3)

Unnamed: 0,Home,Away
0,Alabama,Texas A&M-Corpus Christi
1,Maryland,West Virginia
2,San Diego State,College of Charleston


In [11]:
# Join the team data with the game data
data = pd.merge(games, team_stats, left_on='Home', right_on='Team', sort=False)
data = pd.merge(data, team_stats, left_on='Away', right_on='Team', suffixes=('_Home', '_Away'), sort=False)
data.insert(0, 'Year', year)
data.insert(3, 'Tournament', 'NCAA Tournament')
data.head(3)

Unnamed: 0,Year,Home,Away,Tournament,Rank_Home,Team_Home,Seed_Home,Conf_Home,Wins_Home,Losses_Home,...,3PA_Away,FT_Away,FTA_Away,FT%_Away,TRB_Away,AST_Away,STL_Away,BLK_Away,TOV_Away,PF_Away
0,2023,Alabama,Texas A&M-Corpus Christi,NCAA Tournament,3,Alabama,1.0,SEC,29,5,...,21.058824,577,731,0.789,1262,15.529412,292,1.676471,442,647
1,2023,Maryland,West Virginia,NCAA Tournament,22,Maryland,8.0,B10,21,12,...,20.878788,558,753,0.741,1117,13.181818,221,3.090909,428,608
2,2023,San Diego State,College of Charleston,NCAA Tournament,14,San Diego State,5.0,MWC,27,6,...,30.205882,522,704,0.741,1377,13.794118,250,3.147059,408,572


## Predict Games Using the Classifier

Now that we have a trained model and data for the tournament games this year, we can use it to predict games in the 2021 NCAA Tournament.

In [12]:
# Make Predictions
predictions = cbb.predict(log, data, features)
predictions.to_csv('../Data/predictions/predictions_2023.csv', index=False)
predictions['Upset'] = predictions['Underdog'] == predictions['Predicted Winner']

In [13]:
# First Round
predictions.iloc[0:32,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
0,Alabama,Texas A&M-Corpus Christi,Alabama,0.008773,False
1,West Virginia,Maryland,West Virginia,0.486833,False
2,San Diego State,College of Charleston,San Diego State,0.148697,False
3,Virginia,Furman,Virginia,0.160343,False
4,Creighton,NC State,Creighton,0.370027,False
5,Baylor,UCSB,Baylor,0.111844,False
6,Utah State,Missouri,Missouri,0.519677,True
7,Arizona,Princeton,Arizona,0.061443,False
8,Purdue,Fairleigh Dickinson,Purdue,0.002217,False
9,Memphis,Florida Atlantic,Florida Atlantic,0.500202,True


In [14]:
# Second Round
predictions.iloc[32:48,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
32,Alabama,West Virginia,Alabama,0.194988,False
33,San Diego State,Virginia,San Diego State,0.360758,False
34,Creighton,Baylor,Baylor,0.687204,True
35,Arizona,Missouri,Missouri,0.345528,True
36,Purdue,Florida Atlantic,Purdue,0.239604,False
37,Duke,Louisiana,Duke,0.19704,False
38,Kansas State,Providence,Providence,0.305677,True
39,Marquette,Michigan State,Michigan State,0.33076,True
40,Houston,Auburn,Houston,0.238223,False
41,Indiana,Drake,Indiana,0.187885,False


In [15]:
# Later Rounds
predictions.iloc[48:,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
48,Alabama,San Diego State,Alabama,0.316853,False
49,Baylor,Missouri,Missouri,0.428894,True
50,Purdue,Duke,Purdue,0.278314,False
51,Michigan State,Providence,Providence,0.40116,True
52,Houston,Indiana,Indiana,0.340967,True
53,Texas,Iowa State,Texas,0.29541,False
54,Kansas,Saint Mary's,Kansas,0.24135,False
55,UCLA,Gonzaga,Gonzaga,0.380177,True
56,Alabama,Missouri,Alabama,0.235016,False
57,Purdue,Providence,Purdue,0.174273,False


Congratulations to all Kansas fans because the model has predicted the Jayhawks to repeat as the the 2023 NCAA Tournament Champion!