# Generating Predictions

Using the Logistic Regression model that we chose in the Selecting a Model notebook, we will create predictions for the 2021 NCAA Tournament.

In [1]:
# Import packages
import sys
sys.path.append('../')

import pandas as pd
from sklearn.linear_model import LogisticRegression
import collegebasketball as cbb

import warnings
warnings.filterwarnings('ignore')

cbb.__version__

'2024'

## Train the Model

Using the same method as before, we will train the model. To understand how I arrived at this model, please look at the Selecting a Model notebook for more information.

However, there is one major difference in how we will train the model this time. Before, we split the data into training and testing sets, but since we are predicting for new games, we will use all of the training data to train the model.

In [2]:
# Load the csv files that contain the scores/kenpom data
year = 2024
path = f'../Data/Training/training_{year}.csv'
train = pd.read_csv(path)

# Get a sense for the size of each data set
print('Length of training data: {}'.format(len(train)))

Length of training data: 17943


In [3]:
train.head(3)

Unnamed: 0,Favored,Underdog,Year,Tournament,Label,Win_Loss_Fav,Win_Loss,AdjEM_Fav,AdjEM,AdjO_Fav,...,FT%_opp_Fav,FT%_opp,AST_Fav,AST,AST_opp_Fav,AST_opp,BLK_Fav,BLK,BLK_opp_Fav,BLK_opp
0,UNC,William & Mary,2010,NIT,0,0.540541,0.666667,13.39,6.58,107.4,...,0.699,0.681,15.594595,14.030303,14.378378,13.151515,5.675676,2.545455,4.486486,3.333333
1,Wake Forest,UNC,2010,,0,0.645161,0.540541,14.12,13.39,107.1,...,0.687,0.699,11.903226,15.594595,11.419355,14.378378,5.225806,5.675676,3.903226,4.486486
2,UNC,Nevada,2010,,0,0.540541,0.617647,13.39,10.2,107.4,...,0.699,0.713,15.594595,14.117647,14.378378,15.029412,5.675676,4.382353,4.486486,2.588235


In [4]:
# Get feature names
exclude = ['Favored', 'Underdog', 'Year', 'Tournament', 'Label']

features = list(train.columns)
for col in exclude:
    features.remove(col)

In [5]:
# Train the classifier
log = LogisticRegression(penalty='l2', C=10, solver='liblinear', random_state=77)
log.fit(train[features], train[['Label']])

LogisticRegression(C=10, random_state=77, solver='liblinear')

## Get Input Data for this Year

Next, we'll need to get the input data for this year so we can use it to predict game results for tournament games. We'll retrieve data from each source for this year, clean the data and combine it into a single data set.

In [6]:
stats_path = '../Data/SportsReference/' + str(year) + '_stats.csv'
stats = pd.read_csv(stats_path)
stats = cbb.update_basic(stats.rename(index=str, columns={'School': 'Team'}))

# Fix absolute stats to be per game
cols_to_fix = ['3PA', '3PA_opp',  'AST', 'AST_opp', 'BLK', 'BLK_opp']
for c in cols_to_fix:
    stats[c] = stats[c] / stats['G']

stats[stats['Team'] == 'Marquette']

Unnamed: 0,Team,G,SRS,SOS,Tm.,Opp.,MP,FG_opp,FGA_opp,FG%_opp,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
159,Marquette,34,19.57,10.98,2662,2370,1365,834,1965,0.424,...,379,530,0.715,298,1115,15.823529,291,3.235294,338,522


In [7]:
kp_path = '../Data/Kenpom/' + str(year) + '_kenpom.csv'
kenpom = pd.read_csv(kp_path)
kenpom = cbb.update_kenpom(kenpom)
kenpom[kenpom['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,Luck,Luck Rank,OppAdjEM,OppAdjEM Rank,OppO,OppO Rank,OppD,OppD Rank,NCSOS AdjEM,NCSOS AdjEM Rank
11,12,Marquette,2.0,BE,25,9,22.74,118.3,21,95.6,...,0.036,98,13.13,7,113.3,6,100.2,8,8.27,23


In [8]:
TRank_path = '../Data/TRank/' + str(year) + '_TRank.csv'
TRank = pd.read_csv(TRank_path)
TRank = cbb.update_TRank(TRank)
TRank[TRank['Team'] == 'Marquette']

Unnamed: 0,Rk,Team,Conf,G,Wins,Losses,AdjOE,AdjOE Rank,AdjDE,AdjDE Rank,...,3P%D,3P%D Rank,3PR,3PR Rank,3PRD,3PRD Rank,Adj T.,Adj T. Rank,WAB,WAB Rank
7,8,Marquette,BE,34,25,9,118.9,19,94.6,17,...,33.6,155,40.5,96,43.1,340,69.1,86,6.5,6


In [9]:
# Merge the data from each source (and drop columns that are repeats)
team_stats = pd.merge(kenpom, TRank.drop(['Conf', 'Wins', 'Losses'], axis=1), on='Team', sort=False)
team_stats = pd.merge(team_stats, stats.drop(['G', 'ORB', '3P%', 'ORB'], axis=1), on='Team', sort=False)
team_stats[team_stats['Team'] == 'Marquette']

Unnamed: 0,Rank,Team,Seed,Conf,Wins,Losses,AdjEM,AdjO,AdjO Rank,AdjD,...,3PA,FT,FTA,FT%,TRB,AST,STL,BLK,TOV,PF
11,12,Marquette,2.0,BE,25,9,22.74,118.3,21,95.6,...,24.705882,379,530,0.715,1115,15.823529,291,3.235294,338,522


In [10]:
# Load Tournament games
games_path = '../Data/Tourney/{}.csv'.format(year)
games = pd.read_csv(games_path)
games.head(3)

Unnamed: 0,Home,Away
0,UConn,Stetson
1,Florida Atlantic,Northwestern
2,San Diego State,UAB


In [11]:
# Join the team data with the game data
data = pd.merge(games, team_stats, left_on='Home', right_on='Team', sort=False, how='left')
data = pd.merge(data, team_stats, left_on='Away', right_on='Team', suffixes=('_Home', '_Away'), sort=False, how='left')
data.insert(0, 'Year', year)
data.insert(3, 'Tournament', 'NCAA Tournament')

# Confirm school names are correct
assert len(data[(data['Rank_Home'].isna()) | (data['Rank_Away'].isna())]) == 0

data.head(3)

Unnamed: 0,Year,Home,Away,Tournament,Rank_Home,Team_Home,Seed_Home,Conf_Home,Wins_Home,Losses_Home,...,3PA_Away,FT_Away,FTA_Away,FT%_Away,TRB_Away,AST_Away,STL_Away,BLK_Away,TOV_Away,PF_Away
0,2024,UConn,Stetson,NCAA Tournament,1,UConn,1.0,BE,31,3,...,24.617647,459,599,0.766,1186,13.5,181,3.058824,355,504
1,2024,Florida Atlantic,Northwestern,NCAA Tournament,41,Florida Atlantic,8.0,Amer,25,8,...,21.09375,410,548,0.748,997,15.6875,222,3.28125,280,568
2,2024,San Diego State,UAB,NCAA Tournament,21,San Diego State,5.0,MWC,24,10,...,18.764706,604,810,0.746,1296,13.617647,229,4.617647,392,537


## Predict Games Using the Classifier

Now that we have a trained model and data for the tournament games this year, we can use it to predict games in the 2021 NCAA Tournament.

In [12]:
# Make Predictions
predictions = cbb.predict(log, data, features)
predictions.to_csv('../Data/predictions/predictions_2024.csv', index=False)
predictions['Upset'] = predictions['Underdog'] == predictions['Predicted Winner']

In [14]:
# First Round
predictions.iloc[0:32,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
0,UConn,Stetson,UConn,0.009171,False
1,Florida Atlantic,Northwestern,Northwestern,0.594426,True
2,San Diego State,UAB,San Diego State,0.254881,False
3,Auburn,Yale,Auburn,0.091385,False
4,BYU,Duquesne,BYU,0.236502,False
5,Illinois,Morehead State,Illinois,0.040837,False
6,Washington State,Drake,Washington State,0.416305,False
7,Iowa State,South Dakota State,Iowa State,0.026823,False
8,UNC,Wagner,UNC,0.006887,False
9,Michigan State,Mississippi State,Mississippi State,0.554902,True


In [16]:
# Second Round
predictions.iloc[32:48,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
32,UConn,Northwestern,UConn,0.113092,False
33,Auburn,San Diego State,Auburn,0.327519,False
34,Illinois,BYU,Illinois,0.372588,False
35,Iowa State,Washington State,Iowa State,0.209334,False
36,UNC,Mississippi State,Mississippi State,0.316553,True
37,Alabama,Grand Canyon,Alabama,0.243838,False
38,Baylor,New Mexico,New Mexico,0.30927,True
39,Arizona,Nevada,Nevada,0.344166,True
40,Houston,Nebraska,Houston,0.187894,False
41,Duke,Wisconsin,Wisconsin,0.475039,True


In [17]:
# Later Rounds
predictions.iloc[48:,:]

Unnamed: 0,Favored,Underdog,Predicted Winner,Probabilities,Upset
48,UConn,Auburn,UConn,0.248864,False
49,Iowa State,Illinois,Illinois,0.406706,True
50,Alabama,Mississippi State,Mississippi State,0.40581,True
51,New Mexico,Nevada,Nevada,0.512902,True
52,Houston,Wisconsin,Houston,0.187625,False
53,Texas Tech,Florida,Florida,0.458171,True
54,Purdue,Kansas,Purdue,0.256461,False
55,Tennessee,Oregon,Tennessee,0.274501,False
56,UConn,Illinois,UConn,0.23816,False
57,Mississippi State,Nevada,Nevada,0.533423,True


Congratulations to all UConn fans because the model has predicted the Huskies to repeat as the the 2024 NCAA Tournament Champion!