In [1]:
import numpy as np
import pandas as pd

from basketball import get_column_correlation, normed_error, march_madness, basketball_team, linear_regression

bbData = pd.read_csv("cbb20.csv")

# College Basketball Match Predictor

In this project, we are going to formulate a predictor model for college basketball matches based on data from https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

The training data is the match result in year 2020. We first remove the ranking column since that's what we plan to predict.
And we produce a custom class called "basketball_team" which is a subclass of dataframe .

In [2]:
bbDataCleaned = bbData.drop(['RK', 'WAB'], axis = 1)
basketball = basketball_team(bbDataCleaned)

In [3]:
basketball

Unnamed: 0,TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,TORD,ORB,DRB,FTR,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T
0,Kansas,B12,30,28,116.1,87.7,0.9616,53.7,43.7,18.7,18.6,32.6,26.4,35.8,23.2,54.9,42.4,34.1,30.5,67.4
1,Baylor,B12,30,26,114.5,88.4,0.9513,49.4,45.2,17.8,22.7,35.8,29.8,30.8,30.8,47.5,44.4,35.1,31.1,66.2
2,Gonzaga,WCC,33,31,121.3,94.3,0.9472,57.5,47.6,15.3,18.4,33.6,22.7,38.8,21.8,57.4,47.4,38.6,32.0,72.0
3,Dayton,A10,31,29,119.5,93.4,0.9445,59.7,46.6,18.0,18.8,26.4,26.6,33.9,30.9,62.3,45.1,37.1,33.0,67.5
4,Michigan St.,B10,31,22,114.8,91.3,0.9326,52.6,43.3,18.1,15.8,32.8,26.0,30.8,29.3,52.9,43.4,34.8,28.7,69.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348,Arkansas Pine Bluff,SWAC,29,4,80.1,104.3,0.0461,43.1,48.9,26.2,20.6,27.1,30.7,37.4,46.3,44.6,50.2,26.3,31.1,66.1
349,Maryland Eastern Shore,MEAC,31,5,82.1,108.0,0.0411,40.5,51.0,21.1,20.4,26.2,31.0,25.7,36.6,42.5,51.3,25.0,33.7,67.7
350,Mississippi Valley St.,SWAC,30,4,89.2,119.0,0.0350,43.0,54.0,18.2,17.6,23.5,35.9,22.9,36.1,42.2,55.5,29.4,34.1,77.4
351,Kennesaw St.,ASun,29,1,82.2,112.2,0.0269,39.3,55.3,21.7,18.5,25.8,31.2,32.5,30.4,40.2,55.0,24.8,37.1,68.2


The new "basketball_team" class also has the properties of a python dataframe. For example, iloc() can be called

In [4]:
basketball.iloc[1]

TEAM       Baylor
CONF          B12
G              30
W              26
ADJOE       114.5
ADJDE        88.4
BARTHAG    0.9513
EFG_O        49.4
EFG_D        45.2
TOR          17.8
TORD         22.7
ORB          35.8
DRB          29.8
FTR          30.8
FTRD         30.8
2P_O         47.5
2P_D         44.4
3P_O         35.1
3P_D         31.1
ADJ_T        66.2
Name: 1, dtype: object

## get_column_correlation

This is a function that produces the correlation between column data. By specifying the parameters as column names, we can get their relavent correlations, which can be used to determine which columns should be included in the regression model.

In [5]:
col = list(basketball.columns[2:])
col.remove("BARTHAG")
print(col)

['G', 'W', 'ADJOE', 'ADJDE', 'EFG_O', 'EFG_D', 'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D', 'ADJ_T']


Since the goal is to predict rankings and match outcome, "BARTHAG" is an important column as it represents one team's probability of winning against an average team. Therefore, we use the correlation of a column with "BARTHAG" to determine its importance in the model.

In [6]:
get_column_correlation(basketball, ["BARTHAG","W","ADJOE","ADJDE"])

The correlation between BARTHAG and W is 0.76.
The correlation between BARTHAG and ADJOE is 0.86.
The correlation between BARTHAG and ADJDE is -0.84.
The correlation between W and ADJOE is 0.7.
The correlation between W and ADJDE is -0.64.
The correlation between ADJOE and ADJDE is -0.49.



# obtain_average_team

Since ontaining average team is important in determining one team's ranking, we also included it in the custom class of basketball_team.

In [7]:
avg=basketball.obtain_average_team()

In [8]:
avg["G"]

30.186968838526912

## The model prediction
This next section creates a new index by which the basketball teams are now ranked using a linear regression model. Using the desired metrics (wins, adjusted offense, and adjusted defense), an index from 2 to 0 is used for our new personalizing rankings.

In [9]:
lr, train_X, train_y = linear_regression(basketball)

In [10]:
basketball.add_index(lr, train_X)

The new team rankings:

         TEAM CONF   G   W  ADJOE  ADJDE  BARTHAG  EFG_O  EFG_D   TOR  ...  \
0     Kansas  B12  30  28  116.1   87.7   0.9616   53.7   43.7  18.7  ...   
2    Gonzaga  WCC  33  31  121.3   94.3   0.9472   57.5   47.6  15.3  ...   
1     Baylor  B12  30  26  114.5   88.4   0.9513   49.4   45.2  17.8  ...   
3     Dayton  A10  31  29  119.5   93.4   0.9445   59.7   46.6  18.0  ...   
6  Creighton   BE  30  24  120.6   96.4   0.9289   55.2   48.4  15.9  ...   

    ORB   DRB   FTR  FTRD  2P_O  2P_D  3P_O  3P_D  ADJ_T  newIndex  
0  32.6  26.4  35.8  23.2  54.9  42.4  34.1  30.5   67.4  1.630288  
2  33.6  22.7  38.8  21.8  57.4  47.4  38.6  32.0   72.0  1.598295  
1  35.8  29.8  30.8  30.8  47.5  44.4  35.1  31.1   66.2  1.581119  
3  26.4  26.6  33.9  30.9  62.3  45.1  37.1  33.0   67.5  1.579760  
6  23.9  30.2  28.8  23.4  53.0  48.9  38.7  31.8   68.3  1.542090  

[5 rows x 21 columns]


We can also verify our model using data from year 2021

In [12]:
bbData21 = pd.read_csv("cbb21.csv")

In [13]:
bbDataCleaned21 = bbData.drop(['WAB'], axis = 1)
basketball21 = basketball_team(bbDataCleaned21)

In [14]:
lr.score(basketball21[["W","ADJOE","ADJDE"]], basketball21["BARTHAG"])

0.9745098845573505

The score is 0.9745 and it's a relatively good prediction

In [15]:
basketball21.add_index(lr, basketball21[["W","ADJOE","ADJDE"]])

The new team rankings:

    RK       TEAM CONF   G   W  ADJOE  ADJDE  BARTHAG  EFG_O  EFG_D  ...   ORB  \
0   1     Kansas  B12  30  28  116.1   87.7   0.9616   53.7   43.7  ...  32.6   
2   3    Gonzaga  WCC  33  31  121.3   94.3   0.9472   57.5   47.6  ...  33.6   
1   2     Baylor  B12  30  26  114.5   88.4   0.9513   49.4   45.2  ...  35.8   
3   4     Dayton  A10  31  29  119.5   93.4   0.9445   59.7   46.6  ...  26.4   
6   7  Creighton   BE  30  24  120.6   96.4   0.9289   55.2   48.4  ...  23.9   

    DRB   FTR  FTRD  2P_O  2P_D  3P_O  3P_D  ADJ_T  newIndex  
0  26.4  35.8  23.2  54.9  42.4  34.1  30.5   67.4  1.630288  
2  22.7  38.8  21.8  57.4  47.4  38.6  32.0   72.0  1.598295  
1  29.8  30.8  30.8  47.5  44.4  35.1  31.1   66.2  1.581119  
3  26.6  33.9  30.9  62.3  45.1  37.1  33.0   67.5  1.579760  
6  30.2  28.8  23.4  53.0  48.9  38.7  31.8   68.3  1.542090  

[5 rows x 22 columns]


## march_madness
Finally, this is our implementation of the previous code. The function march_madness does primarily three things:

   1. Picks the top team of every conference using our new index and then the next 32 teams not already chosen by index. These 64 teams are split into 4 regions: north, south, east and west. Seeds are set from 1 to 16 by index.
   2. Simulates regional games for each group. In march madness, a 1 seed plays a 16 seed, a 2 plays a 15, 3 and 14... One winner is chosen from each region. The game winners are determined by their index but there is also some random chance involved, every team has the chance to upset or be upset.
   3. With one team from each region, a final four is played and a single march madness champion is crowned!

In [11]:
march_madness(basketball)

Final Four Teams:
	 North Winner: Villanova
	 South Winner: Dayton
	 East Winner: Gonzaga
	 West Winner: Michigan


The Winner of March Madness is:
-------------------
Michigan!!!
-------------------
