# Machine Learning Sample Notebook | Predicting Tournament Seeding
Author Glen Joy (c) 2024
Sample Jupyter Notebook illustrating basic application of machine-learning models using Scikitlearn. We will attempt to predict team seeding in the NCAA tournament. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

## Importing Data

In [2]:
df = pd.read_csv('./data/archive1/cbb.csv') # reading in data
df.head() # previewing data

Unnamed: 0,TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,POSTSEASON,SEED,YEAR
0,North Carolina,ACC,40,33,123.3,94.9,0.9531,52.6,48.1,15.4,...,30.4,53.9,44.6,32.7,36.2,71.7,8.6,2ND,1.0,2016
1,Wisconsin,B10,40,36,129.1,93.6,0.9758,54.8,47.7,12.4,...,22.4,54.8,44.7,36.5,37.5,59.3,11.3,2ND,1.0,2015
2,Michigan,B10,40,33,114.4,90.4,0.9375,53.9,47.7,14.0,...,30.0,54.7,46.8,35.2,33.2,65.9,6.9,2ND,3.0,2018
3,Texas Tech,B12,38,31,115.2,85.2,0.9696,53.5,43.0,17.7,...,36.6,52.8,41.9,36.5,29.7,67.5,7.0,2ND,3.0,2019
4,Gonzaga,WCC,39,37,117.8,86.3,0.9728,56.6,41.1,16.2,...,26.9,56.3,40.0,38.2,29.0,71.5,7.7,2ND,1.0,2017


## Specifying Training Features and Test Features

In [3]:
# array of team seeds that we're trying to predict
labels = np.nan_to_num(np.array(df['SEED']))  # we will fill teams that werent seeded, which are NaN, with 0 instead

In [4]:
# dropping seed column (hiding the answer for training) 
# we're also dropping columns that have data as Strings since RandomForest only works with numbers
# in practice, we could theoretically encode each string as a number but im lazy.
df = df.drop(['SEED', 'POSTSEASON', 'TEAM', 'CONF'], axis=1) 
print(df.size)

70460


In [5]:
# splitting our data into training data and test data, specifically using 25% for testing
train_features, test_features, train_labels, test_labels = train_test_split(df, labels, test_size = 0.25, random_state = 42)

## Training Model and Testing Model

In [6]:
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

In [7]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

In [8]:
# Checking Accuracy
print("Accuracy:",metrics.accuracy_score(test_labels, predictions))

Accuracy: 0.8251986379114642


## Running Predictions

In [19]:
# for the sake of illustration, im going to have the model predict the seeding of a team whose data was already in our training set
# in practice, you would be running predictions on data the model hasn't seen before
rf.predict(np.array(df.iloc[0]).reshape(1,-1)) # predicting the 1st row in the data (2016 North Carolina team) as an example



array([1.])

It looks like the model got the seeding right!