# March Madness Machine Learning - Development Summary and Results

# Competition Overview

This document is based on work for a Kaggle competition: [March Machine Learning Mania 2022](https://www.kaggle.com/competitions/mens-march-mania-2022)

The goal of this project was to predict the results of match-ups in the College Basketball March Madness Tournament.

Model performance is scored by a cost function that accumulates based on the correctness and confidence of a prediction. A confident, correct prediction has a minimal cost compared to a confident, incorrect prediction that has a high cost. The confidence of the models prediction for a game outcome is passes as a value from 0 to 1. 

## Development Process

Based on time constraints, I broke up the devepment of this project into two phases:
* Phase 1: Models were built using the 'compact game results' from past seasons. This data included which team won and the points scored by the winning and losing team.
* Phase 2: Models were built using the 'detailed game results' froom past seasons. This included stats such as the number of field goals made, blocks, steals, free throws, etc.

For feature set during development, I tested both a Logistic Regression and a Random Forrest Classifier model.

I chose these two models to primarily work with because:
* The logistic regression is simple, easy to interpret, and at the very least will serve as a performance baseline for more advanced models.
* Random Forests have general high performance and robustness, and in particular, are able to utilize a large number of features and determine which are the most relevant to the problem.

## Additional Notebooks

This notebook summarizes key trained models and the ensueing results for the actual 2022 NCCA Men's Tournament.

For a detailed look at my process for developing a model for Phase 1:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Phase%201%20-%20Process%20Write-Up.ipynb

For a detailed look at my process for developing a model for Phase 2:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Phase%201%20-%20Post%20Tournament%20Review.ipynb

For a complete review of my Phase 1 model performance in the 2022 NCAA Tournament:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Initial%20Submission%20-%20Post%20Tournament%20Review.ipynb

For a complete review of my Phase 2 model performance in the 2022 NCAA Tournament:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Phase%202%20-%20Post%20Tournament%20Review.ipynb

For the code I used to calculate the model features for each team and season:
https://github.com/jhowenstein/march-madness-ML/blob/main/march_madness.py

For the notebook I used to calculate and save a csv fill all the feature data:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Feature%20Calculation.ipynb

Analysis Code:
https://github.com/jhowenstein/march-madness-ML

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import os

import march_madness as mm
import random

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures

In [2]:
analysis = mm.Analysis()

In [3]:
analysis.load_training_data('MDataFiles_Stage2/calculated features 1985-2021.csv')

In [4]:
analysis.load_validation_data('MDataFiles_Stage2/calculated features 2022.csv')

# Phase 1 Results

## Final Training Models

The four models below are my final training models for Phase 1 of development. Note: After model selection, I did retrain the selected model on all the training data before generating predictions for the 2022 Tournament. 

The Logistic Regression models had the best training scores but I was concerned they may not generalize as well given the highly correlated nature of some of the features. I did try pruning the feature set of some of the highly correlated or similar features and saw some modest improvements. Unfortunately I didn't have a lot of time to explore this before the submission deadline and ultimately decided to see how my submission using the Logistic Regression model would perform using the full feature set.

The Random Forest models performed slightly worse and in the case of the model controlled by limiting 'max_features', over-fitting of the training set was still occurring which was a concern for generalization. After the submission deadline, I explored using 'max_depth' to control the model complexity and it appeared to be very effective for limiting over-fitting.

### Using All Calculated Features

In [5]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','avg win margin','std win margin','avg loss margin',
                'std loss margin','capped avg win margin','capped std win margin','capped avg loss margin',
                'capped std loss margin','close wins','close losses','weighted top64 wins','weighted top32 wins',
                'weighted top16 wins','weighted top8 wins','weighted top64 losses','weighted top32 losses',
                'weighted top16 losses','weighted top8 losses','last10 win pct','last10 weighted win pct',
                'last5 win pct','last5 weighted win pct','conference tourney wins','conference champ']

In [6]:
X, y = analysis.extract_training_data(feature_keys=feature_keys)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)

In [8]:
X_train.shape

(1737, 56)

In [9]:
y_train.shape

(1737,)

In [10]:
X_test.shape

(580, 56)

In [11]:
y_test.shape

(580,)

### Logistic Regression

In [12]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
logreg.score(X_train,y_train)

0.7276914219919401

In [14]:
logreg.score(X_test,y_test)

0.696551724137931

In [15]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [16]:
analysis.score_model_predictions(y_test,logreg_pred)

0.5416041357699954

### Random Forest Classifier (max_features controlled)

In [17]:
forest = RandomForestClassifier(n_estimators=1000,max_features=3,random_state=100).fit(X_train,y_train)

In [18]:
forest.score(X_train,y_train)

1.0

In [19]:
forest.score(X_test,y_test)

0.7155172413793104

In [20]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [21]:
analysis.score_model_predictions(y_test,forest_pred)

0.5509841947134376

### Random Forest Classifier (max_depth controlled)

In [22]:
forest = RandomForestClassifier(n_estimators=1000,
                                max_features='sqrt',max_depth=5,random_state=100).fit(X_train,y_train)

In [23]:
forest.score(X_train,y_train)

0.7852619458837076

In [24]:
forest.score(X_test,y_test)

0.7189655172413794

In [25]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [26]:
analysis.score_model_predictions(y_test,forest_pred)

0.5495379202820174

### Using Pruned Feature Set (Logistic Regression Only)

In [27]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','capped avg win margin','capped std win margin',
                'capped avg loss margin','capped std loss margin','close wins','close losses','weighted top64 wins',
                'weighted top16 wins','weighted top64 losses','weighted top16 losses','last10 weighted win pct',
                'last5 weighted win pct','conference tourney wins','conference champ']

In [28]:
X, y = analysis.extract_training_data(feature_keys=feature_keys)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)

In [30]:
X_train.shape

(1737, 36)

In [31]:
y_train.shape

(1737,)

In [32]:
X_test.shape

(580, 36)

In [33]:
y_test.shape

(580,)

In [34]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
logreg.score(X_train,y_train)

0.7184801381692574

In [36]:
logreg.score(X_test,y_test)

0.7155172413793104

In [37]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [38]:
analysis.score_model_predictions(y_test,logreg_pred)

0.5385360241344801

## Tournament Results

As mentioned above, for all of these models, I retrained them using the entire training set before testing their predictions on the 2022 NCAA Tournament results.

I had a bug in my submission code (🤦🏻‍♂️) so my actual score on the leaderboard ended up not being accurate. The results below for the 'Logistic Regression (all features)' is what my score should've been. A total of 930 submissions were entered into the competition.

Model Performance: 
* Logistic Regression (all features)
    * Competition Score: 0.67371
    * Leaderboard Ranking: 581st
* Random Forrest ('max_features' controlled)
    * Competition Score: 0.63990
    * Leaderboard Ranking: 343rd
* Random Forrest ('max_depth' controlled)
    * Competition Score: 0.60753
    * Leaderboard Ranking: 71st
* Logistic Regression (pruned feature set)
    * Competition Score: 0.61613
    * Leaderboard Ranking: 143rd
    
In the end, the Logistic Regression model using all the training features performed the worst. The version trained on the pruned feature set performed considerably better and clearly the fewer features helped it generalize. The Random Forest using the 'max_depth' parameter performed the best and actually ended up ranking pretty high compared the competition results.

### Using All Calculated Features

In [41]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','avg win margin','std win margin','avg loss margin',
                'std loss margin','capped avg win margin','capped std win margin','capped avg loss margin',
                'capped std loss margin','close wins','close losses','weighted top64 wins','weighted top32 wins',
                'weighted top16 wins','weighted top8 wins','weighted top64 losses','weighted top32 losses',
                'weighted top16 losses','weighted top8 losses','last10 win pct','last10 weighted win pct',
                'last5 win pct','last5 weighted win pct','conference tourney wins','conference champ']

In [42]:
X_train, y_train = analysis.extract_training_data(feature_keys=feature_keys)

In [43]:
X_test, y_test = analysis.extract_validation_data(feature_keys=feature_keys)

### Logistic Regression

In [44]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [45]:
logreg.score(X_train,y_train)

0.7285282693137678

In [46]:
logreg.score(X_test,y_test)

0.6417910447761194

In [48]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [49]:
analysis.score_model_predictions(y_test,logreg_pred)

0.6737052990506176

### Random Forest Classifier (max_features controlled)

In [50]:
forest = RandomForestClassifier(n_estimators=1000,max_features=3,random_state=100).fit(X_train,y_train)

In [51]:
forest.score(X_train,y_train)

1.0

In [52]:
forest.score(X_test,y_test)

0.6268656716417911

In [53]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [54]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [55]:
analysis.score_model_predictions(y_test,forest_pred)

0.6398966806324071

### Random Forest Classifier (max_depth controlled)

In [56]:
forest = RandomForestClassifier(n_estimators=1000,max_features='sqrt',
                                max_depth=5,random_state=100).fit(X_train,y_train)

In [57]:
forest.score(X_train,y_train)

0.7734138972809668

In [58]:
forest.score(X_test,y_test)

0.6268656716417911

In [59]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [60]:
analysis.score_model_predictions(y_test,forest_pred)

0.60753019317656

### Using Pruned Feature Set (Logistic Regression Only)

In [61]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','capped avg win margin','capped std win margin',
                'capped avg loss margin','capped std loss margin','close wins','close losses','weighted top64 wins',
                'weighted top16 wins','weighted top64 losses','weighted top16 losses','last10 weighted win pct',
                'last5 weighted win pct','conference tourney wins','conference champ']

In [62]:
X_train, y_train = analysis.extract_training_data(feature_keys=feature_keys)

In [63]:
X_test, y_test = analysis.extract_validation_data(feature_keys=feature_keys)

In [64]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [65]:
logreg.score(X_train,y_train)

0.7229175658178679

In [66]:
logreg.score(X_test,y_test)

0.6716417910447762

In [67]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [68]:
analysis.score_model_predictions(y_test,logreg_pred)

0.6161292833078834

# Phase 2 Results

## Final Training Models

For phase two, I added in a significant number of features using the detailed game results. With the additional features, some of which are highly correlated, I went with the Random Forest model due to its general performance and ability to auto-select the most important features. 

In [79]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','avg win margin','std win margin','avg loss margin',
                'std loss margin','capped avg win margin','capped std win margin','capped avg loss margin',
                'capped std loss margin','close wins','close losses','weighted top64 wins','weighted top32 wins',
                'weighted top16 wins','weighted top8 wins','weighted top64 losses','weighted top32 losses',
                'weighted top16 losses','weighted top8 losses','last10 win pct','last10 weighted win pct',
                'last5 win pct','last5 weighted win pct','conference tourney wins','conference champ','Team Avg FGM',
                'Team Avg FGA','Team Avg FGM3','Team Avg FGA3','Team Avg FTM','Team Avg FTA',
                'Team Avg OR','Team Avg DR','Team Avg Ast','Team Avg TO%','Team Avg Stl%','Team Avg Blk%',
                'Team Avg PF','Team Avg TR','Team Avg FGM2','Team Avg FGA2','Team Avg FG%','Team Avg FG2%',
                'Team Avg FG3%','Team Avg FGA3%','Team Avg FT%','Team Avg Pos','Team Avg OEff','Opp Avg FGM',
                'Opp Avg FGA','Opp Avg FGM3','Opp Avg FGA3','Opp Avg FTM','Opp Avg FTA','Opp Avg OR','Opp Avg DR',
                'Opp Avg Ast','Opp Avg TO%','Opp Avg Stl%','Opp Avg Blk%','Opp Avg PF','Opp Avg TR','Opp Avg FGM2',
                'Opp Avg FGA2','Opp Avg FG%','Opp Avg FG2%','Opp Avg FG3%','Opp Avg FGA3%','Opp Avg FT%',
                'Opp Avg Pos','Opp Avg OEff','Net Team Avg FGM','Net Team Avg FGA','Net Team Avg FGM3',
                'Net Team Avg FGA3','Net Team Avg FTM','Net Team Avg FTA','Net Team Avg OR','Net Team Avg DR',
                'Net Team Avg Ast','Net Team Avg TO%','Net Team Avg Stl%','Net Team Avg Blk%','Net Team Avg PF',
                'Net Team Avg TR','Net Team Avg FGM2','Net Team Avg FGA2','Net Team Avg FG%','Net Team Avg FG2%',
                'Net Team Avg FG3%','Net Team Avg FGA3%','Net Team Avg FT%','Net Team Avg Pos','Net Team Avg OEff',
                'Net Opp Avg FGM','Net Opp Avg FGA','Net Opp Avg FGM3','Net Opp Avg FGA3','Net Opp Avg FTM',
                'Net Opp Avg FTA','Net Opp Avg OR','Net Opp Avg DR','Net Opp Avg Ast','Net Opp Avg TO%',
                'Net Opp Avg Stl%','Net Opp Avg Blk%','Net Opp Avg PF','Net Opp Avg TR','Net Opp Avg FGM2',
                'Net Opp Avg FGA2','Net Opp Avg FG%','Net Opp Avg FG2%','Net Opp Avg FG3%','Net Opp Avg FGA3%',
                'Net Opp Avg FT%','Net Opp Avg Pos','Net Opp Avg OEff']

In [80]:
X, y = analysis.extract_training_data(feature_keys=feature_keys)

In [81]:
poly = PolynomialFeatures(2)

In [82]:
X = poly.fit_transform(X)

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)

In [84]:
X_train.shape

(1737, 29161)

In [85]:
y_train.shape

(1737,)

In [86]:
X_test.shape

(580, 29161)

In [87]:
y_test.shape

(580,)

### Random Forrest

In [88]:
forest = RandomForestClassifier(n_estimators=10000,max_features='sqrt',
                                max_depth=5,random_state=100,n_jobs=2).fit(X_train,y_train)

In [89]:
forest.score(X_train,y_train)

0.8485895221646517

In [90]:
forest.score(X_test,y_test)

0.7155172413793104

In [91]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [92]:
analysis.score_model_predictions(y_test,forest_pred)

0.5556406069420509

## Tournament Results

Surprisely, the additional information provided using the 'detailed game results' didn't improve the model's performance compared to the Random Forest in Phase 1. However, all four of the Random Forest models I trained using the additional features performed very providing some evidence that the additional information made the models more robust.

All four of the Logistic Regression Models I trained actually performed significantly worse than the Random Forest's and the models from Phase 1. This supports the trend we saw in Phase 1 where the Logistic Regression's performance actually tended to decrease as more features were added. 

Model Performance: 
* Random Forrest
    * Competition Score: 0.61260
    * Leaderboard Ranking: 126th

In [69]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','avg win margin','std win margin','avg loss margin',
                'std loss margin','capped avg win margin','capped std win margin','capped avg loss margin',
                'capped std loss margin','close wins','close losses','weighted top64 wins','weighted top32 wins',
                'weighted top16 wins','weighted top8 wins','weighted top64 losses','weighted top32 losses',
                'weighted top16 losses','weighted top8 losses','last10 win pct','last10 weighted win pct',
                'last5 win pct','last5 weighted win pct','conference tourney wins','conference champ','Team Avg FGM',
                'Team Avg FGA','Team Avg FGM3','Team Avg FGA3','Team Avg FTM','Team Avg FTA',
                'Team Avg OR','Team Avg DR','Team Avg Ast','Team Avg TO%','Team Avg Stl%','Team Avg Blk%',
                'Team Avg PF','Team Avg TR','Team Avg FGM2','Team Avg FGA2','Team Avg FG%','Team Avg FG2%',
                'Team Avg FG3%','Team Avg FGA3%','Team Avg FT%','Team Avg Pos','Team Avg OEff','Opp Avg FGM',
                'Opp Avg FGA','Opp Avg FGM3','Opp Avg FGA3','Opp Avg FTM','Opp Avg FTA','Opp Avg OR','Opp Avg DR',
                'Opp Avg Ast','Opp Avg TO%','Opp Avg Stl%','Opp Avg Blk%','Opp Avg PF','Opp Avg TR','Opp Avg FGM2',
                'Opp Avg FGA2','Opp Avg FG%','Opp Avg FG2%','Opp Avg FG3%','Opp Avg FGA3%','Opp Avg FT%',
                'Opp Avg Pos','Opp Avg OEff','Net Team Avg FGM','Net Team Avg FGA','Net Team Avg FGM3',
                'Net Team Avg FGA3','Net Team Avg FTM','Net Team Avg FTA','Net Team Avg OR','Net Team Avg DR',
                'Net Team Avg Ast','Net Team Avg TO%','Net Team Avg Stl%','Net Team Avg Blk%','Net Team Avg PF',
                'Net Team Avg TR','Net Team Avg FGM2','Net Team Avg FGA2','Net Team Avg FG%','Net Team Avg FG2%',
                'Net Team Avg FG3%','Net Team Avg FGA3%','Net Team Avg FT%','Net Team Avg Pos','Net Team Avg OEff',
                'Net Opp Avg FGM','Net Opp Avg FGA','Net Opp Avg FGM3','Net Opp Avg FGA3','Net Opp Avg FTM',
                'Net Opp Avg FTA','Net Opp Avg OR','Net Opp Avg DR','Net Opp Avg Ast','Net Opp Avg TO%',
                'Net Opp Avg Stl%','Net Opp Avg Blk%','Net Opp Avg PF','Net Opp Avg TR','Net Opp Avg FGM2',
                'Net Opp Avg FGA2','Net Opp Avg FG%','Net Opp Avg FG2%','Net Opp Avg FG3%','Net Opp Avg FGA3%',
                'Net Opp Avg FT%','Net Opp Avg Pos','Net Opp Avg OEff']

In [70]:
X_train, y_train = analysis.extract_training_data(feature_keys=feature_keys)

In [71]:
X_test, y_test = analysis.extract_validation_data(feature_keys=feature_keys)

In [72]:
poly = PolynomialFeatures(2)

In [73]:
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)

### Random Forrest

In [74]:
forest = RandomForestClassifier(n_estimators=10000,max_features='sqrt',
                                max_depth=5,random_state=100,n_jobs=2).fit(X_train,y_train)

In [75]:
forest.score(X_train,y_train)

0.8118256365990505

In [76]:
forest.score(X_test,y_test)

0.6567164179104478

In [77]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [78]:
analysis.score_model_predictions(y_test,forest_pred)

0.6125969406878569

# Discussion and Conclusions

Overall this competition ended up being much more challenging to solve with machine learning than I expected it to. As I approached development, I expected as I added more features and information to the models, performance would improve, which was certainly not the case. Based on the leaderboard and discussion in the forums, simple models seemed to perform very well this year. FiveThirtyEight also posted an [article](https://fivethirtyeight.com/features/the-stats-led-our-brackets-astray-this-march-madness-that-doesnt-happen-often/) on how some of the more advanced statistics weren't very predictive this year, which might explain why Phase 2 performance really wasn't any better than just knowing the final scores of games.

In fact, just using the historical win percentages for seed-to-seed match-ups yielded my best score (0.60450) and would've placed 52nd overall out of 930 submissions.

Another interesting finding was that the Logistic Regression model ofter performed better on the 'test' set during model training compared to the Random Forest models, however, the Random Forests performed better in the 2022 Tournament almost entirely across the board indicating a better ability to generalize.

Approaching this competition next year, I think it will be more important to keep the 'score' of the model in perspective. Many of my models produced scores on the test set during training that would've been top 5 in the competition. Given the incredible variability of results in the tournmanent, it might be a better to focus on improving generalization than optimizing the third decimal place of an already very good training score. 