# March Madness Machine Learning - Development Summary and Results

# Competition Overview

This document is based on work for a Kaggle competition: [March Machine Learning Mania 2022](https://www.kaggle.com/competitions/mens-march-mania-2022)

The goal of this project is to predict the results of match-ups in the College Basketball March Madness Tournament.

Model performance is scored by a cost function that accumulates based on the correctness and confidence of a prediction. A confident, correct prediction has a minimal cost compared to a confident, incorrect prediction that has a high cost. The confidence of the models prediction for a game outcome is passes as a value from 0 to 1. 

## Development Process

Based on time constraints, I broke up the devepment of this project into two phases:
* Phase 1: Models were built using the 'compact game results' from past seasons. This data included which team won and the points scored by the winning and losing team.
* Phase 2: Models were built using the 'detailed game results' froom past seasons. This included stats such as the number of field goals made, blocks, steals, free throws, etc.

For feature set during development, I tested both a Logistic Regression and a Random Forrest Classifier model.

I chose these two models to primarily work with because:
* The logistic regression is simple, easy to interpret, and at the very least will serve as a performance baseline for more advanced models.
* Random Forests have general high performance and robustness, and in particular, are able to utilize a large number of features and determine which are the most relevant to the problem.

## Additional Notebooks

This notebook summarizes key trained models and the ensueing results for the actual 2022 NCCA Men's Tournament.

For a detailed look at my process for developing a model for Phase 1:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Initial%20Submission%20-%20Process%20Write-Up.ipynb

For a detailed look at my process for developing a model for Phase 2:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Phase%202%20-%20Process%20Write-Up.ipynb

For a complete review of my Phase 1 model performance in the 2022 NCAA Tournament:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Initial%20Submission%20-%20Post%20Tournament%20Review.ipynb

For a complete review of my Phase 2 model performance in the 2022 NCAA Tournament:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Phase%202%20-%20Post%20Tournament%20Review.ipynb

For the code I used to calculate the model features for each team and season:
https://github.com/jhowenstein/march-madness-ML/blob/main/march_madness.py

For the notebook I used to calculate and save a csv fill all the feature data:
https://github.com/jhowenstein/march-madness-ML/blob/main/March%20Madness%20-%20Feature%20Calculation.ipynb

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import os

import march_madness as mm
import random

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures

In [2]:
analysis = mm.Analysis()

In [3]:
analysis.load_training_data('MDataFiles_Stage2/calculated features 1985-2021.csv')

In [4]:
analysis.load_validation_data('MDataFiles_Stage2/calculated features 2022.csv')

# Phase 1 Results

## Final Training Models

### Using All Calculated Features

In [5]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','avg win margin','std win margin','avg loss margin',
                'std loss margin','capped avg win margin','capped std win margin','capped avg loss margin',
                'capped std loss margin','close wins','close losses','weighted top64 wins','weighted top32 wins',
                'weighted top16 wins','weighted top8 wins','weighted top64 losses','weighted top32 losses',
                'weighted top16 losses','weighted top8 losses','last10 win pct','last10 weighted win pct',
                'last5 win pct','last5 weighted win pct','conference tourney wins','conference champ']

In [6]:
X, y = analysis.extract_training_data(feature_keys=feature_keys)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)

In [8]:
X_train.shape

(1737, 56)

In [9]:
y_train.shape

(1737,)

In [10]:
X_test.shape

(580, 56)

In [11]:
y_test.shape

(580,)

### Logistic Regression

In [12]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
logreg.score(X_train,y_train)

0.7276914219919401

In [14]:
logreg.score(X_test,y_test)

0.696551724137931

In [15]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [16]:
analysis.score_model_predictions(y_test,logreg_pred)

0.5416041357699954

### Random Forest Classifier (max_features controlled)

In [17]:
forest = RandomForestClassifier(n_estimators=1000,max_features=3,random_state=100).fit(X_train,y_train)

In [18]:
forest.score(X_train,y_train)

1.0

In [19]:
forest.score(X_test,y_test)

0.7155172413793104

In [20]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [21]:
analysis.score_model_predictions(y_test,forest_pred)

0.5509841947134376

### Random Forest Classifier (max_depth controlled)

In [22]:
forest = RandomForestClassifier(n_estimators=1000,
                                max_features='sqrt',max_depth=5,random_state=100).fit(X_train,y_train)

In [23]:
forest.score(X_train,y_train)

0.7852619458837076

In [24]:
forest.score(X_test,y_test)

0.7189655172413794

In [25]:
forest_pred = mm.bound_predictions(forest.predict_proba(X_test)[:,1])

In [26]:
analysis.score_model_predictions(y_test,forest_pred)

0.5495379202820174

### Using Pruned Feature Set (Logistic Regression Only)

In [27]:
feature_keys = ['tourney seed','weighted win pct','owp','oowp','capped avg win margin','capped std win margin',
                'capped avg loss margin','capped std loss margin','close wins','close losses','weighted top64 wins',
                'weighted top16 wins','weighted top64 losses','weighted top16 losses','last10 weighted win pct',
                'last5 weighted win pct','conference tourney wins','conference champ']

In [28]:
X, y = analysis.extract_training_data(feature_keys=feature_keys)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)

In [30]:
X_train.shape

(1737, 36)

In [31]:
y_train.shape

(1737,)

In [32]:
X_test.shape

(580, 36)

In [33]:
y_test.shape

(580,)

In [34]:
logreg = LogisticRegression().fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
logreg.score(X_train,y_train)

0.7184801381692574

In [36]:
logreg.score(X_test,y_test)

0.7155172413793104

In [37]:
logreg_pred = mm.bound_predictions(logreg.predict_proba(X_test)[:,1])

In [38]:
analysis.score_model_predictions(y_test,logreg_pred)

0.5385360241344801