# March Madness

<!-- PELICAN_BEGIN_SUMMARY -->

This year I participated in Kaggle's March Madness Competition. However, this isn't your typical office pool; instead participants are tasked with assigning every possible match-up in the field of 64 with a probability that a certain team will win.

In this post, I will describe my simple approach to building a machine learning model to predict winning percentages for March Madness games. Then, we'll see how it performed for this years competition!

<img src="http://jbechtel.github.io/images/predicted_bracket.png" alt='[img: bracket]'>
<!-- PELICAN_END_SUMMARY -->

## Introduction

The best part of the March Madness kaggle competition is the data. Kaggle proves formatted regular season stats, play-by-play data, Tourney match-ups and results plus more running back to 2003, and some data back to 1985. 

The goal of the competition is to predict tournament match-ups with the lowest negative log loss score. Unlike normal bracket challenges, Kaggle doesn't ask you to predict just the winner of each game. Instead they want the probability that a team will win. Furthermore, we have to predict win probabilities for every possible game, which for a 64-slot bracket, with 4 play-in games, comes to a grand total of 2200 games. 

Submissions then comprise a 2200x2 vector where each row represents 1 of the possible matchups where the first column includes a string that defines a matchup such as 2014_UNC_vs_Clemson, and the second column represents the probability that the first team (UNC in this case) wins. Then the measure of success is judged as the log loss. This is defined as 
-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)). When true, yt is 1, and yp is probability that yt is true. Here we're measuring the probability that yt is true on the conidition yp. Thus a perfect log loss score is 0. 

<img src="http://jbechtel.github.io/images/predicted_bracket.png" alt='[img: bracket]'>


Given all of this data how should we proceed?

First, this is a classification problem, so the techniques that we explored in the last post are all at our disposal. 

In particular fitting a logistic regression model corresponds to minimizing the log loss, which makes it a natural candidate. However, there are other possibilities such as neural nets, random forests, and support vector machines. 

Before we settle on a classification algorithm, let's consider the feature space. That is, what data is the most important in determining a game's outcome. 

An obvious place to start is team Seed, since we are more confident that a 1 seed will a 16 seed than a 7 seed beating a 10 seed. One of the kaggle starter kernels employs this strategy, and it performs reasonably well. However, since we have so much data, I think it makes sense to feed our model with more descriptors than just Seed. 

Accordingly, we will incorporate season averaged statistics as well as an ELO ranking into the feature vector. 

## Feature Engineering

Team Seeds are one obvious descriptor for predicting outcomes. However, from past experience we know that March Madness games feature plenty of upsets! Is there a way that we might predict these upsets? Maybe by using more descriptive statistics of a team's season performance we can inform our model past the 1 dimensional seeds descriptor...

What stats should we build into our model? Since all of the team's have completed a full regular season, we have plenty of information on their past and recent performance. While one could design a metric that describes a team's 'momentum' going into the tournament (i.e. are they streaking? do their recent stats outperform early season stats?), we will focus on season-averaged stats as a simple approach to gauge a team's strength. 

Since one of my main constraints is time, I want to find and incorporate informative yet straightforward features. To that end, I tried to make use of already coded-up examples, in this case the advanced statistics found in this notebook [LINK HERE]. 

This notebook calculates so-called 'Advanced' metrics including percent possession etc. for every regular season game in our database. Using these defininitions, I computed the season-averaged values for several of these stats, storing them as components in my feature vectors. 

Additionally, I found another promising feature detailed in this [LINK HERE] notebook which calculates every teams ELO score. The ELO score is essentially a ranking system like in chess for example, where all teams start at 1500 and then gain or lose ranking points by defeating or losing to other teams. If a low ranked team defeats a high-ranking opponent, its score will improve dramatically. On the other hand, if a high ranked team defeats a low-ranked team, their ranking will only improve slightly. 

The ELO score is a way to measure a team's true skill, since it depends on the strength of the opposition. 

Now we have two components to our feature vector: (1) Season averaged advanced stats, and (2) ELO scores for every team. We will just concatenate these 2 features for all teams in the database. 





## Set up Classification Problem

Now how do we set up our labels and features for this classification problem?

We are going to be training on all of the past tournaments for which we have data. This includes the 2003-2017 seasons. 

For each Tourney game, we will describe it by the year, name of the first team: t1, and name of the second team: t2. The label describing the outcome is y, which equals 1 when the first team wins, and 0 if the second team wins. 

The feature vector, x, for this game is the difference of the first team's feature vector and the second team's feature vector. 

Then we build our training data as a list of all of the 'difference feature vectors', and the labels are the list of results, y. 

Just in case, we will also include all of the data where the names of the first and second team are switched. 



## Train Model

Now we can use any of our favorite classification algorithms to build our model. I tried several, and settled on logistic regresssion which naturally outputs probabilities that a certain example will result in a specified outcome. 

Of course, we want to avoid overfitting, so we use cross validation when training our model, and we also exhaustively search the space of hyperparameters to find their appropriate values. 

In the end, I achieved log-losses around ~0.5 - 0.6 using logistic regression. As a side note, assigining a value of 50% to the probability of each game outcome results in a log-loss of 0.69, so we are doing better than guessing at random :D

I thought this was decent for the amount of time I was willing to invest, so I trained my model with all available data from 2003-2017, then predicted the probabilistic outcomes for all possible 2018 matchups and uploaded the results to Kaggle. 



## Track my Results

Then comes the exciting part... seeing how my model stacks up against the competition. Since this was my first Kaggle competition, I got pretty obsessed with checking my current standing in the leaderboard. While waiting for Kaggle to update the leaderboard, however, I simply inputed the games into my model so that I could at least track my personal log-loss score with up-to-date game results. 

Other apps that allow you to visualize your predictions can be found here [LINK] and here [LINK]. 

Currently I at least in the top %50, even after the tough Virginia loss. 

At the end of the tournament

## Concluding Remarks

Participating in this competition was a fun experience and I encourage any readers to enter it next year and also to explain your methodology. 

Some things I might try next year are:....
