# CS109A Final Project: Data Driven March Madness 

**Harvard University**<br>
**Fall 2016**<br>
**Authors: Kurt Bullard and Kendrick Vinar**<br>

# Project Walkthrough

Below is a detailed examination of the process by which we created our models. The table of contents below can be used to navigate to a topic of interest. The table of contents is divided up into two portions roughly corresponding to the two separate files in which we performed this project. The data manipulation portion focuses on the acquisition and preparation of the data for the project. The data analysis portion focuses on the models and results. 
_____________________

### Data Manipulation
- [Overview](#overview)
- [Data Sources](#data_sources)
- [Data Cleaning](#data_cleaning)
- [Variable Creation/Selection](#variable_creation)
- [2016 Data Extraction](#2016_data)
- [Preparing the Data for Analysis](#preparation)

### Data Analysis
- [Overview](#overview_2)
- [Goals](#goals)
- [Functional Justification](#functional)
- [Binomial Sampling and the Case for Variance](#binomial)
- [Results](#results)
- [Areas for Further Exploration](#exploration)


Return back to the [homepage](home_page.html).

# Data Manipulation
___________

<a id = 'overview'></a>

## Overview

The intent of this portion of the project is to organize the data in a fashion such that it is ready to be used by the various machine learning algorithms we test. This includes importing the data, cleaning the data, creating variables, and reducing variables. All of that occurs here. 

<a id = 'data_sources'></a>

## Data Sources

We retrieved data from two sources: the [2016 Kaggle March Madness Learning Mania](https://www.kaggle.com/c/march-machine-learning-mania-2016) competition and the [Pomeroy College Basketball Rankings](http://www.kenpom.com). The Kaggle dataset provided us with detailed information for each team’s regular season performance in addition to results from that year’s tournament. The Pomeroy rankings—or KenPom, as they are colloquially known— give additional advanced metrics for each team, including but not limited to offensive efficiency, defensive efficiency, and adjusted tempo. 

We imported nine CSV files from these sources, including information about seeding, and box scores of tournament and regular season games from 2003-2016.

<a id = 'data_cleaning'></a>

## Data Cleaning

While sorting through the data, we realized we needed an easy way to distinguish between data for a Duke team that played in 2005 and a Duke team that played in 2010. For this, we created a unique ID for each team that would identify both the team and the year by combining the team name and the season in which it played. 

We merged the KenPom data into the detailed tournament dataset by choosing the variables we wanted to consider for our model. These included rankings such as offensive and defensive efficiency, advanced metrics that were not included in the Kaggle data. 

<a id = 'variable_creation'></a>

## Variable Creation and Selection

We created a variable for tournament seedings [1, 16] to normalize seed across regions. We did this by scraping the integers from the seed column that also referenced the region from which the bracket came. 

We created new variables with data given us from the Kaggle dataset that we felt would be better predictors. For example, we calculated predictors such as assist ratio, which is calculated as the number of assists in a game divided by the number of made field goals in a game. This predictor functions as a proxy for how team-oriented a team is or how reliant they are on star players. We created other variables in a similar fashion, including offensive and defensive rebound ratio, point differential, and the percentage of shots in the game that were two-point field goals and three-point field goals. 

We also created two variables to quantify performance against other teams in the NCAA tournament. One variable represents wins against other teams in the tournament and another variable represents losses against other teams in the tournament. For a team that beats a tournament ranked $n$ in the tournament, we gave them $1/n$ points. This system weights wins against highly-ranked teams in the tournament much more valuable than wins against lowly-ranked teams. This technique for predictor creation was shown to be effective by John Ezekowitz in his paper, [insert link]. 

[kurt can you say more here about what happens next it’s getting pretty fuzzy for me here]

Once we aggregated all the predictors together, we created a correlation matrix to determine which predictors were correlated with each other. We checked for multicollinearity with variance inflation factor (VIF) and eliminated [kurt can you say more here?]

[kurt can you also say something about the interaction terms]

After selecting the desired variables, we created a new data frame to hold our selected predictors. We found that a lot of this data manipulation was very time intensive, so we chose to separate this into a distinct notebook that could be run only when changes were made. At the end of this notebook, we export a new CSV file that serves as the starting point for our data analysis in another notebook.



<a id = '2016_data'></a>

## 2016 Data Extraction

[kurt, can you say something about the 2016 data and how that was different?] 


<a id = 'preparation'></a>

## Preparing the Data for Analysis



# Data Analysis
_________________

<a id = 'overview_2'></a>

## Overview



<a id = 'goals'></a>

## Goals

The goal of our model was to earn the highest possible score in the tournament bracket. 

<a id = 'functional'></a>

## Functional Justification

We attempted to write the majority of our code in functional form. This strategy is preferred for several reasons. 

First, many of the operations we wish to perform need to be repeated numerous times. For example, there are many times we wish to reindex a dataframe, clean a dataframe of faulty seedings, and more. Therefore, to save us from writing these lines of code over and over, we used functions.

Second, many of the operations we perform are applied over different machine learning algorithms, different years of training data, and more. To maximize flexibility and ease of use, a functional was preferred. 

Last, a functional approach increased the interpretability of our code. Because we packaged our functions into specific tasks,  we were able to understand better what was happening in our code. As this was a partner project, this made working together as a team much easier. 


<a id = 'standard_model'></a>

## Standard Model



<a id = 'risk'></a>

## Risk

In our search for adding variance into our model, we created a risk parameter that defined the decision boundary for classifying a win. Our hope was that we would see more upsets in our model that would lead to increased variance in our predictions. Ideally, our average score would decrease slightly but our variance would increase such that we achieved, on occasion, high scores in the bracket. 

Normally, our decision boundary, or risk parameter, was 0.5. This meant that if our model predicted favorite had a better than 50% chance of winning the game, we would advance them to the next round. We tested how bracket results varied across different decision boundaries, using 0.4, 0.5, 0.6, and 0.7 as our different risk parameters. 

[need to say more here]

<a id = 'binomial'></a>

## Binomial Sampling and the Case for Variance

After creating our risk parameter, we pivoted towards a different method of adding randomness into our bracket predictions. The problem with the 'risk' parameter was that there really wasn't any addition of randomness, there was just a bias towards a larger number of upsets. 

Our solution to this was to fit a binomial distribution over the likelihood of the favorite advancing and vary the number of draws from the distribution. As the number of draws increases, the more the model comes to approximate our original model which advanced the team the model defined as better every time. For example, suppose our model says team 1 has a 60% chance of defeating team 2. In effect, the binomial distribution allows us to flip a weighted coin to decide who will advance. The weighted coin comes up 'advance' for the favorite 60% of the time and 'advance' for the underdodg 40% of the time. If we set the number of coin flips to one, the favorite advances 60% of the time, and the underdog advances 40% of the time. 

While we could have always used a single draw to determine who advances, we felt that a single draw might cost us tournament points in the long run. Why? Because we assume that the expected value of the favorite going forward in the tournament is greater than the expected value of the underdog going forward. For example, maybe an 8-seed gets hot for a game and knocks of a 1-seed. However, it seems likely that while the 8-seed may be better than the seed it received, there is also an element of luck involved in their win. Therefore, we don't expect them to continue winning many games further on into the tournament. In contrast, on average 1-seeds make much deeper tournament runs, accumulating more points. 

By taking more coin flips, say a best-of-3 or best-of-5 game, we bias the result towards the team we think has a better chance of winning. This makes sense intuitively. If you know you're going to win a coin flip for five dollars 90% of the time, you want to play that game as often as you can because you know you'll come out ahead. Even if you lose the first game, you'll keep playing. The binomial distribution is similar. By increasing the number of coin flips, it's akin to playing an imaginary best-of-3 or best-of-5 series for the right to advance to the next round. 

<a id = 'results'></a>

## Results

<a id = 'exploration'></a>

## Areas for Further Exploration