<a href="https://colab.research.google.com/github/jc39963/cloud_hosted_analysis/blob/main/Mini_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NCAA Historical Model Prediction Accuracy

This notebook looks at this [dataset](https://github.com/fivethirtyeight/data/blob/master/historical-ncaa-forecasts/historical-538-ncaa-tournament-model-results.csv) from fivethirtyeight which has their previous predictions on win percentages from NCAA tournament games and whether the prediction was correct or not. The purpose is to see if the actual win percentages match their predicted win percentages.

## Exploratory Data Analysis

In [1]:
import pandas as pd

In [2]:
ncaa_data = pd.read_csv("https://github.com/fivethirtyeight/data/raw/refs/heads/master/historical-ncaa-forecasts/historical-538-ncaa-tournament-model-results.csv"

In [3]:
ncaa_data.sample(10)

Unnamed: 0,year,round,favorite,underdog,favorite_probability,favorite_win_flag
238,2014,2,Virginia,Coastal Carolina,0.964,1
77,2012,1,Western Kentucky,Mississippi Valley State,0.622,1
106,2011,2,Arizona,Memphis,0.668,1
251,2011,2,Duke,Hampton,0.995,1
124,2014,2,Baylor,Nebraska,0.703,1
36,2014,3,Iowa State,North Carolina,0.557,1
89,2014,2,Oklahoma,North Dakota State,0.638,0
190,2012,3,Ohio State,Gonzaga,0.848,1
61,2013,4,Miami (FL),Marquette,0.598,0
85,2012,3,Georgetown,North Carolina State,0.636,0


In [4]:
ncaa_data.shape

(253, 6)

In [5]:
ncaa_data["round"].value_counts()

Unnamed: 0_level_0,count
round,Unnamed: 1_level_1
2,128
3,64
4,24
1,16
5,12
6,6
7,3


In [6]:
ncaa_data["year"].value_counts()

Unnamed: 0_level_0,count
year,Unnamed: 1_level_1
2013,67
2011,67
2012,67
2014,52


In [44]:
ncaa_data["favorite_probability"].describe()



Unnamed: 0,favorite_probability
count,253.0
mean,0.721383
std,0.143935
min,0.501
25%,0.6
50%,0.704
75%,0.846
max,0.997


## Overall Accuracy

I will be checking the actual win percentages of games grouped by what their projected win probabilities were. For example, were games that were predicted to have a 0.5 - 0.599 chance of winning actually won ~50 - 59% of the time? I display this data in a dataframe along with what their predicted win percentage bins are to see how accurate the predicted win percentages were.

In [9]:
accuracy_data

Unnamed: 0,bins
0,50.0 - 59.9%
1,60.0 - 69.9%
2,70.0 - 79.9%
3,80.0 - 89.9%
4,90.0 - 99.9%


In [35]:
bins = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
labels = ["50.0 - 59.9%", "60.0 - 69.9%", "70.0 - 79.9%", "80.0 - 89.9%", "90.0 - 99.9%"]

In [33]:
# Bin the favorite_probability column
ncaa_data["probability_bin"] = pd.cut(ncaa_data["favorite_probability"], bins=bins, labels=labels, right=False)

ncaa_data

Unnamed: 0,year,round,favorite,underdog,favorite_probability,favorite_win_flag,probability_bin
0,2014,2,Texas,Arizona State,0.501,1,50.0 - 59.9%
1,2013,2,Illinois,Colorado,0.504,1,50.0 - 59.9%
2,2013,1,James Madison,Long Island,0.506,1,50.0 - 59.9%
3,2011,2,Cincinnati,Missouri,0.509,1,50.0 - 59.9%
4,2012,3,Cincinnati,Florida State,0.509,1,50.0 - 59.9%
...,...,...,...,...,...,...,...
248,2011,2,Kansas,Boston University,0.990,1,90.0 - 99.9%
249,2012,2,Kentucky,Western Kentucky,0.991,1,90.0 - 99.9%
250,2013,2,Louisville,North Carolina A&T,0.995,1,90.0 - 99.9%
251,2011,2,Duke,Hampton,0.995,1,90.0 - 99.9%


In [34]:
# Group by the bins and calculate total games and win counts
accuracy_data = ncaa_data.groupby("probability_bin").agg(
    total_games=("favorite_win_flag", "size"),
    won_games=("favorite_win_flag", "sum")
).reset_index()

# Calculate the actual win percentages
accuracy_data["actual_win_percents"] = accuracy_data["won_games"] / accuracy_data["total_games"]

# Display the resulting DataFrame
accuracy_data

  accuracy_data = ncaa_data.groupby("probability_bin").agg(


Unnamed: 0,probability_bin,total_games,won_games,actual_win_percents
0,50.0 - 59.9%,63,38,0.603175
1,60.0 - 69.9%,60,35,0.583333
2,70.0 - 79.9%,52,35,0.673077
3,80.0 - 89.9%,38,31,0.815789
4,90.0 - 99.9%,40,38,0.95
