<a href="https://colab.research.google.com/github/jc39963/cloud_hosted_analysis/blob/main/Mini_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NCAA Historical Model Prediction Accuracy

This notebook looks at this [dataset](https://github.com/fivethirtyeight/data/blob/master/historical-ncaa-forecasts/historical-538-ncaa-tournament-model-results.csv) from fivethirtyeight which has their previous predictions on win percentages from NCAA tournament games and whether the prediction was correct or not. The purpose is to see if the actual win percentages match their predicted win percentages.

## Exploratory Data Analysis

In [4]:
import pandas as pd

In [5]:
ncaa_data = pd.read_csv("https://github.com/fivethirtyeight/data/raw/refs/heads/master/historical-ncaa-forecasts/historical-538-ncaa-tournament-model-results.csv")

In [6]:
ncaa_data.sample(10)

Unnamed: 0,year,round,favorite,underdog,favorite_probability,favorite_win_flag
135,2013,4,Wichita State,La Salle,0.714,1
40,2013,2,San Diego State,Oklahoma,0.561,1
42,2011,2,Washington,Georgia,0.562,1
126,2011,4,North Carolina,Marquette,0.704,1
24,2012,3,New Mexico,Louisville,0.537,0
196,2012,4,North Carolina,Ohio,0.862,1
144,2014,2,Kentucky,Kansas State,0.739,1
20,2012,1,Lamar,Vermont,0.527,0
137,2012,3,Syracuse,Kansas State,0.716,1
43,2011,3,Wisconsin,Kansas State,0.563,1


In [7]:
ncaa_data.shape

(253, 6)

In [8]:
ncaa_data["round"].value_counts()

Unnamed: 0_level_0,count
round,Unnamed: 1_level_1
2,128
3,64
4,24
1,16
5,12
6,6
7,3


In [None]:
ncaa_data["year"].value_counts()

Unnamed: 0_level_0,count
year,Unnamed: 1_level_1
2013,67
2011,67
2012,67
2014,52


In [9]:
ncaa_data["favorite_probability"].describe()

Unnamed: 0,favorite_probability
count,253.0
mean,0.721383
std,0.143935
min,0.501
25%,0.6
50%,0.704
75%,0.846
max,0.997


## Overall Accuracy

In [10]:
bins = ["50.0 - 59.9%", "60.0 - 69.9%", "70.0 - 79.9%", "80.0 - 89.9%", "90.0 - 99.9%"]
accuracy_data = pd.DataFrame({"bins": bins})

In [11]:
accuracy_data

Unnamed: 0,bins
0,50.0 - 59.9%
1,60.0 - 69.9%
2,70.0 - 79.9%
3,80.0 - 89.9%
4,90.0 - 99.9%


In [13]:
actual_win_count_5 = ncaa_data[ncaa_data["favorite_probability"] < 0.599]["favorite_win_flag"].sum()
actual_win_count_5

37

In [14]:
total_count_5 = len(ncaa_data[ncaa_data["favorite_probability"] <= 0.599])
total_count_5

63

In [15]:
actual_win_percents = []
actual_win_percents.append(actual_win_count_5 / total_count_5)
actual_win_percents

[0.5873015873015873]

In [16]:
actual_win_count_6 = ncaa_data[(ncaa_data["favorite_probability"] <= 0.699) & (ncaa_data["favorite_probability"] >=0.600)]["favorite_win_flag"].sum()
actual_win_count_6

35

In [17]:
total_count_6 = len(ncaa_data[(ncaa_data["favorite_probability"] <= 0.699) & (ncaa_data["favorite_probability"] >=0.600)])
total_count_6

60

In [18]:
actual_win_percents.append(actual_win_count_6 / total_count_6)
actual_win_percents

[0.5873015873015873, 0.5833333333333334]

In [28]:
actual_win_count_7 = ncaa_data[(ncaa_data["favorite_probability"] <= 0.799) & (ncaa_data["favorite_probability"] >=0.700)]["favorite_win_flag"].sum()
total_count_7 = len(ncaa_data[(ncaa_data["favorite_probability"] <= 0.799) & (ncaa_data["favorite_probability"] >=0.700)])
actual_win_percents.append(actual_win_count_7 / total_count_7)
print(f"The total number of wins for games with a projected win probability between 70 - 79.9% is {actual_win_count_7}, the total number of games in that probability bin is {total_count_7}, and the actual win percentage is {actual_win_count_7/total_count_7}")
actual_win_percents

The total number of wins for games with a projected win probability between 70 - 79.9% is 35, the total number of games in that probability bin is 52, and the actual win percentage is 0.6730769230769231


[0.5873015873015873,
 0.5833333333333334,
 0.6730769230769231,
 0.6730769230769231]

In [33]:
actual_win_count_8 = ncaa_data[(ncaa_data["favorite_probability"] <= 0.899) & (ncaa_data["favorite_probability"] >=0.800)]["favorite_win_flag"].sum()
total_count_8 = len(ncaa_data[(ncaa_data["favorite_probability"] <= 0.899) & (ncaa_data["favorite_probability"] >=0.800)])
actual_win_percents.append(actual_win_count_8 / total_count_8)
print(f"The total number of wins for games with a projected win probability between 8 - 89.9% is {actual_win_count_8}, the total number of games in that probability bin is {total_count_8}, and the actual win percentage is {actual_win_count_8/total_count_8}")
actual_win_percents


The total number of wins for games with a projected win probability between 8 - 89.9% is 31, the total number of games in that probability bin is 38, and the actual win percentage is 0.8157894736842105


[0.5873015873015873,
 0.5833333333333334,
 0.6730769230769231,
 0.8157894736842105,
 0.8157894736842105]

In [35]:
actual_win_percents

[0.5873015873015873,
 0.5833333333333334,
 0.6730769230769231,
 0.8157894736842105]

In [36]:
actual_win_count_9 = ncaa_data[(ncaa_data["favorite_probability"] <= 0.999) & (ncaa_data["favorite_probability"] >=0.900)]["favorite_win_flag"].sum()
total_count_9 = len(ncaa_data[(ncaa_data["favorite_probability"] <= 0.999) & (ncaa_data["favorite_probability"] >=0.900)])
actual_win_percents.append(actual_win_count_9 / total_count_9)
print(f"The total number of wins for games with a projected win probability between 90- 99.9% is {actual_win_count_9}, the total number of games in that probability bin is {total_count_9}, and the actual win percentage is {actual_win_count_9/total_count_9}")
actual_win_percents

The total number of wins for games with a projected win probability between 90- 99.9% is 38, the total number of games in that probability bin is 40, and the actual win percentage is 0.95


[0.5873015873015873,
 0.5833333333333334,
 0.6730769230769231,
 0.8157894736842105,
 0.95]

In [37]:
accuracy_data["actual_win_percents"] = actual_win_percents

In [38]:
accuracy_data

Unnamed: 0,bins,actual_win_percents
0,50.0 - 59.9%,0.587302
1,60.0 - 69.9%,0.583333
2,70.0 - 79.9%,0.673077
3,80.0 - 89.9%,0.815789
4,90.0 - 99.9%,0.95


In [39]:
accuracy_data["total_games"] = [total_count_5, total_count_6, total_count_7, total_count_8, total_count_9]
accuracy_data["won_games"] = [actual_win_count_5, actual_win_count_6, actual_win_count_7, actual_win_count_8, actual_win_count_9]
accuracy_data

Unnamed: 0,bins,actual_win_percents,total_games,won_games
0,50.0 - 59.9%,0.587302,63,37
1,60.0 - 69.9%,0.583333,60,35
2,70.0 - 79.9%,0.673077,52,35
3,80.0 - 89.9%,0.815789,38,31
4,90.0 - 99.9%,0.95,40,38
