March Madness Analysis

College basketball fans can better understand the NCAA Tournament by reviewing and analyzing data from previous years. In this report, results from every March Madness game between 1985 and 2016 are analyzed to identify the following trends:

Expected performance according to seed
Matchups that often result in exciting games
Outliers who have exceeded their expectation

This information can be used to identify games that are likely to be interesting, predict winners, or understand a team's tournament history.

Import Dependencies and Prep Data

	game_id	date	round	region	seed	team	score	opponent_seed	opponent	opponent_score	score_diff	win	seed_id	year
0	0	1985-03-14	Round of 64	East	1	Georgetown	68	16	Lehigh	43	25	1	1_16_fav	1985
1	0	1985-03-14	Round of 64	East	16	Lehigh	43	1	Georgetown	68	-25	0	1_16_dog	1985
2	1	1985-03-14	Round of 64	East	4	Loyola, Illinois	59	13	Iona	58	1	1	4_13_fav	1985
3	1	1985-03-14	Round of 64	East	13	Iona	58	4	Loyola, Illinois	59	-1	0	4_13_dog	1985
4	2	1985-03-14	Round of 64	East	5	Southern Methodist	85	12	Old Dominion	68	17	1	5_12_fav	1985

How does each seed typically perform?

Key Takeaways:

As expected, high seeds generally outperform low seeds.
There are exceptions, as 9 seeds win fewer games on average than 10 and 11 seeds, and 5 seeds win fewer games on average than 6 seeds.
When performing a statistical analysis, many seeds' performances are not significantly different than the neighboring seeds. Interestingly, seeds 8-12 do not have a significant difference in average point margin per game.

Gather Summary Data on Each Seed

	Average Point Spread	Average Wins	Wins by Round
Seed
1	11.392193	3.351562	round Elite Eight 52 National Ch...
2	7.109302	2.398438	round Elite Eight 28 National Ch...
3	4.960452	1.796875	round Elite Eight 14 National Ch...
4	3.313846	1.546875	round Elite Eight 13 National Ch...
5	0.892593	1.109375	round Elite Eight 6 National Cha...
6	0.335793	1.125000	round Elite Eight 3 National Cha...
7	-0.585062	0.890625	round Elite Eight 2 National Cha...
8	-3.281818	0.726562	round Elite Eight 5 National Cha...
9	-4.220000	0.562500	round Elite Eight 1 National Semif...
10	-3.028571	0.640625	round Elite Eight 1 National Semif...
11	-3.524752	0.578125	round Elite Eight 3 National Semif...
12	-4.461538	0.523438	round Elite Eight 0 Round of 32 20 ...
13	-9.043750	0.250000	round Round of 32 6 Round of 64 26 ...
14	-10.688742	0.179688	round Round of 32 2 Round of 64 21 ...
15	-16.065693	0.070312	round Round of 32 1 Round of 64 8 Sw...
16	-24.718750	0.000000	round Round of 64 0 Name: win, dtype: int64

Compare Average Wins By Seed

As expected, higher seeds generally outperform lower seeds. There are exceptions, as 9 seeds win fewer games on average than 10 and 11 seeds, and 5 seeds win fewer games on average than 6 seeds. When plotted against the number of wins that would be expected if the higher seed won each game, the expected levels of advancement appear to generally hold true.

Compare Average Point Margin by Seed

As expected, high seeds have a better average margin of victory than lower seeds. Only seeds 1-6 have positive average point margins, which can be explained by the single elimination format - teams that lose early on in the tournament do not have an opportunity to improve their point margin. 9 seeds also underperform in this area as they have a poorer point margin than both 10 and 11 seeds.

Typical Tournament Advancement By Seed

The plot below shows the percentage of each seed that advances to each round of the NCAA Tournament. As expected, higher seeds generally outperform lower seeds. There is an interesting trend highlighted in the second plot: while 10-12 seeds are less likely to achieve an upset and advance to the 2nd round, when they do, they are much more likely to win their 2nd round matchup than 8 or 9 seeds. This results in 10, 11, and 12 seeds advancing to the Sweet Sixteen more often than 8 and 9 seeds. This trend can be explained by matchups, and the fact that 8 and 9 seeds that win their first round game must then play a 1 seed in the 2nd round, while 10-12 seeds face easier 2nd round matchups.

Statistical Comparison of Seeds

Below are the results of a statistical comparison of each seed's average point margin. The average point margin for every team of each seed was collected and a paired t-test was performed on each set of seeds to determine if they were significantly different from one another. As expected, seeds far from each other are very different, but neighboring seeds often did not meet the 0.05 threshold to be considered significantly different from one another. This allowed us to group seeds' performance into the following statistically similar and unique sets: [1, 2, (3, 4), (5, 6, 7), (8, 9, 10, 11, 12), (13, 14), 15, 16]

	P Value	Seeds Compared	Significant Difference
29	0.093204	3 and 4	False
54	0.587780	5 and 6	False
55	0.159540	5 and 7	False
65	0.376600	6 and 7	False
84	0.427234	8 and 9	False
85	0.821109	8 and 10	False
86	0.834070	8 and 11	False
87	0.305909	8 and 12	False
92	0.313126	9 and 10	False
93	0.568254	9 and 11	False
94	0.841835	9 and 12	False
99	0.668556	10 and 11	False
100	0.213455	10 and 12	False
105	0.431084	11 and 12	False
114	0.178705	13 and 14	False

[1, 2, (3, 4), (5, 6, 7), (8, 9, 10, 11, 12), (13, 14), 15, 16]

Which matchups are most likely to produce upsets?

Key Takeaways:

There is a linear relationship between absolute seed difference and percentage of upsets in matchups.
Every game between a 2 and 5 seed has resulted in an upset.
40% of games between a 2 and 10 seed has resulted in an upset.

The purpose of this section is to determine which matchups are most likely to produce upsets relative to the absolute seed difference of the teams. Matchups of evenly seeded teams were not considered.

Two dataframes were created to analyze and plot the data. The first dataframe has one record for each game from the perspective of the underdogs. This dataframe was used to create a line of best fit to generate the expected upset percentage by absolute seed difference. The second dataframe has one record for each game from the perspective of the winners. It was used to compare each matchup's historical upset percentage to expected upset percentage and plot the data.

The analysis below highlights two subsets of matchups, but the underlying code contains functions that can be used to look at any subset.

Relatively high likelihood of upsets for matchups that don't occur often

This subset was defined as matchups that occur less than every four years on average. To eliminate extreme outliers, the matchups also must have occurred at least three times since 1985. From this subset, the four matchups with the most drastic deviation from expected upset percentage are:

2 seed vs 5 seed

4 total games
100% have been upsets

2 seed vs 8 seed

7 total games
71% have been upsets

1 seed vs 11 seed

6 total matchups
50% have been upsets

2 seed vs 4 seed

7 total matchups
57% have been upsets

Relatively high likelihood of upsets for matchups that occur every year

This subset was defined as matchups that occur at least once every year on average. The four matchups with the most drastic deviation from expected upset percentage are:

2 seed vs 10 seed

45 total games
40% have been upsets

4 seed vs 12 seed

35 total games
34% have been upsets

3 seed vs 11 seed

42 total matchups
33% have been upsets

5 seed vs 12 seed

128 total matchups
36% have been upsets

How have individual teams performed in the tournament historically?

Key Takeaways:

Even the teams that have performed the best historically average less than three wins per tournament.
Most of the teams that have performed the best historically are schools that most fans would associate with being the best in men's college basketball.
The average number of wins for a given seed was used to predict the number of "expected wins" for each team based on their seeding for each year.
Expected wins was used to find teams that have historically over/underperformed.

Quantifying Team Performance

First, we grouped the results of each game by the teams playing to find the following quantities for each team that has been in the tournament: tournament appearances, games played, games won, win percentage, total point margin, average point margin, average wins per tournament, wins expected based on the historical performance of teams with the same seed, expected games played, deviation from expected wins, and average deviation from expected wins. This information was stored in a dataframe so that the results for individual teams could be accessed and compared.

	Team	Tournament Appearances	Games Played	Games Won	Win Percentage	Total Point Margin	Average Point Margin	Average Number of Wins	Expected Wins	Expected Games Played	Deviation From Expected Wins	Average Deviation From Expected Wins
0	Air Force	2	2	0	0	-20	-10	0	0.828125	2.82812	-0.828125	-0.414062
1	Akron	4	4	0	0	-78	-19.5	0	0.914062	4.91406	-0.914062	-0.228516
2	Alabama	15	33	18	0.545455	34	1.0303	1.2	16.8203	31.8203	1.17969	0.0786458
3	Alabama State	2	2	0	0	-69	-34.5	0	0	2	0	0
4	Albany	5	5	0	0	-73	-14.6	0	0.5	5.5	-0.5	-0.1

Finding the "Top Teams"

We then sorted by the number of tournament appearances to find the schools that had been in the most tournaments since the format was changed to include 64 teams. The results were similar to what we originally expected them to be and consisted of schools that have a reputation for being good at men's basketball.

Next we sorted by the number of games won by schools in the NCAA tournament. This consisted of many of the same teams that were in the top ten for most tournament appearances. It was interesting to see what a large spread there was between the number of wins for Duke which was first in this category and UCLA which was tenth. It also showed that some schools like Connecticut get a lot of wins each time they are in the tournament but don't necessarily make it into as many tournaments as some of the other schools included.

Finally, we sorted the teams by the net average point spread. Again this mostly consisted of the same teams with just a few exceptions. The fact that Kentucky, Duke and Kansas average almost a positive 10 point victory in every game they play in the tournament was higher than expected.

Finding Over and Underperforming Teams

We plotted the average deviation from the expected number of wins per tournament versus the average number of wins per tournament to showcase the teams that had overperformed their seeding and had consistently went far in the tournament. The size of the points were based on the number of total games a team has played in the tournament. The large red points on the very right side of the graph represented schools that consistently win a lot of games in the tournament and outperform their seeds. The smaller points in red didn't make as far into the tournament on average, but still overperformed their expected results.

This table shows the average number of wins, average deviation from expected wins, games played, and tournament appearances for the teams highlighted in the previous graph. The fact that Duke, North Carolina, Kentucky, and Connecticut had such a high number of wins and often overperformed their seeds was not surprising consider their position on the total number of wins bar graph shown earlier. It was also expected that they wouldn't overperform their seed as often since they are often seeded very high.

It was interesting that there was a team like Butler that almost wins an extra game more than expected every tournament and had a large number of appearances and games played. Their run to the final four a few years ago probably skews this slightly though.

	Team	Average Number of Wins	Average Deviation From Expected Wins	Games Played	Tournament Appearances
64	Duke	2.90323	0.366179	116	31
169	North Carolina	2.82759	0.460399	108	29
115	Kentucky	2.77778	0.597801	99	27
52	Connecticut	2.75	0.756641	71	20
47	Cleveland State	1.5	1.28516	5	2
28	Butler	1.46154	0.74399	32	13
128	Loyola Marymount	1.33333	0.752604	7	3
77	Florida Gulf Coast	1	0.964844	4	2
168	Norfolk State	1	0.929688	2	1

We plotted the total number of tournament games played versus the total deviation from expected wins. The teams in red are examples that have lots of tournament appearances but don't necessarily perform at the level expected based on their seeding. This trend was harder to show with the previous plot.

This table shows the average number of wins, average deviation from expected wins, games played, and tournament appearances for the teams highlighted in the previous graph. These teams were more similar in the average number of wins per tournament than was originally expected.

	Team	Average Number of Wins	Average Deviation From Expected Wins	Games Played	Tournament Appearances
188	Oklahoma	1.45833	-0.323568	59	24
101	Illinois	1.27273	-0.23402	50	22
45	Cincinnati	1.2	-0.263281	44	20
204	Purdue	1.13043	-0.33288	49	23
198	Pittsburgh	1	-0.503701	38	19
150	Missouri	0.842105	-0.323602	35	19

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
analysis_and_exploration		analysis_and_exploration
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md
Summary of Tasks.txt		Summary of Tasks.txt
clean data.ipynb		clean data.ipynb
march_madness_analysis.ipynb		march_madness_analysis.ipynb
output_11_0.png		output_11_0.png
output_12_0.png		output_12_0.png
output_22_0.png		output_22_0.png
output_25_0.png		output_25_0.png
output_34_0.png		output_34_0.png
output_36_0.png		output_36_0.png
output_38_0.png		output_38_0.png
output_41_0.png		output_41_0.png
output_45_0.png		output_45_0.png
output_7_0.png		output_7_0.png
output_9_0.png		output_9_0.png
team_colors_dict.py		team_colors_dict.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

March Madness Analysis

Import Dependencies and Prep Data

How does each seed typically perform?

Gather Summary Data on Each Seed

Compare Average Wins By Seed

Compare Average Point Margin by Seed

Typical Tournament Advancement By Seed

Statistical Comparison of Seeds

Which matchups are most likely to produce upsets?