College basketball fans can better understand the NCAA Tournament by reviewing and analyzing data from previous years. In this report, results from every March Madness game between 1985 and 2016 are analyzed to identify the following trends:
- Expected performance according to seed
- Matchups that often result in exciting games
- Outliers who have exceeded their expectation
This information can be used to identify games that are likely to be interesting, predict winners, or understand a team's tournament history.
game_id | date | round | region | seed | team | score | opponent_seed | opponent | opponent_score | overtime | score_diff | win | seed_id | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1985-03-14 | Round of 64 | East | 1 | Georgetown | 68 | 16 | Lehigh | 43 | 0 | 25 | 1 | 1_16_fav | 1985 |
1 | 0 | 1985-03-14 | Round of 64 | East | 16 | Lehigh | 43 | 1 | Georgetown | 68 | 0 | -25 | 0 | 1_16_dog | 1985 |
2 | 1 | 1985-03-14 | Round of 64 | East | 4 | Loyola, Illinois | 59 | 13 | Iona | 58 | 0 | 1 | 1 | 4_13_fav | 1985 |
3 | 1 | 1985-03-14 | Round of 64 | East | 13 | Iona | 58 | 4 | Loyola, Illinois | 59 | 0 | -1 | 0 | 4_13_dog | 1985 |
4 | 2 | 1985-03-14 | Round of 64 | East | 5 | Southern Methodist | 85 | 12 | Old Dominion | 68 | 0 | 17 | 1 | 5_12_fav | 1985 |
Key Takeaways:
- As expected, high seeds generally outperform low seeds.
- There are exceptions, as 9 seeds win fewer games on average than 10 and 11 seeds, and 5 seeds win fewer games on average than 6 seeds.
- When performing a statistical analysis, many seeds' performances are not significantly different than the neighboring seeds. Interestingly, seeds 8-12 do not have a significant difference in average point margin per game.
Average Point Spread | Average Wins | Wins by Round | |
---|---|---|---|
Seed | |||
1 | 11.392193 | 3.351562 | round Elite Eight 52 National Ch... |
2 | 7.109302 | 2.398438 | round Elite Eight 28 National Ch... |
3 | 4.960452 | 1.796875 | round Elite Eight 14 National Ch... |
4 | 3.313846 | 1.546875 | round Elite Eight 13 National Ch... |
5 | 0.892593 | 1.109375 | round Elite Eight 6 National Cha... |
6 | 0.335793 | 1.125000 | round Elite Eight 3 National Cha... |
7 | -0.585062 | 0.890625 | round Elite Eight 2 National Cha... |
8 | -3.281818 | 0.726562 | round Elite Eight 5 National Cha... |
9 | -4.220000 | 0.562500 | round Elite Eight 1 National Semif... |
10 | -3.028571 | 0.640625 | round Elite Eight 1 National Semif... |
11 | -3.524752 | 0.578125 | round Elite Eight 3 National Semif... |
12 | -4.461538 | 0.523438 | round Elite Eight 0 Round of 32 20 ... |
13 | -9.043750 | 0.250000 | round Round of 32 6 Round of 64 26 ... |
14 | -10.688742 | 0.179688 | round Round of 32 2 Round of 64 21 ... |
15 | -16.065693 | 0.070312 | round Round of 32 1 Round of 64 8 Sw... |
16 | -24.718750 | 0.000000 | round Round of 64 0 Name: win, dtype: int64 |
As expected, higher seeds generally outperform lower seeds. There are exceptions, as 9 seeds win fewer games on average than 10 and 11 seeds, and 5 seeds win fewer games on average than 6 seeds. When plotted against the number of wins that would be expected if the higher seed won each game, the expected levels of advancement appear to generally hold true.
As expected, high seeds have a better average margin of victory than lower seeds. Only seeds 1-6 have positive average point margins, which can be explained by the single elimination format - teams that lose early on in the tournament do not have an opportunity to improve their point margin. 9 seeds also underperform in this area as they have a poorer point margin than both 10 and 11 seeds.
The plot below shows the percentage of each seed that advances to each round of the NCAA Tournament. As expected, higher seeds generally outperform lower seeds. There is an interesting trend highlighted in the second plot: while 10-12 seeds are less likely to achieve an upset and advance to the 2nd round, when they do, they are much more likely to win their 2nd round matchup than 8 or 9 seeds. This results in 10, 11, and 12 seeds advancing to the Sweet Sixteen more often than 8 and 9 seeds. This trend can be explained by matchups, and the fact that 8 and 9 seeds that win their first round game must then play a 1 seed in the 2nd round, while 10-12 seeds face easier 2nd round matchups.
Below are the results of a statistical comparison of each seed's average point margin. The average point margin for every team of each seed was collected and a paired t-test was performed on each set of seeds to determine if they were significantly different from one another. As expected, seeds far from each other are very different, but neighboring seeds often did not meet the 0.05 threshold to be considered significantly different from one another. This allowed us to group seeds' performance into the following statistically similar and unique sets: [1, 2, (3, 4), (5, 6, 7), (8, 9, 10, 11, 12), (13, 14), 15, 16]
P Value | Seeds Compared | Significant Difference | |
---|---|---|---|
29 | 0.093204 | 3 and 4 | False |
54 | 0.587780 | 5 and 6 | False |
55 | 0.159540 | 5 and 7 | False |
65 | 0.376600 | 6 and 7 | False |
84 | 0.427234 | 8 and 9 | False |
85 | 0.821109 | 8 and 10 | False |
86 | 0.834070 | 8 and 11 | False |
87 | 0.305909 | 8 and 12 | False |
92 | 0.313126 | 9 and 10 | False |
93 | 0.568254 | 9 and 11 | False |
94 | 0.841835 | 9 and 12 | False |
99 | 0.668556 | 10 and 11 | False |
100 | 0.213455 | 10 and 12 | False |
105 | 0.431084 | 11 and 12 | False |
114 | 0.178705 | 13 and 14 | False |
[1, 2, (3, 4), (5, 6, 7), (8, 9, 10, 11, 12), (13, 14), 15, 16]
Key Takeaways:
- There is a linear relationship between absolute seed difference and percentage of upsets in matchups.
- Every game between a 2 and 5 seed has resulted in an upset.
- 40% of games between a 2 and 10 seed has resulted in an upset.
The purpose of this section is to determine which matchups are most likely to produce upsets relative to the absolute seed difference of the teams. Matchups of evenly seeded teams were not considered.
Two dataframes were created to analyze and plot the data. The first dataframe has one record for each game from the perspective of the underdogs. This dataframe was used to create a line of best fit to generate the expected upset percentage by absolute seed difference. The second dataframe has one record for each game from the perspective of the winners. It was used to compare each matchup's historical upset percentage to expected upset percentage and plot the data.
The analysis below highlights two subsets of matchups, but the underlying code contains functions that can be used to look at any subset.
This subset was defined as matchups that occur less than every four years on average. To eliminate extreme outliers, the matchups also must have occurred at least three times since 1985. From this subset, the four matchups with the most drastic deviation from expected upset percentage are:
- 4 total games
- 100% have been upsets
- 7 total games
- 71% have been upsets
- 6 total matchups
- 50% have been upsets
- 7 total matchups
- 57% have been upsets
This subset was defined as matchups that occur at least once every year on average. The four matchups with the most drastic deviation from expected upset percentage are:
- 45 total games
- 40% have been upsets
- 35 total games
- 34% have been upsets
- 42 total matchups
- 33% have been upsets
- 128 total matchups
- 36% have been upsets
Key Takeaways:
-
Even the teams that have performed the best historically average less than three wins per tournament.
-
Most of the teams that have performed the best historically are schools that most fans would associate with being the best in men's college basketball.
-
The average number of wins for a given seed was used to predict the number of "expected wins" for each team based on their seeding for each year.
-
Expected wins was used to find teams that have historically over/underperformed.
First, we grouped the results of each game by the teams playing to find the following quantities for each team that has been in the tournament: tournament appearances, games played, games won, win percentage, total point margin, average point margin, average wins per tournament, wins expected based on the historical performance of teams with the same seed, expected games played, deviation from expected wins, and average deviation from expected wins. This information was stored in a dataframe so that the results for individual teams could be accessed and compared.
Team | Tournament Appearances | Games Played | Games Won | Win Percentage | Total Point Margin | Average Point Margin | Average Number of Wins | Expected Wins | Expected Games Played | Deviation From Expected Wins | Average Deviation From Expected Wins | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Air Force | 2 | 2 | 0 | 0 | -20 | -10 | 0 | 0.828125 | 2.82812 | -0.828125 | -0.414062 |
1 | Akron | 4 | 4 | 0 | 0 | -78 | -19.5 | 0 | 0.914062 | 4.91406 | -0.914062 | -0.228516 |
2 | Alabama | 15 | 33 | 18 | 0.545455 | 34 | 1.0303 | 1.2 | 16.8203 | 31.8203 | 1.17969 | 0.0786458 |
3 | Alabama State | 2 | 2 | 0 | 0 | -69 | -34.5 | 0 | 0 | 2 | 0 | 0 |
4 | Albany | 5 | 5 | 0 | 0 | -73 | -14.6 | 0 | 0.5 | 5.5 | -0.5 | -0.1 |
We then sorted by the number of tournament appearances to find the schools that had been in the most tournaments since the format was changed to include 64 teams. The results were similar to what we originally expected them to be and consisted of schools that have a reputation for being good at men's basketball.
Next we sorted by the number of games won by schools in the NCAA tournament. This consisted of many of the same teams that were in the top ten for most tournament appearances. It was interesting to see what a large spread there was between the number of wins for Duke which was first in this category and UCLA which was tenth. It also showed that some schools like Connecticut get a lot of wins each time they are in the tournament but don't necessarily make it into as many tournaments as some of the other schools included.
Finally, we sorted the teams by the net average point spread. Again this mostly consisted of the same teams with just a few exceptions. The fact that Kentucky, Duke and Kansas average almost a positive 10 point victory in every game they play in the tournament was higher than expected.
We plotted the average deviation from the expected number of wins per tournament versus the average number of wins per tournament to showcase the teams that had overperformed their seeding and had consistently went far in the tournament. The size of the points were based on the number of total games a team has played in the tournament. The large red points on the very right side of the graph represented schools that consistently win a lot of games in the tournament and outperform their seeds. The smaller points in red didn't make as far into the tournament on average, but still overperformed their expected results.
This table shows the average number of wins, average deviation from expected wins, games played, and tournament appearances for the teams highlighted in the previous graph. The fact that Duke, North Carolina, Kentucky, and Connecticut had such a high number of wins and often overperformed their seeds was not surprising consider their position on the total number of wins bar graph shown earlier. It was also expected that they wouldn't overperform their seed as often since they are often seeded very high.
It was interesting that there was a team like Butler that almost wins an extra game more than expected every tournament and had a large number of appearances and games played. Their run to the final four a few years ago probably skews this slightly though.
Team | Average Number of Wins | Average Deviation From Expected Wins | Games Played | Tournament Appearances | |
---|---|---|---|---|---|
64 | Duke | 2.90323 | 0.366179 | 116 | 31 |
169 | North Carolina | 2.82759 | 0.460399 | 108 | 29 |
115 | Kentucky | 2.77778 | 0.597801 | 99 | 27 |
52 | Connecticut | 2.75 | 0.756641 | 71 | 20 |
47 | Cleveland State | 1.5 | 1.28516 | 5 | 2 |
28 | Butler | 1.46154 | 0.74399 | 32 | 13 |
128 | Loyola Marymount | 1.33333 | 0.752604 | 7 | 3 |
77 | Florida Gulf Coast | 1 | 0.964844 | 4 | 2 |
168 | Norfolk State | 1 | 0.929688 | 2 | 1 |
We plotted the total number of tournament games played versus the total deviation from expected wins. The teams in red are examples that have lots of tournament appearances but don't necessarily perform at the level expected based on their seeding. This trend was harder to show with the previous plot.
This table shows the average number of wins, average deviation from expected wins, games played, and tournament appearances for the teams highlighted in the previous graph. These teams were more similar in the average number of wins per tournament than was originally expected.
Team | Average Number of Wins | Average Deviation From Expected Wins | Games Played | Tournament Appearances | |
---|---|---|---|---|---|
188 | Oklahoma | 1.45833 | -0.323568 | 59 | 24 |
101 | Illinois | 1.27273 | -0.23402 | 50 | 22 |
45 | Cincinnati | 1.2 | -0.263281 | 44 | 20 |
204 | Purdue | 1.13043 | -0.33288 | 49 | 23 |
198 | Pittsburgh | 1 | -0.503701 | 38 | 19 |
150 | Missouri | 0.842105 | -0.323602 | 35 | 19 |