In [30]:
import pandas as pd
import altair as alt

Data set found on Kaggle: https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset?resource=download

In [31]:
cbb_df = pd.read_csv('cbb24.csv')
cbb_df

Unnamed: 0,RK,TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG%,EFGD%,...,DRB,FTR,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED
0,1,Houston,B12,34,30,119.2,85.5,0.9785,49.7,44.0,...,30.2,29.9,39.0,48.4,43.4,34.7,30.0,63.3,10.6,1.0
1,2,Connecticut,BE,34,31,127.1,93.6,0.9712,57.1,45.1,...,26.8,33.3,32.5,58.5,43.7,36.7,31.9,64.6,11.3,1.0
2,3,Purdue,B10,33,29,126.2,94.7,0.9644,56.0,47.7,...,24.7,42.8,23.0,53.2,48.1,40.8,31.4,67.6,11.0,1.0
3,4,Iowa St.,B12,34,27,113.6,86.5,0.9583,51.9,47.1,...,29.4,36.1,35.2,51.7,46.9,34.9,31.5,67.6,6.9,2.0
4,5,Auburn,SEC,34,27,120.7,92.1,0.9573,54.1,43.4,...,30.3,38.2,41.0,54.9,42.8,35.2,29.8,69.8,5.5,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
357,358,Stonehill,NEC,30,4,90.3,114.2,0.0629,46.7,52.7,...,31.0,22.6,29.4,47.9,51.7,30.1,35.9,68.1,-22.0,
358,359,St. Francis PA,NEC,28,8,93.1,118.1,0.0608,47.2,53.0,...,31.3,32.6,35.4,45.7,52.9,33.2,35.4,65.4,-18.6,
359,360,IUPUI,Horz,29,6,92.1,116.9,0.0607,46.5,58.2,...,35.5,33.2,33.4,49.5,59.0,24.7,38.0,67.3,-21.6,
360,361,Coppin St.,MEAC,29,2,85.1,111.1,0.0445,42.1,51.3,...,38.6,31.1,38.3,41.9,51.0,28.3,34.5,66.3,-22.9,


This data set contains statistical metrics relating to 362 Division 1 college basketball teams from this past 2023-2024 season. I chose this data set because college basketball is something that I enjoy a great deal. Furthermore, lots of analysis goes into predictions for brackets when March Madness rolls around, so I thought I might take a stab at trying to understand how certain metrics may shed insight to team success. 

A few immediate insights that I found to be interesting was Providence's (My favorite team) BARTHAG or Power Rating (probability of beating the average division 1 team), which was 0.8151 and yet they were still denied a tournament spot while other teams with significantly lower Power Ratings were granted one (still bitter).
I also found it interesting how dense the distribution of Big East power ratings were, aside from two outliers at the bottom, indicating that it's a very competitive conference.

In [36]:
conferences = cbb_df['CONF'].unique()
colors = ['#bebada', '#a6cee3', '#fdb462', '#1f78b4', '#fccde5', '#b2df8a', '#f1b6da', '#33a02c', '#543005', '#fb9a99', '#e31a1c','#fdbf6f', '#ff7f00', '#e7d4e8', '#bc80bd', '#cab2d6', '#6a3d9a','#ffff99', '#9970ab', '#f4a582', '#b15928', 
         '#8dd3c7', '#ffffb3', '#fb8072', '#67001f', '#d9d9d9', '#80b1d3', '#e31a1c', '#003c30', '#7fbc41', '#b3de69', '#ccebc5', '#ffed6f']


barthag_chart = alt.Chart(cbb_df, title="Distribution of College Basketball Teams' Power Rating by Conference").mark_boxplot(opacity=0.85).encode(
    x=alt.X("CONF", title="Conference"),
    y=alt.Y("BARTHAG:Q", title="Power Rating"),
    color=alt.Color('CONF:N', scale=alt.Scale(domain=conferences, range=colors), title='Conference Legend'),
)

barthag_chart

This box and whisker plot displays the distribution of teams' power rating for each conference in division 1 men's basketball. I chose this type of visualization to display the strength of each conference, indicated by the y-position of the box, and the competitiveness of each conference, indicated by the height of each box and whiskers. The colors are categorical, at least that was the intention. Neither colorgorical nor colorbrewer could accomodate the 33 conferences, so I did the best I could.

In [56]:
selection = alt.selection_point(fields=['CONF'], bind='legend')

cbb_scatter = alt.Chart(cbb_df, title='Wins vs. Wins Above Bubble for D1 College Basketball Teams').mark_circle(size=30).encode(
    x=alt.X('W:Q',title='Wins'),
    y=alt.Y('WAB:Q', title='Wins Above Bubble (WAB)'),
    color=alt.Color("CONF", title='Conference Legend'),
    tooltip=['TEAM', 'RK', 'W', 'WAB'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.1))
).interactive().add_params(
    selection
)

cbb_scatter

This visualization is a scatter plot which displays wins versus wins above bubble for college basketball teams. Wins above bubble refers to the difference in how many wins a team had and how many (estimated) they would have need at a minimum to qualify for March Madness. Teams with a positive WAB (>0) qualified for the tournament, while those with a negative WAB fell short. The interactive features are as follows: a tool tip displaying the team name, their rank, their number of wins, and their WAB, a zoom feature, and an interactive legend that allows the viewer to select a conference to see how that conference's teams fared on this visualization. The interactive legend feature is crucial to allowing viewers to see how the tournament selection committee views the competition level of each conference. Teams from more competitive conferences may have fewer wins than those from less competitive conferences and yet still be selected to the tournament. 

In [72]:
cbb_heatmap = alt.Chart(cbb_df, title='Offensive and Defensive Efficiency Rating Effect on Team Ranking').mark_rect().encode(
    alt.Y('ADJDE:Q', title='Defensive Efficiency Rating').bin(maxbins=40),
    alt.X('ADJOE:Q', title='Offensive Efficiency Rating').bin(maxbins=40),
    alt.Color('RK:Q', title='Rank').scale(scheme='greenblue'), 
    tooltip = ['TEAM']
)

cbb_heatmap

This heatmap displays teams' offensive versus defensive ratings, with color indicating their national ranking. A higher offensive rating means more output on offense, while a lower defensive rating means a team performs well while defending. The color encoding has it such that a more favorable ranking (closer to 0) is lighter, well those less favorable are darker.