In [45]:
import pandas as pd
import altair as alt
# Download latest version
df = pd.read_csv("cbb25.csv")


print("Path to dataset files:", df)
print(df.head())

Path to dataset files:       RK                    Team  CONF   G   W  ADJOE  ADJDE  BARTHAG  EFG_O  \
0      1                 Houston   B12  34  30  124.8   88.0   0.9823   52.7   
1      2                    Duke   ACC  34  31  128.5   91.3   0.9807   57.4   
2      3                  Auburn   SEC  33  28  129.0   93.7   0.9756   55.7   
3      4                 Florida   SEC  34  30  127.7   94.0   0.9713   55.0   
4      5                 Alabama   SEC  33  25  127.6   96.4   0.9621   56.3   
..   ...                     ...   ...  ..  ..    ...    ...      ...    ...   
359  360             The Citadel    SC  30   5   93.6  117.5   0.0687   46.9   
360  361             Chicago St.   NEC  32   4   92.5  116.1   0.0682   44.4   
361  362              Coppin St.  MEAC  30   6   87.8  112.4   0.0550   44.0   
362  363     Arkansas Pine Bluff  SWAC  31   6   95.0  121.7   0.0549   50.3   
363  364  Mississippi Valley St.  SWAC  31   3   82.9  125.2   0.0086   42.1   

     EFG_D  ... 

In [46]:
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

df = pd.read_csv("cbb25.csv")
df.head()


Unnamed: 0,RK,Team,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,...,FTRD,2P_O,2P_D,3P_O,3P_D,3PR,3PRD,ADJ_T,WAB,SEED
0,1,Houston,B12,34,30,124.8,88.0,0.9823,52.7,44.9,...,34.1,49.0,43.9,39.8,30.9,34.5,43.1,61.4,11.6,1.0
1,2,Duke,ACC,34,31,128.5,91.3,0.9807,57.4,44.5,...,25.4,58.0,43.4,37.7,30.9,45.4,37.9,65.7,9.6,1.0
2,3,Auburn,SEC,33,28,129.0,93.7,0.9756,55.7,46.0,...,39.2,56.1,47.2,36.8,29.2,40.6,34.8,67.8,12.5,1.0
3,4,Florida,SEC,34,30,127.7,94.0,0.9713,55.0,45.3,...,33.0,56.4,45.9,35.5,29.6,43.6,37.3,69.5,11.1,1.0
4,5,Alabama,SEC,33,25,127.6,96.4,0.9621,56.3,47.9,...,33.9,59.7,48.8,35.0,30.8,46.2,35.1,74.6,9.8,2.0


The problem that I will be trying to solve what features play the biggest roles in wins. The next problem is to build a supervised learning model to predict the number of wins for a team using the season statistics from the csv file that I have taken in.

In [47]:
print("rows, cols:", df.shape)
df.info()
df.isna().mean().sort_values(ascending=False).head(15)


rows, cols: (364, 25)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 25 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   RK       364 non-null    int64  
 1   Team     364 non-null    object 
 2   CONF     364 non-null    object 
 3   G        364 non-null    int64  
 4   W        364 non-null    int64  
 5   ADJOE    364 non-null    float64
 6   ADJDE    364 non-null    float64
 7   BARTHAG  364 non-null    float64
 8   EFG_O    364 non-null    float64
 9   EFG_D    364 non-null    float64
 10  TOR      364 non-null    float64
 11  TORD     364 non-null    float64
 12  ORB      364 non-null    float64
 13  DRB      364 non-null    float64
 14  FTR      364 non-null    float64
 15  FTRD     364 non-null    float64
 16  2P_O     364 non-null    float64
 17  2P_D     364 non-null    float64
 18  3P_O     364 non-null    float64
 19  3P_D     364 non-null    float64
 20  3PR      364 non-null    float64

SEED     0.813187
DRB      0.000000
WAB      0.000000
ADJ_T    0.000000
3PRD     0.000000
3PR      0.000000
3P_D     0.000000
3P_O     0.000000
2P_D     0.000000
2P_O     0.000000
FTRD     0.000000
FTR      0.000000
RK       0.000000
Team     0.000000
TORD     0.000000
dtype: float64

EDA:
For the EDA we will look at distribution of wins and a few relationships. This will include the offensive rating, defensive rating, barthag, effective field goal allowed, effective field goal percentage, wtc.

In [48]:
# Distribution of wins
chart = alt.Chart(df).mark_bar().encode(
    x=alt.X('W:Q', bin=alt.Bin(maxbins=20)),
    y='count()'
).properties(title="Distribution of Wins (W)")
chart


In [49]:
# Scatter: ADJOE vs W
alt.Chart(df).mark_circle(size=60).encode(
    x='ADJOE:Q',
    y='W:Q',
    tooltip=['Team','W','ADJOE','ADJDE','BARTHAG']
).properties(title='ADJOE vs Wins').interactive()


In [50]:
# Scatter: ADJDE vs W
alt.Chart(df).mark_circle(size=60).encode(
    x='ADJDE:Q',
    y='W:Q',
    tooltip=['Team','W','ADJOE','ADJDE','BARTHAG']
).properties(title='ADJDE vs Wins').interactive()


In [51]:
# Scatter: BARTHAG vs W
alt.Chart(df).mark_circle(size=60).encode(
    x='BARTHAG:Q',
    y='W:Q',
    tooltip=['Team','W','ADJOE','ADJDE','BARTHAG']
).properties(title='BARTHAG vs Wins').interactive()


Observations:
Based off the graphs above we are able to see that most teams have an average of 15 wins with the most being 32 wins and least being 2 wins. We are abel to see that the ADJOE, ADJDE, and the Barthag are all strongly correlated with wins, which should be expected.

In [52]:
# Quick numeric summary for a few features
df[['W','ADJOE','ADJDE','BARTHAG','EFG_O','EFG_D','ADJ_T','WAB']].describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
W,364.0,16.934066,5.872009,3.0,13.0,17.0,21.0,31.0
ADJOE,364.0,106.433516,7.661871,82.9,100.875,106.0,111.275,129.0
ADJDE,364.0,106.433242,6.620135,88.0,101.95,106.75,111.225,125.2
BARTHAG,364.0,0.488307,0.258259,0.0086,0.2746,0.44865,0.693275,0.9823
EFG_O,364.0,50.791484,3.015819,42.1,48.9,50.7,52.625,58.7
EFG_D,364.0,50.911264,2.663695,44.4,49.0,50.9,52.8,58.1
ADJ_T,364.0,67.193681,2.445514,58.6,65.575,67.2,68.8,74.6
WAB,364.0,-8.535027,7.283724,-23.4,-13.8,-9.2,-4.25,12.5


Feature selection and preprocessing plan:
When looking at the summary I am seeing that we can use different metrics as predictors and can drop other variables, The variables that I will be keeping is the ones listed above and I will be dropping 'Team ' and non informative columns. I will also be filling missing features with medians.

In [53]:
# pick numeric columns except target
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('W')
print("Candidate numeric features:", numeric_cols)


Candidate numeric features: ['RK', 'G', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D', '3PR', '3PRD', 'ADJ_T', 'WAB', 'SEED']


In [54]:
# define X and y, drop string column and duplicates if any
X = df[numeric_cols].copy()
y = df['W'].copy()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)


(291, 22) (73, 22)


In [55]:
# Preprocessing transformer for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_cols)
])


Modeling:
I will be using Linear regression as my baseline and will use RandomForestRegressor as my stronger models. I will evaluate on training sets using MAQ RMSE and R^2

In [56]:
# Helper function to evaluate
def evaluate_model(model, X_train, y_train, X_test, y_test):
    cv = 5
    scoring = ['neg_mean_absolute_error','neg_root_mean_squared_error','r2']
    cv_res = cross_validate(model, X_train, y_train, cv=cv, scoring=scoring, return_train_score=False)
    print("CV MAE: {:.3f} (+/- {:.3f})".format(-cv_res['test_neg_mean_absolute_error'].mean(), cv_res['test_neg_mean_absolute_error'].std()))
    print("CV RMSE: {:.3f} (+/- {:.3f})".format(-cv_res['test_neg_root_mean_squared_error'].mean(), cv_res['test_neg_root_mean_squared_error'].std()))
    print("CV R2: {:.3f} (+/- {:.3f})".format(cv_res['test_r2'].mean(), cv_res['test_r2'].std()))
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    rmse = mean_squared_error(y_test, preds, squared=False)
    r2 = r2_score(y_test, preds)
    print("\nTest MAE: {:.3f}".format(mae))
    print("Test RMSE: {:.3f}".format(rmse))
    print("Test R2: {:.3f}".format(r2))
    return model, preds


In [57]:
# Baseline pipeline: linear regression
lin_pipe = Pipeline(steps=[
    ('pre', preprocessor),
    ('model', LinearRegression())
])

print("Linear regression")
lin_model, lin_preds = evaluate_model(lin_pipe, X_train, y_train, X_test, y_test)


Linear regression
CV MAE: 0.875 (+/- 0.091)
CV RMSE: 1.097 (+/- 0.091)
CV R2: 0.961 (+/- 0.008)

Test MAE: 0.826
Test RMSE: 1.059
Test R2: 0.974


In [58]:
# Random Forest
rf_pipe = Pipeline(steps=[
    ('pre', preprocessor),
    ('model', RandomForestRegressor(random_state=42, n_jobs=-1))
])
print("Random Forest (default params)")
rf_model, rf_preds = evaluate_model(rf_pipe, X_train, y_train, X_test, y_test)


Random Forest (default params)
CV MAE: 1.754 (+/- 0.151)
CV RMSE: 2.274 (+/- 0.270)
CV R2: 0.836 (+/- 0.024)

Test MAE: 1.949
Test RMSE: 2.465
Test R2: 0.860


As you can see above the linear regression has better performance than the random forest.The taining and test performance are almost identical for the linear regression so there is no overfitting. As for the Random Forest the errors are about 2x larger and the R^2 is a lot lower. Overall this means that the relationship between efficiency metrics and win totals are mainly linear and making linear models are more appropriate for this dataset.

Hyperparameter tuning:
I will be tuning a small grid for RandomForst using GridSearchCV on the pipeline.

In [59]:
param_grid = {
    'model__n_estimators': [100, 300],
    'model__max_depth': [None, 6, 12],
    'model__min_samples_leaf': [1, 3, 5]
}

rf_grid_pipe = Pipeline(steps=[('pre', preprocessor),
                               ('model', RandomForestRegressor(random_state=42, n_jobs=-1))])

gs = GridSearchCV(rf_grid_pipe, param_grid, cv=4, scoring='neg_root_mean_squared_error', n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best CV RMSE:", -gs.best_score_)
best_rf = gs.best_estimator_
best_rf, rf_test_preds = best_rf, best_rf.predict(X_test)

mae = mean_absolute_error(y_test, rf_test_preds)
rmse = mean_squared_error(y_test, rf_test_preds, squared=False)
r2 = r2_score(y_test, rf_test_preds)
print("\nTuned RF Test MAE: {:.3f}, RMSE: {:.3f}, R2: {:.3f}".format(mae, rmse, r2))


Fitting 4 folds for each of 18 candidates, totalling 72 fits
Best params: {'model__max_depth': 12, 'model__min_samples_leaf': 1, 'model__n_estimators': 300}
Best CV RMSE: 2.3036887215669735

Tuned RF Test MAE: 1.936, RMSE: 2.456, R2: 0.861


After tuning, the random forest performed worse becuase the data follows a mostly linear pattern. Linear Regression can model this a lot more accurately. Random forests are nonlinear and are more complex so they like to overfit when the relationship is simple.

In [60]:
# Extract numeric feature names and transformer to get feature order (after imputation/scaling it's same order)
feat_names = numeric_cols 
model = gs.best_estimator_.named_steps['model']
importances = model.feature_importances_
feat_imp = pd.DataFrame({'feature': feat_names, 'importance': importances}).sort_values('importance', ascending=False)
feat_imp.head(20)


Unnamed: 0,feature,importance
20,WAB,0.788545
1,G,0.052372
3,ADJDE,0.012194
5,EFG_O,0.011928
0,RK,0.011384
8,TORD,0.011094
4,BARTHAG,0.010652
6,EFG_D,0.010132
9,ORB,0.010032
18,3PRD,0.008424


In [61]:
# Plot importances
feat_imp_plot = alt.Chart(feat_imp.head(20)).mark_bar().encode(
    x='importance:Q',
    y=alt.Y('feature:N', sort='-x')
).properties(title='Top 20 feature importances (RandomForest)')
feat_imp_plot


In [62]:
# predicted vs actual scatter
preds = rf_test_preds
resid = y_test - preds
df_diag = pd.DataFrame({'actual': y_test, 'pred': preds, 'resid': resid})
alt.Chart(df_diag).mark_circle().encode(
    x='actual:Q',
    y='pred:Q',
    tooltip=['actual','pred']
).properties(title='Predicted vs Actual (Test set)').interactive()


In [63]:
# residual histogram
alt.Chart(df_diag).mark_bar().encode(
    x=alt.X('resid:Q', bin=alt.Bin(maxbins=30)),
    y='count()'
).properties(title='Residuals (pred - actual)')


Based off of the graphs above we are able to see that WAB is the most important feature. I didnt understand what WABs was but after doing some research it makes sense why it is the most important. WAB means "wins above bubble" it is the feature in which it captures how well a team performed relative to the difficulty of their schedule. It takes in your win but it also takes in your strength of schedule, so when you beat a good team then it will move up a lot and if you beat a bad team it wont move.

The second graph shows a predicted vs actual (Test Set) scatter plot. The model when looked at is learning the relationship well. WHen the actual wins are low the predictions are also low, when the actual wins are high the predictions are also high. Becuase the scatter plot is close to diagonal that means that the performance is good, there isnt much error and there is a strong correlation.

The last graph is a residual graph. A perfect prediction is if the residual is 0, a positive residual is over predicted, and a negative residual is under predicted. The graph is clustered around 0 which is pretty good. The tallest bars are between -1 and +2 with a peak around +0.5. This means that the predicted is close to the actual values. The residuals look roughly bell shaped and has a roughly noremal distribution.

Outliers:
When i looked through the data during EDA i noticed that there were a few points that were definitly higher or lower than the rest. These data points weren't any bad data points, they were just teams that performed super well or super poorly. In this dataset those values are normal becuase some teams just score way more efficiently or defend way better. For this reason i didn't remove any outliers as thy still contained useful information for predicting wins.

In the projecy i went throug the full machine larngin workflow using the 2025 Collefe Basketball dataset. I was able to clean and explore the data and understood how the different teams were able to win. I was also able to see the different statistics that played a role into winning. I was able to confirm offensive and defensive efficiency through scatterplots. I was also able to see the distribution of wins through the histogram. I was able to point out how there is a strong correlation between these metrics and wins. 

For the modeling phase i built a Linear Regression model and a Random Forest model. After comparing the two models i was able to notice that the linear model performed better as the relationships between the data are linear. I was also able to scale and did some hyperparameter tuning to improve the performance of my model to make sure that it was properly trained.

In conclusion we are able to notice that team efficiency metrics are strong predictors of success. Models like linear regression is able to capture linear models better. 