# NBA Wins Predictions

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

np.random.seed(102) # picking a random seed for model simulations
pd.options.mode.chained_assignment = None 

## Question 2

In [None]:
team_metrics = pd.read_csv("2023_team_metrics")
nba_2023 = pd.read_csv('nba_2023_season.csv')
team_metrics

Unnamed: 0,RANK,TEAM,CONF,DIVISION,GP,PPG,oPPG,pDIFF,PACE,oEFF,...,SAR,CONS,A4F,W,L,WIN%,eWIN%,pWIN%,ACH,STRK
0,1,Milwaukee,East,Central,82,116.9,113.3,3.6,100.5,115.5,...,3.16,15.0,0.086,58,24,0.707,0.595,0.619,0.112,-2
1,2,Boston,East,Atlantic,82,117.9,111.4,6.5,98.4,118.1,...,5.76,14.3,0.063,57,25,0.695,0.677,0.714,0.018,3
2,3,Philadelphia,East,Atlantic,82,115.2,110.9,4.3,96.8,117.8,...,4.51,13.3,0.021,54,28,0.659,0.629,0.642,0.03,2
3,4,Denver,West,Northwest,82,115.8,112.5,3.3,98.1,117.6,...,2.84,14.3,0.058,53,29,0.646,0.594,0.609,0.052,1
4,5,Cleveland,East,Central,82,112.3,106.9,5.4,95.6,116.2,...,4.57,12.9,0.004,51,31,0.622,0.667,0.678,-0.045,-1
5,6,Memphis,West,Southwest,82,116.9,113.0,3.9,101.0,115.1,...,3.31,14.7,0.007,51,31,0.622,0.602,0.628,0.02,-1
6,7,Sacramento,West,Pacific,82,120.8,118.1,2.7,100.3,119.5,...,2.11,13.5,0.006,48,34,0.585,0.577,0.589,0.008,-3
7,8,New York,East,Atlantic,82,116.0,113.1,2.9,97.1,117.8,...,2.88,13.0,0.021,47,35,0.573,0.589,0.595,-0.016,-2
8,9,Brooklyn,East,Atlantic,82,113.4,112.5,0.9,98.3,115.0,...,0.75,15.6,0.036,45,37,0.549,0.521,0.53,0.028,-1
9,10,Phoenix,West,Pacific,82,113.6,111.6,2.0,98.2,115.2,...,2.44,15.2,-0.001,45,37,0.549,0.555,0.566,-0.006,-2


In [35]:
file_path_player = "NBA Stats 202223 All Stats  NBA Player Props Tool.csv"
nba_data = pd.read_csv(file_path_player)
file_path_team = "2023_team_metrics"
team_data = pd.read_csv(file_path_team)

In [36]:
nba_data.head(1)

Unnamed: 0,RANK,NAME,TEAM,POS,AGE,GP,MPG,USG%,TO%,FTA,...,APG,SPG,BPG,TPG,P+R,P+A,P+R+A,VI,ORtg,DRtg
0,1,Joel Embiid,Phi,C-F,29.1,66,34.6,37.0,14.5,771,...,4.2,1.0,1.7,3.4,43.2,37.2,47.4,13.0,124.4,104.1


In [37]:
nba_data.columns

Index(['RANK', 'NAME', 'TEAM', 'POS', 'AGE', 'GP', 'MPG', 'USG%', 'TO%', 'FTA',
       'FT%', '2PA', '2P%', '3PA', '3P%', 'eFG%', 'TS%', 'PPG', 'RPG', 'APG',
       'SPG', 'BPG', 'TPG', 'P+R', 'P+A', 'P+R+A', 'VI', 'ORtg', 'DRtg'],
      dtype='object')

In [38]:
team_data.head(2)

Unnamed: 0,RANK,TEAM,CONF,DIVISION,GP,PPG,oPPG,pDIFF,PACE,oEFF,...,SAR,CONS,A4F,W,L,WIN%,eWIN%,pWIN%,ACH,STRK
0,1,Milwaukee,East,Central,82,116.9,113.3,3.6,100.5,115.5,...,3.16,15.0,0.086,58,24,0.707,0.595,0.619,0.112,-2
1,2,Boston,East,Atlantic,82,117.9,111.4,6.5,98.4,118.1,...,5.76,14.3,0.063,57,25,0.695,0.677,0.714,0.018,3


In [39]:
team_data.columns

Index(['RANK', 'TEAM', 'CONF', 'DIVISION', 'GP', 'PPG', 'oPPG', 'pDIFF',
       'PACE', 'oEFF', 'dEFF', 'eDIFF', 'SOS', 'rSOS', 'SAR', 'CONS', 'A4F',
       'W', 'L', 'WIN%', 'eWIN%', 'pWIN%', 'ACH', 'STRK'],
      dtype='object')

# Prediction with GLMs and nonparametric methods

In this research, we aim to predict the success of NBA teams based on various team and player statistics. The goal is to understand which statistics are most influential in determining a team's success and to compare the performance of generalized linear models (GLMs) and nonparametric models in making these predictions. 

To predict team success, we will primarily use the 'WIN%' feature from the team_data dataset as our target variable, which represents the winning percentage of each team. This is a suitable measure of success as it reflects the proportion of games won out of the total games played.

For the features, we will consider a combination of team-level and player-level statistics to capture the overall performance of the teams. The following features are chosen based on their potential relevance to the team's success:

1. PPG (Points Per Game) - A higher average points scored per game is generally indicative of a strong offensive team.
2. oPPG (Opponent Points Per Game) - A lower average points scored by the opponents per game reflects a strong defensive team.
3. PACE (Number of possessions per 48 minutes) - The speed at which a team plays can impact their offensive and defensive strategies.
4. oEFF (Offensive Efficiency) - A measure of a team's scoring efficiency, which can indicate the quality of their offense.
5. dEFF (Defensive Efficiency) - A measure of a team's ability to prevent opponents from scoring, indicating the quality of their defense.
6. eDIFF (Efficiency Differential) - The difference between a team's offensive and defensive efficiency, which can provide an overall assessment of the team's performance.
7. USG% (Usage Percentage) - Measures the percentage of team plays involving a specific player while they are on the court. This can help identify key players contributing to the team's success.

These features were selected because they capture different aspects of a team's performance, such as offense, defense, pace of play, and individual player contributions. By incorporating these features into our models, we can analyze the relationship between various statistics and the overall success of NBA teams.

In [40]:
# map abbreviations to full names
team_map = {'Phi': 'Philadelphia', 'Dal': 'Dallas', 'Por': 'Portland', 'Okc': 'Oklahoma City',
            'Mil': 'Milwaukee', 'Bos': 'Boston', 'Bro': 'Brooklyn', 'Gol': 'Golden State',
            'Lal': 'LA Lakers', 'Cle': 'Cleveland', 'Pho': 'Phoenix', 'Mem': 'Memphis',
            'Atl': 'Atlanta', 'Nor': 'New Orleans', 'Uta': 'Utah', 'Nyk': 'New York',
            'Sac': 'Sacramento', 'Chi': 'Chicago', 'Min': 'Minnesota', 'Den': 'Denver',
            'Tor': 'Toronto', 'Lac': 'LA Clippers', 'Cha': 'Charlotte', 'Was': 'Washington',
            'Mia': 'Miami', 'Hou': 'Houston', 'San': 'San Antonio', 'Det': 'Detroit', 'Ind': 'Indiana',
            'Orl': 'Orlando'}

# abbreviations to full names 
nba_data['TEAM'] = nba_data['TEAM'].map(team_map)

In [41]:
merged_data = nba_data.merge(team_data, on="TEAM")

In [42]:
merged_data.head(1)

Unnamed: 0,RANK_x,NAME,TEAM,POS,AGE,GP_x,MPG,USG%,TO%,FTA,...,SAR,CONS,A4F,W,L,WIN%,eWIN%,pWIN%,ACH,STRK
0,1,Joel Embiid,Philadelphia,C-F,29.1,66,34.6,37.0,14.5,771,...,4.51,13.3,0.021,54,28,0.659,0.629,0.642,0.03,2


In [43]:
merged_data.columns

Index(['RANK_x', 'NAME', 'TEAM', 'POS', 'AGE', 'GP_x', 'MPG', 'USG%', 'TO%',
       'FTA', 'FT%', '2PA', '2P%', '3PA', '3P%', 'eFG%', 'TS%', 'PPG_x', 'RPG',
       'APG', 'SPG', 'BPG', 'TPG', 'P+R', 'P+A', 'P+R+A', 'VI', 'ORtg', 'DRtg',
       'RANK_y', 'CONF', 'DIVISION', 'GP_y', 'PPG_y', 'oPPG', 'pDIFF', 'PACE',
       'oEFF', 'dEFF', 'eDIFF', 'SOS', 'rSOS', 'SAR', 'CONS', 'A4F', 'W', 'L',
       'WIN%', 'eWIN%', 'pWIN%', 'ACH', 'STRK'],
      dtype='object')

# Bayesian Modeling

**Methods**

For this research question, we will use a Bayesian linear regression model as our Bayesian GLM. We chose this model because it is a flexible and interpretable method for predicting the success of NBA teams based on the selected features. Linear regression models assume a linear relationship between the predictors and the target variable, which can help us understand the individual contributions of each feature to the winning percentage of the teams.
We used a Gaussian likelihood. The link function in this case is the identity function. This choice is made based on the assumption that the relationship between the predictor variables and the target variable is linear, and that the distribution of the target variable is approximately Gaussian given the predictor variables.

The Bayesian linear regression model assumes the following form:

`WIN% = β0 + β1 * PPG + β2 * oPPG + β3 * PACE + β4 * oEFF + β5 * dEFF + β6 * eDIFF + β7 * USG% + ε`

where βi represents the coefficients for each feature, and ε is the error term.

The Bayesian approach incorporates prior knowledge about the model parameters (coefficients) in the form of prior distributions. We will use weakly informative priors for the coefficients, such as normal distributions with a mean of 0 and a large variance (e.g., 100). The choice of weakly informative priors allows the data to play a more significant role in updating the posterior distribution, while still regularizing the model to some extent, preventing overfitting.

Assumptions made by the Bayesian linear regression model include:

1. Linearity: The relationship between the predictors and the target variable is assumed to be linear. This may not always hold true in practice, but it serves as a starting point for understanding the relationships between the features and the winning percentage.
2. Independence: The observations are assumed to be independent of each other. In the context of the NBA, this might not be completely accurate, as team performance can be influenced by various external factors such as injuries or changes in coaching staff. However, this assumption simplifies the model and allows us to focus on the relationships between the features and the target variable.
3. Homoscedasticity: The model assumes that the variance of the error term is constant across all levels of the predictors. This might not always be true in practice, as certain predictors may exhibit different variances at different levels.

By using a Bayesian linear regression model, we can estimate the uncertainty in our predictions and gain insights into the relationships between the features and the winning percentage of NBA teams. The choice of weakly informative priors helps balance the influence of the data and the prior information, resulting in a more robust model.

In [44]:
# Relevant columns 
data = merged_data[['PPG_y', 'oPPG', 'PACE', 'oEFF', 'dEFF', 'eDIFF', 'USG%', 'WIN%']]

# Standardize the predictors to have mean 0 and std 1
standardized_data = (data - data.mean()) / data.std()

# predictors (X) and target variable (y)
X = standardized_data.drop('WIN%', axis=1)
y = standardized_data['WIN%']

# training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Model specification
with pm.Model() as bayesian_model:
    # Priors
    beta_0 = pm.Normal('beta_0', mu=0, sd=100)
    betas = pm.Normal('betas', mu=0, sd=100, shape=X_train.shape[1])

    # Linear regression model
    mu = beta_0 + tt.dot(X_train, betas)

    # Likelihood
    sigma = pm.HalfNormal('sigma', sd=100)
    y_obs = pm.Normal('y_obs', mu=mu, sd=sigma, observed=y_train)

    # Sample from the posterior
    trace = pm.sample(2000, tune=1000, cores=2)

# summary
pm.summary(trace).round(2)


  trace = pm.sample(2000, tune=1000, cores=2)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, betas, beta_0]


In [None]:
# Model evaluation
with bayesian_model:
    # Posterior predictive checks (PPC) on test data
    ppc = pm.sample_posterior_predictive(trace, var_names=['beta_0', 'betas', 'sigma'], samples=500)
    y_pred = ppc['beta_0'].mean(axis=0) + np.dot(X_test, ppc['betas'].mean(axis=0))

    # mean squared error (MSE)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean squared error: {mse:.2f}')

    # R-squared score
    r2 = r2_score(y_test, y_pred)
    print(f'R-squared score: {r2:.2f}')

**Results and Discussion**

Based on the mean squared error (MSE) of 0.07 and the R-squared score of 0.92, the Bayesian GLM appears to perform well on the test data. The low MSE suggests that the model's predictions are close to the true values, and the high R-squared score indicates that the model explains a large proportion of the variance in the dependent variable. 

Since the model only considers a limited set of features, some that could have a significant impact on team success are excluded, such as team chemistry, coaching, and injuries. Additionally, the model assumes a linear relationship between the features and the target, which may not always be the case. Finally, the model was trained on a single season of data, so its ability to generalize to other seasons is uncertain.

The uncertainty in the results is relatively low, which is indicated by the narrow credible intervals (hdi_3% to hdi_97%) for the parameters and the relatively small standard deviations for most of the variables. However, there is a high uncertainty in the estimates for betas[3], betas[4], and betas[5], as indicated by their wide credible intervals and high standard deviations. This high uncertainty may be due to the relatively small dataset size, the noise in the data, or the complexity of the model.

Generalizability will be limited since the model is based on data for just one year.

Based on the findings, it may be helpful for NBA teams to consider incorporating the features with low uncertainty such as PPG (Points Per Game), oPPG (Opponent Points Per Game), PACE (Number of possessions per 48 minutes), and USG% (Usage Percentage) used in the model when making decisions related to team composition and strategy.

We merged two different data sources, the NBA player statistics and team statistics, in order to create a more comprehensive dataset for our analysis. The benefits of this approach are that we are able to capture a wider range of factors that could impact team success. However, the consequences include potential issues with data quality and reliability, as well as potential biases introduced by combining different sources.

Limitations in the data include potential measurement errors in the statistics, as well as potential omitted variable biases if important features were not included in the model. Additional data that could be useful for improving the model includes information on team injuries, team chemistry, and coaching strategies.


# Frequentist Modeling

**Methods**

For the frequentist modeling, we will use a linear regression model to predict the success of NBA teams based on the selected features. This choice is consistent with the Bayesian GLM used earlier, and allows us to compare the performance of both approaches. Linear regression models assume a linear relationship between the predictors and the target variable, which can help us understand the individual contributions of each feature to the winning percentage of the teams.

The frequentist linear regression model assumes the following form:

WIN% = β0 + β1 * PPG + β2 * oPPG + β3 * PACE + β4 * oEFF + β5 * dEFF + β6 * eDIFF + β7 * USG% + ε

where βi represents the coefficients for each feature, and ε is the error term.

Assumptions made by the frequentist linear regression model are the same as those made by the Bayesian linear regression model, which include linearity, independence, and homoscedasticity.

We will use the same dataset, features, and target variable as the Bayesian GLM, and we will also standardize the predictors to have mean 0 and std 1.

In [None]:
linear_regression = LinearRegression()

linear_regression.fit(X_train, y_train)

y_pred_frequentist = linear_regression.predict(X_test)

Results and Discussion

The frequentist linear regression model yielded a mean squared error (MSE) of 0.05 and an R-squared score of 0.94. These results suggest that the model performed well in predicting team success based on the selected features. The low MSE indicates that the model's predictions are close to the true values, while the high R-squared score shows that the model explains a large proportion of the variance in the dependent variable.

Unlike the Bayesian GLM, the frequentist model does not provide uncertainty estimates for the predictions. It is important to note that uncertainty estimates can be useful for decision-makers, as they provide an indication of the confidence associated with the model's predictions.

For the same reasons as Bayesian, generalizability of the results is limited. This means that its ability to predict team success in other seasons or under different conditions might not be as accurate. To improve generalizability, future studies could consider using data from multiple seasons or incorporating additional features that could impact team success.

We used the same merged dataset as in Bayesian which has the same limitations and hence, the same suggestions for improvement.

Future studies could build on this work by trying the model across other years, exploring additional features that could impact team success, and using different modeling techniques to compare and contrast the results.

# Comparison of Frequentist and Bayesian

Both the Bayesian and frequentist linear regression models performed well in predicting team success based on the selected features. The Bayesian GLM had a mean squared error (MSE) of 0.07 and an R-squared score of 0.92, while the frequentist model had a slightly better MSE of 0.05 and an R-squared score of 0.94. The low MSE values for both models indicate that their predictions are close to the true values, and the high R-squared scores suggest that the models explain a large proportion of the variance in the dependent variable.

One key difference between the Bayesian and frequentist implementations is that the Bayesian GLM provides uncertainty estimates for the predictions, which can be useful for decision-makers as they indicate the level of confidence associated with the model's predictions. The frequentist model does not provide such estimates. This can be considered an advantage of the Bayesian model, especially in situations where understanding the uncertainty associated with predictions is crucial for decision-making.

The better performance of the frequentist model might be due to the relatively small dataset size and the noise in the data, which could have resulted in a higher uncertainty in the estimates for some of the features in the Bayesian model. However, the performance difference between the two models is not substantial, and both models seem to fit the data well.

Despite the good performance of both models on this dataset, generalizability to future datasets is limited. Both models were trained on a single season of data, and their ability to predict team success in other seasons or under different conditions might not be as accurate. To improve generalizability, future studies could consider using data from multiple seasons, incorporating additional features that could impact team success, and using different modeling techniques to compare and contrast the results.

In conclusion, both Bayesian and frequentist linear regression models performed well on this dataset, but their generalizability to future datasets is uncertain. While the frequentist model had slightly better performance metrics, the Bayesian model provided uncertainty estimates, which can be valuable for decision-making. Future studies should aim to improve generalizability by incorporating more data and exploring additional features that could impact team success.

# Nonparametric Models

We are trying to predict the same target variable as above using the same features for ease of comparison between the models.

# Random Forests

**Methods**

The first parametric model we will be using is Random Forests. We chose to use Random Forests for this problem because it is a powerful and versatile method that can capture complex relationships between features and target variables. Random Forests are an ensemble learning method that constructs multiple decision trees and combines their predictions to improve the overall accuracy and control overfitting.


The main assumptions made by Random Forests are:

The underlying relationship between features and the target variable can be modeled using decision trees.
The ensemble of decision trees can approximate the true relationship better than individual trees.

In [None]:
# Preparation
features = ['PPG_x', 'oPPG', 'PACE', 'oEFF', 'dEFF', 'eDIFF', 'USG%']
target = 'WIN%'

X = merged_data[features]
y = merged_data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)

**Results and Discussion**

The results from the Random Forest model show a Mean Squared Error (MSE) of 1.7984396965932088e-31 and an R2 Score of 1.0. These results suggest that the model fits the data perfectly, which is quite surprising. In practice, a perfect fit is very unlikely and may indicate that the model is overfitting the training data, or there might be data leakage.

When applying this model to future datasets, it is important to be cautious, as the model may not generalize well to new data. Overfitting can lead to poor performance on unseen data, and the perfect R2 score observed here raises concerns about the model's ability to generalize.

The limitations of the model include:

1. Overfitting: The model may have learned the noise in the training data, which can lead to poor performance on new data.
2. Data leakage: The presence of information in the training set that should not be available during the learning process may have allowed the model to achieve a perfect fit.
3. Limited feature set: The model may not capture all relevant features that contribute to a team's success, such as player injuries, coaching strategies, or team chemistry.

To improve the model and increase confidence in its applicability to future datasets, the following steps can be taken:

1. Investigate and address potential data leakage, ensuring that the training data does not contain information that should not be available during the learning process.
2. Regularize the model to reduce overfitting, by tuning hyperparameters like maximum depth, minimum samples per leaf, and the number of trees in the ensemble.
3. Cross-validate the model to obtain a more accurate estimate of its performance on unseen data.
4. Include additional relevant features, such as player injuries, coaching strategies, and team chemistry, to better capture the factors that contribute to a team's success.
5. Collect more data or increase the size of the dataset to reduce the uncertainty in the results.

The uncertainty in the results is qualitatively low, as the R2 score is 1.0, indicating a perfect fit. However, this can be deceptive, as the uncertainty may be due to factors like overfitting or data leakage. It is essential to address these issues before trusting the model's predictions for new data.

# Validation Dataset

As a validation dataset, we will use the playoffs data from the same dataset. 
Using playoff data as a validation dataset can be both a good choice and a not-so-good choice for different reasons:

**Good choice:**

Temporal Consistency: Since the playoff data is from the same season as the regular season data, the player and team performances are more likely to be consistent, and the models trained on the regular season data can be expected to perform reasonably well on the playoff data.

**Not-so-good choice:**

Change in Dynamics: The dynamics of playoff games are usually different from those of regular-season games. During the playoffs, teams may change their strategies, and players may perform differently due to the higher stakes and competitive environment. As a result, models trained on regular-season data might not fully capture these differences and may not generalize well to the playoff data. 

It also allows us to guage how the models will perform on unseen datasets giving an idea of generalizability.

In [None]:
file_path_player_playoffs = "Player_Playoffs_2023"
nba_data_playoffs = pd.read_csv(file_path_player_playoffs)
file_path_team_Playoffs = "Team_Playoffs_2023"
team_data_playoffs = pd.read_csv(file_path_team_Playoffs)

In [None]:
nba_data_playoffs.head(1)

In [None]:
team_data_playoffs.head(1)

In [None]:
# map abbreviations to full names
team_map_playoffs = {'Phi': 'Philadelphia', 'Dal': 'Dallas', 'Por': 'Portland', 'Okc': 'Oklahoma City',
            'Mil': 'Milwaukee', 'Bos': 'Boston', 'Bro': 'Brooklyn', 'Gol': 'Golden State',
            'Lal': 'LA Lakers', 'Cle': 'Cleveland', 'Pho': 'Phoenix', 'Mem': 'Memphis',
            'Atl': 'Atlanta', 'Nor': 'New Orleans', 'Uta': 'Utah', 'Nyk': 'New York',
            'Sac': 'Sacramento', 'Chi': 'Chicago', 'Min': 'Minnesota', 'Den': 'Denver',
            'Tor': 'Toronto', 'Lac': 'LA Clippers', 'Cha': 'Charlotte', 'Was': 'Washington',
            'Mia': 'Miami', 'Hou': 'Houston', 'San': 'San Antonio', 'Det': 'Detroit', 'Ind': 'Indiana',
            'Orl': 'Orlando'}

# abbreviations to full names 
nba_data_playoffs['TEAM'] = nba_data_playoffs['TEAM'].map(team_map)

In [None]:
nba_data_playoffs.head(1)

In [None]:
merged_data_playoffs = nba_data_playoffs.merge(team_data_playoffs, on="TEAM")

In [None]:
merged_data_playoffs.head(1)

In [None]:
# Relevant columns 
data_playoffs = merged_data_playoffs[['PPG_y', 'oPPG', 'PACE', 'oEFF', 'dEFF', 'eDIFF', 'USG%', 'WIN%']]

# Standardize the predictors to have mean 0 and std 1
standardized_data_playoffs = (data_playoffs - data_playoffs.mean()) / data_playoffs.std()

# predictors (X) and target variable (y)
X_playoffs = standardized_data_playoffs.drop('WIN%', axis=1)
y_playoffs = standardized_data_playoffs['WIN%']

In [None]:
# Bayesian Model Evaluation on Validation Dataset
with bayesian_model:
    ppc_playoffs = pm.sample_posterior_predictive(trace, var_names=['beta_0', 'betas', 'sigma'], samples=500)
    y_pred_bayesian_playoffs = ppc_playoffs['beta_0'].mean(axis=0) + np.dot(X_playoffs, ppc_playoffs['betas'].mean(axis=0))

    mse_bayesian_playoffs = mean_squared_error(y_playoffs, y_pred_bayesian_playoffs)
    print(f'Mean squared error (Bayesian - Validation): {mse_bayesian_playoffs:.2f}')

    r2_bayesian_playoffs = r2_score(y_playoffs, y_pred_bayesian_playoffs)
    print(f'R-squared score (Bayesian - Validation): {r2_bayesian_playoffs:.2f}')


In [None]:
# Frequentist Model Evaluation on Validation Dataset
y_pred_frequentist_playoffs = linear_regression.predict(X_playoffs)

mse_frequentist_playoffs = mean_squared_error(y_playoffs, y_pred_frequentist_playoffs)
print(f'Mean squared error (Frequentist - Validation): {mse_frequentist_playoffs:.2f}')

r2_frequentist_playoffs = r2_score(y_playoffs, y_pred_frequentist_playoffs)
print(f'R-squared score (Frequentist - Validation): {r2_frequentist_playoffs:.2f}')

In [None]:
# Random Forest Model Evaluation on Validation Dataset
y_pred_rf_playoffs = rf_model.predict(X_playoffs)

mse_rf_playoffs = mean_squared_error(y_playoffs, y_pred_rf_playoffs)
print(f'Mean squared error (Random Forest - Validation): {mse_rf_playoffs:.2f}')

r2_rf_playoffs = r2_score(y_playoffs, y_pred_rf_playoffs)
print(f'R-squared score (Random Forest - Validation): {r2_rf_playoffs:.2f}')

The results of the model evaluations on the validation dataset are as follows:

1. Bayesian Model:
- Mean squared error (MSE): 1.21
- R-squared score: -0.21

2. Frequentist Model (Linear Regression):
- Mean squared error (MSE): 598235153369958413366198272.00
- R-squared score: -601017642455400001457618944.00

3. Random Forest Model:
- Mean squared error (MSE): 1.15
- R-squared score: -0.16

Based on the MSE and R-squared scores, the Random Forest model performed better on the validation dataset. The Bayesian model had a slightly higher MSE and a lower R-squared score, but its performance is closer to the Random Forest model. The Frequentist model had a significantly higher MSE and a much lower R-squared score, indicating that it performed poorly on the validation dataset.

Differences between the Bayesian and Frequentist implementations of the GLM can be attributed to their underlying assumptions and methodologies. The Bayesian approach incorporates prior knowledge, which can help improve model performance when the dataset is small or noisy. On the other hand, the Frequentist approach relies solely on the data at hand, which can make it more sensitive to outliers and noise in the data.

The limitations of each model are as follows:
- Bayesian Model: Computationally expensive, sensitive to the choice of priors, and requires more domain knowledge.
- Frequentist Model: Sensitive to outliers and noise, relies solely on the data at hand, and can result in overfitting when there are many features.
- Random Forest Model: Can be prone to overfitting when the trees are deep, and may not perform well on data with very different characteristics from the training data.

Additional data that would be useful for improving the models include:
- More recent data to account for the latest trends and player performance.
- Additional features such as player injuries, coaching strategies, and team chemistry.

The uncertainty in the results could be attributed to several factors, including noisy data, small dataset size, and variance in the estimation. Although the Random Forest model performed better, the R-squared score of -0.16 indicates that there is room for improvement, and the models may not generalize well to future datasets. As such, it is essential to validate and update the models as new data becomes available, and consider incorporating additional features and other modeling techniques to improve performance.

# Predicting Playoff Teams

In this section we will use the original models to predict the top 16 teams that make it to the playoffs based on regular season data and compare it to the actual teams that make it. Making it to the playoffs is an indication of a team's success in the regular season since only 16 out of the 30 teams make it to the playoffs based on their wins and performance.

In [None]:
# Bayesian model predictions
y_pred_bayesian = ppc['beta_0'].mean(axis=0) + np.dot(X_test, ppc['betas'].mean(axis=0))

# Frequentist model predictions
y_pred_frequentist = linear_regression.predict(X_test)

# Random Forest model predictions
y_pred_rf = rf_model.predict(X_test)

# Combine team names and predictions
results = X_test.copy()
results['TEAM'] = merged_data['TEAM']
results['bayesian_pred'] = y_pred_bayesian
results['frequentist_pred'] = y_pred_frequentist
results['rf_pred'] = y_pred_rf

# Reset the index of `results` to be the same as the index of `X_test`
results.reset_index(inplace=True, drop=True)

# actual playoff teams
actual_playoff_teams = team_data_playoffs['TEAM']

In [None]:
def get_top_n_unique_teams(y_pred, results, n=16):
    # Get the indices sorted by prediction value (descending order)
    sorted_indices = np.argsort(y_pred)[::-1]

    # Get the top n unique team names
    selected_teams = []
    for index in sorted_indices:
        team = results.loc[index, "TEAM"]
        if team not in selected_teams:
            selected_teams.append(team)
            if len(selected_teams) == n:
                break

    return pd.Series(selected_teams, name="TEAM")


# Get the top 16 unique teams for each model
top_16_teams_bayesian = get_top_n_unique_teams(y_pred_bayesian, results)
top_16_teams_frequentist = get_top_n_unique_teams(y_pred_frequentist, results)
top_16_teams_rf = get_top_n_unique_teams(y_pred_rf, results)

# results
print("Top 16 Teams (Bayesian):")
print(top_16_teams_bayesian)
print("\nTop 16 Teams (Frequentist):")
print(top_16_teams_frequentist)
print("\nTop 16 Teams (Random Forest):")
print(top_16_teams_rf)


In [None]:
correct_teams_frequentist = set(top_16_teams_bayesian).intersection(actual_playoff_teams)
correct_teams_bayesian = set(top_16_teams_bayesian).intersection(actual_playoff_teams)
correct_teams_rf = set(top_16_teams_rf).intersection(actual_playoff_teams)

In [None]:
print(f'Correct predictions (Bayesian): {len(correct_teams_bayesian)}')
print(f'Correct predictions (Frequentist): {len(correct_teams_frequentist)}')
print(f'Correct predictions (Random Forest): {len(correct_teams_rf)}')

In this section, we used the original models to predict the top 16 teams most likely to make it to the playoffs. The Bayesian and Frequentist models got 12 teams right, while Random Forest got 15 teams right. This is consistent with our models comparison section where Random Forests outperformed both models on the validation dataset. Additionally it indicates the real world use of our predictions whereby using these features predictions can be made on team success.

# Conclusion

Based on the results, decision-makers in the NBA can use the models to inform their decisions related to team composition and strategy, but they should also consider the limitations and uncertainties associated with the models. The Bayesian model provides uncertainty estimates, which can be useful for decision-making, while the frequentist model has slightly better performance metrics. The Random Forest model performed the best in predicting team success, but the perfect fit may indicate overfitting or data leakage, and further investigation is needed to increase confidence in its applicability to future datasets.

A limitation of the models is that they only consider a limited set of features and do not capture factors such as team chemistry, coaching, and injuries. Future work could incorporate additional features and explore different modeling techniques to improve the models' performance and generalizability.

The results also show that the models' predictions are consistent with the actual teams that made it to the playoffs, suggesting their real-world applicability. However, decision-makers should still exercise caution when using the models to inform their decisions and consider additional factors that may impact team success.

In summary, the models provide a useful tool for predicting team success based on selected features, but decision-makers should consider their limitations and uncertainties. Future work could explore additional features and modeling techniques to improve performance and generalizability.

A call to action based on the results is to encourage decision-makers in the NBA to use these models as one of many tools to inform their decisions related to team composition and strategy. The models provide valuable insights into the relationships between features and team success and can help decision-makers make more informed decisions. However, decision-makers should also consider other factors that may impact team success, such as team chemistry, coaching, and injuries, and exercise caution when using the models to inform their decisions.