<h1 style="text-align: center;">Analyis of Point Production of NBA Players</h1>

# Introduction

Using this dataset of NBA player stats we want to perform a regression analysis of the selected variables to help predict the on-court performance of a player (their average points scored). To do so, all non-preformance related stats, including, age,  height, weight, and draft pick will be used. The season variable will also allow player preformances to be measured throughout time and compare NBA starts from all eras and styles of the game. All of these factors were selected because of their likelihood to correlate with stronger on-court performance.

Our question: How do our selected factors help to predict the individual performance of NBA players? 

To do this, we will use [this dataset](https://www.kaggle.com/datasets/justinas/nba-players-data) from Kaggle. The data was acquired through the NBA API with missing information supplied by scraping from another NBA source (for this reason, the dataset was taken directly from Kaggle and not the original source). 

# Methods and Results

## Load Libraries

In [1]:
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6)

ERROR: Error in library(tidyverse): there is no package called 'tidyverse'


## Dataset Load & Preperation

First we pull our dataset from the internet.

The dataset contains 22 variables, the majority of which we discard as categorical variables cannot be used and several numeric variables are unsuited for this model. We will conduct analysis on the remaining selected variables to try and predict the average points a player will get per season using regression analysis. By first training a model on the training data for each respective datapoint category we will then be able too use the k-nearest neighbors algorithm to perform regression on the data and decide which of the columns provides the best predictor for a players preformance.

In [None]:
url <-"https://raw.githubusercontent.com/mdean808/dsci-100-group-project/b11c50b091b2c4a554a2b7ff8f9e568e081b0f3c/all_seasons.csv"

temp <- tempfile()

download.file(url, temp)
# read the dataset from temp file
player_data <- read_csv(temp)
head(player_data)

: 

For our project, we will focus only on the specific parameters to judge the overall performance of an NBA player, these include points (pts), the variable which we will be predicting, as well as rebounds (reb), and assists (ast), usage percentage (usg_pct), true shooting percentage (ts_pct), and their draft number (draft_number).

Additionally, players drafted in or before 2011 are excluded, to keep the data used recent and the dataset size managable.

In [None]:
nba_players <- player_data |>
    filter(draft_year > 2011) |>
    select(pts, reb, ast, usg_pct, ts_pct, draft_number) |>
    # remove undrafted players with no draft number "Undrafted"
    filter(draft_number != "Undrafted") |>
    mutate(draft_number = as.numeric(draft_number))            
nba_players

: 

Now we will split the data into training and testing. We will use the training data to build our regression model and our testing data to measure how well our model performs.

In [None]:
set.seed("1234")
nba_players_split <- initial_split(nba_players, prop = 0.75, strata = pts)
nba_training <- training(nba_players_split)
nba_testing <- testing(nba_players_split)

nba_training

: 

# Dataset Visualization

Now we will look at the correlation of multiple parameters compared with points. This will allow us to predict which parameters strongly influence overall points scored before we build our regression model. 

In [None]:
options(repr.plot.width = 10)

nba_plot_rebounds <- ggplot(nba_training, aes(x = reb, y = pts)) +
geom_point(alpha = 0.4) +
labs(x = "Rebound", y = "Average Points Per Game") +
ggtitle("Rebound vs. Average Points Per Game") +
theme(text = element_text(size = 12))

nba_plot_assists <- ggplot(nba_training, aes(x = ast, y = pts)) +
geom_point(alpha = 0.4)+
labs(x = "Assist", y = "Average Points Per Game") +
ggtitle("Assist vs. Average Points Per Game") +
theme(text = element_text(size = 12))

nba_plot_usage_pct <- ggplot(nba_training, aes(x = usg_pct, y = pts)) +
geom_point(alpha = 0.4) +
labs(x = "Usage Percentage", y = "Average Points Per Game") +
ggtitle("Usage Percentage vs. Average Points Per Game") +
theme(text = element_text(size = 12))

nba_plot_true_shooting_pct <- ggplot(nba_training, aes(x = ts_pct, y = pts)) +
geom_point(alpha = 0.4) +
labs(x = "True Shooting Percentage", y = "Average Points Per Game") +
ggtitle("True Shooting Percentage vs. Average Points Per Game") +
theme(text = element_text(size = 12))

nba_plot_number <- ggplot(nba_training, aes(x = draft_number, y = pts)) +
geom_point(alpha = 0.4) +
labs(x = "Draft Number", y = "Average Points Per Game") +
ggtitle("Draft Number vs. Average Points Per Game") +
theme(text = element_text(size = 12))

nba_plot_rebounds
nba_plot_assists
nba_plot_usage_pct
nba_plot_true_shooting_pct
nba_plot_number

: 

As shown in the visualizations above, usage percentage appears to have strong and somewhat linear relationship with points scored, which makes it potentially a good predictor. The other variables also exhibit trends, with rebounds and exhists showing a positive relationship with points scored. Draft number seems to have a weak negative relationship, and true shooting percentage shows a distribution with the area with the highest points surrounding the centre of the graph. The lack of linearity with several of the variables supports the use of k-nearest neighbors for this data set over linear regression.

Shown below is the calculated means of the variables, as well as the total number of data points within the training set.

In [None]:
summarise_rebounds <- summarise(nba_training, mean_rebounds = mean(reb))
summarise_assists <- summarise(nba_training, mean_assists = mean(ast))
summarise_usage_pct <- summarise(nba_training, mean_usage = mean(usg_pct))
summarise_true_shooting_pct <- summarise(nba_training, mean_ts = mean(ts_pct))
summarise_draft_number <- nba_training |>
summarise(mean_number = mean(as.numeric(draft_number), na.rm = TRUE))
summarise_nrow <- summarise(nba_training, total_rows = nrow(nba_training))

combined_summary <- bind_cols(summarise_rebounds, summarise_assists, summarise_usage_pct,
summarise_true_shooting_pct, summarise_draft_number, summarise_nrow)


combined_summary

: 

: 

## Data Analysis

To predict the player performance variable, we will be using k-nearest neighbors regression, as the relationships between the predictor variables and our performance variable are not all linear, and the performance variable is numeric.

First, we prepare the recipe, using rebounds, assists, usage percentage, and true shooting percentage to predict the points scored. These variables are all scaled before continuing.

In [None]:
nba_recipe <- recipe(pts ~ reb + ast + usg_pct + ts_pct + draft_number, data = nba_training) |>
step_scale(all_predictors()) |>
step_center(all_predictors())

nba_recipe

: 

Then, we prepare the model. First, we will using tuning to determine the best number of neighbours to use, employing 5-fold cross-validation.

In [None]:
nba_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("regression")

nba_vfold <- vfold_cv(nba_training, v = 5, strata = pts)

nba_workflow <- workflow() |>
add_recipe(nba_recipe) |>
add_model(nba_spec)

nba_workflow


: 

We want to first determine a general range that the best number of neighbors will lie in, so we begin at 1 and step up to 100 in increments of 5.

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

nba_results <- nba_workflow |>
  tune_grid(resamples = nba_vfold, grid = k_vals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

nba_results



: 

Then, we can select the number of neighbors from this list with the best predicted accuracy. Also shown is a graph of the number of neighbours vs. the mean RMSE.

In [None]:
nba_k_plot <- ggplot(nba_results, aes(x = neighbors, y = mean)) +
geom_point() +
labs(x = "Number of neighbours", y = "Mean RMSE") +
ggtitle("Number of neighbours vs. Mean RMSE")

nba_k_plot

nba_k_initial <- nba_results |>
  filter(mean == min(mean))
nba_k_initial

: 

To determine the single best value, we narrow down the range of neighbors we investigate to be between 1 and 15, and step by 1 to find the best value.

In [None]:
k_vals_narrow <- tibble(neighbors = seq(from = 1, to = 15, by = 1))

nba_results_2 <- nba_workflow |>
  tune_grid(resamples = nba_vfold, grid = k_vals_narrow) |>
  collect_metrics() |>
  filter(.metric == "rmse")

nba_results_2

nba_smallest <- nba_results_2 |>
  filter(mean == min(mean))
nba_smallest


: 

Next, we prepare the final model, using the optimal amount of neighbors.

In [None]:
neighbors_val <- nba_smallest |>
pull(neighbors)

nba_tuned_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = neighbors_val) |>
  set_engine("kknn") |>
  set_mode("regression")

nba_fit <- workflow() |>
  add_recipe(nba_recipe) |>
  add_model(nba_tuned_spec) |>
  fit(data = nba_training)

nba_predict <- nba_fit |>
  predict(nba_testing) |>
  bind_cols(nba_testing)

nba_rmse <- nba_predict |>
  metrics(truth = pts, estimate = .pred) |>
  filter(.metric == 'rmse')

nba_predict_select <- nba_predict |>
select(.pred, pts)

nba_predict_select

nba_rmse

: 

The final RMSPE for our model against the testing data is 1.9338. While this is still a significant amount of error, in the context of predicting sports performance, the predictions offered by our model are still useful. Visualization of the completed model's performance is difficult owing to the numerous variables involved. Below, each variable used to predict player score is depicted separately.

In [None]:
options(repr.plot.width = 10)
nba_predict_plot_rebound <- ggplot(nba_training, aes(x = reb, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict, mapping = aes(x = reb, y = .pred), color = 'blue') +
labs(x = "Rebound", y = "Points") +
ggtitle("Rebound vs. Points Scored with Estimated Values") +
theme(text = element_text(size = 12))

nba_predict_plot_rebound

: 

In [None]:
nba_predict_plot_assist <- ggplot(nba_training, aes(x = ast, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict, mapping = aes(x = ast, y = .pred), color = 'blue') +
labs(x = "Assists", y = "Points") +
ggtitle("Assists vs. Points Scored with Estimated Values") +
theme(text = element_text(size = 12))

nba_predict_plot_assist

: 

In [None]:
options(repr.plot.width = 18)

nba_predict_plot_usage <- ggplot(nba_training, aes(x = usg_pct, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict, mapping = aes(x = usg_pct, y = .pred), color = 'blue') +
labs(x = "Usage Percent", y = "Points") +
ggtitle("Usage Percent vs. Points Scored with Estimated Values") +
theme(text = element_text(size = 14))

nba_predict_plot_usage

: 

In [None]:
options(repr.plot.width = 14)

nba_predict_plot_shooting <- ggplot(nba_training, aes(x = ts_pct, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict, mapping = aes(x = ts_pct, y = .pred), color = 'blue') +
labs(x = "True Shooting Percentage", y = "Points") +
ggtitle("Trust Shooting Percent vs. Points Scored with Estimated Values") +
theme(text = element_text(size = 12))

nba_predict_plot_shooting

: 

In [None]:
options(repr.plot.width = 10)
nba_predict_plot_dn <- ggplot(nba_training, aes(x = draft_number, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict, mapping = aes(x = draft_number, y = .pred), color = 'blue') +
labs(x = "Draft Number", y = "Points") +
ggtitle("Draft Number vs. Points Scored with Estimated Values") +
theme(text = element_text(size = 12))

nba_predict_plot_dn

: 

Observe that the visualizations indicate that the data has been overplotted, with the model being heavily affected by each individual data point. Originally we attributed this to the low number of neighbors (7) relative to the data size, however, this behaviour is still present even when increasing neighbors. 

In [None]:
nba_demo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 100) |>
  set_engine("kknn") |>
  set_mode("regression")

nba_fit_demo <- workflow() |>
  add_recipe(nba_recipe) |>
  add_model(nba_demo_spec) |>
  fit(data = nba_training)

nba_predict_demo <- nba_fit_demo |>
  predict(nba_testing) |>
  bind_cols(nba_testing)

nba_predict_demo_plot <- ggplot(nba_training, aes(x = reb, y = pts)) +
geom_point(alpha = 0.4) +
geom_line(data = nba_predict_demo, mapping = aes(x = reb, y = .pred), color = 'orange') +
labs(x = "Rebound", y = "Points") +
ggtitle("Rebound vs. Points Scored with Estimated Values Using 100 Neighbors") +
theme(text = element_text(size = 12))


nba_predict_demo_plot
nba_predict_plot_rebound

: 

# Discussion

Through this analysis, we determined that by using certain measurements of player performance, such as assists per game, usage percentage, true shooting percentage, and player draft number, we were able provide reasonable estimates for the player’s average points per game. The final model had an RMSPE of 1.9 with our testing set. Though this is still a significant amount of error, we think that the predictions offered by this model could still be useful ((TODO: explain why)). Although we had expected that these variables to have some capability in predicting point performance, the accuracy of the predictions was somewhat unexpected. 

* summarize what you found
* discuss whether this is what you expected to find?
* discuss what impact could such findings have?
* discuss what future questions could this lead to?

# References
At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
Make sure to cite the source of your data as well.

: 