# Random Forest Model to Predict Kobe's Shot-making Probabilities

## Introduction

This notebook was created for the Kobe Bryant Shot Selection competition here on Kaggle. For it, we were given a data set containing information on each of Kobe's shots from his 20-year professional basketball career. The results of 5000 shots were removed (set to NAs), and we will use those shots as the testing data set. All the other shots will be used to train the predictive model.

The original features of the data are:
action_type,
combined_shot_typ),
game_event_k),
game_id,
lat,
loc_x,
loc_y,
lon,
minutes_remaining,
period,
playoffs,
season,
seconds_remaining,
shot_distanit a 2 or a 3),s of the court),estricted area”),nge (“16-st the Lakly just the Lakers),e,
matchu@UTA),
o.
ponent (UTA),
shot_id 
Response Variable: shot_made_flag


## Feature Engineering

In [None]:
library(tidyverse)
library(vroom)
library(embed)
library(tidymodels)

In [None]:
# Load data
full_data <- vroom("/kaggle/input/kobe-bryant-shot-selection/data.csv.zip")

Using the Euclidean Distance Formula, I calculated the shot distance by using the x and y coordinates of the shot.

In [None]:
# Create distance column
full_data$shot_distance <- sqrt((full_data$loc_x/10)^2 + (full_data$loc_y/10)^2)

Using trigonometric methods, I calculated the angle at which Kobe took a given shot.

In [None]:
# Create angle column
loc_x_zero <- full_data$loc_x == 0
full_data['angle'] <- rep(0,nrow(full_data))
full_data$angle[!loc_x_zero] <- atan(full_data$loc_y[!loc_x_zero] / full_data$loc_x[!loc_x_zero])
full_data$angle[loc_x_zero] <- pi / 2

I scaled and aggregated the "minutes_remaining" and "seconds_remaining" columns into a single "time_remaining" column in seconds.

In [None]:
# Create single, time variable column
full_data$time_remaining <- (full_data$minutes_remaining*60)+full_data$seconds_remaining

Given that players tend to perform better when they are at home vs. away, I altered the "matchup" column to be a binary "home" or "away" column.

In [None]:
# Create home and away column
full_data$matchup = ifelse(str_detect(full_data$matchup, 'vs.'), 'Home', 'Away')

I simplified the "season" column by changing the cells from a "2000-01" format to a "1" (representing Kobe's first season).

In [None]:
# Create season column
full_data['season'] <- substr(str_split_fixed(full_data$season, '-',2)[,2],2,2)

I created a new column by altering the "game_date" column to show the game number.

In [None]:
# Create game number column
full_data$game_num <- as.numeric(full_data$game_date)

Kobe tore his Achilles tendon during his career which could have affected his shooting ability. I created a column that labels any game after his Achilles with a 1 and any game before with a 0. I don't think the training data contained any 1s, but it's good to have it just in case the testing data does (and it doesn't affect anything if there are only 0s).

In [None]:
# Create Achilles injury status column
full_data$postachilles <- ifelse(full_data$game_num > 1452, 1, 0)

I also created a column indicating the games during Kobe's MVP season. For whatever reason, I don't think there were any 1s in the training data, but it's also good to have just in case it's in the testing data.

In [None]:
# Create MVP status column
full_data$mvp <- ifelse(full_data$game_num >= 909 & full_data$game_num <= 990, 1, 0)

I removed unnecessary columns (including a few with zero variance) to reduce noise in the data during prediction and cross-validation. I saved the altered data to a new variable so I can still access the "shot_id" column when creating the Kaggle submission at the end without including it in my analysis.

In [None]:
# Remove unnecessary columns
new_data <- full_data |>
  select(-c('shot_id', 'team_id', 'team_name', 'shot_zone_range', 'lon', 'lat',
            'seconds_remaining', 'minutes_remaining', 'game_event_id',
            'game_id', 'game_date','shot_zone_area',
            'shot_zone_basic', 'loc_x', 'loc_y'))

## Create Training and Testing Data Sets

The training data is based on the rows that indicate whether Kobe made the shot (from the "shot_made_flag" column). This filters the training data set to only those rows.

In [None]:
# Train
kobe_train <- new_data |>
  filter(!is.na(shot_made_flag))

The testing data is based on the rest of the rows where "shot_made_flag" is NA.

In [None]:
# Test
kobe_test <- new_data |>
  filter(is.na(shot_made_flag)) |>
  select(-shot_made_flag)

Since our response variable is a bunch of "yes's" and "no's" represented by 1s and 0s, we need to ensure the model reads "shot_made_flag" as a categorical variable.

In [None]:
# Make the response variable into a factor
kobe_train$shot_made_flag <- as.factor(kobe_train$shot_made_flag)

We must create a recipe to apply to both the testing and training data sets. First, it turns the "period" column into a factor (so it's not interpreted as literal 1s, 2s, 3s, and 4s). "step_novel" and "step_unknown" account for any discrepancies between the testing and training data sets by assigning a value and a level to categorical data that is present in one but not the other. "step_dummy" then dummy encodes all categorical variables.

In [None]:
# Create Recipe
kobe_recipe <- recipe(shot_made_flag~., data = kobe_train) |>
    step_mutate(period = as.factor(period)) |>
    step_novel(all_nominal_predictors()) |>
    step_unknown(all_nominal_predictors()) |>
    step_dummy(all_nominal_predictors())

## Random Forest Model

In [None]:
# Run code in parallel
library(doParallel)

num_cores <- parallel::detectCores()

cl <- makePSOCKcluster(num_cores)

registerDoParallel(cl)

This sets up the model. Out of all the models I tried, a classification random forest worked best. I will tune the "mtry" and "min_n" hyperparameters via cross-validation in the following few code chunks.

In [None]:
# Create a workflow with model & recipe
kobe_forest <- rand_forest(mtry = tune(),
                         min_n=tune(),
                         trees=800) |>
  set_engine("ranger") |>
  set_mode("classification")


kobe_wf <- workflow() |>
  add_recipe(kobe_recipe) |>
  add_model(kobe_forest)

This sets up different values to try for the hyperparameters during cross-validation.

In [None]:
# Set up grid of tuning values
forest_grid <- grid_regular(mtry(range = c(1,(ncol(kobe_train)-1))),
                            min_n(),
                            levels = 3)

This sets up a 3-fold cross-validation.

In [None]:
# Set up K-fold CV
folds <- vfold_cv(kobe_train, v = 3, repeats=1)

This runs the cross-validation and evaluates the mean log loss for each fold. I then extracted the model that had the best mean log loss.

In [None]:
# Find the best tuning parameters
CV_results <- kobe_wf |>
  tune_grid(resamples=folds,
            grid=forest_grid,
            metrics=metric_set(mn_log_loss))

bestTune <- CV_results |>
  select_best(metric="mn_log_loss")

bestTune$min_n
bestTune$mtry

## Predictions

This finalizes the workflow with the newly-tuned model, and it predicts on the test set.

In [None]:
# Finalize workflow and predict
final_wf <- kobe_wf |>
  finalize_workflow(bestTune) |>
  fit(data=kobe_train)

preds <- final_wf |>
  predict(new_data=kobe_test,
          type="prob")

We then format the predictions to meet the Kaggle competition requirements and write the submission file to be submitted.

In [None]:
# Write Kaggle Submission
kaggle_submission <- full_data |> 
  filter(is.na(shot_made_flag)) |>
  bind_cols(preds) |> 
  select(shot_id, .pred_1) |> 
  rename(shot_made_flag = .pred_1)

vroom_write(x=kaggle_submission, file="kobe_forest.csv", delim = ",")

stopCluster(cl)