<h1 style="text-align: center;">NBA Shot Predictor</h1>
<h5 style="text-align: center;">Oliver Lee</h5>


## 1. Data Collection and Preprocessing  
The goal of this project is to train a model to predict the likelihood a shot is made based on a variety of factors including shot location, shot type, player stats, and more. The main data used for training is found here: https://github.com/DomSamangy/NBA_Shots_04_25. This data contains every shot taken in the NBA from 2004-2025, with features such as player, shot type, shot location, etc.  

Then merge this data with individual player statistics from the NBA API as shown below.

In [None]:
import pandas as pd
from tqdm.notebook import tqdm
import time
from nba_api.stats.endpoints import PlayerDashboardByYearOverYear

#### Define function to fetch stats for a single player

This function uses the NBA API to get field goal %, 3-point %, and minutes played for a given player in the specified season.

In [None]:
def get_player_stats(player_id, season='2024-25'):
    try:
        dash = PlayerDashboardByYearOverYear(player_id=player_id, season=season)
        df = dash.get_data_frames()[1]
        latest_season = df[df['GROUP_VALUE'] == season]
        stats = latest_season[['FG_PCT', 'FG3_PCT', 'MIN']].copy()
        stats['PLAYER_ID'] = player_id
        return stats
    except Exception as e:
        return None

#### Load the raw shot data and fetch stats for unique player IDs

In [None]:
original_df = pd.read_csv("./raw_data/NBA_2025_Shots.csv")
unique_ids = original_df['PLAYER_ID'].unique()
print(f"Loaded {len(original_df)} shot records for {len(unique_ids)} unique players.")

all_stats = []
failed_ids = []

for pid in tqdm(unique_ids, desc="Fetching Player Stats"):
    stats_df = get_player_stats(pid)
    if stats_df is not None:
        all_stats.append(stats_df)
    else:
        failed_ids.append(pid)
    time.sleep(0.5)  # Delay to avoid API rate limit

#### Merge the fetched stats with the original shot data

We'll combine all player stats, merge them with the original dataframe, then save the results.

In [None]:
if all_stats:
    stats_combined = pd.concat(all_stats, ignore_index=True)
    merged_df = original_df.merge(stats_combined, on='PLAYER_ID', how='left')
    
    # Preview the merged data
    display(merged_df.head())
    
    # Save merged data to CSV
    merged_df.to_csv("./merged_data/24_25_allstats.csv", index=False)
    print(f"Saved merged stats to './merged_data/23_24_allstats.csv'.")
else:
    print("No player stats were retrieved.")

if failed_ids:
    print(f"Failed to fetch stats for {len(failed_ids)} players:")
    print(failed_ids)
else:
    print("Successfully fetched stats for all players.")

This merging process takes quite a while thanks to the API's rate limiting, but the final merged data will look like this:

| SEASON_1 | SEASON_2 | TEAM_ID    | TEAM_NAME          | PLAYER_ID | PLAYER_NAME  | POSITION_GROUP | POSITION | GAME_DATE  | GAME_ID  | HOME_TEAM | AWAY_TEAM | EVENT_TYPE  | SHOT_MADE | ACTION_TYPE               | SHOT_TYPE         | BASIC_ZONE             | ZONE_NAME          | ZONE_ABB | ZONE_RANGE | LOC_X | LOC_Y  | SHOT_DISTANCE | QUARTER | MINS_LEFT | SECS_LEFT | FG_PCT | FG3_PCT | MIN          |
|----------|----------|------------|--------------------|-----------|--------------|----------------|----------|------------|----------|-----------|-----------|-------------|-----------|---------------------------|-------------------|------------------------|--------------------|----------|------------|-------|--------|---------------|---------|-----------|-----------|--------|---------|--------------|
| 2024     | 2023-24  | 1610612764 | Washington Wizards | 1629673   | Jordan Poole | G              | SG       | 11-03-2023 | 22300003 | MIA       | WAS       | Missed Shot | False     | Driving Floating Jump Shot | 2PT Field Goal    | In The Paint (Non-RA)  | Center             | C        | 8-16 ft.   | -0.4  | 17.45  | 12            | 1       | 11        | 1         | 0.413  | 0.326   | 2345.555     |
| 2024     | 2023-24  | 1610612764 | Washington Wizards | 1630166   | Deni Avdija  | F              | SF       | 11-03-2023 | 22300003 | MIA       | WAS       | Made Shot   | True      | Jump Shot                 | 3PT Field Goal    | Above the Break 3       | Center             | C        | 24+ ft.    | 1.5   | 30.55  | 25            | 1       | 10        | 26        | 0.506  | 0.374   | 2256.6433333 |

## 2. Training a Random Forest Classifier

In [12]:
import pandas as pd
import numpy as np
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

ValueError: All ufuncs must have type `numpy.ufunc`. Received (<ufunc 'sph_legendre_p'>, <ufunc 'sph_legendre_p'>, <ufunc 'sph_legendre_p'>)

#### Removing Unrelated Features

Some features should be removed before training, as they should have no impact on the shot outcome. We also drop PLAYER_ID here, but keep PLAYER_NAME as an easier way to indetify each player. Y consists of SHOT_MADE, the target prediction label for this experiment.

In [None]:
df = pd.read_csv('./merged_data/24_25_allstats.csv')
df = df.drop(columns=['SEASON_2', 'GAME_ID', 'ZONE_ABB', 'EVENT_TYPE', 'GAME_DATE',
                     'PLAYER_ID', 'TEAM_ID', 'TEAM_NAME'])

X = df.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
y = df['SHOT_MADE'].astype(int)
X_encoded = pd.get_dummies(X)
X_encoded['PLAYER_NAME'] = df['PLAYER_NAME']

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded.drop(columns=['PLAYER_NAME']),  # Remove PLAYER_ID for training
    y, 
    test_size=0.2, 
    stratify=y,
    random_state=42
)

test_player_ids = X_encoded.iloc[X_test.index]['PLAYER_NAME']

Finally, we train the random forest with x and y, and store the model for analysis. For this project, I used a model trained specifically on the 24-25 season, and tested the model on data from previous years.

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42, verbose=1)
model.fit(X_train, y_train)

joblib.dump({
    'model': model,
    'test_player_ids': test_player_ids,
    'feature_names': X_train.columns
}, './models/random_forest_24_25.joblib')