## NBA 2022-23 Player Stats

This notebook is focused on two key tasks related to NBA player statistics from the 2022-23 season. The primary objectives are:

### 1. Position Prediction:

Utilizing machine learning techniques to predict an NBA player's position (e.g., point guard, shooting guard, small forward, power forward, center) based on their statistical performance.
Leveraging a dataset containing player per-game statistics from the 2022-23 season, including points per game (PPG), assists per game (APG), field goal percentage (2P%), free throw percentage (FT%), and other relevant metrics.
Employing classification algorithms to build a model capable of accurately assigning player positions, facilitating team management, and strategic decision-making.

### 2. Player Comparison via Nearest Neighbors:

Implementing a nearest neighbors algorithm to identify and rank the players whose statistical profiles are most similar to a given player.
Extracting and normalizing key player statistics to create a multi-dimensional feature space.
Calculating distances between players in this feature space to identify "nearest neighbors" – players with the most comparable in-game performance.
Providing a valuable tool for player scouting, team analysis, and potential player trades or acquisitions based on similarity metrics.

In [151]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv('nba_2022-23_per_game_stats.csv')
adv = pd.read_csv('nba_2022-23_advanced_stats.csv')

df.drop(['Player Number', 'Player-additional'], axis=1, inplace=True)
adv.drop(['Player Number', 'GP', 'Position', 'Age', 'Unnamed: 19', 'Unnamed: 24', 'Player-additional'], axis=1, inplace=True)
adv.columns


Index(['Player Name', 'Team', 'Total Minutes', 'PER', 'TS%', '3PAr', 'FTr',
       'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS',
       'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP'],
      dtype='object')

In [152]:
# Identify and keep the row with 'TOT' in the 'Team Name' column
df_filtered = df[df['Team'] == 'TOT']

for idx, row in df_filtered.iterrows():
    name = row['Player Name']
    teams = '/'.join(list(set(list(df[df['Player Name'] == name]['Team'].unique())) - set(['TOT'])))
    df_filtered.at[idx, 'Team'] = teams


non_duplicate_rows = df.drop_duplicates(subset='Player Name', keep=False)
# Concatenate the filtered row with the rest of the DataFrame
result_df = pd.concat([df_filtered, non_duplicate_rows], axis=0)

# Identify and keep the row with 'TOT' in the 'Team Name' column
adv_filtered = adv[adv['Team'] == 'TOT']

for idx, row in adv_filtered.iterrows():
    name = row['Player Name']
    teams = '/'.join(list(set(list(adv[adv['Player Name'] == name]['Team'].unique())) - set(['TOT'])))
    adv_filtered.at[idx, 'Team'] = teams


non_duplicate_rows_adv = adv.drop_duplicates(subset='Player Name', keep=False)

# Concatenate the filtered row with the rest of the DataFrame
result_adv = pd.concat([adv_filtered, non_duplicate_rows_adv], axis=0)

stats = pd.merge(result_df, result_adv, on='Player Name', how='inner')



#result_df.to_csv('nba_per_game_processed.csv')

stats.drop(['Team_y'], axis=1, inplace=True)
stats.rename({'Team_x': 'Team'}, axis=1, inplace=True)
stats.to_csv('nba_2022-23_all_stats.csv')

## Attempting to Classify a Player's Listed Position

In [153]:
processed_df = pd.read_csv('nba_2022-23_all_stats.csv', index_col = 0)

positions = {'PG': 0, 'SG': 1, 'SF': 2, 'PF': 3, 'C': 4}
df_new = processed_df[processed_df['Position'].isin(positions.keys())]
# Fill NaN values in 'FT%', '3P%', and '2P%' columns with 0
columns_to_fill = ['FT%', '3P%', '2P%', 'eFG%', 'FG%', '3PAr', 'FTr', 'TOV%', 'TS%']
df_new[columns_to_fill] = df_new[columns_to_fill].fillna(0)

player_df = df_new.copy()



features = ['PTS', 'PF', 'TOV', 'BLK', 'STL', 'AST', 'TRB', 'DRB', 'ORB','GP', 'GS', 'MP',
       'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%',
       'FT', 'FTA', 'FT%', 'Total Minutes', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%',
       'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48',
       'OBPM', 'DBPM', 'BPM', 'VORP']

X = df_new[features]
    
y = df_new['Position'].map(positions)

In [154]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Split the data into a training set and a testing set (adjust test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train different classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

for clf_name, clf in classifiers.items():
    clf.fit(X_train_scaled, y_train)
    y_pred = clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{clf_name} Accuracy: {accuracy:.2f}')


Random Forest Accuracy: 0.57
Gradient Boosting Accuracy: 0.56
Logistic Regression Accuracy: 0.58
SVM Accuracy: 0.57
K-Nearest Neighbors Accuracy: 0.46


With the addition of advanced stats, the position a player is listed as is predicted correctly 58% of the time on the unseen data by the logistic regression classifier.

In [155]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

def find_nearest_neighbours(player_name, data, cols, n_neighbors=5):
    """
    Find the nearest neighbors of a player based on specified columns.

    Parameters:
    - player_name: The name of the player you want to find neighbors for.
    - data: The DataFrame containing the NBA player dataset.
    - cols: A list of column names to consider for finding neighbors.
    - n_neighbors: The number of nearest neighbors to retrieve (default is 5).

    Returns:
    - A DataFrame containing the nearest neighbors and their distances.
    """
    # Extract the specified columns for the KNN algorithm
    data = data.reset_index(drop=True)
    X = data[cols]

    

    # Standardize the data to have zero mean and unit variance
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Create a KNN model
    knn = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean')
    knn.fit(X_scaled)

    # Find the index of the player in the dataset
    player_index = data[data['Player Name'] == player_name].index[0]

    # Use the KNN model to find the nearest neighbors
    distances, indices = knn.kneighbors([X_scaled[player_index]])

    # Create a DataFrame with the nearest neighbors and distances
    neighbors_data = data.iloc[indices[0]].copy()
    neighbors_data['Distance'] = distances[0]

    return neighbors_data


# Use the columns below to find closest matching players to Jayson Tatum
cols = ['PTS', 'AST', 'TRB', 'STL', 'BLK', 'TOV', '2P%','2PA', '3PA', 'FTA', '3P%', 'FT%']
tatum_nn = find_nearest_neighbours('Jayson Tatum', player_df, cols, 10)
tatum_nn


Unnamed: 0,Player Name,Position,Age,Team,GP,GS,MP,FG,FGA,FG%,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,Distance
462,Jayson Tatum,SF,24,BOS,74,74,36.9,9.8,21.1,0.466,...,32.7,6.2,4.3,10.5,0.185,4.8,0.7,5.5,5.1,0.0
410,Julius Randle,PF,28,NYK,77,77,35.5,8.5,18.6,0.459,...,29.5,5.0,3.1,8.1,0.142,3.9,-0.2,3.7,3.9,2.179181
265,LeBron James,PF,38,LAL,55,54,35.5,11.1,22.2,0.5,...,33.3,3.2,2.4,5.6,0.138,5.5,0.6,6.1,4.0,2.442067
113,Jaylen Brown,SF,26,BOS,67,67,35.9,10.1,20.6,0.491,...,31.4,1.6,3.4,5.0,0.1,1.5,-0.2,1.3,2.0,2.473844
167,Luka Dončić,PG,23,DAL,66,66,36.2,10.9,22.0,0.496,...,37.6,7.3,2.9,10.2,0.204,7.6,1.4,8.9,6.6,2.628409
198,Paul George,SF,32,LAC,56,56,34.6,8.2,17.9,0.457,...,29.5,2.3,2.3,4.6,0.114,2.4,0.3,2.8,2.3,2.732257
180,Anthony Edwards,SG,21,MIN,79,79,36.0,8.9,19.5,0.459,...,29.9,0.2,3.6,3.8,0.064,1.0,0.0,1.0,2.1,2.81625
353,Donovan Mitchell,SG,26,CLE,68,68,35.8,10.0,20.6,0.484,...,32.1,5.4,3.5,8.9,0.176,5.6,0.6,6.3,5.0,2.843873
324,Lauri Markkanen,PF,25,UTA,66,66,34.4,8.7,17.3,0.499,...,26.6,6.3,1.9,8.2,0.173,4.9,-1.0,3.8,3.3,2.855985
100,Devin Booker,SG,26,PHO,53,53,34.6,9.9,20.1,0.494,...,31.8,4.2,1.9,6.0,0.157,4.5,-0.3,4.2,2.9,2.864682


It's reasonable thar Julius Randle, JB, Doncic, Booker and Paul George emerge as close neighbours to Tatum as they are all skilled all-round offensive weapons. LeBron and Curry being close neigbours is slightly surprising but this nearest neighbours search is purely based on per game numbers in the six main stat categories as well as efficiency and attempts per game on two-point field goals, three-pointers and free-throws and is therefore a quite simplistic measure.