
#Problem Description
The goal of this project is to uncover the true archetypes of NBA players using unsupervised learning, without relying on traditional positional labels. Instead of using predefined roles like "point guard" or "center", we cluster players based on statistical tendencies to discover data-driven roles that emerge naturally from performance patterns.

This analysis focuses on regular season NBA games from 2020 onward, using one row per player-game performance. To avoid skewing results with limited data, only players with sufficient minutes played are retained. The outcome is intended to give coaches, analysts, and fans insight into the real, evolving structure of modern basketball.


Core performance stats were selected:

Scoring attempts and efficiency: fieldGoalsAttempted, fieldGoalsPercentage, threePointersAttempted, threePointersPercentage, freeThrowsAttempted, freeThrowsPercentage

Counting stats: points, assists, reboundsTotal, steals, blocks, numMinutes

Then we engineered percentage-based features to normalize contribution across players:

pts_pct: % of a player's statistical contribution that is scoring

ast_pct: % of a player's contribution from assists

reb_pct: % from rebounding

stl_pct, blk_pct: % from steals and blocks

These percent-based features allow fair comparison between bigs, guards, and hybrids — making it easier to identify role patterns independent of raw box score volume.

In [1]:

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np

df = pd.read_csv('/content/drive/MyDrive/NBADatabase/PlayerStatistics.csv')
df['gameDate'] = pd.to_datetime(df['gameDate'])
df['season'] = df['gameDate'].dt.year
df = df[(df['gameType'] == 'Regular Season') & (df['season'] >= 2020)].dropna()


Mounted at /content/drive


  df = pd.read_csv('/content/drive/MyDrive/NBADatabase/PlayerStatistics.csv')


In [5]:

df['total_stats'] = df[['points', 'assists', 'reboundsTotal', 'steals', 'blocks']].sum(axis=1)
df['pts_pct'] = df['points'] / df['total_stats']
df['ast_pct'] = df['assists'] / df['total_stats']
df['reb_pct'] = df['reboundsTotal'] / df['total_stats']
df['stl_pct'] = df['steals'] / df['total_stats']
df['blk_pct'] = df['blocks'] / df['total_stats']

df = df.replace([np.inf, -np.inf], np.nan).dropna()

In [6]:

agg_df = df.groupby(['firstName', 'lastName', 'season'], as_index=False)[
    ['numMinutes', 'points', 'assists', 'reboundsTotal', 'steals', 'blocks',
     'fieldGoalsAttempted', 'fieldGoalsPercentage',
     'threePointersAttempted', 'threePointersPercentage',
     'freeThrowsAttempted', 'freeThrowsPercentage',
     'pts_pct', 'ast_pct', 'reb_pct', 'stl_pct', 'blk_pct']
].mean()

feature_cols = ['numMinutes', 'fieldGoalsAttempted', 'fieldGoalsPercentage',
                'threePointersAttempted', 'threePointersPercentage',
                'freeThrowsAttempted', 'freeThrowsPercentage',
                'pts_pct', 'ast_pct', 'reb_pct', 'stl_pct', 'blk_pct']
X = agg_df[feature_cols].copy()


In [7]:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import umap.umap_ as umap

scaler = StandardScaler()
X_scaled = StandardScaler().fit_transform(X)

reducer = umap.UMAP(n_components=3, random_state=42)
embedding = reducer.fit_transform(X_scaled)
agg_df['UMAP1'], agg_df['UMAP2'], agg_df['UMAP3'] = embedding[:,0], embedding[:,1], embedding[:,2]

best_score, best_k = -1, None
for k in range(6, 15):
    labels = KMeans(n_clusters=k, random_state=42, n_init='auto').fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    if score > best_score:
        best_score, best_k = score, k

kmeans = KMeans(n_clusters=best_k, random_state=42, n_init='auto')
agg_df['cluster'] = kmeans.fit_predict(X_scaled)


  warn(


In [8]:

valid_clusters = agg_df['cluster'].value_counts()[lambda x: x >= 10].index
agg_df = agg_df[agg_df['cluster'].isin(valid_clusters)].copy()
X_scaled = X_scaled[agg_df.index]


In [9]:

import plotly.express as px

agg_df['player'] = agg_df['firstName'] + ' ' + agg_df['lastName'] + ' (' + agg_df['season'].astype(str) + ')'
fig = px.scatter_3d(agg_df, x='UMAP1', y='UMAP2', z='UMAP3', color='cluster', hover_name='player', opacity=0.7,
                    title='NBA Player Clusters (3D UMAP)')
fig.show()


In [10]:

for cluster_id in sorted(agg_df['cluster'].unique()):
    print(f"\nCluster {cluster_id} - Sample Players:")
    sample = agg_df[agg_df['cluster'] == cluster_id].sample(n=min(15, len(agg_df[agg_df['cluster'] == cluster_id])), random_state=42)
    for _, row in sample.iterrows():
        print(f"  - {row['firstName']} {row['lastName']} ({int(row['season'])})")



Cluster 0 - Sample Players:
  - Ariel Hukporti (2024)
  - Rudy Gobert (2024)
  - Harrison Barnes (2024)
  - Amen Thompson (2024)
  - Jalen Duren (2024)
  - Moritz Wagner (2024)
  - Daniel Gafford (2024)
  - Jeremiah Robinson-Earl (2024)
  - Deandre Ayton (2024)
  - Noah Clowney (2024)
  - Malcolm Brogdon (2024)
  - Javonte Green (2024)
  - Clint Capela (2024)
  - Kobe Brown (2024)
  - Ausar Thompson (2024)

Cluster 1 - Sample Players:
  - Kevon Looney (2024)
  - Andre Jackson Jr. (2024)
  - Kyle Lowry (2024)
  - Delon Wright (2024)
  - Gary Payton II (2024)
  - Taj Gibson (2024)
  - Neemias Queta (2024)
  - Jonathan Mogbo (2024)
  - Ousmane Dieng (2024)
  - Daniel Theis (2024)
  - Vit Krejci (2024)
  - Alex Len (2024)
  - Kyshawn George (2024)
  - Luke Kennard (2024)
  - Bruno Fernando (2024)

Cluster 2 - Sample Players:
  - Jalen Suggs (2024)
  - De'Aaron Fox (2024)
  - Keldon Johnson (2024)
  - OG Anunoby (2024)
  - Aaron Gordon (2024)
  - Deni Avdija (2024)
  - Jalen Johnson (2024)

In [11]:
cluster_summary = agg_df.groupby('cluster')[feature_cols].median()

low_thresh = X[feature_cols].quantile(0.25)
high_thresh = X[feature_cols].quantile(0.75)

descriptions = {}
for cluster_id, row in cluster_summary.iterrows():
    desc = []
    for col in feature_cols:
        if row[col] >= high_thresh[col]:
            desc.append(f"high in {col}")
        elif row[col] <= low_thresh[col]:
            desc.append(f"low in {col}")
        else:
            desc.append(f"medium in {col}")
    descriptions[cluster_id] = desc

# Print summary for all clusters
for cluster_id in sorted(descriptions):
    print(f"Cluster {cluster_id}:\n- " + ", ".join(descriptions[cluster_id]) + "\n")

Cluster 0:
- medium in numMinutes, medium in fieldGoalsAttempted, high in fieldGoalsPercentage, medium in threePointersAttempted, medium in threePointersPercentage, medium in freeThrowsAttempted, medium in freeThrowsPercentage, medium in pts_pct, medium in ast_pct, medium in reb_pct, medium in stl_pct, medium in blk_pct

Cluster 1:
- low in numMinutes, low in fieldGoalsAttempted, low in fieldGoalsPercentage, low in threePointersAttempted, low in threePointersPercentage, low in freeThrowsAttempted, low in freeThrowsPercentage, low in pts_pct, medium in ast_pct, high in reb_pct, medium in stl_pct, medium in blk_pct

Cluster 2:
- high in numMinutes, high in fieldGoalsAttempted, medium in fieldGoalsPercentage, high in threePointersAttempted, medium in threePointersPercentage, high in freeThrowsAttempted, high in freeThrowsPercentage, high in pts_pct, medium in ast_pct, low in reb_pct, medium in stl_pct, medium in blk_pct

Cluster 3:
- low in numMinutes, low in fieldGoalsAttempted, medium i

Analysis (Model Building and Training)
Feature Scaling
All selected and engineered features were scaled using StandardScaler to give each stat equal weight during clustering.

Dimensionality Reduction with UMAP
We used Uniform Manifold Approximation and Projection (UMAP) to reduce the data to three dimensions for visualization while preserving structural relationships in the high-dimensional space. UMAP was chosen over PCA and t-SNE due to its strong performance on structured numerical data and better cluster preservation.

Clustering with KMeans
We applied KMeans to the UMAP-reduced data.

We used silhouette scoring to determine the optimal number of clusters, with a minimum of 6 required.

Clusters with fewer than 10 players were discarded to focus on meaningful archetypes.

Player names were preserved and sampled from each cluster for interpretability.



Results
The final model produced 7 distinct player clusters. Each cluster was characterized based on its median stat profile relative to the full dataset. Players were labeled with the season year for clarity (e.g., "Kristaps Porzingis (2024)").

A 3D UMAP scatterplot was created with color-coded clusters and labeled sample players. Each cluster was summarized using "high", "medium", or "low" rankings for each stat based on quartile thresholds across all players.


##  Conclusion

This project applied unsupervised learning techniques to identify latent player roles based on in-game statistics from the 2024 NBA regular season. After engineering stat distribution features (e.g., pts_pct, ast_pct, reb_pct), we used UMAP for dimensionality reduction and KMeans clustering to reveal patterns in playstyle and performance.

Summary of Cluster Archetypes
Cluster 0 - Interior Anchors with Efficient Scoring
Players like Rudy Gobert and Clint Capela stand out as defensive bigs who contribute efficient inside scoring. This group ranks high in field goal percentage and contributes across many stat areas without dominating any one category.

Cluster 1 - Low-Usage Role Players
Characterized by low minutes and low shooting volume, this group (e.g., Kevon Looney, Delon Wright) excels in rebounding and defense despite limited offensive production. Their moderate assist rates suggest playmaking from non-primary roles.

Cluster 2 - High-Usage Offensive Engines
Featuring players like Tyrese Haliburton, Ja Morant, and Brandon Ingram, this cluster shows high volume across all offensive metrics — scoring, shooting, and facilitating. These are offensive leaders with low rebounding presence.

Cluster 3 - Low-Impact Reserves
A mixed group with minimal statistical standout features. Players here (e.g., Kyle Anderson, Kevin Love) tend to have reduced roles or limited sample sizes, contributing in modest ways across several areas.

Cluster 4 - Perimeter Defenders and Hustle Bigs
Highlighted by Alex Caruso and Steven Adams, this group excels in rebounding and steals, suggesting scrappy defenders and bigs who rely on effort and positioning rather than offensive creation.

Cluster 5 - High-Assist Low-Scoring Facilitators
Players like Elfrid Payton and Tre Jones fit the mold of pass-first guards and connectors. While not primary scorers, they drive ball movement with high assist shares and moderate rebounding.

Cluster 6 - Balanced Two-Way Wings
Featuring Draymond Green, Mike Conley, and Jerami Grant, this cluster maintains balance across nearly all metrics. These players contribute on both ends without extreme highs or lows, making them valuable complementary pieces.

Observations
Cluster 0 illustrates how some archetypes emerge from overall statistical efficiency and not just raw volume.

Cluster 1 and 3 likely reflect players with small roles or limited sample sizes; future analysis could weight by minutes or games played.

Clusters appear cohesive in terms of playstyle, even when mixing positions, suggesting the dimensionality reduction and normalization steps were effective.

Cluster labeling could be refined further with positional data or usage rates to improve interpretability.