# Clustering Lab

 
Based of the amazing work you did in the Movie Industry you've been recruited to the NBA! You are working as the VP of Analytics that helps support a head scout, Mr. Rooney, for the worst team in the NBA probably the Wizards. Mr. Rooney just heard about Data Science and thinks it can solve all the team's problems!!! He wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs! 

In this document you will work through a similar process that we did in class with the NBA data files will be in the canvas assignment, merging them together.

Details: 

- Determine a way to use clustering to estimate based on performance if 
players are under or over paid, generally. 

- Then select players you believe would be best for your team and explain why. Do so in three categories: 
    * Examples that are not good choices (3 or 4) 
    * Several options that are good choices (3 or 4)
    * Several options that could work, assuming you can't get the players in the good category (3 or 4)

- You will decide the cutoffs for each category, so you should be able to explain why you chose them.

- Provide a well commented and clean report of your findings in a separate notebook that can be presented to Mr. Rooney, keeping in mind he doesn't understand...anything. Include a rationale for variables you included in the model, details on your approach and a overview of the results with supporting visualizations. 


Hints:

- Salary is the variable you are trying to understand 
- When interpreting you might want to use graphs that include variables that are the most correlated with Salary
- You'll need to scale the variables before performing the clustering
- Be specific about why you selected the players that you did, more detail is better
- Use good coding practices, comment heavily, indent, don't use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script,list/load libraries at the top and don't include libraries that aren't used. 
- Be careful for non-traditional characters in the players names, certain graphs won't work when these characters are included.


In [15]:
# imports and reading in data
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import plotly.express as px
from sklearn.metrics import silhouette_score

salary = pd.read_csv('2025_salaries.csv', header = 1, encoding = 'latin-1')
stats = pd.read_csv('nba_2025.txt', sep = ',', encoding = 'latin-1')

merged = pd.merge(salary, stats, on = 'Player')

In [16]:
# Drop variables that will not be needed or are duplicates
duplicates = merged[merged.duplicated(subset = 'Player', keep = False)]

# dropping complete duplicates
df_clean = merged.drop_duplicates()

# sorting by players and minutes played, then dropping the duplicates, keeping the one with the most minutes played
df_clean = df_clean.sort_values(by = ['Player', 'MP'], ascending = [True, False]).drop_duplicates(subset = 'Player', keep = 'first')

# dropping missing values from the salary column since they won't be useful for analysis
df_clean = df_clean.dropna(subset = ['2025-26'])

df_final = df_clean.sort_index()
df_final

Unnamed: 0,Player,Tm,2025-26,Rk,Age,Team,Pos,G,GS,MP,...,TRB,AST,STL,BLK,TOV,PF,PTS,Trp-Dbl,Awards,Player-additional
0,Garrison Mathews,IND,"$131,970",398.0,29.0,IND,SG,15.0,1.0,196.0,...,17.0,10.0,6.0,3.0,3.0,19.0,78.0,0.0,,mathega01
2,Mac McClung,IND,"$164,060",459.0,27.0,2TM,SG,4.0,0.0,47.0,...,5.0,2.0,5.0,2.0,3.0,8.0,23.0,0.0,,mccluma01
5,Monte Morris,IND,"$321,184",470.0,30.0,IND,PG,6.0,0.0,65.0,...,7.0,9.0,1.0,1.0,2.0,3.0,18.0,0.0,,morrimo01
6,E.J. Liddell,PHO,"$706,898",461.0,25.0,BRK,PF,10.0,0.0,49.0,...,11.0,0.0,1.0,0.0,0.0,4.0,22.0,0.0,,liddeej01
7,James Wiseman,IND,"$1,000,000",480.0,24.0,IND,C,4.0,1.0,58.0,...,8.0,3.0,0.0,1.0,5.0,10.0,13.0,0.0,,wisemja01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
508,Anthony Davis,WAS,"$54,126,450",192.0,32.0,DAL,PF,20.0,20.0,626.0,...,221.0,56.0,22.0,33.0,41.0,42.0,407.0,0.0,,davisan02
509,Kevin Durant,HOU,"$54,708,609",8.0,37.0,HOU,SF,50.0,50.0,1835.0,...,267.0,222.0,42.0,44.0,161.0,101.0,1291.0,0.0,,duranke01
510,Joel Embiid,PHI,"$55,224,526",56.0,31.0,PHI,C,31.0,31.0,972.0,...,232.0,121.0,20.0,34.0,92.0,67.0,825.0,1.0,,embiijo01
511,Bradley Beal,LAC,"$59,020,270",426.0,32.0,LAC,SG,6.0,6.0,121.0,...,5.0,10.0,3.0,0.0,9.0,14.0,49.0,0.0,,bealbr01


In [17]:
# Run the clustering algo with your best guess for K

# selecting features for clustering
cluster_data = df_final[['PTS', 'TRB', 'AST', 'STL', 'BLK', 'eFG%', '3P%', 'FT%', 'MP']]
# fill missing values with 0. Given that they're missing, it's likely they duidn't have any stats for that category
cluster_data = cluster_data.fillna(0)

# scaling the data
scaled = MinMaxScaler().fit_transform(cluster_data)

# running kmeans with 5 clusters
kmeans = KMeans(n_clusters = 5, random_state = 1).fit(scaled)

In [18]:
# View the results
print(kmeans.cluster_centers_)
print(kmeans.labels_)
print(kmeans.inertia_)

[[0.07975283 0.08670091 0.06008155 0.09102357 0.04938957 0.63590552
  0.35672549 0.7907549  0.15892931]
 [0.60777794 0.53188214 0.45557087 0.49436005 0.40055676 0.67324934
  0.32988525 0.7894918  0.80361442]
 [0.21486882 0.25205931 0.16807246 0.27917144 0.17261203 0.67008602
  0.31608594 0.74697656 0.42006821]
 [0.42252175 0.35022881 0.34009341 0.42945973 0.16719078 0.66834484
  0.35294444 0.7932     0.63940504]
 [0.02111682 0.03866008 0.02131268 0.03944954 0.02924528 0.56073894
  0.10813333 0.5148     0.0540293 ]]
[0 4 4 0 4 0 4 0 0 2 4 2 2 4 2 0 0 2 2 0 0 4 4 2 0 0 3 0 0 2 3 4 2 4 4 0 2
 2 2 2 2 0 4 2 0 0 2 0 0 0 2 2 0 0 1 4 3 4 0 0 0 2 3 3 2 2 0 0 0 2 0 4 0 2
 0 4 0 3 0 4 4 1 0 0 4 4 4 4 0 2 2 2 2 3 1 2 1 3 2 0 3 2 0 3 2 3 4 2 3 0 2
 0 0 3 0 0 0 2 2 0 0 0 0 3 0 3 3 3 2 0 2 2 0 0 0 2 2 3 2 3 2 0 0 2 2 3 2 2
 3 2 2 1 0 0 4 2 1 2 1 0 1 2 0 3 0 0 0 0 0 0 2 0 0 2 3 1 4 2 0 3 4 1 0 0 2
 2 2 0 3 2 3 0 3 0 4 0 3 0 2 2 3 3 2 0 1 0 2 2 3 0 2 0 2 4 0 2 1 3 2 3 3 2
 1 3 1 4 3 0 0 2 2 2 2 3 2 2 

In [19]:
# Create a visualization of the results with 2 or 3 variables that you think will best
# differentiate the clusters

# assigning cluster labels to the original dataframe
df_final['Cluster'] = kmeans.labels_

fig = px.scatter_3d(
    df_final, x = 'MP', y = 'PTS', z = 'eFG%',
    color = 'Cluster',
    opacity = 0.7)

fig.show()

In [20]:
# Evaluate the quality of the clustering using total variance explained and silhouette scores

# total variance explained
total_sum_squares = np.sum((cluster_data - np.mean(cluster_data))**2)
total = np.sum(total_sum_squares)
between_SSE = (total - kmeans.inertia_)
var_explained = between_SSE / total
print(var_explained)

# silhouette score
silhouette_scores = []
for k in range(2, 11):
    kmeans_obj = KMeans(n_clusters = k, algorithm = "lloyd", random_state = 1)
    kmeans_obj.fit(cluster_data)
    silhouette_scores.append(
        silhouette_score(cluster_data, kmeans_obj.labels_))

0.9999998971893675


In [21]:
# Determine the ideal number of clusters using the elbow method and the silhouette coefficient

# elbow method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 1).fit(cluster_data)
    wcss.append(kmeans.inertia_)

# find k with the highest silhouette score
best_nc = silhouette_scores.index(max(silhouette_scores)) + 2
print(best_nc)


2


In [22]:
# Visualize the results of the elbow method
elbow = pd.DataFrame({"k": range(1, 11), "wcss": wcss})
fig = px.line(elbow, x = "k", y = "wcss", title = "Elbow Method")
fig.show()

In [23]:
# Use the recommended number of cluster (assuming it's different) to retrain your model and visualize the results

# new kmeans model with recommended number of clusters
kmeans2 = KMeans(n_clusters = 2, random_state = 1).fit(scaled)

# visualizing
df_final['Cluster'] = kmeans2.labels_

fig = px.scatter_3d(
    df_final, x = 'MP', y = 'PTS', z = 'eFG%',
    color = 'Cluster',
    opacity = 0.7)

fig.show()

In [24]:
# Once again evaluate the quality of the clustering using total variance explained and silhouette scores
# total variance explained
total_sum_squares2 = np.sum((cluster_data - np.mean(cluster_data))**2)
total2 = np.sum(total_sum_squares)
between_SSE2 = (total - kmeans2.inertia_)
var_explained2 = between_SSE / total
print(var_explained2)

# silhouette score
best_nc2 = silhouette_scores.index(max(silhouette_scores)) + 2
print(best_nc2)

0.9999998971893675
2


In [25]:
# Use the model to select players for Mr. Rooney to consider

# creating a performance metric by summing up key stats
df_final['Performance'] = df_final[['PTS', 'TRB', 'AST', 'STL', 'BLK']].sum(axis = 1)

# plotting to see who has realtively high performance despite being in the lower cluster
fig1 = px.scatter(
    df_final, x=  'Performance', y = '2025-26', color = 'Cluster',
    hover_name = 'Player', 
    title = "Performance vs. Salary"
)
fig1.show()

In [26]:
# players who are efficient (high eFG%) but aren't playing enough (MP) to be in the higher cluster
fig2 = px.scatter(
    df_final, x = 'MP', y = 'eFG%', color = 'Cluster', size = 'PTS',
    hover_name = 'Player', hover_data = ['2025-26'],
    title = "Efficiency (eFG%) vs. Opportunity (MP)"
)
fig2.show()

In [27]:
# Write up the results in a separate notebook with supporting visualizations and 
# an overview of how and why you made the choices you did. This should be at least 
# 500 words and should be written for a non-technical audience.