# Clustering Lab

 
Based of the amazing work you did in the Movie Industry you've been recruited to the NBA! You are working as the VP of Analytics that helps support a head scout, Mr. Rooney, for the worst team in the NBA probably the Wizards. Mr. Rooney just heard about Data Science and thinks it can solve all the team's problems!!! He wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs! 

In this document you will work through a similar process that we did in class with the NBA data (NBA_Perf_22 and nba_salaries_22), merging them together.

Details: 

- Determine a way to use clustering to estimate based on performance if 
players are under or over paid, generally. 

- Then select players you believe would be best for your team and explain why. Do so in three categories: 
    * Examples that are not good choices (3 or 4) 
    * Several options that are good choices (3 or 4)
    * Several options that could work, assuming you can't get the players in the good category (3 or 4)

- You will decide the cutoffs for each category, so you should be able to explain why you chose them.

- Provide a well commented and clean report of your findings in a separate notebook that can be presented to Mr. Rooney, keeping in mind he doesn't understand...anything. Include a rationale for variables you included in the model, details on your approach and a overview of the results with supporting visualizations. 


Hints:

- Salary is the variable you are trying to understand 
- When interpreting you might want to use graphs that include variables that are the most correlated with Salary
- You'll need to scale the variables before performing the clustering
- Be specific about why you selected the players that you did, more detail is better
- Use good coding practices, comment heavily, indent, don't use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script,list/load libraries at the top and don't include libraries that aren't used. 
- Be careful for non-traditional characters in the players names, certain graphs won't work when these characters are included.


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]:
import os
os.listdir()
print(os.getcwd())
os.chdir('/Users/Luke/DS-3001')

/Users/Luke/DS-3001/10_kMeans Clustering


In [3]:
NBA_Perf_22 = pd.read_csv("data/NBA_Perf_22.csv", encoding='latin-1')
NBA_salaries_22 = pd.read_csv("data/nba_salaries_22.csv")
NBA_22 = pd.merge(NBA_Perf_22, NBA_salaries_22, on='Player', how='inner')
NBA_22 = NBA_22.dropna()
NBA_22.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Salary
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,0.439,...,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1,"$2,840,160"
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,0.547,...,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9,"$17,926,829"
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,0.557,...,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1,"$30,351,780"
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,0.402,...,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1,"$2,094,120"
4,Nickeil Alexander-Walker,SG,23,TOT,65,21,22.6,3.9,10.5,0.372,...,0.6,2.3,2.9,2.4,0.7,0.4,1.4,1.6,10.6,"$5,009,633"


In [4]:
drop_cols = [0, 1, 3, 7, 8, 10, 11, 13, 14, 17, 18, 29]
eNBA_22 = NBA_22.drop(NBA_22.columns[drop_cols], axis=1)
eNBA_22.info()

<class 'pandas.core.frame.DataFrame'>
Index: 466 entries, 0 to 502
Data columns (total 18 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     466 non-null    int64  
 1   G       466 non-null    int64  
 2   GS      466 non-null    int64  
 3   MP      466 non-null    float64
 4   FG%     466 non-null    float64
 5   3P%     466 non-null    float64
 6   2P%     466 non-null    float64
 7   eFG%    466 non-null    float64
 8   FT%     466 non-null    float64
 9   ORB     466 non-null    float64
 10  DRB     466 non-null    float64
 11  TRB     466 non-null    float64
 12  AST     466 non-null    float64
 13  STL     466 non-null    float64
 14  BLK     466 non-null    float64
 15  TOV     466 non-null    float64
 16  PF      466 non-null    float64
 17  PTS     466 non-null    float64
dtypes: float64(15), int64(3)
memory usage: 69.2 KB


In [5]:
#eNBA_22['Salary'] = eNBA_22['Salary'].str.replace('$', '').str.replace(',', '').astype('int64')
numbers_listing = list(eNBA_22.select_dtypes('number'))
 #select function to find the numeric variables and create a list  
eNBA_22[numbers_listing] = MinMaxScaler().fit_transform(eNBA_22[numbers_listing])
eNBA_22.head()

Unnamed: 0,Age,G,GS,MP,FG%,3P%,2P%,eFG%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0.136364,0.888889,0.341463,0.591429,0.445896,0.359,0.290667,0.533582,0.392804,0.434783,0.392523,0.426573,0.101852,0.227273,0.214286,0.25,0.408163,0.273649
1,0.409091,0.925926,0.914634,0.668571,0.647388,0.0,0.397333,0.647388,0.314843,1.0,0.476636,0.671329,0.314815,0.409091,0.285714,0.3125,0.387755,0.199324
2,0.227273,0.679012,0.682927,0.848571,0.666045,0.0,0.416,0.666045,0.629685,0.521739,0.682243,0.678322,0.314815,0.636364,0.285714,0.541667,0.612245,0.611486
3,0.090909,0.382716,0.0,0.24,0.376866,0.125,0.413333,0.41791,0.437781,0.217391,0.130841,0.160839,0.064815,0.090909,0.107143,0.104167,0.204082,0.10473
4,0.181818,0.790123,0.256098,0.562857,0.320896,0.311,0.244,0.464552,0.614693,0.130435,0.186916,0.174825,0.222222,0.318182,0.142857,0.291667,0.306122,0.324324


In [73]:
clust_data_NBA = eNBA_22[["FG%", "PTS", "AST"]]
kmeans_obj_NBA = KMeans(n_clusters=2, random_state=1).fit(clust_data_NBA)





In [74]:
print(kmeans_obj_NBA.cluster_centers_)
print(kmeans_obj_NBA.labels_)
print(kmeans_obj_NBA.inertia_)

[[0.4892295  0.22538055 0.13904853]
 [0.48091639 0.59224691 0.45448839]]
[0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0
 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0
 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1
 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0
 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1
 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 1 0 0 0 0 0 0 0 0 

In [75]:
fig = px.scatter_3d(eNBA_22, x= "FG%", y= "PTS", z="AST", color=kmeans_obj_NBA.labels_,
                    title="")
fig.show(renderer="browser")

In [76]:
wcss = []
for i in range(1, 11):
    kmeans_obj_NBA = KMeans(n_clusters=i, random_state=1984).fit(clust_data_NBA)
    wcss.append(kmeans_obj_NBA.inertia_)
elbow_data_NBA = pd.DataFrame({"k": range(1, 11), "wcss": wcss})
fig = px.line(elbow_data_NBA, x="k", y="wcss", title="Elbow Method")
fig.show()























In [77]:
#View the results
kmeans_obj_NBA = KMeans(n_clusters=2, random_state=1).fit(clust_data_NBA)





In [78]:
print(kmeans_obj_NBA.cluster_centers_)
print(kmeans_obj_NBA.labels_)
print(kmeans_obj_NBA.inertia_)

[[0.4892295  0.22538055 0.13904853]
 [0.48091639 0.59224691 0.45448839]]
[0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0
 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0
 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1
 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0
 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1
 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0
 1 1 0 0 1 0 0 0 0 0 0 0 0 

In [12]:
fig = px.scatter_3d(eNBA_22, x= "3P%", y= "2P%", z="FT%", color=kmeans_obj_NBA.labels_,
                    title="")
fig.show(renderer="browser")

In [80]:
#Create a visualization of the results with 2 or 3 variables that you think will best
#differentiate the clusters
clust_data_NBA = eNBA_22[["eFG%", "ORB", "BLK"]]
kmeans_obj_NBA = KMeans(n_clusters=2, random_state=1).fit(clust_data_NBA)
print(kmeans_obj_NBA.cluster_centers_)
print(kmeans_obj_NBA.labels_)
print(kmeans_obj_NBA.inertia_)

[[0.59624258 0.14463534 0.11511137]
 [0.71097968 0.44588344 0.32446809]]
[1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0
 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0
 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 





In [81]:
fig = px.scatter_3d(eNBA_22, x= "eFG%", y= "ORB", z="BLK", color=kmeans_obj_NBA.labels_,
                    title="")
fig.show(renderer="browser")

In [None]:
#Evaluate the quality of the clustering using total variance explained and silhouette scores

In [64]:
total_sum_squares = np.sum((clust_data_NBA - np.mean(clust_data_NBA))**2)
total = np.sum(total_sum_squares)
print(total)

44.514537829538455



The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)



In [65]:
between_SSE = (total-kmeans_obj_NBA.inertia_)
print(between_SSE)
Var_explained = between_SSE/total
print(Var_explained)

25.101231342313696
0.5638883961557679


In [79]:
from itertools import combinations
import pandas as pd
import numpy as np

# Initialize variables to store the top combinations and their corresponding Var_explained values
top_combinations = []
top_var_explained = [float('-inf')] * 5  # Initialize with small values

# Iterate through all combinations of three columns
for column_combination in combinations(eNBA_22.columns, 3):
    # Create a subset of the data with the current column combination
    clust_data_NBA = eNBA_22[list(column_combination)]
    
    # Perform k-means clustering
    kmeans_obj_NBA = KMeans(n_clusters=2, random_state=1).fit(clust_data_NBA)
    
    # Calculate total sum of squares
    total_sum_squares = np.sum((clust_data_NBA - np.mean(clust_data_NBA))**2)
    total = np.sum(total_sum_squares)
    
    # Calculate between sum of squares (explained variance)
    between_SSE = (total - kmeans_obj_NBA.inertia_)
    
    # Calculate explained variance
    var_explained = between_SSE / total
    
    # Check if the current explained variance is among the top 5
    for i, var_explained_value in enumerate(top_var_explained):
        if var_explained > var_explained_value:
            # Update the list of top combinations and Var_explained values
            top_var_explained.insert(i, var_explained)
            top_combinations.insert(i, column_combination)
            # Keep only the top 5 combinations
            top_var_explained = top_var_explained[:5]
            top_combinations = top_combinations[:5]
            break

# Create a DataFrame to display the results
results_table = pd.DataFrame({
    'Top 5 Combinations': top_combinations,
    'Top 5 Var_explained': top_var_explained
})

# Display the results table
print(results_table)





The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




Th

  Top 5 Combinations  Top 5 Var_explained
0   (eFG%, ORB, BLK)             0.829789
1   (FG%, eFG%, BLK)             0.820372
2   (2P%, eFG%, BLK)             0.815692
3   (eFG%, FT%, BLK)             0.811207
4    (GS, eFG%, BLK)             0.809372





The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)




The behavior of DataFrame.sum with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)



In [66]:
#Determine the ideal number of clusters using the elbow method and the silhouette coefficient
wcss = []
for i in range(1, 11):
    kmeans_obj_Dem = KMeans(n_clusters=i, random_state=1).fit(clust_data_NBA)
    wcss.append(kmeans_obj_Dem.inertia_)
elbow_data_NBA = pd.DataFrame({"k": range(1, 11), "wcss": wcss})
elbow_fig = px.line(elbow_data_NBA, x="k", y="wcss", title="Elbow Method")

from sklearn.metrics import silhouette_score

# Run NbClust
silhouette_scores = []
for k in range(2, 11):
    kmeans_obj = KMeans(n_clusters=k, algorithm="auto", random_state=1).fit(clust_data_NBA)
    silhouette_scores.append(silhouette_score(clust_data_NBA, kmeans_obj.labels_))

best_nc = silhouette_scores.index(max(silhouette_scores))+2
#%%
#plot the silhouette scores
sil_fig = go.Figure(data=go.Scatter(x=list(range(2, 11)), y=silhouette_scores))

























algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.



In [67]:
#Visualize the results of the elbow method
elbow_fig.show()
sil_fig

In [71]:
#Use the recommended number of cluster (assuming it's different) to retrain your model and visualize the results
kmeans_obj_NBA = KMeans(n_clusters=3, random_state=1).fit(clust_data_NBA)
print(kmeans_obj_NBA.cluster_centers_)
print(kmeans_obj_NBA.labels_)
print(kmeans_obj_NBA.inertia_)

[[0.2521518  0.41736131 0.32825784]
 [0.55590717 0.46938776 0.66306021]
 [0.13182471 0.22906404 0.14952237]]
[0 0 1 2 0 0 2 0 0 2 2 1 2 1 1 2 2 2 0 0 0 0 0 1 0 0 1 2 0 0 1 0 2 2 2 2 2
 1 2 0 0 2 0 0 2 2 1 2 0 2 0 0 1 1 0 2 1 2 2 0 2 2 0 2 1 0 2 0 2 2 2 2 2 2
 0 0 2 2 2 0 0 2 0 0 0 2 0 0 0 2 2 0 0 1 0 0 0 1 1 0 0 0 1 0 2 0 0 0 0 0 0
 1 0 0 0 0 0 1 1 2 1 0 2 0 2 2 2 2 0 2 2 2 2 0 1 0 0 2 0 0 0 0 1 2 0 1 0 1
 1 2 0 0 0 0 1 2 0 1 0 0 2 0 2 2 2 0 1 0 1 0 0 1 1 1 2 0 0 0 0 0 0 1 0 1 0
 2 0 0 0 1 0 0 1 2 2 2 2 2 1 0 0 0 0 0 0 2 2 2 0 0 0 2 2 0 2 0 1 1 0 1 2 1
 1 2 2 0 2 0 0 0 2 0 0 0 2 0 2 2 2 0 0 0 2 0 2 2 2 2 2 2 2 2 2 0 0 1 2 2 2
 2 1 2 2 0 1 1 0 2 1 0 2 0 0 0 2 2 0 1 0 0 2 0 0 0 2 0 2 2 0 2 0 2 0 0 1 1
 1 0 0 2 0 0 2 2 0 2 0 1 0 0 0 1 0 0 2 1 0 0 0 2 1 2 2 2 2 2 0 2 2 0 2 2 0
 2 2 2 2 0 2 2 0 2 0 0 0 0 1 0 2 2 0 0 0 1 1 0 2 0 0 1 1 1 2 2 2 2 0 1 2 0
 0 2 2 2 0 0 0 0 2 0 2 0 2 2 0 1 1 1 1 1 1 2 2 1 2 1 0 2 2 2 0 0 2 0 2 2 2
 0 0 1 0 1 0 2 0 0 0 2 1 2 0 2 2 2 2 1 0 0 0 0 1 0 0 2 0 0 0 0 0 0





In [72]:
fig = px.scatter_3d(eNBA_22, x= "FG%", y= "2P%", z="eFG%", color=kmeans_obj_NBA.labels_,
                    title="")
fig.show(renderer="browser")

In [None]:
#Once again evaluate the quality of the clustering using total variance explained and silhouette scores

In [None]:
#Use the model to select players for Mr. Rooney to consider

In [None]:
#Write up the results in a separate notebook with supporting visualizations and 
#an overview of how and why you made the choices you did. This should be at least 
#500 words and should be written for a non-technical audience.