# NBA Archetypes (Part 2)

In this project, we aim to explore the evolving landscape of the NBA by identifying distinct player archetypes based on performance data from the 2023-2024 regular season. Traditional player positions—Point Guard, Shooting Guard, Small Forward, Power Forward, and Center—have become less rigid as the game has evolved, with changes in play style, strategy, and individual skillsets. To begin, we will perform data preprocessing to clean and prepare the dataset, followed by exploratory data analysis (EDA) to better understand the patterns and relationships within the data. Using advanced data analysis techniques such as Principal Component Analysis (PCA) to reduce the dimensionality and K-Means clustering to group similar players, we will define archetypes that reflect the modern roles players occupy on the court. This analysis will provide valuable insights into the variety of player types in today’s NBA, moving beyond traditional positions to a more nuanced understanding of player roles and their impact on team performance.

## Initialization

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Machine learning and data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## Load Data

In [2]:
# Load data function
def load_data(file_name, local_path, server_path):
    try:
        data = pd.read_csv(local_path + file_name)
        print(f"'{file_name}' file successfully read from the local path.")

    except FileNotFoundError:
        try:
            data = pd.read_csv(server_path + file_name)
            print(f"'{file_name}' file successfully read from the server path.")

        except FileNotFoundError:
            print(f"'{file_name}' file not found. Please check the file paths.")
            data = None
            
    return data

file_names = ['cluster_0.csv', 'cluster_1.csv', 'cluster_2.csv', 'cluster_3.csv', 'cluster_4.csv', 'cluster_5.csv']
local_path =  '/Users/benjaminstephen/Documents/TripleTen/Code_Pudding/NBA-Archetypes/datasets/'
server_path = '/datasets/'

cluster_0 = load_data(file_names[0], local_path, server_path)
cluster_1 = load_data(file_names[1], local_path, server_path) 
cluster_2 = load_data(file_names[2], local_path, server_path)
cluster_3 = load_data(file_names[3], local_path, server_path)   
cluster_4 = load_data(file_names[4], local_path, server_path) 
cluster_5 = load_data(file_names[5], local_path, server_path) 

'cluster_0.csv' file successfully read from the local path.
'cluster_1.csv' file successfully read from the local path.
'cluster_2.csv' file successfully read from the local path.
'cluster_3.csv' file successfully read from the local path.
'cluster_4.csv' file successfully read from the local path.
'cluster_5.csv' file successfully read from the local path.


## Data Preprocessing

In [3]:
# Analyze function
def analyze(data):
    # Display the DataFrame
    display(data)

    # Print DataFrame Info
    print("DATAFRAME INFO:")
    data.info()
    print()

    # Calculate Percentage of Null Values
    print("PERCENTAGE OF NULL VALUES:")
    print((data.isnull().sum()/len(data)) * 100)
    print()

    # Calculate Number of Duplicated Rows
    print("NUMBER OF DUPLICATED ROWS:", data.duplicated().sum())

## Cluster 0

In [4]:
analyze(cluster_0)

Unnamed: 0,Player Name,Archetype,Total MP,Total 3P,Total 3PA,Total 2P,Total 2PA,Total FT,Total FTA,Total ORB,Total DRB,Total AST,Total STL,Total BLK,Total TOV,Total PF,Total PTS
0,A.J. Green,0,616.0,67.2,168.0,16.8,28.0,16.8,16.8,11.2,56.0,28.0,11.2,5.6,11.2,50.4,252.0
1,A.J. Lawson,0,310.8,12.6,50.4,42.0,71.4,16.8,21.0,12.6,37.8,21.0,8.4,4.2,12.6,21.0,134.4
2,AJ Griffin,0,172.0,10.0,40.0,8.0,24.0,2.0,2.0,2.0,16.0,6.0,2.0,2.0,8.0,6.0,48.0
3,Adam Flagler,0,14.0,1.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0
4,Adama Sanogo,0,65.7,0.0,0.0,14.4,27.0,8.1,11.7,18.9,17.1,0.0,0.9,0.0,5.4,5.4,36.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,Wendell Moore Jr.,0,75.0,0.0,5.0,10.0,12.5,0.0,0.0,0.0,12.5,5.0,5.0,0.0,5.0,5.0,17.5
222,Wenyen Gabriel,0,81.0,1.0,6.0,7.0,16.0,0.0,5.0,7.0,18.0,3.0,2.0,2.0,8.0,10.0,17.0
223,Wesley Matthews,0,414.0,25.2,68.4,10.8,28.8,18.0,25.2,10.8,43.2,21.6,14.4,10.8,7.2,43.2,111.6
224,Xavier Moon,0,119.0,1.4,16.8,12.6,29.4,1.4,1.4,8.4,9.8,21.0,2.8,2.8,5.6,8.4,33.6


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player Name  226 non-null    object 
 1   Archetype    226 non-null    int64  
 2   Total MP     226 non-null    float64
 3   Total 3P     226 non-null    float64
 4   Total 3PA    226 non-null    float64
 5   Total 2P     226 non-null    float64
 6   Total 2PA    226 non-null    float64
 7   Total FT     226 non-null    float64
 8   Total FTA    226 non-null    float64
 9   Total ORB    226 non-null    float64
 10  Total DRB    226 non-null    float64
 11  Total AST    226 non-null    float64
 12  Total STL    226 non-null    float64
 13  Total BLK    226 non-null    float64
 14  Total TOV    226 non-null    float64
 15  Total PF     226 non-null    float64
 16  Total PTS    226 non-null    float64
dtypes: float64(15), int64(1), object(1)
memory usage: 30.1+ KB

PERCENTAGE OF NU

## Cluster 1

In [5]:
analyze(cluster_1)

Unnamed: 0,Player Name,Archetype,Total MP,Total 3P,Total 3PA,Total 2P,Total 2PA,Total FT,Total FTA,Total ORB,Total DRB,Total AST,Total STL,Total BLK,Total TOV,Total PF,Total PTS
0,Alec Burks,1,2427.9,249.6,657.9,171.3,477.2,288.6,330.5,50.5,252.2,175.3,54.8,26.1,92.4,148.6,1377.7
1,Austin Reaves,1,2632.2,155.8,418.2,303.4,516.6,229.6,270.6,57.4,295.2,451.0,65.6,24.6,172.2,155.8,1303.8
2,Bogdan Bogdanovi?,1,2401.6,237.0,639.9,229.1,458.2,150.1,165.9,55.3,221.2,244.9,94.8,23.7,110.6,181.7,1335.1
3,Bojan Bogdanovi?,1,2954.3,264.3,671.1,352.5,693.7,235.7,298.2,45.5,261.6,193.0,56.7,8.5,204.7,190.6,1733.6
4,Brandon Ingram,1,2105.6,83.2,243.2,416.0,774.4,243.2,307.2,44.8,281.6,364.8,51.2,38.4,160.0,147.2,1331.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,Stephen Curry,1,2419.8,355.2,873.2,296.0,569.8,296.0,325.6,37.0,296.0,377.4,51.8,29.6,207.2,118.4,1953.6
57,Tobias Harris,1,2366.0,91.0,259.0,371.0,693.0,189.0,210.0,77.0,371.0,217.0,70.0,49.0,91.0,112.0,1204.0
58,Trae Young,1,1944.0,172.8,469.8,259.2,540.0,345.6,405.0,21.6,124.2,583.2,70.2,10.8,237.6,108.0,1387.8
59,Tyrese Haliburton,1,2221.8,193.2,538.2,303.6,510.6,193.2,227.7,34.5,234.6,752.1,82.8,48.3,158.7,75.9,1386.9


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player Name  61 non-null     object 
 1   Archetype    61 non-null     int64  
 2   Total MP     61 non-null     float64
 3   Total 3P     61 non-null     float64
 4   Total 3PA    61 non-null     float64
 5   Total 2P     61 non-null     float64
 6   Total 2PA    61 non-null     float64
 7   Total FT     61 non-null     float64
 8   Total FTA    61 non-null     float64
 9   Total ORB    61 non-null     float64
 10  Total DRB    61 non-null     float64
 11  Total AST    61 non-null     float64
 12  Total STL    61 non-null     float64
 13  Total BLK    61 non-null     float64
 14  Total TOV    61 non-null     float64
 15  Total PF     61 non-null     float64
 16  Total PTS    61 non-null     float64
dtypes: float64(15), int64(1), object(1)
memory usage: 8.2+ KB

PERCENTAGE OF NULL 

## Cluster 2

In [None]:
analyze(cluster_2)

Unnamed: 0,Player Name,Archetype,Total MP,Total 3P,Total 3PA,Total 2P,Total 2PA,Total FT,Total FTA,Total ORB,Total DRB,Total AST,Total STL,Total BLK,Total TOV,Total PF,Total PTS
0,Aaron Holiday,2,1271.4,85.8,218.4,101.4,202.8,54.6,62.4,23.4,101.4,140.4,39.0,7.8,54.6,124.8,514.8
1,Aaron Wiggins,2,1224.6,62.4,124.8,148.2,249.6,54.6,70.2,62.4,124.8,85.8,54.6,15.6,54.6,93.6,538.2
2,Aleksej Pokusevski,2,811.6,35.0,110.2,61.6,133.4,61.6,83.8,33.2,145.6,72.0,29.4,27.6,38.8,43.6,290.8
3,Amir Coffey,2,1463.0,70.0,182.0,98.0,168.0,63.0,70.0,28.0,119.0,77.0,42.0,14.0,35.0,105.0,462.0
4,Anthony Black,2,1166.1,34.5,96.6,75.9,151.8,48.3,82.8,34.5,103.5,89.7,34.5,20.7,55.2,110.4,317.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,Wendell Carter Jr.,2,1408.0,66.0,170.5,159.5,253.0,93.5,132.0,110.0,269.5,93.5,33.0,27.5,66.0,121.0,605.0
132,Yuta Watanabe,2,927.2,48.0,170.4,33.6,62.8,18.9,33.5,21.8,88.0,27.3,25.3,12.6,40.8,77.2,233.0
133,Zach LaVine,2,872.5,60.0,170.0,110.0,207.5,87.5,102.5,7.5,120.0,97.5,20.0,7.5,52.5,57.5,487.5
134,Zeke Nnaji,2,574.2,5.8,23.2,63.8,127.6,40.6,63.8,63.8,63.8,34.8,17.4,40.6,29.0,81.2,185.6


DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136 entries, 0 to 135
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Player Name  136 non-null    object 
 1   Archetype    136 non-null    int64  
 2   Total MP     136 non-null    float64
 3   Total 3P     136 non-null    float64
 4   Total 3PA    136 non-null    float64
 5   Total 2P     136 non-null    float64
 6   Total 2PA    136 non-null    float64
 7   Total FT     136 non-null    float64
 8   Total FTA    136 non-null    float64
 9   Total ORB    136 non-null    float64
 10  Total DRB    136 non-null    float64
 11  Total AST    136 non-null    float64
 12  Total STL    136 non-null    float64
 13  Total BLK    136 non-null    float64
 14  Total TOV    136 non-null    float64
 15  Total PF     136 non-null    float64
 16  Total PTS    136 non-null    float64
dtypes: float64(15), int64(1), object(1)
memory usage: 18.2+ KB

PERCENTAGE OF NU