# Framework Workflow

This notebook presents the step-by-step workflow of the framework used in the article DOI: [insert DOI here].

It is divided into different stages and sub-stages, ensuring full reproducibility of the analyses conducted in the work.

---

## Preprocessing Stage

This stage includes the visualization and decisions made regarding different observations in the dataset, such as:

- Removal of players with low minutes
- Removal of highly correlated variables
- Calculation of the four moments  
  *(Note: For this, we have a separate dataset called `four_moment_nba.csv`, available on Google Drive)*
- Analysis of heavy tails
- Min-Max transformation
- Use of the UMAP dimensionality reduction algorithm

---

## Clustering Stage

This stage includes the visualization and decisions made regarding different behaviors of the data related to clustering:

- Visualization of silhouette score and inertia
- Silhouette score analysis for specific groups using a blade (knife) plot
- Final selection of the optimal number of clusters

---

## Results Stage

Finally, all results presented in the study are included here, except for the UBMG plot, which was created directly in LaTeX.  
However, the relevant information is also included in one of the notebook cells.

In [None]:
features_to_remove = [
    "MIN",
    "FGM",
    "FGA",
    "FG2M",
    "FG2U",
    "FG2A",
    #"FG3M",
    "FG3U",
    "FG3A",
    "FTM",
    "FTA",
    #"FTU",
    #"CLOSE_SHOT_M",
    "CLOSE_SHOT_U",
    "CLOSE_SHOT_A",
    "MID_RANGE_SHOT_M",
    #"MID_RANGE_SHOT_U",
    "MID_RANGE_SHOT_A",
    #"LONG_MID_RANGE_SHOT_M",
    "LONG_MID_RANGE_SHOT_U",
    "LONG_MID_RANGE_SHOT_A",
    "T_MID_RANGE_SHOT_M",
    "T_MID_RANGE_SHOT_U",
    "T_MID_RANGE_SHOT_A",
    "THREE_POINT_SHOT_M",
    "THREE_POINT_SHOT_U",
    "THREE_POINT_SHOT_A",
    #"OREB",
    #"DREB",
    "REB",
    #"AST",
    #"STL",
    #"BLK",
    "TO",
    #"PF",
    #"PTS",
    "PLUS_MINUS",
    "E_OFF_RATING",
    #"OFF_RATING",
    "E_DEF_RATING",
    #"DEF_RATING",
    "E_NET_RATING",
    #"NET_RATING",
    "AST_TOV",
    "E_PACE",
    "PACE",
    "PACE_PER40",
    "POSS",

    "TM_FG2M_PCT",
    "TM_FG2U_PCT",
    "TM_FG2A_PCT",
    "TM_FG3M_PCT",
    "TM_FG3U_PCT",
    "TM_FG3A_PCT",
    "TM_FTM_PCT",
    "TM_FTU_PCT",
    "TM_FTA_PCT",
    "TM_OREB_PCT",
    "TM_DREB_PCT",
    "TM_REB_PCT",
    "TM_AST_PCT",
    "TM_STL_PCT",
    "TM_BLK_PCT",
    "TM_TO_PCT",
    "TM_PF_PCT",
    "TM_PTS_PCT",

    #"CLOSE_SHOT_PCT",
    #"MID_RANGE_SHOT_PCT",
    "LONG_MID_RANGE_SHOT_PCT",
    "T_MID_RANGE_SHOT_PCT",
    "THREE_POINT_SHOT_PCT",
    #"FG_PCT",
    "FG2_PCT",
    "FG3_PCT",
    "FT_PCT",
    "AST_PCT",
    #"AST_RATIO",
    "PIE",
    "OREB_PCT",
    #"DREB_PCT",
    "REB_PCT",
    #"TM_TOV_PCT",
    "EFG_PCT",
    "TS_PCT",
    "USG_PCT",
    "E_USG_PCT"
]

#### Preprocessing Stage

In [None]:
# Loading the dataset
# NOTE: The dataset that is loaded here, is not the one with all the four moments, only the mean!!!

dataset = pd.read_csv("nba_dataset.csv", dtype={"PLAYER_ID": str, "PLAYER_NAME": str, "SEASON_ID": str})

# We have some metadata, like the name of the player, individual awards and others one.
metadata = dataset.iloc[:, -12:].copy()
dataset_features = dataset.iloc[:, :-12].copy()

##### Removal of players with low minutes

In [None]:
cdf_graph(dataset: dataset_features, column: "MIN_TOTAL", title: 'minutes_cdf')

# Based on the graph, remove all players with less than 400 minutes in a seasson
dataset_features = dataset_features[dataset_features["MIN_TOTAL"] >= 400]

##### Removal of high correlated features

In [None]:
# NOTE: This step was made to all features, but we gonna use as example the SHOT LOCATION METRICS

shot_locations_metrics = [
    "CLOSE_SHOT_M_MEAN", "CLOSE_SHOT_U_MEAN", "CLOSE_SHOT_A_MEAN", "CLOSE_SHOT_PCT_MEAN",
    "MID_RANGE_SHOT_M_MEAN", "MID_RANGE_SHOT_U_MEAN", "MID_RANGE_SHOT_A_MEAN", "MID_RANGE_SHOT_PCT_MEAN",
    "LONG_MID_RANGE_SHOT_M_MEAN", "LONG_MID_RANGE_SHOT_U_MEAN", "LONG_MID_RANGE_SHOT_A_MEAN", "LONG_MID_RANGE_SHOT_PCT_MEAN",
    "THREE_POINT_SHOT_M_MEAN", "THREE_POINT_SHOT_U_MEAN", "THREE_POINT_SHOT_A_MEAN", "THREE_POINT_SHOT_PCT_MEAN"
]

# High correlation visualization
matrix_of_correlation_graph(dataset_features, shot_locations_metrics)

# Pca to remove high correlationated features 
plot_pca_loadings(dataset_features, shot_locations_metrics)

selected_features = [
    "FG3M",
    "FTU",
    "CLOSE_SHOT_M",
    "MID_RANGE_SHOT_U",
    "LONG_MID_RANGE_SHOT_M",
    "OREB",
    "DREB",
    "AST",
    "STL",
    "BLK",
    "PF",
    "PTS",
    "OFF_RATING",
    "DEF_RATING",
    "NET_RATING",
    "CLOSE_SHOT_PCT",
    "MID_RANGE_SHOT_PCT",
    "FG_PCT",
    "AST_RATIO",
    "DREB_PCT",
    "TM_TOV_PCT"
]

# NOTE: As said 

# Filtered dataset with the feature selection
selected_dataset = dataset_features.copy()
selected_dataset = selected_dataset.drop()

##### Analysis of heavy tails

In [None]:
# Create a file summarizing the features
summarize_features(selected_dataset)

# Plot the distribution of each feature in batches
plot_histograms_in_batches(selected_dataset, batch_size=10)

# Transform features with high skewness
dataset_transformed = transform_skewed_features(selected_dataset)

##### Normalization

In [None]:
dataset_scaled = min_max(dataset_transformed)

##### Dimensionality reduction

In [None]:
# NOTE: We gonna need a SEED thoroug our next steps, here we selected on by a random selector.
# if you need a random seed use -> random.randint(0, 2**32 - 1)
# We go to use the one used in the paper!

seed=2610048211

dataset_embedding = umap_method(dataset_scaled, n_components=15, n_neighbors=5, random_state=seed)

#### Clustering Stage

In [None]:
# Plot the silhouette score of every cluster
plot_silhouette_analysis(cluster_dataset_embedding,filename="silhoutte_score.pdf")

# Plot the inertia score of every cluster
plot_inertia_analysis(cluster_dataset_embedding)

# Plot blade graph for a specific cluster
plot_silhouette_blades(cluster_dataset_embedding, 5, seed=seed)


# Final clusterization
kmeans_labels, score, outliers, centroid, inertia = kmeans_method(
    embedding=cluster_dataset_embedding,
    n_clusters=5,
    seed=seed,
    plot=False
)

# Final organization of the dataset to use throug the result section
raw_features, features_with_meta, minmax_normalized = formatted_data_to_analysis(features=cluster_dataset_filtered,metadata=metadata,labels=kmeans_labels)

#### Results Stage

In [None]:
cluster_size(features_with_meta, "cluster_size")

cluster_members_by_year(features_with_meta)

cluster_members_total(features_with_meta)

awards_by_cluster(features_with_meta)

gini_most(minmax_normalized, 10)

ginix_diference_plot(features_with_meta)

plotradar(minmax_normalized, colors)

# Select specific seasons for the UBMG graph
specific_seasons = ["2019-20", "2020-21", "2021-22", "2022-23", "2023-24"]
# Analyze cluster migration percentages for the selected seasons
ubmg_results = analyze_cluster_migration_percentages(features_with_meta, specific_seasons, n_clusters=5)
# To view the resulting table, simply print it
# print(ubmg_results)

player_position()

player_shot_map()