## NBA Clustering Lab Analysis

### Introduction

The purpose of this report was to analyze NBA players, taking into account their performance level and cost, to find potential steals to support a playoff push for Mr. Rooney's team. I wanted to avoid overpaying for low-value performance, so I used the two datasets I was provided with to select appropriate players. I used a list of player statistics, specifically focusing on players, their overall points, games played, and minutes played, as well as a list of player salaries for the 2025-2026 season. I merged these into a single dataset for my analysis. My findings will allow for the team to balance player performance and salary costs, avoiding overspending.

### Variables

The variables I used in the model were as follows. Games played (G) was used when dropping duplicate players who have played for multiple teams in one season and were therefore included multiple times. Minutes played (MP) shows how much time a player is on the court, and it could be inferred that higher minutes indicate higher importance to the team. Points scored (PTS) serves as a direct measure of a player's scoring contribution, and is thus one of the most valuable metrics. From these variables, I derived metrics to help with efficiency measures, i.e. points per minute (PTS_per_MP) to measure how much a player scores per minute played. My value metric was efficiency relative to salary, i.e. points per minute divided by salary in millions. I used these variables altogether because I believe they best exemplify performance (scoring) and cost-effectiveness, and I was able to remove all unnecessary statistics that would've overcomplicated the model and wouldn't have assissted Mr. Rooney in making decisions.

### Approach

In terms of my actual approach, I started with necessary data cleaning. I first merged the statistics & salaries datasets together so that each player would have information for both. I removed duplicate players by keeping the version with the most games played, because that was likely more indicative of their typical performance. I removed missing salary values and any incomplete statistics, and I converted salaries to numeric values in order to perform my analysis (which strictly entailed working with numeric values). I separated out Players and Salaries because I didn't need it for my modeling, but I would need it later on in my analysis for visualizations, comparisons, and selections. 

The next step was the actual clustering of players. I initially scaled the data so that differences in units (points vs minutes) wouldn't bias or overwhelm the clusters when it came to larger scales. I used a method called KMeans clustering to group players into tiers based on similarities in their performance. The goal is to make clusters as different as possible from each other. I started by guessing 3 clusters would be optimal (low, medium, and high performers), meaning I used a k-value of 3 for my model. I plotted this on a scatter plot colored by salary, with each cluster represented by a different shape and the variables measured being points scored and minutes played. 

After clustering, I checked two measures to evaluate the model's performance and the effectiveness of the k-value. Total variance explained tells me how much of the differences between players is captured by my groups, and silhouette score measures how clearly each player fits in their group. In both cases, a higher number is typically better.

I then used the elbow method and silhouette scores to find the ideal number of clusters. The intention of the elbow method is to plot the "cost" of adding more clusters (I evaluated 1-10) and look for the point where adding more groups doesn't help much anymore (a bend in the plot). This point represents the optimal k, which turned out to be 2, showing that 2 clusters works best for a simple, clear separation between high and low performers. The silhouette scores also revealed that 2 was the ideal amount of clusters, as k=2 provided me with the highest silhouette score.

With this in mind, I retrained my model using a k-value of 2 instead of 3, and I plotted it again with the same constraints. I then reevaluated the model's performance by checking the total variance explained and silhouette score.

Finally, I put the model to use. I created a simple efficiency metric (points per minute / salary in millions) to give a clear idea of which players give the most performance for the least money. I filtered out players who play very few minutes (<300), as their stats may be inflated and not reflect their true ability. I sorted players by cluster and value, which provided me with three groups for Mr. Rooney: good choices (medium performers with high value), not good choices (low performers with poor value), and potential backups (high-profile but expensive players).


### Results

For the initial 3 clusters, there were 99 players in Cluster 0 (low performers), 174 players in Cluster 1 (medium performers), and 138 players in Cluster 2 (high performers). My visualization is pictured below.

<img src="3Clusters.png" width="500">

This is a scatter plot of points scored vs minutes played. As you can see, the clusters are differentiated by shape, generally with Cluster 0 at the bottom left (low points, low minutes), Cluster 2 at the top right (high points, high minutes), and Cluster 1 in between. The clusters are colored by salary, with the scale shown at the right, so darker colors reflect lower salaries and lighter colors reflect higher salaries. This is what I used to determine overall value.

Upon evaluation, my model using 3 clusters had a total variance explained of 0.7657 (76.57%- not bad!) and a silhouette score of 0.4461. I wanted to improve upon this, so I was then able to determine the ideal number of clusters using the elbow method and silhouette coefficient. I tested k-values of 1-10, and the highest silhouette score corresponded to k=2, with 0.488474. As k increased, scores typically decreased, so in this case a lower k is better. My visualization of the elbow plot also provided me with similar insights, pictured below.

<img src="ElbowPlot.png" width="500">

As previously discussed, the elbow of the plot represents the ideal k-value, which turned out to be 2 (thus aligning with the silhouette score). I applied this new knowledge when retraining the model with a different k-value, which changed my visualization to represent the two new clusters.

<img src="2Clusters.png" width="500">

This new visualization is clearly similar to the original one, only with 2 clusters this time instead of 3. The same attributes apply to the new plot, with Cluster 0 (low performance) at the bottom left and Cluster 1 (high performance) at the top right. 

Upon reevaluation, I discovered that the total variance explained happened to drop to 0.5978, but the silhouette score increased to 0.4885. I believe that it is more important to prioritize a higher silhouette score, because it best maximizes separation quality. Higher variance isn't always better, because clusters would start to lose their meaning as the amount of them increases. This leads to overfitting, in which the model starts capturing noise instead of real patterns. The silhouette score protects against this by penalizing unnecessary splits, and the natural split of the data into two distinct clusters allows for better decision-making.

In my ultimate use of the model for selecting players for Mr. Rooney to consider, my results showed the top 4 players for each category (good selections, not good selections, and potential backups). Tyrese Proctor, Maxime Raynaud, Mohamed Diawara, and Noah Penda were the best options, all providing strong scoring relative to salary. Olivier-Maxence Prosper, Nick Smith Jr., Cam Whitmore, and Liam McNeeley were not ideal options, as they were low contributers relative to salary. Marcus Smart, Khris Middleton, Draymond Green, and Paul George were potential backups. My reasoning for choosing these high-profile but expensive players was that despite their less than ideal cost, they have respectable production. We could focus our targets on the cost-effective players, but if that proved to be unsuccessful, we could compile the resources to nab one of these players instead. 

Overall, my analysis demonstrates that performance-based clustering combined with salary efficiency metrics can help identify undervalued players and reduce financial risk in roster construction. Hopefully, my findings can help Mr. Rooney get his team to the playoffs!