The motivation for this project comes from a keen interest in understanding how geography influences the performance of NBA players. By looking at career statistics like points per game, assists per game and rebounds per game, I aim to identify potential patterns and correlations between a player's state of origin and their performance on the court.

The dataset for this analysis combines detailed career statistics of NBA players with demographic data from various states. This offers a comprehensive view of how regional factors might contribute to athletic success. My interest in this topic stems from a passion for basketball and a curiosity about the socio-economic and cultural factors that shape athletes.

This project aims to provide insights valuable to talent scouts, coaches and sports analysts. It highlights the diversity of player performance across different states and adds to the broader conversation about developing sports talent in various regions.

In [None]:
# import calls
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import sklearn.metrics as metrics
import seaborn as sns;
import cv2
sns.axes_style("whitegrid")
sns.set_context("paper")
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import add_dummy_feature
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay

np.random.seed(42)

In [None]:
from google.colab import files
upload = files.upload()

In [None]:
nba_df = pd.read_excel('NBA Players by State.xlsx')
nba_df.shape

In [None]:
pop = pd.read_csv('uscities.csv')
pop.head()

In [None]:
nba_df.shape

In [None]:
nba_df.info()

In [None]:
nba_df.describe()

In [None]:
nba_df.head()

In [None]:
nba_df.columns

In [None]:
nba_df.rename(columns={'MP.1': 'MPPG', 'PTS.1': 'PTSPG', 'TRB.1': 'TRBPG', 'AST.1': 'ASTPG'}, inplace=True)
nba_df.columns

#Which states have produced the most NBA players?

To answer this question, we extracted the State column from the NBA dataset and counted the frequency of each State which indicates the number of players from each state. Lastly, we sorted the frequencies in descending order to find out the state that produced the most NBA Players.

From our analysis, we can see that California produced the most players.

In [None]:
# extracting the states column from
states = nba_df['State'].value_counts().sort_values(ascending=False)
states.head()

#Is there a correlation between a player's birthplace (state) and the length of their NBA career (number of years played)?


We follow the following algorithmic approach to check whether there is any correlation between a player's birthplace and the length of their NBA career.

*   Create e dataframe, nba_df2, which only consists of the columns Players, Yrs and State from the nba_df dataset.

*   Identify indices where numbers of years played is not numerical and store them in drop_indices

*   Drop the rows with indices present in drop_indices

*   Create a boolean mask to filter states with more than 100 players so that the average years played by each state is not skewed.


*   Use the boolean mask, mask, to filter nba_df2 and create a new DataFrame, filtered_df, that only includes data from states with more than 100 players.

*   Group filtered_df by State and calculate the mean of Yrs (average years played) for each state.


*   Sort the results in descending order to find which states have the highest average years played.

*   Use a bar plot to visualize the average years played by state, with states sorted by the average years played.



In [None]:
# creating nba_df2 that contains the columns
# Player, Yrs and State
nba_df2 = nba_df[['Player', 'Yrs', 'State']]

# chekcing if there is any non-numerical data
nba_df2['Yrs'].unique()

In [None]:
# identifying unwanted rows and dropping them by their indices
drop_indices = nba_df2[nba_df2['Yrs'] == 'Yrs'].index
nba_df2.drop(drop_indices, axis=0, inplace=True)

In [None]:
# creating a boolean mask
mask = nba_df2[['State', 'Yrs']].groupby('State')[ 'Yrs'].count() > 100
filtered_df = nba_df2[nba_df2['State'].isin(mask[mask].index)] # using the mask to output states that have a count of
                                                              # more than 100
# grouping the DataFrame
filtered_df.groupby('State')['Yrs'].mean().sort_values(ascending=False)

In [None]:
# plotting the states vs average years played
sns.barplot(filtered_df, x='State', y='Yrs', color='lightsteelblue', errorbar=None)

# setting the labels
plt.xticks(rotation=90)
plt.xlabel('States')
plt.ylabel('Average years played')
plt.title('Average years played by state')

According to the above analysis, Louisiana stands out with an average number of years played in NBA of almost 6.40 years. However, the other states are not much different from Louisiana. Other states include North Carolina with 5.99 years, Ohio with 5.73 years and California with 5.60 years. Thus, we have no significant evidence of the correlation between a player’s birthplace and career span in NBA as the averages across states are fairly similar.

#How do the career statistics (e.g., points per game, assists per game, rebounds per game) of NBA players vary by their state of origin?



*   Create a DataFrame, nba_df3, having only the columns Player, State, PTSPG, ASTPG and TRBPG.

*   Identify the rows that have unwanted values and store them in drop_indices. For instance, rows with value PTS in the PTSPG column.

*   Drop the rows with indices present in drop_indices.

*   Group the data by State and calculate the mean of PTSPG, ASTPG and TRBPG for each state.


*   Create separate visualizations for each of these statistics:

    *   Plot the average points per game by state.
    *   Plot the average assists per game by state.
    *   Plot the average rebounds per game by state.

In [None]:
nba_df3 = nba_df[['Player', 'State', 'PTSPG', 'ASTPG', 'TRBPG']]
nba_df3['PTSPG'].unique() # ensuring if there are any non-numeric values

In [None]:
# setting out the only non-numeric value
nba_df3[nba_df3['PTSPG'] == 'PTS']

In [None]:
drop_indices = nba_df3[nba_df3['PTSPG'] == 'PTS'].index # indices to be dropped
nba_df3.drop(drop_indices, axis=0, inplace=True)  # dropping the unwanted rows by indices

In [None]:
nba_df3.groupby('State')[['PTSPG', 'ASTPG', 'TRBPG']].mean().head() # mean of career statistics

In [None]:
# plotting states vs average points per game
plt.close('all')
plt.figure(figsize=(8, 6))
sns.barplot(nba_df3, x='State', y='PTSPG', errorbar=None, color='coral')

# setting up labels
plt.xticks(rotation=90)
plt.xlabel('States')
plt.ylabel('Average points per game')
plt.title('Average points per game by state')


The bar chart highlights the average points per game for NBA players by their state of origin. There's a noticeable difference across states. Hawaii shows the highest average points per game, while states like Alaska and Wyoming have lower averages. This suggests that player scoring performance varies widely depending on where they come from

In [None]:
# plotting states vs average assists per game
plt.close('all')
plt.figure(figsize=(8, 6))
sns.barplot(nba_df3, x='State', y='ASTPG', errorbar=None, color='orange')

# setting up labels
plt.xticks(rotation=90)
plt.xlabel('States')
plt.ylabel('Average assists per game')
plt.title('Average assists per game by state')


This bar chart displays the average assists per game by state of origin for NBA players. Here, Alaska stands out with the highest average assists per game. In contrast, states like Arizona and South Carolina have lower averages. This indicates that the ability to assist varies significantly among players from different states.

In [None]:
# plotting states vs average rebounds per game
plt.close('all')
plt.figure(figsize=(8, 6))
sns.barplot(nba_df3, x='State', y='TRBPG', errorbar=None, color='deeppink')

# setting up labels
plt.xticks(rotation=90)
plt.xlabel('States')
plt.ylabel('Average rebounds per game')
plt.title('Average rebounds per game by state')


The third bar chart illustrates the average rebounds per game by state of origin for NBA players. Hawaii again stands out, this time with the highest average rebounds per game. On the other hand, states like Alaska and Nevada have lower averages. This shows that players' rebounding performance varies considerably based on their state of origin.


# Predictive Modelling

Below is the algorithmic approach for predictive modeling of city sizes based on PTS, AST, and TRB.

*   Calculate the first quartile (q1) and third quartile (q3) of the city population densities.
*   Implement the assign_city_size function:

    *   Input: DataFrame containing city population densities.

    *   Output: DataFrame with an additional column categorizing each city as 'small', 'medium', or 'big' based on density.


*   Check for and handle NaN values in key features 'PTS', 'AST', and 'TRB'.


*   Convert non-numeric types to floats.

*   Extract the features 'PTS', 'AST', and 'TRB' from the merged DataFrame.
*   Extract the target variable 'city_size' from the merged DataFrame.


*   Use train_test_split to split the features and target into training and testing sets.


*   Convert y_train and y_test to DataFrames to ensure they are in the correct format.
*   Initialize the Decision Tree classifier:


*   List itemSet the maximum depth to 15.

*   Train the Decision Tree classifier on the training data and perform predictions.

In [None]:
pop_df = pop[['city','state_name', 'population', 'density']]
pop_df

In [None]:
nba_df = nba_df[['Player', 'PTS', 'AST', 'TRB', 'State', 'City']]
nba_df

In [None]:
# merging nba_df and pop_df on city and then state
merged_df = pd.merge(nba_df, pop_df, left_on = ['City', 'State'], right_on=['city', 'state_name'], how='inner')

# calculating upper and lower quartiles
q1 = merged_df['density'].describe()[4]
q3 = merged_df['density'].describe()[6]

# defining function to categorize city_size based on pop. density
def assign_city_size(density):
  if density < q1:
    return 1  # indicates small city
  elif density > q3:
    return 2  # indicates medium city
  else:
    return 3  # indicates big city

# applying the function to the 'density' column
merged_df['city_size'] = merged_df['density'].apply(assign_city_size)
merged_df.drop(columns=['city', 'state_name'], axis=1, inplace=True)  # dropping the duplicate columns

In [None]:
# plotting pairplot
mdf = merged_df[['PTS', 'AST', 'TRB', 'city_size']]
plt.close('all')
sns.pairplot(mdf, hue='city_size')

In [None]:
# making all the features numerical
merged_df['PTS'] = merged_df['PTS'].astype(float)
merged_df['AST'] = merged_df['AST'].astype(float)
merged_df['TRB'] = merged_df['TRB'].astype(float)

# setting NaN values to zero
merged_df[['PTS', 'AST', 'TRB', 'city_size']] = merged_df[['PTS', 'AST', 'TRB', 'city_size']].fillna(0)

In [None]:
# Define features and target
feat = merged_df[['PTS', 'AST', 'TRB']].values
target = merged_df['city_size'].values

# Split the data
x_train, x_test, y_train, y_test = train_test_split(feat, target, test_size=0.2, random_state=42)

# Ensure y_train and y_test are 1D arrays
y_train_df = pd.DataFrame(y_train)
y_test_df = pd.DataFrame(y_test)

In [None]:
# creating Decision Tree Classifier
log_reg = DecisionTreeClassifier(max_depth=15)

# fitting the model with the training data
log_reg.fit(x_train, y_train_df)

# outputing the scores of training and testing data
print("Score on training data: ", log_reg.score(x_train, y_train_df))
print("Score on testing data: ", log_reg.score(x_test, y_test_df))

In [None]:
# making predictions using training and testing data
y_pred = log_reg.predict(x_train)
y_predt = log_reg.predict(x_test)

# printing classifier report of training data
print(classification_report(y_train, y_pred))

In [None]:
# printing classifier report of testing data
print(classification_report(y_test, y_predt))

The model does a decent job on the training data, showing an accuracy of 71%. But, there's a noticeable difference in precision and recall across different classes. For instance, class 3 has a high recall of 0.91. However, for classes 1 and 2, both precision and recall are just moderate.

When we look at the testing data, the story changes. The accuracy drops to 40%, which means the model isn't doing well with new, unseen data. Specifically, for classes 1 and 2, both precision and recall are pretty low, indicating the model finds it hard to classify these classes correctly in the test set. Class 3 does a bit better with a precision of 0.51, recall of 0.63, and an F1-score of 0.56, but these numbers aren't particularly impressive either.

Overall, it seems like the model is overfitting. It performs significantly better on the training data than on the testing data, which is a clear sign of this issue.

In [None]:
# confusion matrix of training data
ConfusionMatrixDisplay.from_predictions(y_train, y_pred)

In [None]:
# confusion matrix of testing data
ConfusionMatrixDisplay.from_predictions(y_test, y_predt)

The confusion matrix for the training data reveals that class 3 has the highest true positive rate with 1126 instances. This matches the high recall (0.91) observed earlier. For classes 1 and 2, there are more false positives and false negatives, which aligns with their moderate precision and recall scores.

In the testing data, the confusion matrix shows a clear drop in true positives across all classes, especially for classes 1 and 2. This corresponds to the lower precision and recall scores for these classes. Class 3 maintains the highest number of true positives (196), but it also has a significant number of misclassifications. This indicates it performs better than classes 1 and 2, but still not well enough.

These confusion matrices confirm the earlier observation that the model is overfitting. The model performs much better on the training data compared to the testing data. There are significant drops in true positive rates and increases in misclassifications in the test set, especially for classes 1 and 2. The model struggles to maintain its precision and recall when faced with new, unseen data.