**Data Collection:** Load and organize data to streamline next steps of the project

This section includes loading the data and joining/merging it.

In [15]:
import os
import pandas as pd

In [16]:
# Load the 2020-2021 player stats data and the 2020 player salary data
per36 = pd.read_csv('nba2021_per36min.csv')
per36.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,PF,21,MIA,28,2,408,6.4,10.8,0.59,...,0.561,3.2,6.7,9.9,1.5,1.1,1.3,2.6,4.8,16.0
1,Jaylen Adams,PG,24,MIL,6,0,17,2.1,16.9,0.125,...,0.0,0.0,6.4,6.4,4.2,0.0,0.0,0.0,2.1,4.2
2,Steven Adams,C,27,NOP,27,27,760,4.5,7.4,0.603,...,0.468,5.5,5.8,11.3,2.7,1.2,0.7,2.2,2.4,10.3
3,Bam Adebayo,C,23,MIA,26,26,873,7.9,13.8,0.573,...,0.841,2.1,7.8,9.9,5.7,1.0,1.1,3.2,2.8,21.4
4,LaMarcus Aldridge,C,35,SAS,18,18,480,8.0,16.9,0.476,...,0.762,1.1,4.7,5.9,2.6,0.5,1.2,1.2,2.0,19.0


In [17]:
# Load the 2020 player salary data
salary = pd.read_csv('NBA Players Salaries 2020.csv')
salary.head()

Unnamed: 0,Rk,Player,Tm,2019-20,2020-21,2021-22,2022-23,2023-24,2024-25,Signed Using,Guaranteed
0,1,Stephen Curry\curryst01,GSW,40231758.00 $,43006362.00 $,45780966.00 $,,,,Bird Rights,129019086.00 $
1,2,Chris Paul\paulch01,OKC,38506482.00 $,41358814.00 $,44211146.00 $,,,,Bird Rights,79865296.00 $
2,3,Russell Westbrook\westbru01,HOU,38178000.00 $,41006000.00 $,43848000.00 $,46662000.00 $,,,Bird Rights,123032000.00 $
3,4,John Wall\walljo01,WAS,37800000.00 $,40824000.00 $,43848000.00 $,46872000.00 $,,,Bird Rights,122472000.00 $
4,5,James Harden\hardeja01,HOU,37800000.00 $,40824000.00 $,43848000.00 $,46872000.00 $,,,Bird Rights,122472000.00 $


In [18]:
# Remove the extra text in the 'Player' column of the salary data.  Even though it appears there is only one
# '\' there are actually two if a single value is printed out as a test 
salary['Player'] = salary['Player'].str.split('\\').str[0]
salary.head()

Unnamed: 0,Rk,Player,Tm,2019-20,2020-21,2021-22,2022-23,2023-24,2024-25,Signed Using,Guaranteed
0,1,Stephen Curry,GSW,40231758.00 $,43006362.00 $,45780966.00 $,,,,Bird Rights,129019086.00 $
1,2,Chris Paul,OKC,38506482.00 $,41358814.00 $,44211146.00 $,,,,Bird Rights,79865296.00 $
2,3,Russell Westbrook,HOU,38178000.00 $,41006000.00 $,43848000.00 $,46662000.00 $,,,Bird Rights,123032000.00 $
3,4,John Wall,WAS,37800000.00 $,40824000.00 $,43848000.00 $,46872000.00 $,,,Bird Rights,122472000.00 $
4,5,James Harden,HOU,37800000.00 $,40824000.00 $,43848000.00 $,46872000.00 $,,,Bird Rights,122472000.00 $


In [19]:
# Merge the two dataframes on the 'Player' column, keeping only the rows that have a matching 'Player' value
df = pd.merge(per36, salary, on='Player')
df.head()

Unnamed: 0,Player,Pos,Age,Tm_x,G,GS,MP,FG,FGA,FG%,...,Rk,Tm_y,2019-20,2020-21,2021-22,2022-23,2023-24,2024-25,Signed Using,Guaranteed
0,Jaylen Adams,PG,24,MIL,6,0,17,2.1,16.9,0.125,...,531,POR,163356.00 $,,,,,,Minimum Salary,163356.00 $
1,Jaylen Adams,PG,24,MIL,6,0,17,2.1,16.9,0.125,...,544,ATL,100000.00 $,,,,,,,100000.00 $
2,Steven Adams,C,27,NOP,27,27,760,4.5,7.4,0.603,...,41,OKC,25842697.00 $,27528090.00 $,,,,,1st Round Pick,53370787.00 $
3,Bam Adebayo,C,23,MIA,26,26,873,7.9,13.8,0.573,...,253,MIA,3454080.00 $,5115492.00 $,,,,,1st Round Pick,8569572.00 $
4,LaMarcus Aldridge,C,35,SAS,18,18,480,8.0,16.9,0.476,...,40,SAS,26000000.00 $,24000000.00 $,,,,,Cap Space,50000000.00 $


**Data Cleaning:** Clean up the data in order to prepare it for the next steps of the project.

This section will take care of null or missing values and any duplicates

Post Merge Observations

The per36 statistics take a player's actual statistics and adjust them to show what would be expected
if the player played 36 minutes in a game.  Because of this, the number of games (G) and minutes played (MP) are not
necessary.

In addition, there are a lot of NaN values for salaries past 2019-20.  Since players tend to sign short term contracts, many will have to re-sign a new contract in the future and thus there is no data for what they will be earning.  Since there is uncertainty and many missing values, salaries after 2019-20 can be removed.

There are two team columns (Tm_x and Tm_y), some matching and others not.  Since the data was not collected at exactly the same time this is somewhat expected.  Many trades and free agent signings occur in the NBA each season so a player may end up playing for multiple teams in the same season.  It will be assumed that a player's ability does not change significantly based on what teamt they play for.  However, the salary a player earns may change based on what team they play for (smaller markets may pay less, bigger markets may pay more).  At this point it is unproven but the team from the salary database will be used since there may be a correlation between team and salary.  This is something that can be investigated later.

Lastly, the Rk column was brought over from the salary dataframe and was a ranking of highest to lowest salary.  This information is not necessary in the dataframe since we can simply sort by salary to get the same result.

In [20]:
# Drop unecessary columns and rename Tm_y to 'Team'
df['Team'] = df['Tm_y']
df = df.drop(['G', 'MP', 'Tm_x', 'Tm_y', 'Rk', '2020-21', '2021-22', '2022-23', '2023-24', '2024-25'], axis=1)
df.head()

Unnamed: 0,Player,Pos,Age,GS,FG,FGA,FG%,3P,3PA,3P%,...,AST,STL,BLK,TOV,PF,PTS,2019-20,Signed Using,Guaranteed,Team
0,Jaylen Adams,PG,24,0,2.1,16.9,0.125,0.0,4.2,0.0,...,4.2,0.0,0.0,0.0,2.1,4.2,163356.00 $,Minimum Salary,163356.00 $,POR
1,Jaylen Adams,PG,24,0,2.1,16.9,0.125,0.0,4.2,0.0,...,4.2,0.0,0.0,0.0,2.1,4.2,100000.00 $,,100000.00 $,ATL
2,Steven Adams,C,27,27,4.5,7.4,0.603,0.0,0.0,0.0,...,2.7,1.2,0.7,2.2,2.4,10.3,25842697.00 $,1st Round Pick,53370787.00 $,OKC
3,Bam Adebayo,C,23,26,7.9,13.8,0.573,0.1,0.2,0.4,...,5.7,1.0,1.1,3.2,2.8,21.4,3454080.00 $,1st Round Pick,8569572.00 $,MIA
4,LaMarcus Aldridge,C,35,18,8.0,16.9,0.476,1.8,5.0,0.358,...,2.6,0.5,1.2,1.2,2.0,19.0,26000000.00 $,Cap Space,50000000.00 $,SAS


In [21]:
# Check for missing values in the new dataframe
null_data = df[df.isnull().any(axis=1)]
print(null_data)

                      Player Pos  Age  GS   FG   FGA    FG%   3P  3PA    3P%  \
1               Jaylen Adams  PG   24   0  2.1  16.9  0.125  0.0  4.2  0.000   
31               Jordan Bell   C   26   1  3.6  10.8  0.333  0.0  1.4  0.000   
43            Marques Bolden   C   22   0  1.2   3.7  0.333  0.0  0.0  0.000   
48             Avery Bradley  SG   30   1  5.3  11.3  0.470  2.7  6.5  0.421   
61                Trey Burke  PG   28   0  6.1  13.8  0.444  2.8  7.0  0.396   
73           Marquese Chriss  PF   23   0  6.7  18.7  0.357  1.3  6.7  0.200   
75                Gary Clark  SF   26  11  2.0   6.7  0.294  1.5  5.7  0.264   
77                Gary Clark  SF   26  11  2.0   6.7  0.294  1.5  5.7  0.264   
84          DeMarcus Cousins   C   30  11  5.6  14.9  0.376  2.8  8.3  0.336   
122              Tim Frazier  PG   30   0  2.2   8.7  0.250  0.0  2.2  0.000   
145               Jeff Green  PF   34  13  4.8   9.5  0.505  2.2  5.1  0.427   
147               Jeff Green  PF   34  1

It appears that all the missing data is in the 'Signed Using' column.  This information may be useful for the players that have the information filled out different types of contracts (rookie, cap space, max, super max, etc.) affect how much a player earns.  Once the model is made that can be investigated further, but for now the NaN values can be replaced by 'Unknown'.

In [22]:
# Replace the missing values in the 'Signed Using' column with 'Unknown'
df['Signed Using'] = df['Signed Using'].fillna('Unknown')
null_data = df[df.isnull().any(axis=1)]
print(null_data)

Empty DataFrame
Columns: [Player, Pos, Age, GS, FG, FGA, FG%, 3P, 3PA, 3P%, 2P, 2PA, 2P%, FT, FTA, FT%, ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS, 2019-20, Signed Using, Guaranteed, Team]
Index: []

[0 rows x 29 columns]


In [23]:
# Create 5 new dataframes to separate players by position.  This will help later when investigating what affects salary.
PG = df[df['Pos'] == 'PG']
SG = df[df['Pos'] == 'SG']
SF = df[df['Pos'] == 'SF']
PF = df[df['Pos'] == 'PF']
C = df[df['Pos'] == 'C']

print(PG.head(), SG.head(), SF.head(), PF.head(), C.head())

              Player Pos  Age  GS   FG   FGA    FG%   3P  3PA    3P%  ...  \
0       Jaylen Adams  PG   24   0  2.1  16.9  0.125  0.0  4.2  0.000  ...   
1       Jaylen Adams  PG   24   0  2.1  16.9  0.125  0.0  4.2  0.000  ...   
16  Ryan Arcidiacono  PG   26   0  3.0   8.1  0.375  2.0  5.4  0.375  ...   
17     D.J. Augustin  PG   33   0  3.4  10.3  0.336  2.4  6.6  0.364  ...   
20        Lonzo Ball  PG   23  25  6.0  14.2  0.424  3.4  8.8  0.381  ...   

    AST  STL  BLK  TOV   PF   PTS       2019-20    Signed Using  \
0   4.2  0.0  0.0  0.0  2.1   4.2   163356.00 $  Minimum Salary   
1   4.2  0.0  0.0  0.0  2.1   4.2   100000.00 $         Unknown   
16  5.7  1.0  0.0  0.7  3.4   9.4  3000000.00 $       Cap space   
17  6.1  1.0  0.1  1.6  1.4  11.8  7250000.00 $       Cap Space   
20  5.5  1.4  0.6  2.4  2.9  16.3  8719320.00 $  1st Round Pick   

       Guaranteed  Team  
0     163356.00 $   POR  
1     100000.00 $   ATL  
16   6000000.00 $   CHI  
17   7250000.00 $   ORL  
20  

Since the affect of which city/franchise a player plays in is likely to affect their salary, data about franchise valuations needs to be pulled.  Data was taken from a Forbes article at https://www.forbes.com/sites/kurtbadenhausen/2021/02/10/nba-team-values-2021-knicks-keep-top-spot-at-5-billion-warriors-bump-lakers-for-second-place/?sh=85035f5645b7

In [24]:
# Create new dataframe where information will eventually be stored
market = pd.DataFrame()

# Create a dictionary with the team as the key and the franchise value as the value, shown in Billions of Dollars.
dict = {'MEM':1.3, 'NOP':1.35, 'MIN':1.4, 'DET':1.45, 'ORL':1.46,
       'CHO':1.5, 'ATL':1.52, 'IND':1.55, 'CLE':1.56, 'OKC':1.575, 
       'MIL':1.625, 'DEN':1.65, 'UTA':1.66, 'PHO':1.7, 'WAS':1.8, 
       'SAC':1.825, 'SAS':1.85, 'POR':1.9, 'MIA':2, 'PHI':2.075, 
       'TOR':2.15, 'DAL':2.45, 'HOU':2.5, 'BRK':2.65, 'LAC':2.75, 
       'BOS':3.2, 'CHI':3.3, 'LAL':4.6, 'GSW':4.7, 'NYK':5}

print(market, dict)

Empty DataFrame
Columns: []
Index: [] {'MEM': 1.3, 'NOP': 1.35, 'MIN': 1.4, 'DET': 1.45, 'ORL': 1.46, 'CHO': 1.5, 'ATL': 1.52, 'IND': 1.55, 'CLE': 1.56, 'OKC': 1.575, 'MIL': 1.625, 'DEN': 1.65, 'UTA': 1.66, 'PHO': 1.7, 'WAS': 1.8, 'SAC': 1.825, 'SAS': 1.85, 'POR': 1.9, 'MIA': 2, 'PHI': 2.075, 'TOR': 2.15, 'DAL': 2.45, 'HOU': 2.5, 'BRK': 2.65, 'LAC': 2.75, 'BOS': 3.2, 'CHI': 3.3, 'LAL': 4.6, 'GSW': 4.7, 'NYK': 5}


In [25]:
# Add the information from the dictionary to the market dataframe
market = market.append(dict, ignore_index=True)
market = market.T
market.reset_index(inplace = True, drop = False)
market.columns = ['Team', 'Team_Value']
market.head()

Unnamed: 0,Team,Team_Value
0,ATL,1.52
1,BOS,3.2
2,BRK,2.65
3,CHI,3.3
4,CHO,1.5


In [26]:
# Get some basic stats for the team values
mean = market['Team_Value'].mean()
med = market['Team_Value'].median()
cut_25 = market.Team_Value.quantile(0.25)
cut_75 = market.Team_Value.quantile(0.75)

print('The mean is ' + str(mean) + ' billion dollars')
print('The median is ' + str(med) + ' billion dollars')
print('50% of teams are valued between ' + str(cut_25) + ' and ' + str(cut_75) + ' billion dollars')

The mean is 2.201666666666667 billion dollars
The median is 1.8125 billion dollars
50% of teams are valued between 1.5525 and 2.4875 billion dollars


In [27]:
# Add a column so that the teams can easily be grouped based on value during exploration
size = []

for value in market['Team_Value']:
    if value < cut_25:
        size.append('small')
    elif value <= cut_75:
        size.append('medium')
    else:
        size.append('large')
        
market['Market Size'] = size
market = market.sort_values('Market Size')
print(market)

   Team  Team_Value Market Size
10  HOU       2.500       large
1   BOS       3.200       large
2   BRK       2.650       large
3   CHI       3.300       large
19  NYK       5.000       large
9   GSW       4.700       large
13  LAL       4.600       large
12  LAC       2.750       large
23  PHO       1.700      medium
24  POR       1.900      medium
16  MIL       1.625      medium
15  MIA       2.000      medium
28  UTA       1.660      medium
29  WAS       1.800      medium
20  OKC       1.575      medium
26  SAS       1.850      medium
7   DEN       1.650      medium
6   DAL       2.450      medium
5   CLE       1.560      medium
27  TOR       2.150      medium
25  SAC       1.825      medium
22  PHI       2.075      medium
0   ATL       1.520       small
18  NOP       1.350       small
17  MIN       1.400       small
11  IND       1.550       small
8   DET       1.450       small
4   CHO       1.500       small
21  ORL       1.460       small
14  MEM       1.300       small


In [28]:
# Merge the market dataframe with the main dataframe
df = pd.merge(df, market, on='Team')
df = df.sort_values('Player')
df.head()

Unnamed: 0,Player,Pos,Age,GS,FG,FGA,FG%,3P,3PA,3P%,...,BLK,TOV,PF,PTS,2019-20,Signed Using,Guaranteed,Team,Team_Value,Market Size
118,Aaron Gordon,PF,25,19,5.9,13.9,0.427,2.0,5.5,0.369,...,1.0,3.5,2.5,17.1,19863636.00 $,Bird Rights,54409091.00 $,ORL,1.46,small
277,Aaron Holiday,PG,24,6,5.1,13.7,0.37,1.9,5.8,0.333,...,0.2,1.4,3.1,13.4,2239200.00 $,1st Round Pick,4584840.00 $,IND,1.55,small
31,Abdel Nader,SF,27,0,6.5,13.0,0.5,2.1,5.2,0.4,...,0.2,1.7,2.7,18.0,1618520.00 $,Cap Space,1618520.00 $,OKC,1.575,medium
364,Al Horford,C,34,19,7.5,16.6,0.453,2.8,7.1,0.39,...,1.1,1.6,2.2,18.7,28000000.00 $,Cap Space,97000000.00 $,PHI,2.075,medium
109,Al-Farouq Aminu,PF,30,0,2.3,2.3,1.0,2.3,2.3,1.0,...,0.0,4.5,0.0,6.8,9258000.00 $,MLE,29162700.00 $,ORL,1.46,small
