# Project Phase 2

Team Members: Kate Li (kl739), Audrey Holden (aeh252), Katherine Yee (ky424), Julian Correa (jfc297)

## Research Question

Question: What factors predict an NBA player’s salary? Is there a relationship between an NBA player's performance and an increase in their salary? 

In this assignment we will be analyzing how the salaries of NBA players change depending on many different statistical variables, including points, assists, rebounds, blocks, steals, etc, per game. We will train a multivariable regression to see if we can predict an NBA player's salary depending on their previous season(s) performance statistics. Our model will hopefully help sports analysts better predict a contract that an NBA player may receive. It may also provide NBA team managers more insight into team spending and budgeting. 

We will also see if there is a positive or negative correlation between performance and salary. 
We also plan on analyzing different NBA player positions and examining the magnitude of salary changes depending on their respective stats. For example, a center having an increase in blocks per game may improve their salary, but a comparative increase in points per game by a shooting guard would most likely result in a dramatic increase in their salary given that offensive roles are more valuable.

## Data Collection and Cleaning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.linear_model import LinearRegression, LogisticRegression
import duckdb
import requests
from bs4 import BeautifulSoup
import time

We read the dataset (csv) files and stored them in variables. `salaries` has data on NBA player salaries by year from 1990-2017. `stats` has data on NBA player performance statistics by year from 1950-2017.

In [2]:
salaries = pd.read_csv("salaries.csv")
stats = pd.read_csv("stats.csv")

We created a pandas dataframe using the `salaries` csv file and assigned it to the variable `salaries_df`. We then dropped miscellaneous columns such as `Register Value`, `Season End`, and `Full Team Name`. 

In [3]:
salaries_df = pd.DataFrame(salaries)
salaries_df = salaries_df.drop(['Register Value', 'Season End', 'Full Team Name'], axis = 1)
print(salaries_df)

              Player Name     Salary in $   Season Start Team
0              A.C. Green   $1,750,000.00           1990  LAL
1              A.C. Green   $1,750,000.00           1991  LAL
2              A.C. Green   $1,750,000.00           1992  LAL
3              A.C. Green   $1,885,000.00           1993  PHO
4              A.C. Green   $6,472,600.00           1994  PHO
...                   ...              ...           ...  ...
11832  Zydrunas Ilgauskas   $8,740,000.00           2005  CLE
11833  Zydrunas Ilgauskas   $9,442,697.00           2006  CLE
11834  Zydrunas Ilgauskas  $10,142,156.00           2007  CLE
11835  Zydrunas Ilgauskas  $10,841,615.00           2008  CLE
11836  Zydrunas Ilgauskas  $11,541,074.00           2009  WAS

[11837 rows x 4 columns]


Here, we renamed the column `Player Name` to just `Player` and the `Season Start` column to just `Season`. We then selected only the players that have salaries listed between the years 2010 and 2017 (inclusive). We also dropped rows where the player was listed twice in one year but with different teams, meaning we disregard mid-season trading. 

In [4]:
salaries_df = salaries_df.rename(columns = {'Player Name': 'Player', ' Salary in $ ': 'Salary', 'Season Start': 'Season'})

salaries_df = salaries_df[(salaries_df['Season'] >= 2010) & (salaries_df['Season'] <= 2017)]

salaries_df = salaries_df.drop_duplicates(subset = ['Player', 'Season'], keep = 'first')
print(salaries_df)

              Player          Salary  Season Team
19      A.J. Hammons    $650,000.00     2016  DAL
20      A.J. Hammons  $1,312,611.00     2017  MIA
22        A.J. Price    $762,195.00     2010  IND
23        A.J. Price    $854,389.00     2011  IND
24        A.J. Price    $885,120.00     2012  WAS
...              ...             ...     ...  ...
11805  Zaza Pachulia  $2,898,000.00     2016  GSW
11806  Zaza Pachulia  $3,477,600.00     2017  GSW
11817        Zhou Qi    $815,615.00     2017  HOU
11818   Zoran Dragic  $1,962,103.00     2014  MIA
11819   Zoran Dragic  $2,050,397.00     2015  BOS

[3710 rows x 4 columns]


We wanted to ensure that we only selected the players that were active during the entire time frame that we selected (2010 to 2017 seasons). In order to do so, we counted the number of times each player was listed within `salaries_df` which had already been filtered to have data only within this time frame. 

We converted this count into a dataframe that included the player and the number of times they were counted (which was always 8). Performing an inner join on the player, we were able to modify the original `salaries_df` so that it included all of the salaries for each player from 2010 to 2017 inclusive. This dataframe now has only players that were active from 2010 to 2017 (inclusive) and includes player name, the season, salaries, and team. 

In [7]:
for i in salaries_df:
    player_count = salaries_df['Player'].value_counts()
    select = player_count[player_count == 8]

filter_players_df = pd.DataFrame(select)
filter_players_df = filter_players_df.reset_index()

salaries_df = duckdb.sql("""SELECT salaries_df.Player, salaries_df.Salary, salaries_df.Season, salaries_df.Team \
                             FROM salaries_df \
                             INNER JOIN filter_players_df \
                             ON salaries_df.Player = filter_players_df.Player""").df()
print(salaries_df)
print(salaries_df.iloc[:8, :])

          Player          Salary  Season Team
0     Aaron Gray  $1,028,840.00     2010  NOH
1     Aaron Gray  $2,500,000.00     2011  TOR
2     Aaron Gray  $2,575,000.00     2012  TOR
3     Aaron Gray  $2,690,875.00     2013  TOR
4     Aaron Gray  $1,227,985.00     2014  DET
..           ...             ...     ...  ...
827  Omri Casspi    $947,907.00     2013  HOU
828  Omri Casspi  $1,063,384.00     2014  SAC
829  Omri Casspi  $3,000,000.00     2015  SAC
830  Omri Casspi  $3,000,000.00     2016  NOH
831  Omri Casspi  $2,106,470.00     2017  GSW

[832 rows x 4 columns]
       Player          Salary  Season Team
0  Aaron Gray  $1,028,840.00     2010  NOH
1  Aaron Gray  $2,500,000.00     2011  TOR
2  Aaron Gray  $2,575,000.00     2012  TOR
3  Aaron Gray  $2,690,875.00     2013  TOR
4  Aaron Gray  $1,227,985.00     2014  DET
5  Aaron Gray    $452,059.00     2015  DET
6  Aaron Gray    $452,059.00     2016  DET
7  Aaron Gray    $452,059.00     2017  DET


## Data Description

We plan on creating two separate data frames that we will join into one data frame so that we can perform data analysis. Our first data frame will contain data on about 40 NBA players that were active from the 2014-2015 season to the 2023-2024 season. This data set will have NBA performance metrics (columns) for each player (rows) over the last 10 seasons. Our second data frame will contain data on the salaries/contracts of the same 40 NBA players over the same time period. This data set will have the season salaries/earnings (columns) of each player (rows) over the last 10 seasons. These data sets were created so that we could more efficiently analyze the impact of performance metrics and player salaries. 

It is unclear if anyone specifically funded our datasets of NBA and HoopsHype. HoopsHype was founded by Jorge Sierra and was eventually acquired by the USA Sports Media Group in 2012. The NBA on the other hand does not have a primary owner/ sponsor, so it is unclear who funded the creation of the NBA dataset. However, given the revenue that the NBA makes, it wouldn't be surprising if they funded their own data collection. Google owns Kaggle which likely implies that Kaggle was funded by Google, both being reliable companies. It seems that all of our sources are reliable. 

The websites that we collected data from are all from public domains. Since the performance statistics of NBA players can be recorded by anyone that keeps close track of the sport, there is not much in terms of private data or information when documenting player stats. The same can be said about player salaries, which are always publicly available. However, the referenced dataset from one of the websites, Kaggle, utilizes submitted data from other data analysts, which could be formatted however the original analyst wanted. This means we would be using already processed data, which could be missing information.

Web scraping will be used to extract data from the three public websites (BR NBA Data , ESPN Player Salaries , NBA Leaders). Additionally, we will be using formatted data from an existing dataset (Kaggle Player Performance and Salaries). We will use these data to create two new datasets. One dataset selects 40 players with their stats per year from the beginning of the 2014-2015 season to the end of the 2023-2024 season. The other dataset contains those same players’ salaries over the same timeframe. We will use these two dataframes to further compare the two factors, and point out any trends and make predictions.

NBA officials and ESPN analysts collected the data for their future reference and comparison.  All NBA players are very much aware that their data is collected after every game.  The players can utilize the data to help improve their abilities and study their performances.

## Data Limitations

A potential data limitation includes the effect of inflation on a player’s salary when comparing performance progression between seasons. When analyzing a player's salary from previous seasons and comparing it to their current salary, not taking into account inflation would fail to provide an accurate representation of how much a players salary actually increased or decreased based on performance throughout the seasons. Since inflation changes the value of how much their salary is worth, it would not provide an accurate representation of any correlation between the two. To combat this, we would have to normalize the given data to achieve standardization.

Another limitation to consider is team salary caps. NBA revenue has increased through each decade so NBA teams have increasingly larger salary caps to use on contracts. Since salary caps have steadily increased, teams are able to write larger contracts for their players, meaning that player salaries increase naturally. This may be a limitation in our data because if we don’t take into account salary cap increase, we may see an average salary increase for all players, even if they played poorly. In these cases, a player’s performance and their salary would have a positive correlation, but it would be a result of team salary cap increase rather than better performance. 

Individual player salary caps have also changed (Max contracts have gotten dramatically bigger as the years go on). The salary cap from the 2014-2015 season, which was \\$63.065 million,  pales in comparison to the salary cap in the 2024- 2025 season, which is \\$140.588  million. This is a data limitation because it complicates our multivariable prediction analysis. Chris Paul, a future Hall-of-Famer, earned \\$21.47 million dollars in his 2015 NBA season, while Mikal Bridges earned \\$23.3 million in the 24-25 season. Chris Paul, a significantly better and higher ranked player earned less than Mikal Bridges, someone who would have probably only earned \\$10 million 10 years before. Given the inflation and change in NBA salary spending, it makes it tricky to create comparable graphs and prediction charts with the data. 

Another limitation is if a player got injured and missed a season. This would result in missing data in the time series. When we want to determine how a player’s performance and salary changes season to season, missing data from the time series will lead to inaccurate results when analyzing potential trends in the data. Making predictions based on the given information will also be negatively impacted. 

Failing to recognize the harms of the data could potentially have a significant impact on people. One area affected by these harms could be sports analytics. If a sports team hires a data analyst to review the data about players progression to determine what the salary should be for that season but they fail to take into account inflation, the analyst may provide the wrong salary recommendation for a player. This would result in the franchise paying a player a sum of money that may generate a financial burden to the program that outweighs the benefits the player provides to the team. ESPN sports commentators and analysts, who often discuss future contract agreements, may be mislead by these data limitations. 

People who participate in sports betting may also be negatively impacted by the limitation of players missing a season due to injury. If someone wants to bet on how a player performs in a game or season based on how well they've historically performed, a gap in the time data series would skew the results of the player’s performance and provide misleading information to the better. This would result in a person potentially losing money from faulty data.

## Exploratory Data Analysis

Since we haven’t completed our web scraping for our data sets yet, we will do an exploratory data analysis later that includes basic summary statistics and a plot that visualizes our two data sets. 

## Questions for Reviewers

1. What does it mean by raw source data and how do we know if it is applicable to us?
2. What other types of EDA should we include?