# Project Phase 2

Team Members: Kate Li (kl739), Audrey Holden (aeh252), Katherine Yee (ky424), Julian Correa (jfc297)

## Research Question

Question: What factors predict an NBA player’s salary? Is there a relationship between an NBA player's performance and an increase in their salary? 

In this assignment we will be analyzing how the salaries of NBA players change depending on many different statistical variables, including points, assists, rebounds, blocks, steals, etc, per game. We will train a multivariable regression to see if we can predict an NBA player's salary depending on their previous season(s) performance statistics. Our model will hopefully help sports analysts better predict a contract that an NBA player may receive. It may also provide NBA team managers more insight into team spending and budgeting. 

We will also see if there is a positive or negative correlation between performance and salary. 
We also plan on analyzing different NBA player positions and examining the magnitude of salary changes depending on their respective stats. For example, a center having an increase in blocks per game may improve their salary, but a comparative increase in points per game by a shooting guard would most likely result in a dramatic increase in their salary given that offensive roles are more valuable.

## Data Collection and Cleaning

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import time

These are our websites that are stored into variables.

In [2]:
NBA_stats = """https://www.basketball-reference.com/players/"""
HH_salaries = """https://hoopshype.com/salaries/players/"""
stat_leaders = """https://www.nba.com/stats/leaders?Season=2023-24&SeasonType=Regular+Season"""
kaggle_salary = """https://www.kaggle.com/datasets/thedevastator/exploring-nba-player-performance-and-salaries-19"""

We are doing a GET request to the web hosts for the specified filenames. We are ensuring that we are able to make web requests to each website so that we can extract data. 

In [3]:
stats_result = requests.get(NBA_stats)
print(stats_result.status_code)
if stats_result.status_code != 200:
    print("Something went wrong:", stats_result.status_code, stats_result.reason)

salaries_result = requests.get(HH_salaries)
print(salaries_result.status_code)
if salaries_result.status_code != 200:
    print("Something went wrong:", salaries_result.status_code, salaries_result.reason)

stat_leaders_result = requests.get(stat_leaders)
print(stat_leaders_result.status_code)
if stat_leaders_result.status_code != 200:
    print("Something went wrong:", stat_leaders_result.status_code, stat_leaders_result.reason)

kaggle_result = requests.get(kaggle_salary)
print(kaggle_result.status_code)
if kaggle_result.status_code != 200:
    print("Something went wrong:", kaggle_result.status_code, kaggle_result.reason)

200
200
200
200


We converted each `GET` request with the requests library to a text string using .text and saved the converted outputs into new variables. We used print statements to ensure that we did this step properly.

In [4]:
stats_text = stats_result.text
print(type(stats_text))

salaries_text = salaries_result.text
print(type(salaries_text))

stat_leaders_text = stat_leaders_result.text
print(type(stat_leaders_text))

kaggle_text = kaggle_result.text
print(type(kaggle_text))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


We converted each HTML text string to a searchable tree of tags using `BeautifulSoup`.

In [25]:
page_one = BeautifulSoup(stats_text, "html.parser")
page_two = BeautifulSoup(salaries_text, "html.parser")
page_three = BeautifulSoup(stat_leaders_text, "html.parser")
page_four = BeautifulSoup(kaggle_text, "html.parser")

In [40]:
salaries_info = page_two.find("tbody").find_all("tr")
print(salaries_info[0].text)

#for salary in salaries_info:
   #print(salary.text.strip()[:5])

leaders_info = page_three.find_all("tr")
print(leaders_info)



						1.
					


								Stephen Curry							


							$55,761,217						

							$59,606,817						

							$62,587,158						

							$0						

							$0						

							$0						

[]


## Data Description

We plan on creating two separate data frames that we will join into one data frame so that we can perform data analysis. Our first data frame will contain data on about 40 NBA players that were active from the 2014-2015 season to the 2023-2024 season. This data set will have NBA performance metrics (columns) for each player (rows) over the last 10 seasons. Our second data frame will contain data on the salaries/contracts of the same 40 NBA players over the same time period. This data set will have the season salaries/earnings (columns) of each player (rows) over the last 10 seasons. These data sets were created so that we could more efficiently analyze the impact of performance metrics and player salaries. 

It is unclear if anyone specifically funded our datasets of NBA and HoopsHype. HoopsHype was founded by Jorge Sierra and was eventually acquired by the USA Sports Media Group in 2012. The NBA on the other hand does not have a primary owner/ sponsor, so it is unclear who funded the creation of the NBA dataset. However, given the revenue that the NBA makes, it wouldn't be surprising if they funded their own data collection. Google owns Kaggle which likely implies that Kaggle was funded by Google, both being reliable companies. It seems that all of our sources are reliable. 

The websites that we collected data from are all from public domains. Since the performance statistics of NBA players can be recorded by anyone that keeps close track of the sport, there is not much in terms of private data or information when documenting player stats. The same can be said about player salaries, which are always publicly available. However, the referenced dataset from one of the websites, Kaggle, utilizes submitted data from other data analysts, which could be formatted however the original analyst wanted. This means we would be using already processed data, which could be missing information.

Web scraping will be used to extract data from the three public websites (BR NBA Data , ESPN Player Salaries , NBA Leaders). Additionally, we will be using formatted data from an existing dataset (Kaggle Player Performance and Salaries). We will use these data to create two new datasets. One dataset selects 40 players with their stats per year from the beginning of the 2014-2015 season to the end of the 2023-2024 season. The other dataset contains those same players’ salaries over the same timeframe. We will use these two dataframes to further compare the two factors, and point out any trends and make predictions.

NBA officials and ESPN analysts collected the data for their future reference and comparison.  All NBA players are very much aware that their data is collected after every game.  The players can utilize the data to help improve their abilities and study their performances.

## Data Limitations

A potential data limitation includes the effect of inflation on a player’s salary when comparing performance progression between seasons. When analyzing a player's salary from previous seasons and comparing it to their current salary, not taking into account inflation would fail to provide an accurate representation of how much a players salary actually increased or decreased based on performance throughout the seasons. Since inflation changes the value of how much their salary is worth, it would not provide an accurate representation of any correlation between the two. To combat this, we would have to normalize the given data to achieve standardization.

Another limitation to consider is team salary caps. NBA revenue has increased through each decade so NBA teams have increasingly larger salary caps to use on contracts. Since salary caps have steadily increased, teams are able to write larger contracts for their players, meaning that player salaries increase naturally. This may be a limitation in our data because if we don’t take into account salary cap increase, we may see an average salary increase for all players, even if they played poorly. In these cases, a player’s performance and their salary would have a positive correlation, but it would be a result of team salary cap increase rather than better performance. 

Individual player salary caps have also changed (Max contracts have gotten dramatically bigger as the years go on). The salary cap from the 2014-2015 season, which was \\$63.065 million,  pales in comparison to the salary cap in the 2024- 2025 season, which is \\$140.588  million. This is a data limitation because it complicates our multivariable prediction analysis. Chris Paul, a future Hall-of-Famer, earned \\$21.47 million dollars in his 2015 NBA season, while Mikal Bridges earned \\$23.3 million in the 24-25 season. Chris Paul, a significantly better and higher ranked player earned less than Mikal Bridges, someone who would have probably only earned \\$10 million 10 years before. Given the inflation and change in NBA salary spending, it makes it tricky to create comparable graphs and prediction charts with the data. 

Another limitation is if a player got injured and missed a season. This would result in missing data in the time series. When we want to determine how a player’s performance and salary changes season to season, missing data from the time series will lead to inaccurate results when analyzing potential trends in the data. Making predictions based on the given information will also be negatively impacted. 

Failing to recognize the harms of the data could potentially have a significant impact on people. One area affected by these harms could be sports analytics. If a sports team hires a data analyst to review the data about players progression to determine what the salary should be for that season but they fail to take into account inflation, the analyst may provide the wrong salary recommendation for a player. This would result in the franchise paying a player a sum of money that may generate a financial burden to the program that outweighs the benefits the player provides to the team. ESPN sports commentators and analysts, who often discuss future contract agreements, may be mislead by these data limitations. 

People who participate in sports betting may also be negatively impacted by the limitation of players missing a season due to injury. If someone wants to bet on how a player performs in a game or season based on how well they've historically performed, a gap in the time data series would skew the results of the player’s performance and provide misleading information to the better. This would result in a person potentially losing money from faulty data.

## Exploratory Data Analysis

Since we haven’t completed our web scraping for our data sets yet, we will do an exploratory data analysis later that includes basic summary statistics and a plot that visualizes our two data sets. 

## Questions for Reviewers

1. When we attempt to scrape data from our NBA Leaders source (https://www.nba.com/stats/leaders?Season=2023-24&SeasonType=Regular+Season) by using the <tr> tag, we get None or an empty list in return. However, we can see the tr tag when we inspect the website. How do we fix it/what are we doing wrong?
2. How do I bypass a 403 error? Our origial source for Official NBA Player salaries: (https://www.espn.com/nba/salaries) returns a 403 error when we run a GET request. Does this mean we have to find another data source? (We have replaced this source with data from HoopsHype [https://hoopshype.com/salaries/players/])
3. If our source has drop-down menus for different seasons, how should we go about extracting data from each one to create one dataframe with information? 
4. How should we format our final data set? We will have different NBA statistical categories (PPG, APG, RPG, etc.) that span over 10 years for each player. What should our rows and columns look like and how do we best organize them? Is there a limit on how many columns we can have?
5. What does it mean by raw source data and how do we know if it is applicable to us?
6. How should we format our data? Since we are working with data that varies by season/year, what would be the best way to organize all of our data? Should we change our idea to make it easier to format data?