![logo](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/open_court_logo2.png)
# Open Court
## The Glassdoor of the National Basketball Association (NBA)

### Problem

The National Basketball Association (NBA) is widely considered to be the premier men's professional basketball league in the world and features some of the most well-known athletes. As is such, NBA players are the world's best paid athletes by average annual salary per player. And their increasing salaries have led to salary caps which limit teams' total salaries. This limit is subject to a complex system of rules and exceptions; therefore this is considered a "soft" cap.

The plot below shows the historical salary cap values of the NBA. The NBA players’ salary cap was instituted in the 1984-1985 season. That year the salary cap was 3.6 million and the average player salary was 300,000. By 2013, 66 percent of NBA players earned more than 1 million or more. By the 2012-2013 season, the salary cap increased to 58 million and the average NBA player salary was 5.1 million. In the upcoming 2017-2018 NBA season, the salary cap is almost at the 100 million mark with the highest salary at nearly 35 million and average at 6.5 million.

![salary_cap](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/nba_salary_cap.png)

The plot below shows salaries used by each team for the upcoming 2017-2018 season. As you can see, more than half of all NBA teams surpassed the designated salary cap showing that it's a "soft" cap with many exceptions. The vertical line indicates the salary cap for the season.

![team_salaries](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/current_sal.png)

The escalation of NBA player salaries has not only been explosive, but it has also created a large earnings gap between players. The box plot below shows the spread of teams' salaries and the spread per player (teams with the largest total salary on the left and lowest on right). As seen below, there are many outliers well beyond the median (as represented by the ticks). For example, the Golden State Warriors are paying Stephen Curry a whopping 34,682,550 (the highest tick mark) this season which is greatly higher than Warrior's teammate Jason Thompson at just 945,126.

![salary_spread](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/sal_spread.png)

Here is a diagram to explain how to interpret the box plots.

![box_plot](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/box-plot-explained.gif)

The issue of salaries has been a huge topic of controversy in the NBA and has led to several lockouts over these labor disputes. There were NBA lockouts in the 1995, 1996, 1998 and 2011 NBA seasons and the main issue in each of them was related to players' salaries. Owners felt players were overpaid, and players felt as if their earning power was restricted. And the topic of NBA salaries continues to be a hot topic of discussion today.

Below is a histogram that shows the salary distrbutions based on whether a player was an all-star or not. A player's all-star status is usually attributed to their performence on the court. There are other factors as well but by in large, it's based on how well they are performing. So we could see that there is probably a relationship between on-court performence and salaries

![all_star](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/all_star_status_salaries.png)

## Proposal

In an effort to get an idea of what players are worth, judging strictly on quantative data, I am building a model to estimate player salaries based on on-the-court performance. The project is called "Open Court".

Determining factors that influence NBA owners to pay players is of great importance in light of these financial constraints and dispute over salaries. When owners put out offers and sign contracts with players, how would they know that they're money is being put to good use? How do they know if they might be paying a player too much? Or if they aren't paying their players enough for their performence and thus risk losing them when their contracts end? Also, sports agents who represent their athlete clients can make a stronger case for higher salaries on their client's contract if they have hard evidence based on statistics showing their worth.

The purpose of developing "Open Court" is to help settle disputes regarding NBA players' salaries and to identify the variables that are most likely to contribute to a player's salary. Unlike coaches who are mainly hired and paid based on a single metric (wins), players are hired and paid based on indvidual performance, which can be measured by their on-the-court metrics. Though there is a lot of literature regarding what determines NBA salaries, there is very little statistical backing to their claims.

The development of Open Court is based on the hypothesis that a player's performance variables such as points per game, field goals, etc. would be significant contributors to player salaries. The dependent variable for this project was NBA player salaries and the independent variables were the offensive and defensive statistical categories.

In the rest of the job market, one's skill-set and the amount of experience they have using those skills determines salaries and with that information, applications like Glassdoor and Indeed can accurately predict how much a person would make for any given job. The plot below was determined with such information by Indeed.

![programs](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/averagesalary.png)

Likewise with the NBA, I believe there are certain performance determinants that will offer an accurate prediction for salary. Perhaps teams put greater premium on points per game than they do on total rebounds per game. These kind of questions are what I hope to explore and answer.

## Data Webscraping

Unlike Glassdoor or Indeed, NBA salary information is available online, as well as player statistics. Though there are many datasets different people acquired and put out on the web, none of them were substantial enough to develop a model on. Therefore, I had to scrape basketball sites in order to get the data for this project.

Using BeautifulSoup, I successfully scraped data from the sports sites and loaded them into tables within a PostGres database. The code for my web scraping was developed into Python scripts stored in their respective folders. I scraped multiple web sites to gather draft data, salary data, NBA player statistics and NCAA college statistics. All the Python scripts are included in this GitHub repository in their respective folders.

The versions for Python and the libraries used for webscraping are below. The following versions of Python and packages will be needed to run the Python scripts.

    Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
    BeautifulSoup version: 4.5.3
    Pandas version: 0.19.2
    Numpy version: 1.12.1
    RegEx version: 2.2.1

The entire set up was engineered using Docker. All relevant Docker files and yml files are included in the repository.

A diagram of the environment that I was working with is below.

![setup](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/docker_postgres_setup.PNG)

## The Data

The dependent variable for this study was NBA player salaries and the independent variables were the offensive and defensive statistical categories.

We are going to use the salaries and statistics of 486 NBA players from the 2016-2017 season. I decided to only use the statistics from the 2016-2017 season as they would be most reflective of the current salary rates. The salary cap for the NBA has been increasing at a rate faster than inflation and so it wouldn't make for a good model to bring in statistics from multiple years. I thought about bringing in "year" as a feature so that the model would give weight to the year, as well as other features. But I wanted to stick to just statistical features to predicta single salary outcome.

#### Data Dictionary
- **Player**: Player Name -- TEXT
- **Position**: Position - TEXT
- **Shooting_Hand**: Hand that player shoots with -- TEXT
- **Height_inches**: Height of player -- INTEGER
- **Weight_lbs**: Weight of player -- FLOAT
- **College**: College that player played at -- TEXT
- **Draft_Year**: Year player was drafted -- INTEGER
- **Draft_Position**: Rank in draft -- INTEGER
- **Season_Count**: Number of seasons played in NBA - INTEGER
- **Age**: Age of Player at the start of February 1st of that season -- INTEGER
- **G**: Games -- INTEGER
- **GS**: Games Started -- INTEGER
- **MP**: Minutes Played -- FLOAT
- **FG**: Field Goals -- FLOAT
- **FGA**: Field Goal Attempts -- FLOAT
- **FG_Perc**: Field Goal Percentage -- FLOAT
- **Three_P**: 3-Point Field Goals -- FLOAT
- **Three_Att**: 3-Point Field Goal Attempts -- FLOAT
- **Three_Perc**: 3-Point Field Goal Percentage -- FLOAT
- **Two_P**: 2-Point Field Goals -- FLOAT
- **Two_Att**: 2-Point Field Goal Attempts -- FLOAT
- **Two_Perc**: 2-Point Field Goal Percentage -- FLOAT
- **EFG_Perc**: Effective Field Goal Percentage -- FLOAT <br>
     This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal
- **FT**: Free Throws -- FLOAT
- **FTA**: Free Throw Attempts -- FLOAT
- **FT_Perc**: Free Throw Percentage -- FLOAT
- **ORB**: Offensive Rebounds -- FLOAT
- **DRB**: Defensive Rebounds  -- FLOAT
- **TRB**: Total Rebounds -- FLOAT
- **AST**: Assists -- FLOAT
- **STL**: Steals -- FLOAT
- **BLK**: Blocks -- FLOAT
- **All_Star**: All Star status, 1 if they were all star at some point in career, 0 if not -- INTEGER
- **TOV**: Turnovers -- FLOAT
- **PF**: Personal Fouls --- FLOAT
- **PTS**: Points -- FLOAT
- **PER**: Player Efficiency Rating - FLOAT <br>
     A measure of per-minute production standardized such that the league average is 15
- **WS**: Win Shares -- FLOAT <br>
     An estimate of the number of wins contributed by a player
- **Salary**: Salary for the 2016-2017 season -- FLOAT

## Data Cleaning

With data that is scraped from the web, the data is messy and needs to be cleaned.

- Get rid of a couple of rows (that were header rows) that contain only NoneType values
- Rename some of the columns
- Change to proper data types
- Deal with missing values
    
To see more detailed steps taken for data cleaning, please refer to the following notebooks.

[Data Cleaning for Draft Table](https://github.com/michaelkim9/nba_predictor_project/blob/master/Part_1_webscraping_draft_table_beautifulsoup.ipynb)<br>
[Data Cleaning for Data Set](https://github.com/michaelkim9/nba_predictor_project/blob/master/Part_4_predictive_model.ipynb)

Additional data cleaning that was important for the model are below.

-----------------

#### Dropping features not related to on-court performance
There are some features that are not related to on-court performence. There might be indicators that are related to past performence, like draft_position, which is related to how well a person did in college. But our model is more concerned about actual on-court performence at the professional level. And there are other fields that are simply not reflective of performence at all, such as shooting_hand, height and weight. So we are going to drop the following columns.

##### Drop these fields
- Player
- Position
- Shooting_Hand
- Height_inches
- Weight_lbs
- College
- Draft_Year
- Draft_Position
- All_star

-------------------

#### Dropping Player Efficiency Rating (PER)

PER is an advanced statistical measure of per-minute production standardized such that the league average is 15. It combines various factors but there is a major flaw with it. 

PER largely measures offensive performance. Two of the defensive statistics it incorporates—blocks and steals (which was not tracked as an official stat until 1973)—can produce a distorted picture of a player's value and that PER is not a reliable measure of a player's defensive acumen.

Therefore, it's a statistic that will add onto offensive statistics and may even "double" count it without including defensive statistical factors.

Because of this unbalance weight towards offensive statistics, we are going to drop this column.

Below is a diagram of PER and salaries. If PER would be a good determinant of salary, then it should follow a similar distribution as the salaries. But from the joint plot below, the distributions seem to be different.

![per](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/sal_per.png)

Another advanced statistic similar to that of PER is Win-Shares (WS). As stated in the data dictionary, it is an estimate of the number of wins contributed by a player. And this statistic is more comprehensive than PER so we're going to keep this statistic in the data set. As you can see from the joint plot below, it follows a similar distribution as salaries.

![ws](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/sal_ws.png)

------------------------------------------------------------------------------------------------------

#### Excluding rookies

Season count was a key indicator of whether a player was in their rookie contract. Age didn't really give any kind of relevant information because there could be someone who is older but in their rookie contract. Season count as more explanatory power as it shows which players are actually in their rookie contracts.

Below is a swarm plot of salaries by age.
![age](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/player_sal_by_age.png)

The same swarm plot is below only it shows salaries by the amount of seasons in the NBA.
![season_count](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/player_sal_season_count.png)

For rookies, their salaries are constrainted by the rookie salary cap which is typically in force for three years. Thus the swarm plot in our EDA notebook showed that there was a large swarm of lower salaries at season counts of 1-3. Therefore, we are going to exclude rookies from this data set. If a rookie or a second year player performed well statistically, they would not be compensated accordingly because they are "locked" into a contract that my not reward them for their excellent play. And the whole premise of our data

##### Drop these rows
- Any rookies which mean any rows that have a player with a season count lower than 3
    - 172 rookies
    - 486 total players
    - 314 players in our final data set
    
##### Drop these fields
- Age
- Season_count

------------------------------------------------------------

#### Per Game Statistics

Currently, the data has total numbers for the entire season. For example, the feature "PTS" includes all the points that the player made for all of 2016-2017. However, this might not be entirely reflective of an individual player's performance compared to someone else's because the total number doesn't compensate for any lost statistics due to injuries, suspensions or any other reason for why a player couldn't play. Therefore, the columns that are indicated as totals are going to be divided by the total number of games that they played so that the statistics would be "per game." And then, going to drop the columns that are related to the number of games or minutes that a player played as they aren't statistics related to player performance.

The columns that are percentages aren't affected because the percentage will remain the same whether it's a total or per-game statistic.

##### Adjust these fields to per game
- FG: Field Goals
- FGA: Field Goal Attempts 
- Three_P: 3-Point Field Goals 
- Three_Att: 3-Point Field Goal Attempts 
- Two_P: 2-Point Field Goals 
- Two_Att: 2-Point Field Goal Attempts 
- FT: Free Throws 
- FTA: Free Throw Attempts 
- ORB: Offensive Rebounds 
- DRB: Defensive Rebounds 
- TRB: Total Rebounds 
- AST: Assists 
- STL: Steals 
- BLK: Blocks 
- TOV: Turnovers 
- PF: Personal Fouls 
- PTS: Points 

##### Drop these fields
- G: Games
- GS: Games Start
- MP: Minutes Played

-------------

#### Dropping columns that have high correlation with others

The features (stats) in our data set need to be independent from one another. From the heatmap diagram below, we notice that there are stats columns that are highly (or even directly) correlated with one another. This will affect our model as it'll be unevenly weighting certain statistics and not others because it's appearing more than once in any given column. For example, field goals, field goal attempts and field goal percentage are directly correlated with one another. Or 2 points, 3 points, free throws are all contributing to the statistic points (which is a sum of all of these). So there are certain features that would be getting counted more than once and unevenly favoring those features.

Heatmap to show correlations of the remaining fields below
![heatmap_a](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/stats_heat_map_a.png)

From the heatmap above, we could identify the stats columns that are highly correlated with one another and drop those columns. Going to  So we are going to drop the following.

- fg
- fga
- fg_percent
- three_p
- three_att
- three_perc
- two_p
- two_att
- two_perc
- fta
- orb
- drb
- ft 
![heatmap_b](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/stats_heat_map_b.png)

-------

## Machine Learning Models

Below is an outline of the predictive machine learning models 

### Linear Model with Regularization
Now we are going to build out a predictive model - starting with linear regression with regularization. Regularization is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalization.

### Scaling and Train-Test-Split
Statistics are going to vary in scale as there are some statistics that are just going to be inherently higher values than others. For example, typically players will have more free throws than they do blocks. But the magnitude (or the number of free throws versus blocks) shouldn't determine the magnitude of the end result (salary). So I am going to scale the data so that the statistical variables are standardized by scaling to unit variance.

### More complex linear model: ElasticNet
The r^2 for the linear model was around 42% for the test data. The basic linear regression model may not be the best model given that basketball statistics and a player's performance have many variables and there are other factors that contribute to player's salaries. Building out a more complex model that includes these other factors could greatly improve the accuracy of our prediction. 

Utilized a Pipeline and GridSearch Cross Validation to implement Elastic Net.

But after building out this model, the ElasticNet didn't do that much better than the linear model. Actually, the basic linear model seems to have better scores and less error.

### Random Forest Model
Tried some more models - maybe the linear model isn't the best model for this data set. Tried using a Random Forest which should have a lot of benefits for our data. It's a non-parametric model - so it can predict variables that are non-normally distributed - which is our data as salaries do not follow a normal distribution. 

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers.

After I implemented this model, 

### Limitations

The main limitation to developing this application is that only quantitative data was used. There are other intangibles that we can't account for in this model such as fan appeal, player's prone to injuries, or player's personality, leadership and impact on team morale. However, the model does highlight some trends and clearly shows that data can be used to determine NBA salaries. But these kind's of limitations are something that anyone faces in the normal job market. Applications like Glassdoor and Indeed predict salaries based on hard skills that are determined to be in greater demand. However, their models are limited too in that they are not able to capture intangible factors like a person's soft skills, leadership ability, team fit, business acumen, etc. But such applications are still useful tools that both employers and job-seekers utilize as it gives useful information on expected salaries given a certain type of skill-set. Likewise, Open Court functions as a similar tool for NBA team owners, athletes and their agents.

Another limitation is that this model doesn't account for players who signed long-term contracts well before the 2016-2017 season. Therefore, their salary rate is determined by a prior season of play and isn't fully captured by just the 2016-2017 season alone. For the purposes of this model and project, I am attempted to just predict the 2016-2017 salaries based soley on basketball statistics from that season. 