![logo](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/open_court_logo2.png)
# Open Court
## The Glassdoor of the National Basketball Association (NBA)

### Problem

The National Basketball Association (NBA) is widely considered to be the premier men's professional basketball league in the world and features some of the most well-known athletes. As is such, NBA players are the world's best paid athletes by average annual salary per player. And their increasing salaries have led to salary caps which limit teams' total salaries. This limit is subject to a complex system of rules and exceptions; therefore this is considered a "soft" cap.

The plot below shows the historical salary cap values of the NBA. The NBA players’ salary cap was instituted in the 1984-1985 season. That year the salary cap was 3.6 million and the average player salary was 300,000. By 2013, 66 percent of NBA players earned more than 1 million or more. By the 2012-2013 season, the salary cap increased to 58 million and the average NBA player salary was 5.1 million. In the upcoming 2017-2018 NBA season, the salary cap is almost at the 100 million mark with the highest salary at nearly 35 million and average at 6.5 million.

![salary_cap](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/nba_salary_cap.png)

The plot below shows salaries used by each team for the upcoming 2017-2018 season. As you can see, more than half of all NBA teams surpassed the designated salary cap showing that it's a "soft" cap with many exceptions. The vertical line indicates the salary cap for the season.

![team_salaries](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/current_sal.png)

The escalation of NBA player salaries has not only been explosive, but it has also created a large earnings gap between players. The box plot below shows the spread of teams' salaries and the spread per player (teams with the largest total salary on the left and lowest on right). As seen below, there are many outliers well beyond the median (as represented by the ticks). For example, the Golden State Warriors are paying Stephen Curry a whopping 34,682,550 (the highest tick mark) this season which is greatly higher than Warrior's teammate Jason Thompson at just 945,126.

![salary_spread](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/sal_spread.png)

Here is a diagram to explain how to interpret the box plots.

![box_plot](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/box-plot-explained.gif)

The issue of salaries has been a huge topic of controversy in the NBA and has led to several lockouts over these labor disputes. There were NBA lockouts in the 1995, 1996, 1998 and 2011 NBA seasons and the main issue in each of them was related to players' salaries. Owners felt players were overpaid, and players felt as if their earning power was restricted. And the topic of NBA salaries continues to be a hot topic of discussion today.

### Proposal

In an effort to get an idea of what players are worth, judging strictly on quantative data, I am building a model to estimate player salaries based on on-the-court performance. The project is called "Open Court".

Determining factors that influence NBA owners to pay players is of great importance in light of these financial constraints and dispute over salaries. When owners put out offers and sign contracts with players, how would they know that they're money is being put to good use? How do they know if they might be paying a player too much? Or if they aren't paying their players enough for their performence and thus risk losing them when their contracts end? Also, sports agents who represent their athlete clients can make a stronger case for higher salaries on their client's contract if they have hard evidence based on statistics showing their worth.

The purpose of developing "Open Court" is to help settle disputes regarding NBA players' salaries and to identify the variables that are most likely to contribute to a player's salary. Unlike coaches who are mainly hired and paid based on a single metric (wins), players are hired and paid based on indvidual performance, which can be measured by their on-the-court metrics. Though there is a lot of literature regarding what determines NBA salaries, there is very little statistical backing to their claims.

The development of Open Court is based on the hypothesis that a player's performance variables such as points per game, field goals, etc. would be significant contributors to player salaries. The dependent variable for this project was NBA player salaries and the independent variables were the offensive and defensive statistical categories.

In the rest of the job market, one's skill-set and the amount of experience they have using those skills determines salaries and with that information, applications like Glassdoor and Indeed can accurately predict how much a person would make for any given job. The plot below was determined with such information by Indeed.

![programs](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/averagesalary.png)

Likewise with the NBA, I believe there are certain performance determinants that will offer an accurate prediction for salary. Perhaps teams put greater premium on points per game than they do on total rebounds per game. These kind of questions are what I hope to explore and answer.

### Part 1: Data Webscraping

Unlike Glassdoor or Indeed, NBA salary information is available online, as well as player statistics. Though there are many datasets different people acquired and put out on the web, none of them were substantial enough to develop a model on. Therefore, I had to scrape basketball sites in order to get the data for this project.

Using BeautifulSoup, I successfully scraped data from the sports sites and loaded them into tables within a PostGres database. The code for my web scraping was developed into Python scripts stored in their respective folders. I scraped multiple web sites to gather draft data, salary data, NBA player statistics and NCAA college statistics. All the Python scripts are included in this GitHub repository in their respective folders.

The versions for Python and the libraries used for webscraping are below. The following versions of Python and packages will be needed to run the Python scripts.

    Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
    BeautifulSoup version: 4.5.3
    Pandas version: 0.19.2
    Numpy version: 1.12.1
    RegEx version: 2.2.1

The entire set up was engineered using Docker. All relevant Docker files and yml files are included in the repository.

A diagram of the environment that I was working with is below.

![setup](https://raw.githubusercontent.com/michaelkim9/nba_predictor_project/master/other_assets/docker_postgres_setup.PNG)

### Part 2: Data Cleaning

With data that is scraped from the web, the data is messy and needs to be cleaned. Some necessary steps of data cleaning are outlined below:

- Get rid of a couple of rows (that were header rows) that contain only NoneType values
- Rename some of the columns
- Change to proper data types
- Deal with some more missing values
- Add column for draft year
    
To see more detailed steps taken for data cleaning, please refer to the following notebooks.

[Data Cleaning for Draft Table](https://github.com/michaelkim9/nba_predictor_project/blob/master/Part_1_webscraping_draft_table_beautifulsoup.ipynb)



### Part 3: Feature Engineering

There were some steps we took to engineer the features for the model.

- Remove the team column
     - For the purposes of this model, the team that a particular player plays for is irrelvent as we are only mainly looking at their performance statistics as a predictor of salary. There are players who also play for multiple teams and so 

### Limitations

The main limitation to developing this application is that only quantitative data was used. There are other intangibles that we can't account for in this model such as fan appeal, player's prone to injuries, or player's personality, leadership and impact on team morale. However, the model does highlight some trends and clearly shows that data can be used to determine NBA salaries. But these kind's of limitations are something that anyone faces in the normal job market. Applications like Glassdoor and Indeed predict salaries based on hard skills that are determined to be in greater demand. However, their models are limited too in that they are not able to capture intangible factors like a person's soft skills, leadership ability, team fit, business acumen, etc. But such applications are still useful tools that both employers and job-seekers utilize as it gives useful information on expected salaries given a certain type of skill-set. Likewise, Open Court functions as a similar tool for NBA team owners, athletes and their agents.

Another limitation is that this model doesn't account for players who signed long-term contracts well before the 2016-2017 season. Therefore, their salary rate is determined by a prior season of play and isn't fully captured by just the 2016-2017 season alone. For the purposes of this model and project, I am attempted to just predict the 2016-2017 salaries based soley on basketball statistics from that season. 