In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# The Data

Now is when we begin building out models to predict salary based on player's statistics

The dependent variable for this study was NBA player salaries and the independent variables were the offensive and defensive statistical categories.

We are going to use the salaries and statistics of 486 NBA players from the 2016-2017 season. I decided to only use the statistics from the 2016-2017 season as they would be most reflective of the current salary rates. The salary cap for the NBA has been increasing at a rate faster than inflation and so it wouldn't make for a good model to bring in statistics from multiple years. I thought about bringing in "year" as a feature so that the model would give weight to the year, as well as other features. But I wanted to stick to just statistical features to predicta single salary outcome.

### Data Dictionary

- **Player**: Player Name -- TEXT
- **Position**: Position - TEXT
- **Shooting_Hand**: Hand that player shoots with -- TEXT
- **Height_inches**: Height of player -- INTEGER
- **Weight_lbs**: Weight of player -- FLOAT
- **College**: College that player played at -- TEXT
- **Draft_Year**: Year player was drafted -- INTEGER
- **Draft_Position**: Rank in draft -- INTEGER
- **Season_Count**: Number of seasons played in NBA - INTEGER
- **Age**: Age of Player at the start of February 1st of that season -- INTEGER
- **G**: Games -- INTEGER
- **GS**: Games Started -- INTEGER
- **MP**: Minutes Played -- FLOAT
- **FG**: Field Goals -- FLOAT
- **FGA**: Field Goal Attempts -- FLOAT
- **FG_Perc**: Field Goal Percentage -- FLOAT
- **Three_P**: 3-Point Field Goals -- FLOAT
- **Three_Att**: 3-Point Field Goal Attempts -- FLOAT
- **Three_Perc**: 3-Point Field Goal Percentage -- FLOAT
- **Two_P**: 2-Point Field Goals -- FLOAT
- **Two_Att**: 2-Point Field Goal Attempts -- FLOAT
- **Two_Perc**: 2-Point Field Goal Percentage -- FLOAT
- **EFG_Perc**: Effective Field Goal Percentage -- FLOAT <br>
     This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal
- **FT**: Free Throws -- FLOAT
- **FTA**: Free Throw Attempts -- FLOAT
- **FT_Perc**: Free Throw Percentage -- FLOAT
- **ORB**: Offensive Rebounds -- FLOAT
- **DRB**: Defensive Rebounds  -- FLOAT
- **TRB**: Total Rebounds -- FLOAT
- **AST**: Assists -- FLOAT
- **STL**: Steals -- FLOAT
- **BLK**: Blocks -- FLOAT
- **All_Star**: All Star status, 1 if they were all star at some point in career, 0 if not -- INTEGER
- **TOV**: Turnovers -- FLOAT
- **PF**: Personal Fouls --- FLOAT
- **PTS**: Points -- FLOAT
- **PER**: Player Efficiency Rating - FLOAT <br>
     A measure of per-minute production standardized such that the league average is 15
- **WS**: Win Shares -- FLOAT <br>
     An estimate of the number of wins contributed by a player
- **Salary**: Salary for the 2016-2017 season -- FLOAT

### EXTRA NOTES

- Excluded were the salaries of rookies and those players who were still playing under their rookie contract (22). The rationale was that these player’s salaries are constrained by the rookie salary cap which is typically in force for three years. So if a rookie or a second year player performed well statistically, they would not be compensated accordingly because they are “locked” into a contract that may not reward them for their excellent play
- The first model (Exhibit II) created includes every player in the NBA, and gives an R2 value of 0.4546. In other words, roughly 45 percent of the variation in salaries is explained in the model by the variation in the eleven statistical fields used. 

# Data Cleaning

Now we are going to build out some models for salary prediction. However, before we do this, we need to look at our features and set up our data for the model.

In [11]:
import psycopg2 as pg2
from psycopg2.extras import RealDictCursor

def execute_query(query):  
    connection = pg2.connect(host='postgres',
                             user='postgres',
                             database='basketball')
    cursor = connection.cursor(cursor_factory=RealDictCursor)
    cursor.execute(query)
    r = cursor.fetchall()
    cursor.close()
    return r

def query_to_df(query):
    df = pd.DataFrame(execute_query(query))
    return df

In [26]:
nba_df = query_to_df('SELECT * FROM nba_2016')
nba_df.head()

Unnamed: 0,age,all_star,ast,blk,college,draft_position,draft_year,drb,efg_perc,fg,...,three_att,three_p,three_perc,tov,trb,two_att,two_p,two_perc,weight_lbs,ws
0,24,0,4.0,13.0,Purdue University,46,2016,28.0,0.464,17.0,...,10.0,5.0,0.5,10.0,36.0,32.0,12.0,0.375,260.0,0.0
1,32,0,125.0,9.0,University of Oregon,26,2007,51.0,0.483,121.0,...,128.0,48.0,0.375,66.0,69.0,172.0,73.0,0.424,161.0,0.3
2,21,0,150.0,40.0,University of Arizona,4,2014,289.0,0.499,393.0,...,267.0,77.0,0.288,89.0,405.0,598.0,316.0,0.528,220.0,3.7
3,22,0,3.0,0.0,University of Kentucky,0,0,3.0,0.0,0.0,...,2.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0,210.0,0.1
4,25,0,7.0,7.0,Michigan State University,15,2014,24.0,0.454,23.0,...,15.0,3.0,0.2,8.0,33.0,39.0,20.0,0.513,237.0,0.2


In [27]:
nba_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 486 entries, 0 to 485
Data columns (total 39 columns):
age               486 non-null int64
all_star          486 non-null int64
ast               486 non-null float64
blk               486 non-null float64
college           486 non-null object
draft_position    486 non-null int64
draft_year        486 non-null int64
drb               486 non-null float64
efg_perc          485 non-null float64
fg                486 non-null float64
fg_perc           485 non-null float64
fga               486 non-null float64
ft                486 non-null float64
ft_perc           471 non-null float64
fta               486 non-null float64
g                 486 non-null float64
gs                486 non-null float64
height_inches     486 non-null float64
mp                486 non-null float64
orb               486 non-null float64
per               486 non-null float64
pf                486 non-null float64
player            486 non-null object
position

#### Dropping features not related to on-court performence
There are some features that are not related to on-court performence. There might be indicators that are related to past performence, like draft_position, which is related to how well a person did in college. But our model is more concerned about actual on-court performence at the professional level. And there are other fields that are simply not reflective of performence at all, such as shooting_hand, height and weight. So we are going to drop the following columns.

##### Drop these fields
- Player
- Shooting_Hand
- Height_inches
- Weight_lbs
- College
- Draft_Year
- Draft_Position

In [28]:
nba_df.drop(['player','shooting_hand','height_inches','weight_lbs','college','draft_year','draft_position'],axis=1, inplace=True)

#### Age versus Season Count

From my previous notebook with EDA, seson count was a key indicator of whether a player was in their rookie contract. Age didn't really give any kind of relevant information. But season count had more explanatory power as it shows the fact that there were players who were restricted by the rookie salary cap. For rookies, their salaries are constrainted by the rookie salary cap which is typically in force for three years. Thus the swarm plot in our EDA notebook showed that there was a large swarm of lower salaries at season counts of 1-3. In order for our model to account for the lower rooke salaries, season count will be kept as a feature for the model to use, even though it isn't necessarily a performance statistic because this field will be needed to regualrize the model and to account for the rookie salary cap. We're going to drop the age field as it's too closely correlated to season count but of the two, season count seems to have more explanatory power. So dropping the age column. 

##### Drop these fields
- Age

In [29]:
nba_df.drop('age',axis=1,inplace=True)

#### Per Game Statistics

Currently, the data has total numbers for the entire season. For example, the feature "PTS" includes all the points that the player made for all of 2016-2017. However, this might not be entirely reflective of an individual player's performance compared to someone else's because the total number doesn't compensate for any lost statistics due to injuries of suspensions. Therefore, the columns that are indicated as totals are going to be divided by the total number of games that they played so that the statistics would be "per game." And then, going to drop the columns that are related to the number of games or minutes that a player played as they aren't statistics related to player performance.

The columns that are percentages aren't affected because the percentage will remain the same whether it's a total or per-game statistic.

##### Adjust these fields to per game
- FG: Field Goals
- FGA: Field Goal Attempts 
- Three_P: 3-Point Field Goals 
- Three_Att: 3-Point Field Goal Attempts 
- Two_P: 2-Point Field Goals 
- Two_Att: 2-Point Field Goal Attempts 
- FT: Free Throws 
- FTA: Free Throw Attempts 
- ORB: Offensive Rebounds 
- DRB: Defensive Rebounds 
- TRB: Total Rebounds 
- AST: Assists 
- STL: Steals 
- BLK: Blocks 
- TOV: Turnovers 
- PF: Personal Fouls 
- PTS: Points 

##### Drop these fields
- G: Games
- GS: Games Start
- MP: Minutes Played

In [30]:
nba_df.head()

Unnamed: 0,all_star,ast,blk,drb,efg_perc,fg,fg_perc,fga,ft,ft_perc,...,stl,three_att,three_p,three_perc,tov,trb,two_att,two_p,two_perc,ws
0,0,4.0,13.0,28.0,0.464,17.0,0.405,42.0,9.0,0.45,...,1.0,10.0,5.0,0.5,10.0,36.0,32.0,12.0,0.375,0.0
1,0,125.0,9.0,51.0,0.483,121.0,0.403,300.0,32.0,0.8,...,25.0,128.0,48.0,0.375,66.0,69.0,172.0,73.0,0.424,0.3
2,0,150.0,40.0,289.0,0.499,393.0,0.454,865.0,156.0,0.719,...,65.0,267.0,77.0,0.288,89.0,405.0,598.0,316.0,0.528,3.7
3,0,3.0,0.0,3.0,0.0,0.0,0.0,4.0,1.0,0.5,...,0.0,2.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0,0.1
4,0,7.0,7.0,24.0,0.454,23.0,0.426,54.0,14.0,0.737,...,8.0,15.0,3.0,0.2,8.0,33.0,39.0,20.0,0.513,0.2


In [None]:
adjust_columns = [
    FG
    FGA
    Three_P
    Three_Att
    Two_P
    Two_Att
    FT
    FTA
    ORB
    DRB
    TRB
    AST:
    STL
    BLK
    TOV
    PF
    PTS
]

In [20]:

nba_df['fg'] = nba_df['fg'] / nba_df['g']

In [24]:
nba_df['fga'] = nba_df['fga'] / nba_df['g']

In [25]:
nba_df.head(10)

Unnamed: 0,all_star,ast,blk,drb,efg_perc,fg,fg_perc,fga,ft,ft_perc,...,stl,three_att,three_p,three_perc,tov,trb,two_att,two_p,two_perc,ws
0,0,4.0,13.0,28.0,0.464,0.772727,0.405,1.909091,9.0,0.45,...,1.0,10.0,5.0,0.5,10.0,36.0,32.0,12.0,0.375,0.0
1,0,125.0,9.0,51.0,0.483,1.861538,0.403,4.615385,32.0,0.8,...,25.0,128.0,48.0,0.375,66.0,69.0,172.0,73.0,0.424,0.3
2,0,150.0,40.0,289.0,0.499,4.9125,0.454,10.8125,156.0,0.719,...,65.0,267.0,77.0,0.288,89.0,405.0,598.0,316.0,0.528,3.7
3,0,3.0,0.0,3.0,0.0,0.0,0.0,0.8,1.0,0.5,...,0.0,2.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0,0.1
4,0,7.0,7.0,24.0,0.454,1.277778,0.426,3.0,14.0,0.737,...,8.0,15.0,3.0,0.2,8.0,33.0,39.0,20.0,0.513,0.2
5,1,337.0,86.0,370.0,0.527,5.573529,0.473,11.779412,108.0,0.8,...,52.0,242.0,86.0,0.355,115.0,465.0,559.0,293.0,0.524,6.3
6,0,57.0,16.0,203.0,0.499,3.560606,0.499,7.136364,65.0,0.765,...,19.0,1.0,0.0,0.0,33.0,278.0,470.0,235.0,0.5,2.3
7,0,99.0,44.0,374.0,0.468,3.0,0.393,7.639344,96.0,0.706,...,60.0,212.0,70.0,0.33,94.0,451.0,254.0,113.0,0.445,1.9
8,0,11.0,0.0,21.0,0.463,1.0,0.375,2.666667,12.0,0.75,...,3.0,44.0,14.0,0.318,7.0,24.0,36.0,16.0,0.444,0.1
9,0,23.0,32.0,198.0,0.517,2.93617,0.517,5.680851,70.0,0.625,...,27.0,1.0,0.0,0.0,37.0,292.0,266.0,138.0,0.519,2.1


#### Converting statistics to 

# Machine Learning Models

This is where some of the interesting stuff happens! Now we are going to build out a predictive model 