<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DS-SF-42 | Class 8 | Regression - Addressing model fit

<br>
### _Predicting shots made per game by Kobe Bryant_

_Authors: Kiefer Katovich (SF) and Gus Ostow (SF)_

---

The Kobe shots dataset has hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression will cause issues that other datasets might not. In this exploration you will be diagnose issues with model fit using regression metrics, train/test split, and cross validation.


### Plan

Today I am going to flip the script: we are going to start the class with a hands-on partner activity to motivate the day's topic, then address the theory after.

1. Motivating the problem
3. Slides interlude
2. Addressing the problem

### Teams

<img src=https://i.imgur.com/JI6ydY5.png align=left>
<br><br><br><br><br><br><br><br><br><br><br><br><br><br>


# <font color=blue>Part I</font> - Motifivating the problem


---

### 1. Load packages and data

In [47]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [48]:
kobe = pd.read_csv('./datasets/kobe_superwide_games.csv')

---

### 2. Examine the data

#### Guiding questions

- How many columns are there? 
- Examine what the observations (rows) and columns represent.
- Why does this dataset _feel_ different than the datasets we've touched so far?
- What concerns do you have even before fitting your first model?

In [46]:
# A: 
kobe.shape


(1558, 645)

In [None]:
#print columns to see what values are in it

for col in kobe.columns:
    print col

In [60]:
kobe["AWAY_GAME"].unique()

array([0, 1])

In [65]:
#filtering and selecting only row 1 and seeing if its true or false.

kobe.filter(like="SEASON_OPPONENT").loc[0, ].astype(bool)

SEASON_OPPONENT:atl:1996-97    False
SEASON_OPPONENT:atl:1997-98    False
SEASON_OPPONENT:atl:1999-00    False
SEASON_OPPONENT:atl:2000-01    False
SEASON_OPPONENT:atl:2001-02    False
SEASON_OPPONENT:atl:2002-03    False
SEASON_OPPONENT:atl:2003-04    False
SEASON_OPPONENT:atl:2004-05    False
SEASON_OPPONENT:atl:2005-06    False
SEASON_OPPONENT:atl:2006-07    False
SEASON_OPPONENT:atl:2007-08    False
SEASON_OPPONENT:atl:2008-09    False
SEASON_OPPONENT:atl:2009-10    False
SEASON_OPPONENT:atl:2010-11    False
SEASON_OPPONENT:atl:2011-12    False
SEASON_OPPONENT:atl:2012-13    False
SEASON_OPPONENT:atl:2013-14    False
SEASON_OPPONENT:atl:2014-15    False
SEASON_OPPONENT:atl:2015-16    False
SEASON_OPPONENT:bkn:2012-13    False
SEASON_OPPONENT:bkn:2015-16    False
SEASON_OPPONENT:bos:1996-97    False
SEASON_OPPONENT:bos:1997-98    False
SEASON_OPPONENT:bos:1999-00    False
SEASON_OPPONENT:bos:2001-02    False
SEASON_OPPONENT:bos:2002-03    False
SEASON_OPPONENT:bos:2003-04    False
S

In [68]:
#filtering and selecting only row 1 and seeing if its true or false.

column_mask = kobe.filter(like="SEASON_OPPONENT").loc[0, ] > 0

In [69]:
column_mask = kobe.filter(like="SEASON_OPPONENT").loc[0, column_mask] > 0

In [45]:
#A: drop the column "shots made" to run the predictions for all other variables. 

kobe.columns

kobe.columns.values

array(['SHOTS_MADE', 'AWAY_GAME', 'SEASON_OPPONENT:atl:1996-97',
       'SEASON_OPPONENT:atl:1997-98', 'SEASON_OPPONENT:atl:1999-00',
       'SEASON_OPPONENT:atl:2000-01', 'SEASON_OPPONENT:atl:2001-02',
       'SEASON_OPPONENT:atl:2002-03', 'SEASON_OPPONENT:atl:2003-04',
       'SEASON_OPPONENT:atl:2004-05', 'SEASON_OPPONENT:atl:2005-06',
       'SEASON_OPPONENT:atl:2006-07', 'SEASON_OPPONENT:atl:2007-08',
       'SEASON_OPPONENT:atl:2008-09', 'SEASON_OPPONENT:atl:2009-10',
       'SEASON_OPPONENT:atl:2010-11', 'SEASON_OPPONENT:atl:2011-12',
       'SEASON_OPPONENT:atl:2012-13', 'SEASON_OPPONENT:atl:2013-14',
       'SEASON_OPPONENT:atl:2014-15', 'SEASON_OPPONENT:atl:2015-16',
       'SEASON_OPPONENT:bkn:2012-13', 'SEASON_OPPONENT:bkn:2015-16',
       'SEASON_OPPONENT:bos:1996-97', 'SEASON_OPPONENT:bos:1997-98',
       'SEASON_OPPONENT:bos:1999-00', 'SEASON_OPPONENT:bos:2001-02',
       'SEASON_OPPONENT:bos:2002-03', 'SEASON_OPPONENT:bos:2003-04',
       'SEASON_OPPONENT:bos:2004-05', 

---

### 3.  Prepare the dataset for training AND validation

- Make predictor matrix `X` and target variable `y`
- Split your data into a validation set using `train_test_split`

In [43]:
# A:
#target = kobe['SHOTS_MADE']
#X = kobe.drop('SHOTS_MADE', axis = 1)

In [72]:
X = kobe.drop("SHOTS_MADE", axis = 1)
y = kobe["SHOTS_MADE"]

In [73]:
X.shape[0] == y.shape[0]

True

---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

1. How does it perform? Keep the regression metrics we talked about on Tuesday in mind, like mean squared error, mean absolute error, and $R^2$
2. Is there a disparity between your train set and your test set? What does that indicate?

In [None]:
train_test_split(X,y,test_size=0.3)

In [53]:
# A: # Training/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

lm = LinearRegression()

#X = df 
#y = target 

model = lm.fit(X_train, y_train)
prediction = model.predict (X_test)
score = model.score (X_test, y_test)
print score

-5.72950652745e+16


In [27]:
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [87]:
kobe.corr()

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
SHOTS_MADE,1.000000,0.009671,-0.076029,-0.024370,0.004683,0.004683,0.011791,0.006625,0.011986,-0.017227,...,-0.014039,-0.036108,0.062609,-0.017227,0.006544,0.000266,-0.016149,0.146655,0.059012,0.194842
AWAY_GAME,0.009671,1.000000,-0.000046,-0.000046,0.025310,0.025310,-0.000046,-0.000046,-0.025375,-0.025375,...,-0.000046,-0.021380,0.011718,0.025310,0.014764,0.002704,0.024190,0.053858,0.022408,0.007128
SEASON_OPPONENT:atl:1996-97,-0.076029,-0.000046,1.000000,-0.001285,-0.000909,-0.000909,-0.001285,-0.001285,-0.000909,-0.000909,...,-0.001285,-0.010103,-0.006594,-0.000909,-0.012282,-0.001282,-0.003170,-0.023060,-0.024709,-0.060220
SEASON_OPPONENT:atl:1997-98,-0.024370,-0.000046,-0.001285,1.000000,-0.000909,-0.000909,-0.001285,-0.001285,-0.000909,-0.000909,...,-0.001285,-0.010103,-0.006594,-0.000909,-0.012282,-0.001282,-0.003170,-0.023060,-0.024709,-0.054241
SEASON_OPPONENT:atl:1999-00,0.004683,0.025310,-0.000909,-0.000909,1.000000,-0.000642,-0.000909,-0.000909,-0.000642,-0.000642,...,-0.000909,-0.007142,-0.004661,-0.000642,-0.008682,-0.000906,-0.002241,-0.016301,-0.033038,-0.030961
SEASON_OPPONENT:atl:2000-01,0.004683,0.025310,-0.000909,-0.000909,-0.000642,1.000000,-0.000909,-0.000909,-0.000642,-0.000642,...,-0.000909,-0.007142,-0.004661,-0.000642,-0.008682,-0.000906,-0.002241,-0.016301,0.018543,-0.023016
SEASON_OPPONENT:atl:2001-02,0.011791,-0.000046,-0.001285,-0.001285,-0.000909,-0.000909,1.000000,-0.001285,-0.000909,-0.000909,...,-0.001285,-0.010103,-0.006594,-0.000909,-0.012282,-0.001282,-0.003170,-0.023060,0.003515,-0.027180
SEASON_OPPONENT:atl:2002-03,0.006625,-0.000046,-0.001285,-0.001285,-0.000909,-0.000909,-0.001285,1.000000,-0.000909,-0.000909,...,-0.001285,-0.010103,-0.006594,-0.000909,-0.012282,-0.001282,-0.003170,-0.005776,-0.005434,-0.019807
SEASON_OPPONENT:atl:2003-04,0.011986,-0.025375,-0.000909,-0.000909,-0.000642,-0.000642,-0.000909,-0.000909,1.000000,-0.000642,...,-0.000909,-0.007142,-0.004661,-0.000642,-0.008682,-0.000906,-0.002241,0.037716,-0.010654,-0.009099
SEASON_OPPONENT:atl:2004-05,-0.017227,-0.025375,-0.000909,-0.000909,-0.000642,-0.000642,-0.000909,-0.000909,-0.000642,1.000000,...,-0.000909,-0.007142,-0.004661,-0.000642,-0.008682,-0.000906,-0.002241,-0.016301,-0.037904,-0.005831


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [89]:
# from class

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.5)

lr = LinearRegression()
lr.fit(X_train, y_train)

lr_r2_train = lr.score(X_train, y_train)
lr_r2_test = lr.score(X_test, y_test)

print "R2 on train: {}".format(lr_r2_train)
print "R2 on test: {}".format(lr_r2_test)

R2 on train: 0.890785724923
R2 on test: 0.223015602856


### what's causing this

- to many predictor columns
- too few rows compared to columns

In [81]:
sns.heatmap(x.iloc[:,60:100].corr())

NameError: name 'x' is not defined

# <font color=blue> Interlude</font> - Slides

Sit back and enjoy the show...

----
# <font color=blue> Part II</font> - Addressing the problem

---

### 6. Try fitting ealuating a  `Ridge` model instead of a standard `LinearRegression`
The ridge regression is a model _similar_ to the standard linear regression, but for now let it remain shrouded in an \*air\* of mystery.

Is it better than the Linear regression? On the training set? On the test set? Why do you think that is?

In [101]:
# A:

X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, test_size= 0.25)

lm = Ridge(alpha = 100)

#X = df 
#y = target 

model = lm.fit(X2_train, y2_train)

score1 = model.score (X2_test, y2_test)
score2 = model.score (X2_test, y2_test)
print score1, score2

0.624210885319 0.624210885319


---
### 7. Examine your ridge model's coefficients

Does anything jump out at you? Use any the tools we've learned so far like histograms, barplots, and other descriptive statistics to compare the ridge model's fit to the linear regression we used earlier.


---

### 8. Play around with the `alpha` hyper parameter

How does this impact the coefficients of the fit model?

#### EX:
```python
ridge = Ridge(alpha = 10.0)
```

Some good values to try might be `0`, `0.1`, `1.0`, `10`, `100`

In [7]:
# A:

---

### 9. Fit a `Lasso` model and examine it's coefficients

Is it better than the Linear regression? Is it better than Ridge? What do the differences in results imply about the issues with the dataset?

- Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
- What percent of the variables in the original dataset are "zeroed-out" by the lasso?
- What are the most important predictors for how many shots Kobe made in a game?

In [8]:
# A:

---

### 10. Tune the alpha for your `Lasso` model

How does this influence the coefficients? The model performance on the train and the test sets?

In [9]:
# A:

---

### 11. Synthesize what you've discovered

Write a couple of sentences telling the story: 
- How did a standard linear regression perform on the Kobe dataset? What qualities of this dataset caused these results>
- How did a Ridge perform in comparison? What clues could you glean from its coefficients? How does `alpha` seem to dictate the coefficients?
- What about the the `Lasso`?
- When will be useful?

In [10]:
# A: