<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DS-SF-42 | Class 8 | Regression - Addressing model fit

<br>
### _Predicting shots made per game by Kobe Bryant_

_Authors: Kiefer Katovich (SF) and Gus Ostow (SF)_

---

The Kobe shots dataset has hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression will cause issues that other datasets might not. In this exploration you will be diagnose issues with model fit using regression metrics, train/test split, and cross validation.


### Plan

Today I am going to flip the script: we are going to start the class with a hands-on partner activity to motivate the day's topic, then address the theory after.

1. Motivating the problem
3. Slides interlude
2. Addressing the problem

### Teams

<img src=https://i.imgur.com/JI6ydY5.png align=left>
<br><br><br><br><br><br><br><br><br><br><br><br><br><br>


# <font color=blue>Part I</font> - Motifivating the problem


---

### 1. Load packages and data

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
kobe = pd.read_csv('./datasets/kobe_superwide_games.csv')

---

### 2. Examine the data

#### Guiding questions

- How many columns are there? 
- Examine what the observations (rows) and columns represent.
- Why does this dataset _feel_ different than the datasets we've touched so far?
- What concerns do you have even before fitting your first model?

In [3]:
# A:
kobe.describe()

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
count,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,...,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0
mean,7.358793,0.500642,0.001284,0.001284,0.000642,0.000642,0.001284,0.001284,0.000642,0.000642,...,6.4e-05,0.006207,0.002047,3.2e-05,0.014149,5e-05,0.000433,0.031766,42.946727,780.486521
std,3.47118,0.50016,0.035817,0.035817,0.025335,0.025335,0.035817,0.035817,0.025335,0.025335,...,0.001791,0.022033,0.011133,0.001267,0.041313,0.00139,0.004902,0.049402,26.048206,449.923227
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,391.25
50%,7.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,42.0,780.5
75%,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,63.0,1169.75
max,22.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.05,0.25,0.133333,0.05,0.533333,0.041667,0.111111,0.352941,105.0,1559.0


In [6]:
kobe.

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
1,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,2
2,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,3
3,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
4,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,5


---

### 3.  Prepare the dataset for training AND validation

- Make predictor matrix `X` and target variable `y`
- Split your data into a validation set using `train_test_split`

In [19]:
kobe['AWAY_GAME'].unique()

array([0, 1])

In [21]:
kobe.filter(like='SEASON_OPPONENT').loc[0,] > 0

SEASON_OPPONENT:atl:1996-97    False
SEASON_OPPONENT:atl:1997-98    False
SEASON_OPPONENT:atl:1999-00    False
SEASON_OPPONENT:atl:2000-01    False
SEASON_OPPONENT:atl:2001-02    False
SEASON_OPPONENT:atl:2002-03    False
SEASON_OPPONENT:atl:2003-04    False
SEASON_OPPONENT:atl:2004-05    False
SEASON_OPPONENT:atl:2005-06    False
SEASON_OPPONENT:atl:2006-07    False
SEASON_OPPONENT:atl:2007-08    False
SEASON_OPPONENT:atl:2008-09    False
SEASON_OPPONENT:atl:2009-10    False
SEASON_OPPONENT:atl:2010-11    False
SEASON_OPPONENT:atl:2011-12    False
SEASON_OPPONENT:atl:2012-13    False
SEASON_OPPONENT:atl:2013-14    False
SEASON_OPPONENT:atl:2014-15    False
SEASON_OPPONENT:atl:2015-16    False
SEASON_OPPONENT:bkn:2012-13    False
SEASON_OPPONENT:bkn:2015-16    False
SEASON_OPPONENT:bos:1996-97    False
SEASON_OPPONENT:bos:1997-98    False
SEASON_OPPONENT:bos:1999-00    False
SEASON_OPPONENT:bos:2001-02    False
SEASON_OPPONENT:bos:2002-03    False
SEASON_OPPONENT:bos:2003-04    False
S

In [11]:
Descriptors = ['SHOT_ZONE_BASIC:above_the_break_3','SHOT_ZONE_RANGE:24+_ft.','SEASON:1996-97',
               'SHOT_ZONE_AREA:left_side_center(lc)','SHOT_ZONE_AREA:left_side(l)',
               'COMBINED_SHOT_TYPE:layup','SHOT_ZONE_AREA:right_side_center(rc)',
               'SHOT_ZONE_BASIC:in_the_paint_(non-ra)','SHOT_ZONE_AREA:right_side(r)',
               'SHOT_ZONE_BASIC:restricted_area','SHOT_ZONE_RANGE:8-16_ft.',
               'SHOT_ZONE_RANGE:less_than_8_ft.','SHOT_ZONE_RANGE:16-24_ft.',
               'SHOT_ZONE_BASIC:mid-range','SHOT_ZONE_AREA:center(c)',
               'COMBINED_SHOT_TYPE:jump_shot','SECONDS_REMAINING','SHOT_TYPE:2pt_field_goal']

In [12]:
# A:
y = kobe.SHOTS_MADE
X = kobe[Descriptors]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [17]:
lm = LinearRegression()
model = lm.fit(X_train, y_train)
predictions = model.predict(X_train)
score = model.score(X_train, y_train)

lm2 = LinearRegression()
model2 = lm2.fit(X_train, y_train)
predictions2 = model2.predict(X_test)
score2 = model2.score(X_test, y_test)


print 'score: ', score

print 'score: ', score2

score:  0.6521037182148983
score:  0.6324666848156186


In [18]:
y = kobe.SHOTS_MADE #shots made is target variable
q = kobe.columns #name of all columns
q1 = q[1:len(q)] #columns 2 through the end

resultsarray = []
Descriptors_array = []
    
for i in range(1,25):
    if i not in (3,5):
        Descriptors_array.append(abs(kobe.corr()["SHOTS_MADE"]).sort_values()[644-i:644].index[0])
        X = kobe[Descriptors_array]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
        
        lm = LinearRegression()
        model = lm.fit(X_train, y_train)
        predictions = model.predict(X_test)
        score = model.score(X_test, y_test)
        resultsarray.append(score)

max_value = resultsarray.index(max(resultsarray)) + 2  #add 2 for the 2 values we are removing      

Descriptors_array = []

for i in range(1,max_value + 2): #add 2 to offset from 1-based and -1 end value
    if i not in (3,5):
        Descriptors_array.append(abs(kobe.corr()["SHOTS_MADE"]).sort_values()[644-i:644].index[0])

X = kobe[Descriptors_array]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
        
lm = LinearRegression()
model = lm.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)
print score

0.6331555640143891


In [47]:
X = kobe.drop('SHOTS_MADE', axis = 1)
y = kobe['SHOTS_MADE']



In [23]:
assert X.shape[0] == y.shape[0]

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [26]:

lr = LinearRegression()

lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


lr = LinearRegression()

lr.fit(X_train, y_train)


lr_r2_train = lr.score(X_train, y_train)
lr_r2_test = lr.score(X_test, y_test)

print "R2 on train: {}".format(lr_r2_train)
print "R2 on test: {}".format(lr_r2_test)

R2 on train: 0.841736709151
R2 on test: -1.84569566526e+17


---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

1. How does it perform? Keep the regression metrics we talked about on Tuesday in mind, like mean squared error, mean absolute error, and $R^2$
2. Is there a disparity between your train set and your test set? What does that indicate?

In [5]:
# A:

# <font color=blue> Interlude</font> - Slides

Sit back and enjoy the show...

----
# <font color=blue> Part II</font> - Addressing the problem

---

### 6. Try fitting ealuating a  `Ridge` model instead of a standard `LinearRegression`
The ridge regression is a model _similar_ to the standard linear regression, but for now let it remain shrouded in an \*air\* of mystery.

Is it better than the Linear regression? On the training set? On the test set? Why do you think that is?

In [69]:
# A:
X = kobe.drop('SHOTS_MADE', axis = 1)
y = kobe['SHOTS_MADE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)


rd = Ridge(alpha = 100.0)

rd.fit(X_train, y_train)


rd_r2_train = rd.score(X_train, y_train)
rd_r2_test = rd.score(X_test, y_test)

print "R2 on train: {}".format(rd_r2_train)
print "R2 on test: {}".format(rd_r2_test)

R2 on train: 0.677628310334
R2 on test: 0.61886813136


In [74]:
ridge = Ridge()

ridge.fit(X_train, y_train)

pd.DataFrame(zip(X_train, ridge.coef_), columns = ['feature', 'coef'])


Unnamed: 0,feature,coef
0,AWAY_GAME,-0.275356
1,SEASON_OPPONENT:atl:1996-97,-0.276279
2,SEASON_OPPONENT:atl:1997-98,-0.831625
3,SEASON_OPPONENT:atl:1999-00,0.179406
4,SEASON_OPPONENT:atl:2000-01,-0.058562
5,SEASON_OPPONENT:atl:2001-02,1.247201
6,SEASON_OPPONENT:atl:2002-03,-0.798047
7,SEASON_OPPONENT:atl:2003-04,0.499361
8,SEASON_OPPONENT:atl:2004-05,0.581403
9,SEASON_OPPONENT:atl:2005-06,-0.367261


---
### 7. Examine your ridge model's coefficients

Does anything jump out at you? Use any the tools we've learned so far like histograms, barplots, and other descriptive statistics to compare the ridge model's fit to the linear regression we used earlier.


---

### 8. Play around with the `alpha` hyper parameter

How does this impact the coefficients of the fit model?

#### EX:
```python
ridge = Ridge(alpha = 10.0)
```

Some good values to try might be `0`, `0.1`, `1.0`, `10`, `100`

In [7]:
# A:

---

### 9. Fit a `Lasso` model and examine it's coefficients

Is it better than the Linear regression? Is it better than Ridge? What do the differences in results imply about the issues with the dataset?

- Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
- What percent of the variables in the original dataset are "zeroed-out" by the lasso?
- What are the most important predictors for how many shots Kobe made in a game?

In [71]:
# A:
X = kobe.drop('SHOTS_MADE', axis = 1)
y = kobe['SHOTS_MADE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)


ls = Lasso()

ls.fit(X_train, y_train)


ls_r2_train = ls.score(X_train, y_train)
ls_r2_test = ls.score(X_test, y_test)

print "R2 on train: {}".format(ls_r2_train)
print "R2 on test: {}".format(ls_r2_test)

R2 on train: 0.603379933925
R2 on test: 0.621943753723


---

### 10. Tune the alpha for your `Lasso` model

How does this influence the coefficients? The model performance on the train and the test sets?

In [9]:
# A:

---

### 11. Synthesize what you've discovered

Write a couple of sentences telling the story: 
- How did a standard linear regression perform on the Kobe dataset? What qualities of this dataset caused these results>
- How did a Ridge perform in comparison? What clues could you glean from its coefficients? How does `alpha` seem to dictate the coefficients?
- What about the the `Lasso`?
- When will be useful?

In [10]:
# A: