<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DS-SF-42 | Class 8 | Regression - Addressing model fit

<br>
### _Predicting shots made per game by Kobe Bryant_

_Authors: Kiefer Katovich (SF) and Gus Ostow (SF)_

---

The Kobe shots dataset has hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression will cause issues that other datasets might not. In this exploration you will be diagnose issues with model fit using regression metrics, train/test split, and cross validation.


### Plan

Today I am going to flip the script: we are going to start the class with a hands-on partner activity to motivate the day's topic, then address the theory after.

1. Motivating the problem
3. Slides interlude
2. Addressing the problem

### Teams

<img src=https://i.imgur.com/JI6ydY5.png align=left>
<br><br><br><br><br><br><br><br><br><br><br><br><br><br>


# <font color=blue>Part I</font> - Motifivating the problem


---

### 1. Load packages and data

In [44]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
kobe = pd.read_csv('./datasets/kobe_superwide_games.csv')

---

### 2. Examine the data

#### Guiding questions

- How many columns are there? 
- Examine what the observations (rows) and columns represent.
- Why does this dataset _feel_ different than the datasets we've touched so far?
- What concerns do you have even before fitting your first model?

In [3]:
# A:
kobe.head()

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
1,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,2
2,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,3
3,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
4,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,5


In [6]:
# no. of columns
print("No. of columns in the dataset :",kobe.shape[1])

No. of columns in the dataset : 645


In [66]:
for col in kobe.columns:
    print(col)

SHOTS_MADE
AWAY_GAME
SEASON_OPPONENT:atl:1996-97
SEASON_OPPONENT:atl:1997-98
SEASON_OPPONENT:atl:1999-00
SEASON_OPPONENT:atl:2000-01
SEASON_OPPONENT:atl:2001-02
SEASON_OPPONENT:atl:2002-03
SEASON_OPPONENT:atl:2003-04
SEASON_OPPONENT:atl:2004-05
SEASON_OPPONENT:atl:2005-06
SEASON_OPPONENT:atl:2006-07
SEASON_OPPONENT:atl:2007-08
SEASON_OPPONENT:atl:2008-09
SEASON_OPPONENT:atl:2009-10
SEASON_OPPONENT:atl:2010-11
SEASON_OPPONENT:atl:2011-12
SEASON_OPPONENT:atl:2012-13
SEASON_OPPONENT:atl:2013-14
SEASON_OPPONENT:atl:2014-15
SEASON_OPPONENT:atl:2015-16
SEASON_OPPONENT:bkn:2012-13
SEASON_OPPONENT:bkn:2015-16
SEASON_OPPONENT:bos:1996-97
SEASON_OPPONENT:bos:1997-98
SEASON_OPPONENT:bos:1999-00
SEASON_OPPONENT:bos:2001-02
SEASON_OPPONENT:bos:2002-03
SEASON_OPPONENT:bos:2003-04
SEASON_OPPONENT:bos:2004-05
SEASON_OPPONENT:bos:2005-06
SEASON_OPPONENT:bos:2006-07
SEASON_OPPONENT:bos:2007-08
SEASON_OPPONENT:bos:2008-09
SEASON_OPPONENT:bos:2009-10
SEASON_OPPONENT:bos:2010-11
SEASON_OPPONENT:bos:2011-12

In [64]:
column_mask = kobe.filter(like="SEASON_OPPONENT").loc[0,:] > 0

In [65]:
kobe.filter(like="SEASON_OPPONENT").loc[0,column_mask]

SEASON_OPPONENT:min:1996-97    1.0
Name: 0, dtype: float64

In [7]:
kobe.describe()

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
count,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,...,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0,1558.0
mean,7.358793,0.500642,0.001284,0.001284,0.000642,0.000642,0.001284,0.001284,0.000642,0.000642,...,6.4e-05,0.006207,0.002047,3.2e-05,0.014149,5e-05,0.000433,0.031766,42.946727,780.486521
std,3.47118,0.50016,0.035817,0.035817,0.025335,0.025335,0.035817,0.035817,0.025335,0.025335,...,0.001791,0.022033,0.011133,0.001267,0.041313,0.00139,0.004902,0.049402,26.048206,449.923227
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,391.25
50%,7.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,42.0,780.5
75%,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,63.0,1169.75
max,22.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.05,0.25,0.133333,0.05,0.533333,0.041667,0.111111,0.352941,105.0,1559.0


- It has too many columns



In [11]:
kobe['SEASON_OPPONENT:atl:1996-97'].unique()

array([ 0.,  1.])

In [12]:
kobe['ACTION_TYPE:turnaround_finger_roll_shot'].unique()

array([ 0.        ,  0.03571429,  0.04166667])

In [13]:
kobe['ACTION_TYPE:pullup_jump_shot'].unique()

array([ 0.        ,  0.04347826,  0.04545455,  0.09090909,  0.03571429,
        0.11111111,  0.08695652,  0.03333333,  0.08333333,  0.0625    ,
        0.06666667,  0.05555556,  0.04761905,  0.11764706,  0.03846154,
        0.1       ,  0.05      ,  0.0952381 ,  0.05263158,  0.03125   ,
        0.14285714,  0.13043478,  0.25      ,  0.22727273,  0.03448276,
        0.13333333,  0.04      ,  0.07142857,  0.17647059,  0.21052632,
        0.14814815,  0.06896552,  0.07692308,  0.08108108,  0.18181818,
        0.04166667,  0.16666667,  0.2       ,  0.17857143,  0.07407407,
        0.03703704,  0.09677419,  0.27777778,  0.08      ,  0.12903226,
        0.15      ,  0.12      ,  0.05882353,  0.03225806,  0.10344828,
        0.15384615,  0.10714286,  0.06060606,  0.13793103,  0.2195122 ,
        0.13636364,  0.125     ,  0.09375   ,  0.33333333,  0.32142857,
        0.08823529,  0.21428571,  0.35      ,  0.31578947,  0.30769231,
        0.3125    ,  0.22222222,  0.23076923,  0.24      ,  0.5 

In [18]:
kobe[kobe['SEASON_GAME_NUMBER'] == 1].iloc[:,13:]

Unnamed: 0,SEASON_OPPONENT:atl:2008-09,SEASON_OPPONENT:atl:2009-10,SEASON_OPPONENT:atl:2010-11,SEASON_OPPONENT:atl:2011-12,SEASON_OPPONENT:atl:2012-13,SEASON_OPPONENT:atl:2013-14,SEASON_OPPONENT:atl:2014-15,SEASON_OPPONENT:atl:2015-16,SEASON_OPPONENT:bkn:2012-13,SEASON_OPPONENT:bkn:2015-16,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,76
163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,165
221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,1,223
309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,311
393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,395
492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,494
586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,588
672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,1,674
738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,1,740


---

### 3.  Prepare the dataset for training AND validation

- Make predictor matrix `X` and target variable `y`
- Split your data into a validation set using `train_test_split`

In [50]:
# A:
y = kobe['SHOTS_MADE']
kobe_new_dataset = kobe.drop('SHOTS_MADE',axis=1) # it's not changing original dataset so you can put it as X
# X = kobe.drop('SHOTS_MADE',axis=1)
kobe_new_dataset1 = kobe_new_dataset.drop(['SEASON_GAME_NUMBER','CAREER_GAME_NUMBER'],axis =1)

In [68]:
X.shape[0] == y.shape[0]

True

In [69]:
# can also use assert
assert X.shape[0] == y.shape[0]

In [57]:
X = kobe_new_dataset.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# test_size is keyword argument, arguments is positional and for keyword arguments position doesn't matter

lm = LinearRegression()

lm.fit(X_train,y_train)

y_pred = lm.predict(X_test)

print("bo coefficient : {}".format(lm.coef_[0]))
print("b1 intercept : {}".format(lm.intercept_))

bo coefficient : -0.38912490724743987
b1 intercept : 1123644016.721597


- Too many predictor columns
- Too few rows compared to columns
- Potential multi colinearity
    - Even though it's not present here but something to look out

# More than 1 column exactly same as another one - multi colinearlity will break linear regression
- eg length of column in mins and length of column in hours

---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

1. How does it perform? Keep the regression metrics we talked about on Tuesday in mind, like mean squared error, mean absolute error, and $R^2$
2. Is there a disparity between your train set and your test set? What does that indicate?

In [58]:
# A:
print(mean_squared_error(y_test,y_pred))
print(lm.score(X_test, y_test))
print(lm.score(X_train,y_train))

2.114261213e+17
-1.67641929101e+16
0.838104932189


In [59]:
X = kobe_new_dataset1.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

lm = LinearRegression()

lm.fit(X_train,y_train)

y_pred = lm.predict(X_test)

print("bo coefficient : {}".format(lm.coef_[0]))
print("b1 intercept : {}".format(lm.intercept_))
print("Mean squared error",mean_squared_error(y_test,y_pred))
print("R squared for test data",lm.score(X_test, y_test))
print("R squared for training data",lm.score(X_train,y_train))

bo coefficient : -0.2615828188728984
b1 intercept : -923296198.0988156
Mean squared error 3.48476646242e+15
R squared for test data -2.83963141283e+14
R squared for training data 0.843953767781


# <font color=blue> Interlude</font> - Slides

Sit back and enjoy the show...

----
# <font color=blue> Part II</font> - Addressing the problem

---

### 6. Try fitting ealuating a  `Ridge` model instead of a standard `LinearRegression`
The ridge regression is a model _similar_ to the standard linear regression, but for now let it remain shrouded in an \*air\* of mystery.

Is it better than the Linear regression? On the training set? On the test set? Why do you think that is?

In [74]:
# A:

rg = Ridge()
y = kobe['SHOTS_MADE']
X = kobe.drop('SHOTS_MADE',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rg.fit(X_train,y_train)


print("score on training data",rg.score(X_train,y_train))
print("score on test data",rg.score(X_test,y_test))
print("Coefficient",rg.coef_[0])
print("Intercept",rg.intercept_)

score on training data 0.828506419582
score on test data 0.558404415231
Coefficient -0.206340948183
Intercept 1.50001765924


In [98]:
ridge_coeff = pd.DataFrame(zip(X_train,rg.coef_),columns = ["feature","coeff"])


TypeError: data argument can't be an iterator

In [99]:
lm.coef_.std()

281345479.48214769

In [100]:
rg.coef_.std()

0.23238671109860784

In [101]:
lm.coef_.mean()

78900813.779886603

In [102]:
rg.coef_.mean()

0.0018165474699748157

Too many columns - 
model is ignoring the column and coeff is 0

---
### 7. Examine your ridge model's coefficients

Does anything jump out at you? Use any the tools we've learned so far like histograms, barplots, and other descriptive statistics to compare the ridge model's fit to the linear regression we used earlier.


---

### 8. Play around with the `alpha` hyper parameter

How does this impact the coefficients of the fit model?

#### EX:
```python
ridge = Ridge(alpha = 10.0)
```

Some good values to try might be `0`, `0.1`, `1.0`, `10`, `100`

In [79]:
# A:
rg = Ridge(alpha = 10)
y = kobe['SHOTS_MADE']
X = kobe.drop('SHOTS_MADE',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rg.fit(X_train,y_train)

print("score on training data",rg.score(X_train,y_train))
print("score on test data",rg.score(X_test,y_test))
print("Coefficient",rg.coef_[0])
print("Intercept",rg.intercept_)

score on training data 0.723512270006
score on test data 0.650277492847
Coefficient -0.207427804296
Intercept 1.99211196062


---

### 9. Fit a `Lasso` model and examine it's coefficients

Is it better than the Linear regression? Is it better than Ridge? What do the differences in results imply about the issues with the dataset?

- Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
- What percent of the variables in the original dataset are "zeroed-out" by the lasso?
- What are the most important predictors for how many shots Kobe made in a game?

In [87]:
# A:
ls = Lasso()
y = kobe['SHOTS_MADE']
X = kobe.drop('SHOTS_MADE',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

ls.fit(X_train,y_train)

print("score on training data",ls.score(X_train,y_train))
print("score on test data",ls.score(X_test,y_test))
print("Coefficient",ls.coef_)
print("Intercept",ls.intercept_)

score on training data 0.602950130069
score on test data 0.611188644433
Coefficient [-0.         -0.          0.          0.          0.          0.         -0.
  0.          0.          0.          0.          0.         -0.          0.
  0.         -0.         -0.         -0.          0.          0.         -0.
  0.         -0.         -0.          0.         -0.         -0.         -0.
 -0.         -0.          0.         -0.          0.         -0.         -0.
  0.          0.         -0.         -0.          0.          0.          0.
 -0.         -0.         -0.          0.          0.          0.          0.
 -0.         -0.         -0.          0.         -0.          0.          0.
  0.         -0.         -0.          0.         -0.          0.          0.
 -0.          0.          0.         -0.         -0.         -0.          0.
 -0.          0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.          0.         -0.         -0.         -0.     

In [97]:


#print(dict(zip(X.columns,ls.coef_)))

---

### 10. Tune the alpha for your `Lasso` model

How does this influence the coefficients? The model performance on the train and the test sets?

In [89]:
# A:
ls = Lasso(alpha = 10)
y = kobe['SHOTS_MADE']
X = kobe.drop('SHOTS_MADE',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

ls.fit(X_train,y_train)

print("score on training data",ls.score(X_train,y_train))
print("score on test data",ls.score(X_test,y_test))
print("Coefficient",ls.coef_)
print("Intercept",ls.intercept_)

score on training data 0.547560926459
score on test data 0.579802080908
Coefficient [ -0.00000000e+00  -0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00  

---

### 11. Synthesize what you've discovered

Write a couple of sentences telling the story: 
- How did a standard linear regression perform on the Kobe dataset? What qualities of this dataset caused these results>
- How did a Ridge perform in comparison? What clues could you glean from its coefficients? How does `alpha` seem to dictate the coefficients?
- What about the the `Lasso`?
- When will be useful?

In [10]:
# A: