<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Predicting shots made per game by Kobe Bryant

_Authors: Kiefer Katovich (SF)_

---

In this lab you'll be using regularized regression penalties Ridge, Lasso, and Elastic Net to try and predict how many shots Kobe Bryant made per game in his career.

The Kobe shots dataset has hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression using every predictor would dramatically overfit the model considering the limited number of observations (games) we have available. Furthermore, many of the predictors have significant multicollinearity. 

**Warning:** Some of these calculations are computationally expensive and may take a while to execute.  It may be worth while to only use a portion of the data to perform these calculations, especially if you have experienced kernel issues in the past.

---

### 1. Load packages and data

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
kobe = pd.read_csv('./datasets/kobe_superwide_games.csv')

---

### 2. Examine the data

- How many columns are there?
- Examine what the observations (rows) and columns represent.
- Why is this data that regularization might be particularly useful for?

In [3]:
# A:How many columns are there?
print("There are {} columns in Kobe data.".format(len(kobe.columns)))


There are 645 columns in Kobe data.


In [4]:
# A : Examine what the observations (rows) and columns represent.
kobe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1558 entries, 0 to 1557
Columns: 645 entries, SHOTS_MADE to CAREER_GAME_NUMBER
dtypes: float64(640), int64(5)
memory usage: 7.7 MB


This data is Kobe's shots from all his career games.

In [5]:
kobe.head()

Unnamed: 0,SHOTS_MADE,AWAY_GAME,SEASON_OPPONENT:atl:1996-97,SEASON_OPPONENT:atl:1997-98,SEASON_OPPONENT:atl:1999-00,SEASON_OPPONENT:atl:2000-01,SEASON_OPPONENT:atl:2001-02,SEASON_OPPONENT:atl:2002-03,SEASON_OPPONENT:atl:2003-04,SEASON_OPPONENT:atl:2004-05,...,ACTION_TYPE:tip_layup_shot,ACTION_TYPE:tip_shot,ACTION_TYPE:turnaround_bank_shot,ACTION_TYPE:turnaround_fadeaway_bank_jump_shot,ACTION_TYPE:turnaround_fadeaway_shot,ACTION_TYPE:turnaround_finger_roll_shot,ACTION_TYPE:turnaround_hook_shot,ACTION_TYPE:turnaround_jump_shot,SEASON_GAME_NUMBER,CAREER_GAME_NUMBER
0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
1,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,2
2,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,3
3,2.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
4,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,5


 A : Why is this data that regularization might be particularly useful for?
Regularization is useful because all datas has different scales. Moreover, first part is shot made against each opponents on each season. Second part are shot type of his shot.

---

### 3. Make predictor and target variables. Standardize the predictors.

Why is normalization necessary for regularized regressions?

Use the `sklearn.preprocessing` class `StandardScaler` to standardize the predictors.

In [6]:
# A:
kobe.columns = [x.lower() for x in kobe.columns]

In [7]:
kobe.tail()

Unnamed: 0,shots_made,away_game,season_opponent:atl:1996-97,season_opponent:atl:1997-98,season_opponent:atl:1999-00,season_opponent:atl:2000-01,season_opponent:atl:2001-02,season_opponent:atl:2002-03,season_opponent:atl:2003-04,season_opponent:atl:2004-05,...,action_type:tip_layup_shot,action_type:tip_shot,action_type:turnaround_bank_shot,action_type:turnaround_fadeaway_bank_jump_shot,action_type:turnaround_fadeaway_shot,action_type:turnaround_finger_roll_shot,action_type:turnaround_hook_shot,action_type:turnaround_jump_shot,season_game_number,career_game_number
1553,4.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.105263,0.0,0.0,0.052632,62,1555
1554,4.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,63,1556
1555,9.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.045455,64,1557
1556,3.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,65,1558
1557,19.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.04,66,1559


In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
# target is shots_made and all other columns are predictors.
X = kobe.drop('shots_made', axis=1)
y = kobe[['shots_made']]

X.head()


Unnamed: 0,away_game,season_opponent:atl:1996-97,season_opponent:atl:1997-98,season_opponent:atl:1999-00,season_opponent:atl:2000-01,season_opponent:atl:2001-02,season_opponent:atl:2002-03,season_opponent:atl:2003-04,season_opponent:atl:2004-05,season_opponent:atl:2005-06,...,action_type:tip_layup_shot,action_type:tip_shot,action_type:turnaround_bank_shot,action_type:turnaround_fadeaway_bank_jump_shot,action_type:turnaround_fadeaway_shot,action_type:turnaround_finger_roll_shot,action_type:turnaround_hook_shot,action_type:turnaround_jump_shot,season_game_number,career_game_number
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,2
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,3
3,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,4
4,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,5


In [10]:
y.head()

Unnamed: 0,shots_made
0,0.0
1,0.0
2,2.0
3,2.0
4,0.0


In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=20)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1043, 644) (515, 644) (1043, 1) (515, 1)


In [13]:
ss_scaler = StandardScaler()
ss_scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [14]:
ss_scaler_kobe = ss_scaler.transform(X_train)

In [15]:
ss_scaler_kobe[0:5]

array([[-0.98006444, -0.04383183, -0.04383183, ...,  0.1271675 ,
         0.23962241,  1.04767319],
       [-0.98006444, -0.04383183, -0.04383183, ...,  0.48663439,
        -1.44389743,  0.50269298],
       [ 1.02034107, -0.04383183, -0.04383183, ...,  0.09778799,
        -0.44909025, -0.59171626],
       [ 1.02034107, -0.04383183, -0.04383183, ...,  0.15899529,
         1.50226229,  0.90753542],
       [ 1.02034107, -0.04383183, -0.04383183, ...,  0.55684276,
        -1.44389743,  1.57485812]])

---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

Cross-validate the $R^2$ of an ordinary linear regression model with 10 cross-validation folds.

How does it perform?

In [16]:
# A:
from sklearn.model_selection import cross_val_score, cross_val_predict
lr = LinearRegression()
mean_cv= cross_val_score(lr, X_train, y_train, cv=10).mean()
print(mean_cv)

-6.07613306167e+15


---

### 5. Find an optimal value for Ridge regression alpha using `RidgeCV`.

[Go to the documentation and read how RidgeCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)

> *Hint: once the RidgeCV is fit, the attribute `.alpha_` contains the best alpha parameter it found through cross-validation.*

Recall that Ridge performs best searching alphas through logarithmic space (`np.logspace`). This may take awhile to fit!


In [17]:
# A:
a = np.logspace(-5, 5, 200)


In [18]:
ridgecv = RidgeCV(alphas=a)
ridgecv.fit(X_train, y_train)
r_alpha = ridgecv.alpha_

print("Optimal value for ridge regression is {}.".format(r_alpha))

Optimal value for ridge regression is 9.547716114208066.


In [19]:
ridgecv.coef_

array([[ -2.92266918e-01,  -1.41393458e-01,   9.28966174e-02,
          9.60514618e-04,   0.00000000e+00,   3.38487144e-01,
         -1.16009954e-01,   1.32446271e-01,   1.15021757e-01,
          1.88108751e-02,   2.08116498e-01,  -1.42402502e-01,
         -2.52766047e-01,   3.17007182e-01,  -8.15474621e-02,
         -1.94189250e-01,  -1.16421634e-01,  -1.73837788e-01,
          2.70591485e-01,   0.00000000e+00,   2.88514290e-02,
         -3.91472065e-02,   0.00000000e+00,   4.70821464e-02,
          2.47979769e-01,  -3.51995905e-01,  -3.28654210e-01,
         -1.94866627e-01,   0.00000000e+00,  -1.30926824e-02,
          0.00000000e+00,  -5.19622470e-01,   5.65402213e-02,
         -2.31243199e-01,   3.33368716e-01,  -1.77068989e-01,
          1.17419922e-01,   0.00000000e+00,  -6.46666967e-01,
         -2.78974767e-02,  -1.33797804e-01,   2.38116136e-02,
         -2.77695924e-01,  -3.44206775e-01,  -8.69178582e-02,
          0.00000000e+00,  -2.53816927e-01,   7.29411547e-03,
        

---

### 6. Cross-validate the Ridge regression $R^2$ with the optimal alpha.

Is it better than the Linear regression? If so, why might this be?

In [20]:
prediction = ridgecv.predict(X_train)
print(prediction)

[[  9.3921845 ]
 [  7.11466737]
 [ 10.2735615 ]
 ..., 
 [  7.84297424]
 [  7.38415923]
 [  5.50352856]]


In [21]:
mean_cv_ridge = cross_val_score(ridgecv,X_train, y_train,cv=10).mean()


In [22]:
print(mean_cv_ridge)

0.635986380188


Ridge CV model is much better model because I find best alpha parameter for ridge and weight less on unneccessary features.

---

### 7. Find an optimal value for Lasso regression alpha using `LassoCV`.

[Go to the documentation and read how LassoCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) It is very similar to `RidgeCV`.

> *Hint: again, once the `LassoCV` is fit, the attribute `.alpha_` contains the best alpha parameter it found through cross-validation.*

Recall that Lasso, unlike Ridge, performs best searching for alpha through linear space (`np.linspace`). However, you can actually let the LassoCV decide itself what alphas to use by instead setting the keyword argument `n_alphas=` to however many alphas you want it to search over. It is recommended to let sklearn choose the range of alphas.

_**Tip:** If you find your CV taking a long time and you're not sure if its working set `verbose =1`._

In [23]:
# A:

l_alphas = np.linspace(0.001, 10000, 100)
#print(l_alphas)
lassocv = LassoCV(alphas=l_alphas)
lassocv.fit(X_train, y_train)

l_alpha = lassocv.alpha_

print("Optimal value for lasso regression is {}.".format(l_alpha))

  y = column_or_1d(y, warn=True)


Optimal value for lasso regression is 0.001.




# ---

### 8. Cross-validate the Lasso $R^2$ with the optimal alpha.

Is it better than the Linear regression? Is it better than Ridge? What do the differences in results imply about the issues with the dataset?

In [24]:
# A:
mean_cv_lasso = cross_val_score(lassocv,X_train, y_train,cv=10).mean()

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)




In [25]:
print(mean_cv_lasso)

0.579423838293


It is better model than linear regression, but not better than ridge because the Ridge is best suited to deal with multicollinearity.

---

### 9. Look at the coefficients for variables in the Lasso.

1. Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
2. What percent of the variables in the original dataset are "zeroed-out" by the lasso?
3. What are the most important predictors for how many shots Kobe made in a game?

> **Note:** if you only fit the Lasso within `cross_val_score`, you will have to refit it outside of that
function to pull out the coefficients.

In [26]:
# A:
coef= lassocv.coef_
coef

array([ -3.25781235e-01,  -0.00000000e+00,   1.67177120e-01,
         0.00000000e+00,   0.00000000e+00,   1.41713891e+00,
        -5.04467458e-01,   6.16745694e-01,   4.58048625e-01,
         0.00000000e+00,   6.48311982e-01,  -1.15432036e-01,
        -1.32002011e+00,   1.21834491e+00,  -0.00000000e+00,
        -1.20393334e+00,  -3.39041875e-01,  -6.68328096e-01,
         1.35726766e+00,   0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.07606233e+00,  -1.23517761e+00,  -2.21257988e+00,
        -6.77263792e-01,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -1.61018961e+00,   5.77135892e-02,
        -4.09669662e-03,   1.36837962e+00,  -5.38569126e-01,
         1.67538370e-01,   0.00000000e+00,  -3.34942684e+00,
        -0.00000000e+00,  -2.86044380e-01,   0.00000000e+00,
        -1.78511281e+00,  -1.30474904e+00,  -0.00000000e+00,
         0.00000000e+00,  -1.59060984e+00,  -0.00000000e+00,
         0.00000000e+00,

In [27]:
# Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
np.sort(np.absolute(coef))[::-1]

array([  9.58162980e+00,   5.77684538e+00,   4.78847013e+00,
         4.43773500e+00,   4.07314236e+00,   3.72136083e+00,
         3.53473932e+00,   3.41928861e+00,   3.35218337e+00,
         3.34942684e+00,   3.18009713e+00,   3.08298520e+00,
         3.07084401e+00,   2.98970554e+00,   2.98290530e+00,
         2.90847189e+00,   2.89740031e+00,   2.84479482e+00,
         2.82278999e+00,   2.69690790e+00,   2.68506099e+00,
         2.63506923e+00,   2.62707593e+00,   2.61925954e+00,
         2.61342296e+00,   2.60571957e+00,   2.58143514e+00,
         2.58066540e+00,   2.56149335e+00,   2.53460182e+00,
         2.52542602e+00,   2.49088704e+00,   2.39798184e+00,
         2.36045731e+00,   2.34824839e+00,   2.33849535e+00,
         2.33692791e+00,   2.31039841e+00,   2.30840476e+00,
         2.30433495e+00,   2.26045464e+00,   2.25694831e+00,
         2.24483842e+00,   2.21257988e+00,   2.20702428e+00,
         2.16916522e+00,   2.13800937e+00,   2.12131659e+00,
         2.11812742e+00,

In [28]:
zeros = (coef==0).sum()

In [29]:
percent = zeros/coef.size * 100

In [30]:
# What percent of the variables in the original dataset are "zeroed-out" by the lasso?
print('Percent of the variables in the origianl data set are "zeroed-out" by the lasso is {}.'.format(percent) )

Percent of the variables in the origianl data set are "zeroed-out" by the lasso is 44.87577639751553.


What are the most important predictors for how many shots Kobe made in a game?

In [31]:
predict_zip = list(zip(kobe.iloc[:,1:], coef))

In [32]:
predictor_df=pd.DataFrame(predict_zip)

In [33]:
predictor_df.head()

Unnamed: 0,0,1
0,away_game,-0.325781
1,season_opponent:atl:1996-97,-0.0
2,season_opponent:atl:1997-98,0.167177
3,season_opponent:atl:1999-00,0.0
4,season_opponent:atl:2000-01,0.0


In [34]:
predictor_df.sort_values(1, ascending=False)

Unnamed: 0,0,1
609,action_type:jump_bank_shot,9.581630
266,season_opponent:mil:2005-06,5.776845
373,season_opponent:phi:2005-06,4.437735
116,season_opponent:den:2000-01,4.073142
466,season_opponent:sea:2004-05,3.721361
374,season_opponent:phi:2006-07,3.534739
327,season_opponent:nyk:2002-03,3.419289
233,season_opponent:mem:2009-10,3.352183
134,season_opponent:det:2000-01,3.070844
163,season_opponent:gsw:2010-11,2.982905


Dunk shot and jump shot are big two predictors

---

### 10. Find an optimal value for Elastic Net regression alpha using `ElasticNetCV`.

[Go to the documentation and read how LassoCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html).

Note here that you will be optimizing both the alpha parameter and the l1_ratio:
- `alpha`: strength of regularization
- `l1_ratio`: amount of ridge vs. lasso (0 = all ridge, 1 = all lasso)
    
Do not include 0 in the search for `l1_ratio`: it will not allow it and break!

You can use `n_alphas` for the alpha parameters instead of setting your own values: highly recommended!

Also - be careful setting too many l1_ratios over cross-validation folds in your search. It can take a very long time if you choose too many combinations and for the most part there are diminishing returns in this data.

In [35]:
# A:
n = np.linspace(0.001, 10000, 100)
nn= [.1, .5, .7, .9, .95, .99, 1]
encv = ElasticNetCV(cv=10, alphas=n, l1_ratio = nn)
encv.fit(X_train , y_train)

  y = column_or_1d(y, warn=True)


ElasticNetCV(alphas=array([  1.00000e-03,   1.01011e+02, ...,   9.89899e+03,   1.00000e+04]),
       copy_X=True, cv=10, eps=0.001, fit_intercept=True,
       l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=1000,
       n_alphas=100, n_jobs=1, normalize=False, positive=False,
       precompute='auto', random_state=None, selection='cyclic',
       tol=0.0001, verbose=0)

In [36]:
encv.l1_ratio_

0.10000000000000001

In [37]:
encv.alpha_

0.001

---

### 11. Cross-validate the ElasticNet $R^2$ with the optimal alpha and l1_ratio.

How does it compare to the Ridge and Lasso regularized regressions?

In [38]:
# A:
mean_cv_en = cross_val_score(encv,X_train, y_train,cv=10).mean()

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)




In [39]:
print(mean_cv_en)

0.599557964376


---

### 12. [Bonus] Compare the residuals for the Ridge and Lasso visually.


In [None]:
# A: Maybe a jointplot?