# Predicting shots made per game by Kobe Bryant

In this lab you'll be using regularization techniques Ridge, Lasso, and Elastic Net to try and predict well how many shots Kobe Bryant made per game in his career.

---

### 1. Load packages and data

In [26]:
import numpy as np
import pandas as pd
import patsy
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.cross_validation import cross_val_score
from sklearn import metrics

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
kobe = pd.read_csv('kobe_superwide_games.csv')

---

### 2. Examine the data

- How many columns are there?
- Infer what the observations (rows) and columns represent.
- Why is this data that regularization might be particularly useful for?

In [18]:
kobe.head(3)

0.2244047619048

---

### Make predictor and target variables. Normalize the predictors.

Why is normalization necessary for regularized regressions?

There is a class in sklearn.preprocessing called `StandardScaler`. Look it up and figure out how to use it to normalize your predictor matrix. 

In [17]:
target = kobe[['SHOTS_MADE']]
predictors = kobe.ix[:,1:]

scaler = StandardScaler()
scaled_data = scaler.fit_transform(predictors)

standardized = pd.DataFrame(scaled_data, columns=predictors.columns)
standardized.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AWAY_GAME,1558.0,-7.838545e-18,1.000321,-1.001285,-1.001285,0.998717,0.998717,0.998717
SEASON_OPPONENT:atl:1996-97,1558.0,-1.285352e-15,1.000321,-0.035852,-0.035852,-0.035852,-0.035852,27.892651
SEASON_OPPONENT:atl:1997-98,1558.0,-1.162216e-15,1.000321,-0.035852,-0.035852,-0.035852,-0.035852,27.892651
SEASON_OPPONENT:atl:1999-00,1558.0,4.360681e-16,1.000321,-0.025343,-0.025343,-0.025343,-0.025343,39.458839
SEASON_OPPONENT:atl:2000-01,1558.0,3.266090e-16,1.000321,-0.025343,-0.025343,-0.025343,-0.025343,39.458839
SEASON_OPPONENT:atl:2001-02,1558.0,-3.869124e-16,1.000321,-0.035852,-0.035852,-0.035852,-0.035852,27.892651
SEASON_OPPONENT:atl:2002-03,1558.0,-1.132759e-16,1.000321,-0.035852,-0.035852,-0.035852,-0.035852,27.892651
SEASON_OPPONENT:atl:2003-04,1558.0,4.385577e-17,1.000321,-0.025343,-0.025343,-0.025343,-0.025343,39.458839
SEASON_OPPONENT:atl:2004-05,1558.0,2.335976e-17,1.000321,-0.025343,-0.025343,-0.025343,-0.025343,39.458839
SEASON_OPPONENT:atl:2005-06,1558.0,1.193151e-16,1.000321,-0.035852,-0.035852,-0.035852,-0.035852,27.892651


---

### Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.

Cross-validate the $R^2$ of a linear regression model with 10 cross-validation folds.

How does it perform?

In [29]:
lr = LinearRegression()

x = predictors
y = target

model        =  lr.fit(x, y)
predictions  =  model.predict(x)
score        =  model.score(x, y)

scores = cross_val_score(lr, predictors, y, cv=10)
print "Cross-validated scores:", scores

R_squared = metrics.r2_score(y, predictions)
print "Cross-Predicted R^2:", R_squared

Cross-validated scores: [ -2.28311785e+17  -5.05731427e+17  -7.52128097e+15  -1.64706932e+17
  -7.73941483e+16  -1.16921069e+17  -1.44702808e+16  -2.75209639e+14
  -4.67472226e+15  -7.17980707e+14]
Cross-Predicted R^2: 0.807892771734


---

### Find an optimal value for Ridge regression alpha using RidgeCV

[Go to the documentation and read how RidgeCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)

Hint: once the RidgeCV is fit, the attribute `.alpha_` contains the best alpha parameter it found through cross-validation.

Recall that Ridge performs best searching alphas through logarithmic space (`np.logspace`).


In [39]:
ridge = RidgeCV(alphas=np.logspace(0,5,200),fit_intercept=False,cv=10)
ridge.fit(x,y)
ridge.alpha_

opt_ridge = Ridge(alpha=ridge.alpha_,fit_intercept=False)

---

### Cross-validate the Ridge $R^2$ with the optimal alpha.

Is it better than the Linear regression? If so, why would this be?

In [46]:
ridge_scores = cross_val_score(opt_ridge, predictors, y, cv=10)
print "Cross-validated scores:", ridge_scores
print "Mean: ", np.mean(ridge_scores)

Cross-validated scores: [ 0.64358493  0.5244071   0.51837756  0.60441685  0.53548521  0.56056225
  0.53480591  0.43739465  0.4710879   0.48528824]
Mean:  0.531541061003


---

### Find an optimal value for Lasso regression alpha using LassoCV

[Go to the documentation and read how LassoCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) It is very similar to RidgeCV.

Hint: again, once the LassoCV is fit, the attribute `.alpha_` contains the best alpha parameter it found through cross-validation.

Recall that Lasso, unlike Ridge, performs best searching alphas through linear space (`np.linspace`). However, you can actually let the LassoCV decide itself what alphas to use by instead setting the keyword argument `n_alphas=` to however many alphas you want it to search over.

In [41]:
lasso = RidgeCV(alphas=np.arange(0, 0.15, 0.0025),fit_intercept=False,cv=10)
lasso.fit(x,y)
lasso.alpha_

opt_lasso = Lasso(alpha=lasso.alpha_,fit_intercept=False)

---

### Cross-validate the Lasso $R^2$ with the optimal alpha.

Is it better than the Linear regression? Is it better than Ridge? For each, why would this be?

Depending on which $R^2$ is better between the Ridge and Lasso, what can you infer about the primary issue in the data?

In [45]:
lasso_scores = cross_val_score(opt_lasso, predictors, y, cv=10)
print "Cross-validated scores:", lasso_scores
print "Mean: ", np.mean(lasso_scores)

Cross-validated scores: [ 0.62523891  0.51542013  0.5161596   0.57138981  0.54782538  0.54797083
  0.50505654  0.42822027  0.45928     0.48492841]
Mean:  0.520148988892


---

### Look at the coefficients for variables in the Lasso.

1. Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
2. What percent of the variables in the original dataset are "zeroed-out" by the lasso?
3. What are the most important predictors for how many shots kobe made in a game?

Note: if you only fit the Lasso within cross_val_score, you will have to refit it outside of that
function to pull out the coefficients.

In [58]:
opt_lasso.fit(x,y)

Lasso(alpha=0.14749999999999999, copy_X=True, fit_intercept=False,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

---

### Find an optimal value for Elastic Net regression alpha using ElasticNetCV

[Go to the documentation and read how LassoCV works.](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html).

Note here that you will be optimizing both the alpha parameter and the l1_ratio:

    alpha: strength of regularization
    l1_ratio: amount of ridge vs. lasso (0 = all ridge, 1 = all lasso)
    
Do not include 0 in the search for l1_ratio: it will not allow it and break!

You can use n_alphas for the alpha parameters instead of setting your own values: highly recommended!

Also - be careful setting too many l1_ratios over cross-validation folds in your search. It can take a very long time if you choose too many combinations and for the most part there are diminishing returns in this data.

In [65]:
elastic = ElasticNetCV(l1_ratio=np.arange(.01,1,.05),n_alphas=50,cv=10)
elastic.fit(x,y)



  y = column_or_1d(y, warn=True)


In [68]:
opt_elastic = ElasticNet(alpha=elastic.alpha_,l1_ratio=elastic.l1_ratio_)

---

### Cross-validate the ElasticNet $R^2$ with the optimal alpha and l1_ratio.

How does it compare to the other regularized regressions?

In [70]:
elastic_scores = cross_val_score(opt_elastic, predictors, y, cv=10)
print "Cross-validated scores:", elastic_scores
print "Mean: ", np.mean(elastic_scores)

Cross-validated scores: [ 0.53797685  0.48044185  0.47560029  0.54125653  0.53455716  0.50328487
  0.4627109   0.39023747  0.42987488  0.43500476]
Mean:  0.479094555653


---

### Plot the residuals for the ridge, lasso, and elastic net on histograms

This is another way to look at the performance of your model.

The tighter the distribution of residuals around zero, the better your model has performed!