<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Predicting shots made per game by Kobe Bryant

_Authors: Kiefer Katovich (SF)_

---

In this lab you'll be using regularized regression penalties Ridge, Lasso, and Elastic Net to try and predict how many shots Kobe Bryant made per game in his career.

The Kobe shots dataset has hundreds of columns representing different characteristics of each basketball game. Fitting an ordinary linear regression using every predictor would dramatically overfit the model considering the limited number of observations (games) we have available. Furthermore, many of the predictors have significant multicollinearity. 

**Warning:** Some of these calculations are computationally expensive and may take a while to execute.  It may be worth while to only use a portion of the data to perform these calculations, especially if you have experienced kernel issues in the past.

---

### 1. Load packages and data

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [3]:
kobe = pd.read_csv('data/kobe_superwide_games.csv')

---

### 2. Examine the data

- How many columns are there?
- Examine what the observations (rows) and columns represent.
- Why is this data that regularization might be particularly useful for?

In [7]:
# A:
kobe.shape

(1558, 645)

---

### 4. Build a linear regression predicting `SHOTS_MADE` from the rest of the columns.



In [23]:
# A:
X = kobe.drop('SHOTS_MADE', axis=1)
y = kobe['SHOTS_MADE']

from sklearn.linear_model import Lasso, LinearRegression, Ridge
lasso_model = Lasso()
lasso_model.fit(X,y)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [24]:
c = pd.DataFrame(zip(X.columns, lasso_model.coef_), columns=['feature', 'coef'])
c.sort_values('coef')

Unnamed: 0,feature,coef
553,MEAN_Y_POSITION,-0.001587
0,AWAY_GAME,-0.000000
422,SEASON_OPPONENT:sac:1998-99,0.000000
423,SEASON_OPPONENT:sac:1999-00,0.000000
424,SEASON_OPPONENT:sac:2000-01,-0.000000
...,...,...
643,CAREER_GAME_NUMBER,0.000061
582,SECONDS_REMAINING,0.005617
583,MINUTES_REMAINING,0.008581
584,PERIOD,0.036898


In [25]:
c[c.coef != 0]

Unnamed: 0,feature,coef
553,MEAN_Y_POSITION,-0.001587
574,SHOT_TYPE:2pt_field_goal,0.068167
582,SECONDS_REMAINING,0.005617
583,MINUTES_REMAINING,0.008581
584,PERIOD,0.036898
643,CAREER_GAME_NUMBER,6.1e-05


---

### 5. Look at the coefficients for variables in the Lasso.

1. Show the coefficient for variables, ordered from largest to smallest coefficient by absolute value.
2. What percent of the variables in the original dataset are "zeroed-out" by the lasso?
3. What are the most important predictors for how many shots Kobe made in a game?

> **Note:** if you only fit the Lasso within `cross_val_score`, you will have to refit it outside of that
function to pull out the coefficients.

In [10]:
# A: