**_Section 7.0:_** Load packages

In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model, metrics
from sklearn import cross_validation
from sklearn import grid_search
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

sns.set_style("darkgrid")

### _Section 7.1_
#### Create sample data and fit a model
```diff
+ The following section provides an opportunity for the student to use a loss function (MSE) to differentiate between two models - one trained on normal data and one trained on biased data.
```

In [None]:
df = pd.DataFrame({'x': range(100), 'y': range(100)})
biased_df  = df.copy()
biased_df.loc[:20, 'x'] = 1
biased_df.loc[:20, 'y'] = 1

def append_jitter(series):
    jitter = np.random.random_sample(size=100)
    return(series + jitter)

df['x'] = append_jitter(df.x)
df['y'] = append_jitter(df.y)

biased_df['x'] = append_jitter(biased_df.x)
biased_df['y'] = append_jitter(biased_df.y)

In [None]:
## fit
lm = linear_model.LinearRegression()
lm.fit(df[['x']], df['y'])

print(metrics.mean_squared_error(df['y'], lm.predict(df[['x']])))

In [None]:
## biased fit
lm = linear_model.LinearRegression().fit(biased_df[['x']], biased_df['y'])
print(metrics.mean_squared_error(biased_df['y'], lm.predict(biased_df[['x']])))

### _Section 7.2_
#### Cross validation
Intro to cross validation with bike share data from last time. We will be modeling casual ridership. 
```diff
+ The following section provides an opportunity for the student to use cross validation, and to observe how it allows us to iteratively evaluate a model on a test set using a loss function to estimate generalizability.
```

In [None]:
wd = './dataset/'
bikeshare = pd.read_csv(wd + 'bikeshare.csv')

#### Create dummy variables and set outcome (dependent) variable

In [None]:
weather = pd.get_dummies(bikeshare.weathersit, prefix='weather')
modeldata = bikeshare[['temp', 'hum']].join(weather[['weather_1', 'weather_2', 'weather_3']])
y = bikeshare.casual

#### Create a cross validation with 5 folds

In [None]:
kf = cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True)

In [None]:
mse_values = []
scores = []
n= 0
print("~~~~ CROSS VALIDATION each fold ~~~~")
for train_index, test_index in kf:
    lm = linear_model.LinearRegression().fit(modeldata.iloc[train_index], y.iloc[train_index])
    mse_values.append(metrics.mean_squared_error(y.iloc[test_index], lm.predict(modeldata.iloc[test_index])))
    scores.append(lm.score(modeldata, y))
    n+=1
    print('Model', n)
    print('MSE:', mse_values[n-1])
    print('R2:', scores[n-1])

print("~~~~ SUMMARY OF CROSS VALIDATION ~~~~")
print('Mean of MSE for all folds:', np.mean(mse_values))
print('Mean of R2 for all folds:', np.mean(scores))

In [None]:
lm = linear_model.LinearRegression().fit(modeldata, y)
print("~~~~ Single Model ~~~~")
print('MSE of single model:', metrics.mean_squared_error(y, lm.predict(modeldata)))
print('R2: ', lm.score(modeldata, y))

KFold returns an 80:20 train:test split

In [None]:
total = len(test_index)+len(train_index)
print(float(len(train_index))/total)
print(float(len(test_index))/total)

### Check
While the cross validated approach here generated more overall error, which of the two approaches would predict new data more accurately: the single model or the cross validated, averaged one? Why?


Answer: 

### _Section 7.3_
#### Activity: Cross Validation with Linear Regression
```diff
+ The following section provides an opportunity for the student to build on the application of cross validation, by manually optimizing one of the parameters: k
```
Note:  
**Shuffle** (boolean, optional)- Dictates whether to shuffle the data before splitting into batches

In [None]:
# Give it a try...

### _Section 7.4_
#### There are ways to improve our model with regularization
```diff
+ The following section provides an opportunity for the student to try out different regularization methods discussed in today's lecture, and then optimize the parameters (first manually, then using gridsearch). 

+ Grid-search is a way of optimizing our models, by iteratively evaluating the model at each position in a grid of parameters.
```
Let's check out the effects on MSE and R2

In [None]:
lm = linear_model.LinearRegression().fit(modeldata, y)
print("~~~ OLS ~~~")
print('OLS MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)))
print('OLS R2:', lm.score(modeldata, y))

lm = linear_model.Lasso().fit(modeldata, y)
print("~~~ Lasso ~~~")
print('Lasso MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)))
print('Lasso R2:', lm.score(modeldata, y))

lm = linear_model.Ridge().fit(modeldata, y)
print("~~~ Ridge ~~~")
print('Ridge MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)))
print('Ridge R2:', lm.score(modeldata, y))

#### Figuring out the alphas can be done by "hand"

In [None]:
alphas = np.logspace(-10, 10, 21)
for a in alphas:
    print('Alpha:', a)
    lm = linear_model.Ridge(alpha=a)
    lm.fit(modeldata, y)
    print(lm.coef_)
    print(metrics.mean_squared_error(y, lm.predict(modeldata)))

#### Or we can use grid search to make this faster

In [None]:
alphas = np.logspace(-10, 10, 21)
gs = grid_search.GridSearchCV(
    estimator=linear_model.Ridge(),
    param_grid={'alpha': alphas},
    scoring='neg_mean_squared_error')
# 'neg_mean_squared_error' is the same as 'mean_squared_error' for these purposes

gs.fit(modeldata, y)

Best score 

In [None]:
print(gs.best_score_)

Mean squared error here comes in negative, so let's make it positive

In [None]:
print(-gs.best_score_)

Explain which grid_search setup worked best

In [None]:
print(gs.best_estimator_)

Show all the grid pairings and their performances

In [None]:
print(gs.grid_scores_)

### _Section 7.5_
#### Gradient Descent

In [None]:
num_to_approach, start, steps, optimized = 6.2, 0., [-1, 1], False
while not optimized:
    current_distance = num_to_approach - start
    got_better = False
    next_steps = [start + i for i in steps]
    for n in next_steps:
        distance = np.abs(num_to_approach - n)
        if distance < current_distance:
            got_better = True
            print(distance, 'is better than', current_distance)
            current_distance = distance
            start = n
    if got_better:
        print('found better solution! using', current_distance)
        a += 1
    else:
        optimized = True
        print(start, 'is closest to', num_to_approach)

#### Bonus: 
implement a stopping point, similar to what n_iter would do in gradient descent when we've reached "good enough"

#### Demo: Application of Gradient Descent 

In [None]:
lm = linear_model.SGDRegressor()
lm.fit(modeldata, y)
print("Gradient Descent R2:", lm.score(modeldata, y))
print("Gradient Descent MSE:", metrics.mean_squared_error(y, lm.predict(modeldata)))

#### Check: 
- Untuned, how well did gradient descent perform compared to OLS?

Answer: 

### _Section 7.6_
#### Independent Practice: Bike data revisited

There are tons of ways to approach a regression problem. The regularization techniques appended to ordinary least squares optimizes the size of coefficients to best account for error. Gradient Descent also introduces learning rate (how aggressively do we solve the problem), epsilon (at what point do we say the error margin is acceptable), and iterations (when should we stop no matter what?)

For this deliverable, our goals are to:

- implement the gradient descent approach to our bike-share modeling problem,
- show how gradient descent solves and optimizes the solution,
- demonstrate the grid_search module!

While exploring the Gradient Descent regressor object, you'll build a grid search using the stochastic gradient descent estimator for the bike-share data set. Continue with either the model you evaluated last class or the simpler one from today. In particular, be sure to implement the "param_grid" in the grid search to get answers for the following questions:

- With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?
- Based on the data, we know when to properly use l1 vs l2 regularization. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true? If not, did gradient descent have enough iterations?
- How do these results change when you alter the learning rate (eta0)?

**Bonus**: 
- Can you see the advantages and disadvantages of using gradient descent after finishing this exercise?

#### Independent Practice Starter Code

In [None]:
params = {} # put your gradient descent parameters here
gs = grid_search.GridSearchCV(
    estimator=linear_model.SGDRegressor(),
    cv=cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True),
    param_grid=params,
    scoring='neg_mean_squared_error',
    )

gs.fit(modeldata, y)

print('BEST ESTIMATOR')
print(-gs.best_score_)
print(gs.best_estimator_)
print('ALL ESTIMATORS')
print(gs.grid_scores_)

In [None]:
## go for it!