<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Linear Regression Review Lab

_Authors: Alexander Combs (NYC)_

---

In [None]:
import numpy as np
import pandas as pd
import random

import matplotlib
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline

### Create a Python dictionary 

- Use the following as the keys: 'X' and 'Y'
- Create two lists to use as the values in the dictionary: <br>
    for 'X': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 <br>
    for 'Y': .5, .7, .8, .99, 1, 1.4, 1.8, 2.1, 2.4, 2.9

In [None]:
my_dict = {
    'X': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
    'Y': [.5, .7, .8, .99, 1, 1.4, 1.8, 2.1, 2.4, 2.9]
}

my_dict

### Using that dictionary, create a pandas DataFrame and call it pre_df

In [None]:
pre_df = pd.DataFrame(my_dict)

### Using the Series from the DataFrame, create two new series

- The first list should use the 'X' values and add 10 to each value
- The second list should use the 'Y' values and add 3 to each
- Add those new lists to a new DataFrame and save it as new_data (hint: zip())

Note: the original DataFrame should be unchanged (don't save to pre_df as new columns)

In [None]:
x_series = pre_df['X'] + 10
y_series = pre_df['Y'] + 3

new_data = pd.DataFrame(list(zip(x_series,y_series)), columns = ['X','Y'])
new_data

### Using pd.concat, vertically concat the new DataFrame, new_data, to the original pre_df DataFrame. Save it as df.

Hint: Be mindful of your column names, and make sure your index is 0-based and continuous.

In [None]:
df = pd.concat([pre_df,new_data], ignore_index = True)
df

### Plot the df DataFrame using pandas + matplotlib

- Set the figure size to 12 wide and 6 height
- Add a title, 'X vs Y' to the plot
- Set the size of the markers to 50 and the color of the markers to black

In [None]:
df.plot(x='X', y='Y', kind='scatter', color='black', \
        figsize=(12,6), title='X vs Y', s=50)

### Using statsmodels, fit an OLS regression to your data and print our the summary

In [None]:
import statsmodels.api as sm
Y=df.Y
X=df.X
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit()
results.summary()

## Using the model you fitted, answer the folowing questions:

### What is the R-squared for the model?

In [None]:
results.rsquared

### What is the p-value for your X?

In [None]:
results.t_test([0, 1]).pvalue

### What is the intercept?

In [None]:
results.params[0]

### Using the above, write the equation for our model

In [None]:
# Y = -0.0857 + 0.29*X

### Solve the equation for an x of 20 then 21 (by hand/calculator)

In [None]:
.29 * 20 - .0857

In [None]:
-.0857 + .29 * 21

### Using the predict functionality of statsmodels, predict the values for 20 and 21

Hint: You'll need to use a list - don't forget your intercept!

In [None]:
xlist = [20,21]
Xlist = sm.add_constant(xlist)

results.predict(Xlist)

### Get the SSE by using the predictions for every X (y_hats) and the true y values

In [None]:
y_hat = results.predict(X)
sum(np.square(y_hat - df['Y']))

### Now plot your predictions for every X

- Plot the predictions as a line and the true y values using a scatterplot

In [None]:
fig = plt.figure(figsize=(12, 6))

plt.scatter(df['X'],df['Y'], color = 'black', s=50)
plt.title("X vs Y")
plt.xlabel("X")
plt.ylabel("Y")
plt.plot(df['X'], y_hat, color='r');

### Import PolynomialFeatures from sklearn. Then do the following:

- Instantiate a PolynomialFeatures object and save it as poly
- Documentation is [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
- Use degree=5 (Hint: If that gives you more than 6 columns, you forgot to remove something)
- Use fit_transform on X to create a numpy array of polynomial features
- Save that array as poly_feats
- Convert this array to a DataFrame and save it as poly_X
- Join this new poly_df DataFrame with df['Y'] using pd.merge (Hint: join on the index)
- Save this joined index as pdf

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)

In [None]:
poly_feats = poly.fit_transform(df[['X']])
poly_X = pd.DataFrame(poly_feats)

In [None]:
pdf = pd.merge(df[['Y']],poly_X, right_index=True, left_index=True)
pdf

### Using statsmodels as before, fit this new model and save it as poly_results

In [None]:
X = pdf[[0,1,2,3,4,5]]
X = sm.add_constant(X)
y = pdf['Y']

lm = sm.OLS(y, X)
poly_results = lm.fit()
poly_results.summary()

### Print out the model's predictions and save them as poly_yhat

In [None]:
poly_yhat = poly_results.predict(X)
poly_yhat

### Calculate the SSE

In [None]:
sum(np.square(poly_yhat - pdf['Y']))

### Now, create a for loop that does the follow:

- Iterates over the following alpha values [0, .001, .01, .25, .5, 1, 10]
- In each loop, you are going to fit a regularized regression
- See [Statsmodels Docs](http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html) to understand how to do this
- In each loop, set the value of alpha to the value of being iterated over
- Set the L1_wt parameter to 0
- In each loop print out the alpha value, the SSE, and the mean absolute value of the coefficient of the model
- You should also print out the predictions as a line and the true y's as a scatterplot as above

In [None]:
alpha = [0, .001, .01, .25, .5, 1, 10]

for i in alpha:
    #lm = sm.OLS(y, X)
    results = lm.fit_regularized(alpha = i, L1_wt = 0)
    
    y_hat = results.predict(X)
    
    print("Alpha: ", i)
    
    sse = sum(np.square(y_hat - pdf['Y']))
    print("SSE: ", sse) 
    
    print("Mean Abs(coefficient): ", np.mean(abs(results.params)))
    
    fig = plt.figure(figsize=(6,4))
    ax = plt.gca()
    ax.scatter(df['X'], y, c='k')
    ax.plot(df['X'], y_hat, color='r')
    plt.show()

### Using the output of the above, answer the following:
- What happens to the SSE over the increasing alpha values?
- What happens to the mean abs. value of the coefficients?
- Does increasing the bias to reduce variance always mean a better model?

In [None]:
# SSE increases.

In [None]:
# The coefficient moves toward 0.

In [None]:
# Increasing bias will not always improve the model. We want to find an optimal trade-off between bias and variance.