# Polynomial Regression

## Objectives:

1. Understanding of interaction effect in linear regression

2. Understanding how to read residuals.

3. Creating higher order terms and interaction terms by using PolynomialFeatures from sklearn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm

In [None]:
advertising = pd.read_csv('data/Advertising.csv', index_col=0)

In [None]:
y = advertising.Sales
X = advertising[['TV', 'Radio']]

Let's make sure that everything is as expected.

In [None]:
X.head(3)


In [None]:
y.head(3)

__Your Turn__

- Use statsmodels.api to fit a linear regression model to this data.

__Your Turn__

- Find y_predict (your predictions for the cities based on the model we fitted)

__Your Turn__

- Find residuals (The amount of error in your prediction for each city. Recall that the true values are in y)

## Residual Plot

In [None]:
plt.scatter(y_predict, residuals)
plt.hlines(y = 0, xmin = y_predict.min(), xmax = y_predict.max())
plt.title('Residual Plot for Advertising Dataset')
plt.xlabel('On the x-axis we put predicted values for Sales')
plt.ylabel('On the y-axis we put residuals (errors) for Sales')
plt.show()

### Linear Regression on Advertising Dataset - Visualization

<img src = 'images/interaction.png' width = 550>

Img Source: ISLR, p81

## Adding Interaction Terms to the Model

In [None]:
# Sklearn has PolynomialFeatures class for creating higher order terms in the data
from sklearn.preprocessing import PolynomialFeatures

Recall that when we imported the class PolynomialFeatures, we should instantiate it to be able to use it. 

__Important parameters__

- Degree: Degrees of polynomials to be created. In our case we have $X_{1} = \text{TV}$ and $X_{2} = \text{Radio}$ 

if degree=2:

$$X_{1}^{2}, X_{1}  X_{2},X_{2}^{2}$$ columns will be created.

if degree=3:

$$X_{1}^{2}, X_{1}^{3}, X_{1}X_{2}, X_{1}^{2}X_{2}, X_{1}X_{2}^{2}, X_{2}^{2}, X_{2}^{3}$$


- Interaction only: It only adds interaction terms between the variables: 


$$ X_{1}X_{2} $$


Now, to understand the effect of PolynomialFeatures let's work with the columns ['TV', 'Radio' 'Newspaper]. Later on for the final model we will exclude 'Newspaper'.

In [None]:
columns = ['TV', 'Radio', 'Newspaper']

In [None]:
# Instantiate the PolynomialFeatures with some degree = 2
pf = PolynomialFeatures(degree=2)

In [None]:
p_data = pf.fit_transform(advertising[columns])

In [None]:
# PolynomialFeatures has a method that creates column names
p_columns = pf.get_feature_names(input_features=columns)

In [None]:
p_df = pd.DataFrame(p_data, columns=p_columns, index=y.index)

In [None]:
p_df.head()

__Your Turn__

- Change the parameters and understand the effect of the parameters.

1. make degree =3. How many columns added?

2. set degree = 3 and interaction_only = True. How many columns now? What happened?

3. Set degree = 10, how many columns do you have?



## Fitting a Linear Regression Model with Polynomial Features

Now let's use degree =2 and interaction_only = True and see whether this improves our model.

In [None]:
# This time, only use TV and Radio

In [None]:
pf = PolynomialFeatures(degree=2, interaction_only=True)

final_data = pf.fit_transform(advertising[['TV', 'Radio']])

final_cols = pf.get_feature_names(input_features=['TV', 'Radio'])
final_df = pd.DataFrame(final_data, columns=final_cols, index=y.index)

final_df.head()

In [None]:
model = sm.OLS(y, final_df)

In [None]:
final_model_fitted = model.fit()

In [None]:
final_model_fitted.summary()

## Residuals for the linear model with interactions

In [None]:
y_predict = final_model_fitted.predict(final_df)

In [None]:
residuals = y - y_predict

In [None]:
plt.scatter(y_predict, residuals)
plt.hlines(y = 0, xmin = y_predict.min(), xmax = y_predict.max())
plt.title('Residual Plot for Advertising Dataset')
plt.xlabel('Predicted values for Sales')
plt.ylabel('Residuals (errors) for Sales')
plt.show()

__Your Turn__

- We still see some pattern in the residuals.

- Create different datasets by changing the parameters in polynomial_features. 

- Can you improve this model?


## Bonus: R-style formulas
Many models are based on [R-style](http://r-statistics.co/Linear-Regression.html) regression formulas.  Statsmodels can help with this!

In [None]:
import statsmodels.formula.api as smf

In [None]:
model = smf.ols(formula='', data=advertising)
res = model.fit()
res.summary()