## Codio Activity 8.3: Scikit-Learn Pipelines

**Estimated time: 60 minutes**

**Total Points: 24 Points**

This activity focuses on using the pipeline functionality of scikit-learn to combine a transformer with an estimator.  Specifically, you will combine the process of generating polynomial features with that of building a linear regression model.  You will use the `Pipeline` functionality from the `sklearn.pipeline` module to construct both a quadratic and cubic model.

## Index:

 - [Problem 1](#Problem-1)
 - [Problem 2](#Problem-2)
 - [Problem 3](#Problem-3)
 - [Problem 4](#Problem-4)
 - [Problem 5](#Problem-5)
 - [Problem 6](#Problem-6)

In [12]:
import numpy as np
import pandas as pd
import warnings
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

warnings.filterwarnings("ignore")

### The Data

The data will again be the automobile dataset.  You are to use the pipelines to build quadratic features and linear models using `horsepower` to predict `mpg`.   

In [13]:
auto = pd.read_csv("data/auto.csv")

In [14]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


[Back to top](#Index:) 

## Problem 1

### Creating a `Pipeline`

**4 Points**

First, you are to use a `Pipeline` to combine `PolynomialFeatures` and `LinearRegression`.  Be sure to set `degree=2` in the transformer, and leave all the arguments as default in your regressor.  Instantiate your pipeline as `pipe` below.  Name the transformer `quad_features` and the regressor `quad_model` inside the pipeline object. 

In [15]:
### GRADED

pipe = Pipeline(
    [
        ("quad_features", PolynomialFeatures(degree=2)),
        ("quad_model", LinearRegression()),
    ]
)

# Answer check
print(type(pipe))
print(pipe.named_steps)

<class 'sklearn.pipeline.Pipeline'>
{'quad_features': PolynomialFeatures(), 'quad_model': LinearRegression()}


[Back to top](#Index:) 

## Problem 2

### Fitting the Pipeline

**4 Points**

Fit the `pipe` object created above using `horsepower` to predict `mpg`.  Use the fit pipeline to determine the `mean_squared_error` of your model, and assign the value as a float to `quad_pipe_mse` below.  

In [16]:
### GRADED

X = auto[["horsepower"]]
y = auto["mpg"]
pipe.fit(X, y)
quad_pipe_mse = float(mean_squared_error(y, pipe.predict(X)))

# Answer check
print(type(quad_pipe_mse))
print(quad_pipe_mse)

<class 'float'>
18.98476890761722


[Back to top](#Index:) 

## Problem 3

### Examining the Coefficients

**4 Points**

Now, to examine the coefficients, use the `.named_steps` attribute of the pipeline object to extract the regressor.  Assign the model as an sklearn `LinearRegression` estimator to `quad_reg` below.  Extract the coefficients from the model and assign these as an array to the variable `coefs`.

In [17]:
### GRADED

quad_reg = pipe.named_steps["quad_model"]
coefs = quad_reg.coef_  # coefficients of regressor

# Answer check
print(type(quad_reg))
print(coefs)

<class 'sklearn.linear_model._base.LinearRegression'>
[ 0.         -0.46618963  0.00123054]


[Back to top](#Index:) 

## Problem 4

### Considering the Bias 

**4 Points**

Not that your coefficients have 3 values.  Your model also contains an intercept term though, and this leads to one more value than expected from a quadratic model with one input feature.  This is due to the inclusion of the bias term using `PolynomialFeatures` and the intercept term added with the `fit_intercept = True` default setting in the regressor.  


To get the appropriate model coefficients and intercept, you can set `include_bias = False` in the `PolynomialFeatures` transformer.  Build and fit a new pipeline named `pipe_no_bias` below.  Use the same names for the transformer and estimator in the pipeline -- `quad_features` and `quad_model`.  Determine the mean squared error of the new model and assign it as a float to `no_bias_mse`.  

In [18]:
### GRADED

pipe_no_bias = Pipeline(
    [
        ("quad_features", PolynomialFeatures(degree=2, include_bias=False)),
        ("quad_model", LinearRegression()),
    ]
)
X = auto[["horsepower"]]
y = auto["mpg"]
pipe_no_bias.fit(X, y)
no_bias_mse = mean_squared_error(y, pipe_no_bias.predict(X))


# Answer check
print(type(pipe_no_bias))
print(no_bias_mse)
print([pipe_no_bias[-1].intercept_, pipe_no_bias[-1].coef_])

<class 'sklearn.pipeline.Pipeline'>
18.98476890761722
[56.90009970211294, array([-0.46618963,  0.00123054])]


[Back to top](#Index:) 

## Problem 5

### Building a Cubic Model with `Pipeline`

**4 Points**

Now, build a cubic model using `PolynomialFeatures` with `degree = 3`.  Instantiate the pipeline as `cubic_pipe` below, and remember to drop the bias term from the transformer.  Save your cubic models mean squared error as a float to `cubic_mse` below.

In [19]:
cubic_pipe = Pipeline(
    [
        ("quad_features", PolynomialFeatures(degree=3, include_bias=False)),
        ("quad_model", LinearRegression()),
    ]
)
X = auto[["horsepower"]]
y = auto["mpg"]
cubic_pipe.fit(X, y)
cubic_mse = mean_squared_error(y, cubic_pipe.predict(X))

# Answer check
print(type(cubic_pipe))
print(cubic_mse)
print([cubic_pipe[-1].intercept_, cubic_pipe[-1].coef_])

<class 'sklearn.pipeline.Pipeline'>
18.94498981448592
[60.68478490666064, array([-5.68850128e-01,  2.07901126e-03, -2.14662591e-06])]


[Back to top](#Index:) 

## Problem 6

### Making Predictions on New Data

**4 Points**

Finally, one of the main benefits derived from using a Pipeline is that you do not need to engineer new polynomial features when predicting with new data.  Use your cubic pipeline to predict the `mpg` for a vehicle with 200 horsepower.  Assign your prediction as a numpy array to `cube_predict` below.

In [20]:
cube_predict = cubic_pipe.predict([[200]])  # cubic pipe prediction

# Answer check
print(type(cube_predict))
print(cube_predict)

<class 'numpy.ndarray'>
[12.90220247]
