## Feature Engineering

* Demonstrate the use of **transformation** to incorporate non-linear functions into linear models

### Example - Weight vs. Height

We've previously seen simple examples of regression model such as ```weight``` vs. ```height```:

* A linear relationship is probably *okay* for modeling this data
* But, it certainly makes some assumptions that aren't totally justified
* How can we fit more suitable, or more general functions

<img src="scatterplot_height_weight.JPG" alt="Drawing" style="width: 500px;"/>

**Fitting complex functions**

Let's start with a polynomial function (e.g. a cubic function):

$$
\mbox{weight} = \theta_0 + \theta_1 \times \mbox{height} + \theta_2 \times \mbox{height}^2 + \theta_3 \times \mbox{height}^3
$$

* Note that this is perfectly straightforward: the model still takes the form

    $$
    \mbox{weight} = \theta \cdot x
    $$

* We just need to use the feature vector

    $$
    x = [1, \mbox{height}, \mbox{height}^2, \mbox{height}^3]
    $$

In [1]:
import pandas as pd
df = pd.read_csv("weight_height.csv", sep=',', header=0)
df.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


In [2]:
import numpy as np

df['Height^2'] = df['Height'].apply(lambda x: np.square(x))
df['Height^3'] = df['Height'].apply(lambda x: np.square(x)*x)

df.head()

Unnamed: 0,Gender,Height,Weight,Height^2,Height^3
0,Male,73.847017,241.893563,5453.381922,402715.987625
1,Male,68.781904,162.310473,4730.950324,325403.771243
2,Male,74.110105,212.740856,5492.307721,407035.504061
3,Male,71.730978,220.04247,5145.333263,369079.789145
4,Male,69.881796,206.349801,4883.465393,341265.331673


**Code: Extracting weight and height, height^2, height^3**

In [3]:
Y = df['Weight']
X = df[['Height', 'Height^2', 'Height^3']]

print(len(X), len(Y))

10000 10000


**Code: Transform Linear Regression Model**

(equivalent to nonlinear model)

In [4]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 4180.2425414188765
Coefficients: 
 [-1.97710785e+02  3.09581169e+00 -1.55075939e-02]


### Other Forms

**NOTE** that we can use the same approach to fit arbitrary functions of the features! E.g.,

$$
\mbox{weight} = \theta_0 + \theta_1 \times \mbox{height} + \theta_2 \times \mbox{height}^2 + \theta_3\exp{(\mbox{height})} + \theta_4\sin{(\mbox{height})}
$$

* We can perform arbitrary combinations of the **features** and the model will still be linear in the **parameters** (theta):

    $$
    \mbox{weight} = \theta \cdot x
    $$