# A taste of feature engineering

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Linear regression
Linear regression aims to find  
a best-fitting line (or curve)

In [None]:
def noise(*shape, amp=1):
    return amp * np.random.randn(*shape)

In [None]:
### make some sample data
x = np.linspace(0,10,20)
X = x[:, np.newaxis]
y = 3 + 0.3*x + noise(20, amp=0.2)
plt.axis('equal')
plt.scatter(x,y)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_model = model.predict(X)

In [None]:
plt.axis('equal')
plt.scatter(x, y)
plt.plot(x, y_model, c='red')

In [None]:
model.intercept_

In [None]:
model.coef_

A basic model of linear regression  
is to find `a` and `b`  
such that `a + b*x` is very close to `y`

In [None]:
a, b = model.intercept_, model.coef_[0]
df = pd.DataFrame({'1': np.ones_like(x), 
                   'x': x, 
                   'a+bx': a + b*x, 
                   'y_model': y_model, 
                   'y': y
                  })
df.head()

It aims to minimize the `error`,  
where `error = sum of (y[i] - y_model[i])**2`

In [None]:
error = ((y - y_model)**2).sum()
error

If `fit_intercept=False`  
then the column `1` will be omitted  
and the algorithm uses `bx` to fit `y`

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
y_model = model.predict(X)

The outcome becomes less desired

In [None]:
plt.axis('equal')
plt.scatter(x, y)
plt.plot(x, y_model, c='red')

In [None]:
b = model.coef_[0]
df = pd.DataFrame({'1': np.ones_like(x), 
                   'x': x, 
                   'bx': b*x, 
                   'y_model': y_model, 
                   'y': y
                  })
df.head()

Adding new features (columns)  
increase the flexibility of a model|

## Adding features
one may add `1` or `x**2`  
to the features  
or even `1/x` or `np.exp(x)`

In [None]:
### make some sample data
x = np.linspace(0,10,20)
X = x[:, np.newaxis]
y = 3 - 0.5*x + 0.1*x**2 + noise(20, amp=0.2)
plt.axis('equal')
plt.scatter(x,y)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_model = model.predict(X)

plt.axis('equal')
plt.scatter(x, y)
plt.plot(x, y_model, c='red')

In [None]:
df = pd.DataFrame({
        '1': np.ones_like(x), 
        'x': x, 
        'x**2': x**2
    })
df.head()

In [None]:
from sklearn.linear_model import LinearRegression
X = df.values ### use the features in df to train
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
y_model = model.predict(X)

In [None]:
plt.axis('equal')
plt.scatter(x, y)
plt.plot(x, y_model, c='red')

In [None]:
a,b,c = model.coef_
df = pd.DataFrame({
        '1': np.ones_like(x), 
        'x': x, 
        'x**2': x**2, 
        'a+bx+cx**2': a + b*x + c*x**2, 
        'y_model': y_model, 
        'y': y
    })
df.head()

$a+bx+cx^2$ is not linear in terms of $x$  
but it is linear in terms of $1$, $x$, and $x^2$

In [None]:
### make some sample data
x = np.linspace(0,10,20)
X = x[:, np.newaxis]
y = 3 - 0.5*x + 0.1*np.exp(x) + noise(20, amp=0.2)
plt.scatter(x,y)

In [None]:
df = pd.DataFrame({
        '1': np.ones_like(x), 
        'x': x, 
        'exp(x)': np.exp(x)
    })
df.head()

In [None]:
from sklearn.linear_model import LinearRegression
X = df.values ### use the features in df to train
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
y_model = model.predict(X)

In [None]:
plt.scatter(x, y)
plt.plot(x, y_model, c='red')

In [None]:
a,b,c = model.coef_
df = pd.DataFrame({
        '1': np.ones_like(x), 
        'x': x, 
        'exp(x)': np.exp(x), 
        'a+bx+c exp(x)': a + b*x + c*np.exp(x), 
        'y_model': y_model, 
        'y': y
    })
df.head()

## Categorial data
Categorial data uses numbers to   
represent each category  
but it does not adopt the order of numbers

say `1` means brand A  
`2` stands for brand B and  
`3` stands for brand C

In [None]:
### make sample data
x = np.array([1]*10 + [2]*10 + [3]*10)
x

In [None]:
y = np.concatenate([170 + noise(10), 
                    190 + noise(10), 
                    180 + noise(10)
                   ])
y

In [None]:
plt.scatter(x, y)

In [None]:
from sklearn.linear_model import LinearRegression
X = x[:, np.newaxis]
model = LinearRegression()
model.fit(X, y)
y_model = model.predict(X)

plt.scatter(x, y)
plt.scatter(x, y_model, c='red')

One may replace the categorial data  
by its **one-hot encoding**

In [None]:
df = pd.get_dummies(x)
df

In [None]:
from sklearn.linear_model import LinearRegression
X = df.values ### use the features in df to train
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
y_model = model.predict(X)

plt.scatter(x, y)
plt.scatter(x, y_model, c='red')

In [None]:
a,b,c = model.coef_
df = pd.get_dummies(x)
df['ax1+bx2+cx3'] = a*df[1] + b*df[2] + c*df[3]
df['y_model'] = y_model
df['y'] = y
df.iloc[[0,10,20],:]

In this case  
`model.coef_` are exactly  
the prediction of the prices  
of each brand

In [None]:
model.coef_