# Lecture 6
1. Logarithmic Transformation
2. Standardization and Scaling
3. Polynomial Regression


## 1. Logarithmic Transformation
One of the principal tenets of the linear regression model is the idea that the relationship between the variables at play is linear. 

In cases when that is not necessarily true, we can apply transformation to the data that result in having a linear relationship. 

Once the linear model is obtained, we can then undo the transformation to obtain our final model.

A typical transformation that is often used is applying a **logarithm** to *either one* or *both* of the dependent and response variables.

In [None]:
import pandas as pd   # pandas is a module for data reading

mammals = pd.read_csv('./mammals.csv')     # csv is a common format for data storage

print(type(mammals))


In [None]:
import matplotlib.pyplot as plt

body_data = mammals['body']
brain_data = mammals['brain']

plt.scatter(body_data, brain_data)
plt.show()

In [None]:
import numpy as np

log_body_data = np.log(body_data)
log_brain_data = np.log(brain_data)

plt.scatter(log_body_data, log_brain_data)
plt.show()

The transformation has helped us convert our problem into a simpler one. In this case, the relationship we see in this data may be modelled as a power law, i.e., $y=x^b$. With some middle school math:
$$\begin{aligned}
\log(y)&=\log(x^b) \\ 
\log(y)&=b \log(x) \\ 
\bar{y} &=b \bar{x} 
\end{aligned}$$


This plot is called a *log-log plot*. There is also *semi-log plot*. Read after class. 

We can bulid a linear model from here.

In [None]:
from scipy import stats

beta_1, beta_0, r, p, std_err = stats.linregress(log_body_data, log_brain_data)

def myfunc(x):
  return beta_0 + beta_1 * x

mymodel = list(map(myfunc, log_body_data))

plt.scatter(log_body_data, log_brain_data)
plt.plot(log_body_data, mymodel)
plt.show()

What does this mean on the original data? 

$$ \begin{aligned} 
\log(Brain) &= \beta_0 + \beta_1 \log(Body) \\
Brain &= e^{\beta_0 + \beta_1 \log(Body)}\\
Brain &= e^{\beta_0} e^{\beta_1 \log(Body)}\\
Brain &= e^{\beta_0} {Body}^{\beta_1}
\end{aligned}
$$ 

In [None]:
def myfunc_powered(x):
    return np.exp(beta_0) * (x**beta_1)

mymodel_powered = list(map(myfunc_powered, body_data))

plt.plot(body_data, mymodel_powered)

plt.scatter(body_data, brain_data)
plt.show()

Wait a second, why does look so weird?

In [None]:
def myfunc_powered(x):
    return np.exp(beta_0) * (x**beta_1)

mymodel_powered = list(map(myfunc_powered, range(7000)))

plt.plot(range(7000), mymodel_powered)

plt.scatter(body_data, brain_data)
plt.show()

## 2. Standardization and Scaling

there are many more tricks to pre-process the data in order to facilitate our modelling.

One of those techniques consists on centring the independent variables such that their mean is zero. 

Another useful transformation is the scaling of our variables. This is convenient in cases where we have features that have very different scales, where some variables have large values and others have very small ones.

### Normalization or Unit Scaling
The aim of this transformation is to convert the range of a given variable into a scale that goes from 0 to 1.

$$f_{scaled}=\frac{f-f_{min}}{f_{max}-f_{min}}$$
where $f_{min}$ and $f_{max}$ are the minimal and maximal values of this feature in dataset.

In [None]:
mammals[['body', 'brain']]

In [None]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

scaler.fit_transform(mammals[['body', 'brain']])  # Scaling applied to each column


### z-Score Scaling
An alternative method for scaling our features consists of taking into account how far away data points are from the mean.
$$f_{z-score}=\frac{f-\mu_f}{\sigma_f}$$
where $\mu_f$ is the mean and $\sigma_f$ is the standard deviation of this feature.

In [None]:
scaler2 = preprocessing.StandardScaler()

scaler2.fit_transform(mammals[['body','brain']])

## 3. Polynomial Regression
In the previous section we have seen how a simple transformation in the input and output variables make a complex model into a simpler one. In fact, we can try fitting different models using more and more complex functions. 

One important point to note is that a model is said to be linear when it is linear in the **parameters**. 

With that in mind, the 1-variable model
$$ y=\beta_0+\beta_1 x + \beta_2 x^2 +\varepsilon$$
and multivariate model 
$$ y=\beta_0 + \beta_1 x_1 +\beta_2 x_2 +\beta_{11} x_1^2 + \beta_{22} x_2^2 + \beta_{12}x_1x_2+\varepsilon$$
are both linear as the parameters $\beta_{i}$ are linear.

In the examples above, the models are given by second order polynomials in one and two variables. When using such models to fit our data, they are called *polynomial regression* and in general the $k$-th order polynomial model in one variable is given by
$$ y=\beta_0+\beta_1 x + \beta_2 x^2 + ... + \beta_k x^k+\varepsilon.$$
* Linear regression is first order polynomial regression.

Polynomial models can be very useful in cases where we know that nonlinear effects are present in the target variable.

The polynomial model is effectively the [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series) of an unknown function and thus can be used to approximate it.

In [None]:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

plt.scatter(x, y)
plt.show()

In [None]:
import numpy
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))


plt.scatter(x, y)
plt.plot(range(25), mymodel(range(25)))
plt.show()

In [None]:
numpy.polyfit(x, y, 3)

### R-Squared ($R^2$) Score
It is important to know how well the relationship between the values of the x- and y-axis is, if there are no relationship the polynomial regression can not be used to predict anything.

The relationship is measured with a value called the R-squared.

The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related.

Python and the Sklearn module will compute this value for you, all you have to do is feed it with the x and y arrays:

In [None]:
from sklearn.metrics import r2_score

print(r2_score(y, mymodel(x)))

Bad fit?

In [None]:
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))


plt.scatter(x, y)
plt.plot(range(95), mymodel(range(95)))
plt.show()

In [None]:
print(r2_score(y, mymodel(x)))