# Transforming Numerical Variables

**Linear and logistic regression assume that the variables are normally distributed.** If they are
not, we can often **apply a mathematical transformation to change their distribution into
Gaussian**, and sometimes even unmask linear relationships between variables and their
targets. 

This means that **transforming variables may improve the performance of linear
machine learning models**. 

Commonly used mathematical transformations include the
logarithm, reciprocal, power, square and cube root transformations, as well as the Box-Cox
and Yeo-Johnson transformations.

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats as stats

## Transforming variables with the logarithm

The logarithm function is commonly used to transform variables. It has a **strong effect on
the shape of the variable distribution and can only be applied to positive variables**.

In [None]:
import scipy.stats as stats
from sklearn.preprocessing import FunctionTransformer

In [None]:
data = pd.read_csv("data/boston.csv")
data.head()

In [None]:
def diagnostic_plots(df, variable):
    plt.figure(figsize=(12,4))
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.show()

In [None]:
diagnostic_plots(data, 'LSTAT')

In [None]:
data_tf = data.copy()

In [None]:
data_tf[['LSTAT', 'NOX', 'DIS', 'RM']] = np.log(data[['LSTAT', 'NOX', 'DIS', 'RM']])

In [None]:
diagnostic_plots(data_tf, 'LSTAT')

In [None]:
transformer = FunctionTransformer(np.log)

In [None]:
data_tf = transformer.transform(data[['LSTAT', 'NOX', 'DIS', 'RM']])

## Transforming variables with the reciprocal function

The **reciprocal function, defined as 1/x, is a strong transformation with a very drastic effect
on the variable distribution.** It isn't defined for the value 0, but it can be applied to negative
numbers.

In [None]:
from sklearn.preprocessing import FunctionTransformer


data = pd.read_csv("data/boston.csv")

In [None]:
def diagnostic_plots(df, variable):
    # function to plot a histogram and a Q-Q plot
    # side by side, for a certain variable
    plt.figure(figsize=(10,4))
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.show()

In [None]:
diagnostic_plots(data, 'DIS')

In [None]:
transformer = FunctionTransformer(np.reciprocal)

In [None]:
data_tf = transformer.transform(data[['LSTAT', 'NOX', 'DIS', 'RM']])

In [None]:
data_tf = pd.DataFrame(data_tf, columns=['LSTAT', 'NOX', 'DIS', 'RM'])
diagnostic_plots(data_tf, 'DIS')

## Using power transformations on numerical variables

**Exponential or power functions are mathematical transformations** that follow `X = X^lambda`,
where lambda can be any exponent. 

The square and cube root transformations are special
cases of power transformations where lambda is 1/2 or 1/3, respectively. 

In practice, we **try
different lambdas to determine which one offers the best transformation**.

In [None]:
from sklearn.preprocessing import FunctionTransformer

data = pd.read_csv("data/boston.csv")

In [None]:
data.head()

In [None]:
data.hist(bins=30, figsize=(10,10))
plt.show()

In [None]:
diagnostic_plots(data, 'LSTAT')

In [None]:
# make a copy of the dataframe where we will store the modified
# variables
data_tf = data.copy()

In [None]:
transformer = FunctionTransformer(lambda x: np.power(x, 0.3))

# capture variables to transform in a list
cols = ['LSTAT', 'NOX', 'DIS', 'RM']

# transform slice of dataframe with indicated variables
data_tf = transformer.transform(data[cols])

data_tf = pd.DataFrame(data_tf, columns=cols)

In [None]:
# visualize the transformation (not in book)
diagnostic_plots(data_tf, 'LSTAT')

## Using square and cube root to transform variables

The **square and cube root transformations are two specific forms of power transformations
where the exponents are 1/2 and 1/3, respectively.**

> The square root transformation is not defined for negative values, so make
sure you only transform those variables whose values are >=0; otherwise,
you will introduce NaN or receive an error message.

In [None]:
data = pd.read_csv("data/boston.csv")

data_tf = data.copy()

transformer = FunctionTransformer(np.sqrt)

# make a list of variables to transform
cols = ['LSTAT', 'NOX', 'DIS', 'RM']

# transform slice of dataframe with indicated variables
# returns NumPy array
data_tf = transformer.transform(data[cols])

data_tf = pd.DataFrame(data_tf, columns=cols)

In [None]:
diagnostic_plots(data_tf, 'LSTAT')