As introducted in the topic *[Pandas] Data Cleaning*, clean data is used as the input for analytical tasks. However, to build a good predictive model (in both performance and computation), a lot more works need to be done in order to improve data quality. These kind of tasks are called *feature engineering*, they get data into the appropriate format and reveal hidden insights.

*Feature engineering* tasks are technically simple, but they do require some domain knowledge. This makes *feature engineering* more of an art than a science.

# 1. Data preprocessing

## 1.1. Scaling
Scaling is a preprocessing technique defined only on numerical variables, where the scaled variable distributes the same as the original one but does have specific properties.
- *Min-max scaling*: to have the min of $0$ and max of $1$
- *Standardization*: to have the mean of $0$ and standard deviation of $1$
- *Manhattan normalization*: to have the absolute values sum up to $1$
- *Euclidean normalization*: to have the squared values sum up to $1$

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler

#### Min-max scaling
Many Machine Learning algorithms assume all variables are on the same scale, typically $[0;1]$. The formula for rescaling to $[a;b]$ is:

$$\mathbf{x}\leftarrow\frac{\mathbf{x}-\mathbf{x}_{min}}{\mathbf{x}_{max}-\mathbf{x}_{min}}(b-a)+a$$

In [7]:
df = pd.DataFrame({
    'x': [17, 22, 25, 30, 38],
    'y': [75, 81, 32, 23, 55]
})

scaler = MinMaxScaler(feature_range=(0,1))
data_scaled = scaler.fit_transform(df.values)
df_scaled = pd.DataFrame(data_scaled, columns=['x_scaled', 'y_scaled'])

df.join(df_scaled)

Unnamed: 0,x,y,x_scaled,y_scaled
0,17,75,0.0,0.896552
1,22,81,0.238095,1.0
2,25,32,0.380952,0.155172
3,30,23,0.619048,0.0
4,38,55,1.0,0.551724


#### Standardization
Also called *z-score* scaling, it is a technique that transforms a variable so that it has a mean of $0$ and a standard deviation of $1$. Standardization is strongly recommended as the default method for data preprocessing. The formula for standardizing $x$ is:

$$\mathbf{x}\leftarrow\frac{\mathbf{x}-\mu_\mathbf{x}}{\sigma_\mathbf{x}}$$

In [9]:
df = pd.DataFrame({
    'x': [17, 22, 25, 30, 38],
    'y': [75, 81, 32, 23, 55]
})

scaler = StandardScaler()
data_scaled = scaler.fit_transform(df.values)
df_scaled = pd.DataFrame(data_scaled, columns=['x_scaled', 'y_scaled'])

df.join(df_scaled)

Unnamed: 0,x,y,x_scaled,y_scaled
0,17,75,-1.310622,0.953649
1,22,81,-0.613483,1.216121
2,25,32,-0.195199,-0.927401
3,30,23,0.50194,-1.32111
4,38,55,1.617363,0.078742


#### Normalizing
This technique ensures the vector has a total length of 1. The length can be either Manhattan distance (L1) $\|\mathbf{x}\|_1 = |x_1|+|x_2|+\dots+|x_n|$ or Euclidean distance (L2) $\|\mathbf{x}\|_2 = \sqrt{x_1^2+x_2^2+\dots+x_n^2}$. The formula for normalizing is:

$$\mathbf{x}\leftarrow\frac{\mathbf{x}}{\|\mathbf{x}\|}$$

In [12]:
df = pd.DataFrame({
    'x': [17, 22, 25, 30, 38],
    'y': [75, 81, 32, 23, 55]
})

scaler = Normalizer(norm='l1')
data_scaled = scaler.fit_transform(df.values.T).T
df_scaled = pd.DataFrame(data_scaled, columns=['x_scaled', 'y_scaled'])

df.join(df_scaled)

Unnamed: 0,x,y,x_scaled,y_scaled
0,17,75,0.128788,0.281955
1,22,81,0.166667,0.304511
2,25,32,0.189394,0.120301
3,30,23,0.227273,0.086466
4,38,55,0.287879,0.206767


In [4]:
df = pd.DataFrame({
    'x': [17, 22, 25, 30, 38],
    'y': [75, 81, 32, 23, 55]
})

scaler = Normalizer(norm='l2')
data_scaled = scaler.fit_transform(df.values.T).T
df_scaled = pd.DataFrame(data_scaled, columns=['x_scaled', 'y_scaled'])

df.join(df_scaled)

Unnamed: 0,x,y,x_scaled,y_scaled
0,17,75,0.277905,0.579259
1,22,81,0.359642,0.625599
2,25,32,0.408684,0.24715
3,30,23,0.490421,0.177639
4,38,55,0.6212,0.42479


## 1.2. Bucketizing
Bucketizing (or binning) refers to the tasks that put data into larger bins. Doing bucketizing means sacrifying information, however it makes data more regularized and thus prevents overfitting. Either numerical and categorical variables can be binned.

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

#### Discretizating

In [34]:
df = pd.DataFrame({'x': [17, 22, 25, 30, 38]})

group = pd.cut(
    df.x,
    bins=[0, 20, 30, 100],
    right=False,
    labels=['A', 'B', 'C'])

df.assign(group=group)

Unnamed: 0,x,group
0,17,A
1,22,B
2,25,B
3,30,C
4,38,C


#### Clustering

In [48]:
np.random.seed(1)
df = pd.DataFrame({
    'x': np.random.randint(10, 100, size=10),
    'y': np.random.randint(10, 100, size=10)
})

clusterer = KMeans(3, random_state=0)
group = clusterer.fit_predict(df.values)

df.assign(group=group)

Unnamed: 0,x,y,group
0,47,86,2
1,22,81,2
2,82,16,1
3,19,35,0
4,85,60,1
5,15,30,0
6,89,28,1
7,74,94,2
8,26,21,0
9,11,38,0


#### Mapping

In [2]:
df = pd.DataFrame({'x': ['England', 'France', 'Germany', 'Korea', 'Japan']})

x_map = df.x.map({
    'England': 'Europe', 'France': 'Europe', 'Germany': 'Europe',
    'Korea': 'Asia', 'Japan': 'Asia'
})

df.assign(x_map=x_map)

Unnamed: 0,x,x_map
0,England,Europe
1,France,Europe
2,Germany,Europe
3,Korea,Asia
4,Japan,Asia


## 1.3. Encoding
Encoding is the technique that transforms a categorical variable to numerical variables.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer

#### Mapping
Mapping is used to transform ordinal data to numerical data.

In [16]:
df = pd.DataFrame({
    'x': ['Low', 'Medium', 'High', 'High', 'Medium']
})

x_encoded = df.x.map({'Low': 1, 'Medium': 2, 'High': 3})

df.assign(x_encoded=x_encoded)

Unnamed: 0,x,x_encoded
0,Low,1
1,Medium,2
2,High,3
3,High,3
4,Medium,2


#### One-hot encoding
This technique is used to transform nominal data to numerical data.

In [15]:
df = pd.DataFrame({
    'x': ['Apple', 'Apple', 'Orange', 'Mango', 'Apple']
})

encoder = LabelBinarizer()
x_encoded = pd.DataFrame(
    data=encoder.fit_transform(df.x.values.reshape(-1,1)),
    columns=encoder.classes_)

df.join(x_encoded)

Unnamed: 0,x,Apple,Mango,Orange
0,Apple,1,0,0
1,Apple,1,0,0
2,Orange,0,0,1
3,Mango,0,1,0
4,Apple,1,0,0


# 2. Dimensionality reduction

#### The curse of dimensionality

## 2.1. Feature selection
Feature selection refers to the tasks that remove low quality data, hence keep informative features. It helps reduce noises and computational cost. Here are some feature selection techniques, they can be implemented very easily.

#### Filter methods
Filter methods use Descriptive Statistics to decide which feature to be filtered out.

- *Missing ratio evaluation*. Features having more than 40-50% of missing values can be dropped.

- *Low variance filtering*. Think about a constant feature, whose all observations have the same value, it has no predictive power and cannot explain the target variable. Therefore, features with very low variance can be safely removed.

- *High correlation filtering*. A pair of features having a high Pearson's correlation coefficient means they are very similar to each other, and they will bring the same information to the predictive model. Such a situation is call *multi-colinearity*, and it can mislead some Machine Learning algorithms. Therefore, only 1 variable in the high correlated pair should be used.

#### Wrapper methods
Wrapper methods run Machine Learning algorithms on a subsets of the dataset to detect unimporatant features.

- *Feature importances analysis*. Some ML algorithms have the ability to return feature importances. For example, Linear Regression uses variable weights and Decision Tree uses sum of information gains. Feature importances express how much information features contribute on predicting the target variable. In this approach, low important features will be removed.

- *Backward feature elimination*. The idea of this technique is to fit an algorithm on all input variables and consecutively remove one feature at a time that worst reduces model score. This procedure stops when model score no longer changes.

- *Foward feature construction*. This is basically the inverse process of the previous techique, it fits model and evaluate the results untill all significant features have been visited. Both backward elimination and foward construction are greedy algorithms, therefore they are not suite for large scale data.

# 3. Feature synthesis

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*