# **Feature engineering: extraction**

As mentioned in the [feature selection notebook](https://github.com/leobezerra/scikit-zero/blob/master/en/notebooks/Feature_selection.ipynb), the Feature engineering pipeline also includes feature extraction. Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features).

Feature Extraction techniques can lead advantages such as:
- Accuracy improvements.
- Overfitting risk reduction.
- Speed up in training.
- Improved Data Visualization.
- Increase in explainability of our model.

In this notebook we are going to use pandas and sckit-learn (sklearn), which is used for data mining and analysis. The dataset we use for all examples of algorithms will be the popular iris dataset


In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

Verifying uploaded data:

In [None]:
df.head()

We can see that the iris dataset has 4 features. When extracting features we ended up reducing the dimensionality of the data. In other words, the number of features. There are some algorithms for this purpose, the ones we will cover are:


*   ***Principal component analysis (PCA)***
*   ***Linear discriminant analysis (LDA)***
*   ***Independent component analysis (ICA)***
*   ***t-Distributed Stochastic Neighbor Embedding (t-SNE)***
 
















## **PCA** 

> An unsupervised linear dimensionality reduction technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variable values called principal components. The number of main components is always less than or equal to the number of original variables. The PCA is associated with a technique for reducing the mass of data, with the least possible loss of information, grouping them in an order according to their variation, that is, according to their behavior within the population. In this way, it allows to summarize and visualize the information in a data set containing the information by multiple correlated quantitative variables, extracting important information from a data table, making it easier to work, since smaller sets are easier to explore by providing faster the algorithm to be implemented.




In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

features = ['sepal length', 'sepal width', 'petal length', 'petal width']

x = df.loc[:, features].values
y = df.loc[:,['target']].values
x = StandardScaler().fit_transform(x)

pca = PCA(n_components=3)
componentes = pca.fit_transform(x)



PCAdf = pd.DataFrame(data = componentes
             , columns = ['componente 1', 'componente 2','componente 3'])

PCAdf.head()

Joining the two components with the target:

In [None]:
finaldf = pd.concat([PCAdf, df[['target']]], axis = 1)

finaldf.head()

We can see that we reduced the number of features in the dataset to just 2 components, they together contain 95.8% of the initial information.

It shows the value of the variation of each main component relative to the original data


In [None]:
import seaborn as sns

df = pd.DataFrame({'var':pca.explained_variance_ratio_,
             'PC':['PC1','PC2','PC3']})
sns.barplot(x='PC',y="var", 
           data=df, color="c");

## ***LDA***

> It is also a method of reducing dimensionality that uses information from the categories (SUPERVISED) associated with each pattern to linearly extract the most discriminating characteristics. LDA is most commonly used as a dimensionality reduction technique in the pre-processing step for classifying patterns and machine learning applications. The objective is to design a data set in a space of a smaller dimension with good class separability, in order to avoid overfitting and also reduce computational costs. With this approach, the similarity between the LDA and the PCA is perceived, but besides finding the axes of components that maximize the variation of the data, it is also necessary to find the axes that maximize the separation between various classes, it helps to avoid overfitting by minimizing the error in parameter estimation.


In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

lda = LDA(n_components=2)

X_test

Reducing to 2 features:

In [None]:
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

X_test

## ***ICA***

> A dimensionality reduction algorithm that transforms a set of variables ​​in a new set of components. It does this so that the statistical independence between the new components is maximized. This is similar to (PCA), which maps a collection of variables ​​for statistically unrelated components, except that the ICA goes a step further by maximizing statistical independence rather than just developing uncorrelated components. If we talk about an image, it means that you will find the curves and borders within an image. For example, in facial recognition, the ICA will identify the eyes, nose, mouth, etc. as independent components.

In [None]:
from sklearn.decomposition import FastICA

transformer = FastICA(n_components=2,random_state=0)
x = transformer.fit_transform(x)
x

## ***t-SNE***

> The objective of t-SNE is from a set of points in a multi-dimensional space to find a faithful representation of these points in a space of smaller dimension, usually a 2D plane. The algorithm is non-linear and adapts to the data, performing different transformations in different regions of multi-dimensional space. 
PCA is similar to t-SNE, however PCA is a linear dimension reduction technique that aims to maximize variance and preserve large distances between pairs, on the other hand t-SNE preserves only small distances between pairs or local similarities. t-SNE calculates a measure of similarity between pairs of instances in the upper dimensional space and the lower dimensional space. He then tries to optimize these two similarity measures using a cost function.

In [None]:
from sklearn.manifold import TSNE
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

tsne = TSNE(n_components=2, n_iter=1000, random_state=40)
points = tsne.fit_transform(df[features])

points

Now that we have the two resulting dimensions we can visualise them by creating a scatter plot of the two dimensions and coloring each sample by its respective **label**.

In [None]:
import matplotlib.pyplot as plt 

df['tsne-2d-one'] = points[:,0]
df['tsne-2d-two'] = points[:,1]

plt.figure(figsize=(10,7))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="target",
    data=df,
    legend="full",
    alpha=0.3
)

We can see that the **label** are very clearly clustered in their own sub groups. If we would now use a clustering algorithm to pick out the seperate clusters we could probably quite accurately assign new points to a **label**.