# Pre-processing

### 1$-$ [Encoding](#encoding)

### 2$-$ [Normalization](#normalization)

### 3$-$ [Imputation](#imputation)

### 4$-$ [Selection](#selection)

### 5$-$ [Extraction](#extraction)

<a name="encoding"></a>
## 1$-$ Encoding

**Is to convert qualitative data into numeric values.**

### 1.a $-$ Ordinal encoding.

- LabelEncode.
- OrdinalEncoder.

In [3]:
# LabelEncode :
# -> Used for process a single column.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

In [4]:
# Data
y = np.array(["cat", "dog", "cat", "bird"])

In [5]:
# Create an object of the LabelEncoder class.
encoder = LabelEncoder()

# Develop the encoder with fit() method
encoder.fit(y)

LabelEncoder()

In [6]:
# Categories
# We have three classes
encoder.classes_

array(['bird', 'cat', 'dog'], dtype='<U4')

In [7]:
# Used our transformer to process our future data.

# Notice :
# >>> encoder.fit(y)
# >>> encoder.transform(y)
# equivalent to
# >>> encoder.fit_transform()

encoder.transform(y)

array([1, 2, 1, 0])

In [8]:
# Decode our data with `inverse_transform`.

encoder.inverse_transform(np.array([2, 1, 2, 0]))

array(['dog', 'cat', 'dog', 'bird'], dtype='<U4')

In [9]:
# OrdinalEncoder :
# -> Used for process several columns.

from sklearn.preprocessing import OrdinalEncoder

In [10]:
# Data
X = np.array([["cat", "fur"],
              ["dog", "fur"],
              ["cat", "fur"],
              ["bird", "feathers"]])

In [11]:
# Create an object of the LabelEncoder class.
encoder = OrdinalEncoder()

# Develop our transformer and transform our data
encoder.fit_transform(X)

array([[1., 1.],
       [2., 1.],
       [1., 1.],
       [0., 0.]])

### 1.b $-$ One-Hot Encoding.

- LabelBinarizer.
- MultiLabelBinarizer.
- OneHotEncoder.

In [12]:
from sklearn.preprocessing import LabelBinarizer

In [13]:
# Data
y = np.array(["cat", "dog", "cat", "bird"])

In [14]:
# LabelBinarizer
# -> Used to process only one column.

encoder = LabelBinarizer()
encoder.fit_transform(y)

array([[0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0]])

In [15]:
# OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [16]:
# Data
X = np.array([["cat", "fur"],
              ["dog", "fur"],
              ["cat", "fur"],
              ["bird", "feathers"]])

In [24]:
# LabelBinarizer
# -> Used to process several columns.

encoder = OneHotEncoder()

CSR = encoder.fit_transform(X)

In [25]:
# Compressed Sparse Row (CSR) format
CSR

<4x5 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [26]:
# Show the content of CSR format
print(CSR)

  (0, 1)	1.0
  (0, 4)	1.0
  (1, 2)	1.0
  (1, 4)	1.0
  (2, 1)	1.0
  (2, 4)	1.0
  (3, 0)	1.0
  (3, 3)	1.0


In [27]:
# Compressed Sparse Row (CSR) is the default choice in the 
# `OneHotEncoder` class.

# If you want used it in the `LabelBinarizer` class use the param 
# `sparse_output=True`.