## <font color='darkblue'>Preface</font>
([article source](https://pub.towardsai.net/data-preprocessing-concepts-with-python-b93c63f14bb6)) <font size='3ptx'>**A robust method to make data ready for machine learning estimators**</font>
![1.png](images/1.png)
<br/>

**In this article, we will study some important data preprocessing methods**. It is a very important step to visualize the data and make it in a suitable form so that the estimators (<font color='brown'>algorithm</font>) fit well with good accuracy.

Topics to be covered:
1. <font size='3ptx'>[**Standardization**](#sect1)</font>
2. <font size='3ptx'>[**Scaling with sparse data and outliers**](#sect2)</font>
3. <font size='3ptx'>[**Normalization**](#sect3)</font>
4. <font size='3ptx'>[**Categorical Encoding**](#sect4)</font>
5. <font size='3ptx'>[**Imputation**](#sect5)</font>

<a id='sect1'></a>
## <font color='darkblue'>Standardization</font>
**Standardization is a process that deals with the mean and standard deviation of the data points**. As raw data, the values are varying from very low to very high. So, to avoid the low performance in the model we use standardization. It says, the mean becomes zero and the standard deviation becomes a unit.

The formula to standardization shown below:
![2.png](images/2.png)
<br/>

When we use an algorithm to fit our data it assumes that the data is centered and the order of variance of all features are the same otherwise the estimators will not predict correctly. The sklearn library has a method to standardize the data set with [**StandardScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) in preprocessing class.

We use the import command to use this feature in python:
```python
#Before modeling our estimator we should always some preprocessing scaling.
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
```

<a id='sect2'></a>
## <font color='darkblue'>Scaling with sparse data and outliers</font>

### <font color='darkgreen'>Scaling with Sparse data</font>
Scaling of data is another way of making feature values be in some range of “0” and “1”. There are two methods of doing these i.e. [**MinMaxScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and [**MaxAbsScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler). Below is an Example with python:

In [2]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

X_train = np.array([[ 1., 0.,  2.], [ 2.,  0.,  -1.], [ 0.,  2., -1.]])
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)

[[0.5 0.  1. ]
 [1.  0.  0. ]
 [0.  1.  0. ]]


As we see the input value comes in a range of “0” and “1”.

**Creating scaling of the sparse data centering is not a good idea because it may change its structure. So, it is good to scale the input raw data that has values on different scales.**

### <font color='darkgreen'>Scaling with Outliers:</font>
**When raw data have many outliers then the scaling with mean and variance doesn’t do well with the data**. So, we have to use a more robust method like the interquartile method (<font color='brown'>IQR</font>) because the outliers are influenced by mean and variance. The range of the IQR is between 25% and 75% in which the median is removed and scaling the quantile range.

The [**RobustScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) takes some parameters to perform scaling:
* The first parameter is <font color='violet'>with_centering</font> that centers the data before scaling if it is true.
* The second parameter is <font color='violet'>with_scaling</font> if it is true then it scale the data in the quantile range.

Example with python

In [10]:
from sklearn.preprocessing import RobustScaler

X = [
    [ 1., 0.,  2.],
    [ 2.,  0.,  -1.],
    [ 0.,  2., -1.],
    [ 3.,  1., 1.],
    [100, 100, -100]
]

transformer = RobustScaler(with_scaling=True).fit(X)
transformer.transform(X)

array([[ -0.5,  -0.5,   1.5],
       [  0. ,  -0.5,   0. ],
       [ -1. ,   0.5,   0. ],
       [  0.5,   0. ,   1. ],
       [ 49. ,  49.5, -49.5]])

<a id='sect3'></a>
## <font color='darkblue'>Normalization</font>
**The scaling process in this is to normalize the values to their unit norm.** An example of this normalization is [**MinMaxScaler**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler). The process is useful when we are dealing with quadratic form in pair forms it can be kernel-based or dot product-based.

It is also useful based on of vector space model i.e the vectors related with text data samples to ease in data filtration.

Two types of Normalization happen as shown below:
* [**Normalize**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html#sklearn.preprocessing.normalize): It deals to scale the input vectors to unit norm. The norm parameter is used to normalize all the non-zero values. It takes three arguments L1, L2, and max where the L2 is the default norm.
* [**Normalizer**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer): It also does the same operation but in this process the fit method is optional.

Example with Python:

In [14]:
from sklearn.preprocessing import normalize

X = [
    [ 1., 0., 2.], 
    [ 2., 0., -1.], 
    [ 0., 2., -1.],
    [ -1., 1., -2.]
]
X_normalized = normalize(X, norm='l2')
print(X_normalized)

[[ 0.4472136   0.          0.89442719]
 [ 0.89442719  0.         -0.4472136 ]
 [ 0.          0.89442719 -0.4472136 ]
 [-0.40824829  0.40824829 -0.81649658]]


Example with Normalizer:

In [16]:
from sklearn.preprocessing import Normalizer

X = [
    [ 1., 0., 2.], 
    [ 2., 0., -1.], 
    [ 0., 2., -1.],
    [ -1., 1., -2.]
]

normalizer = Normalizer().fit(X)
normalizer.transform(X)

array([[ 0.4472136 ,  0.        ,  0.89442719],
       [ 0.89442719,  0.        , -0.4472136 ],
       [ 0.        ,  0.89442719, -0.4472136 ],
       [-0.40824829,  0.40824829, -0.81649658]])

The normalizer is useful in the pipeline of data processing in the beginning.

When we use sparse input, it is important to convert it to be CSR format to avoid multiple memory copies. The CSR is compressed Sparse Rows comes in [**scipy.sparse.csr_matrix**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).

<a id='sect4'></a>
## <font color='darkblue'>Categorical Encoding</font>
**When we get some raw data set then some columns are that are not in continuous values rather in some categories of binary and multiple categories**. So, to make them in integer value we use encoding methods. There are some encoding methods given below:
* [**Get Dummies**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html): It is used to get a new feature column with 0 and 1 encoding the categories with the help of the pandas’ library.
* [**Label Encoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder): It is used to encode binary categories to numeric values in the sklearn library.
* [**One Hot Encoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder): The sklearn library provides another feature to convert categories class to new numeric values of 0 and 1 with new feature columns.
* [**Hashing**](https://contrib.scikit-learn.org/category_encoders/hashing.html): It is more useful than one-hot encoding in the case of high dimensions. It is used when there is high cardinality in the feature.

There are many other encoding methods like mean encoding, Helmert encoding, ordinal encoding, probability ratio encoding and, etc. 

Example with Python:
```python
df1=pd.get_dummies(df['State'],drop_first=True)
```
![3.png](images/3.png)
<br/>

<a id='sect5'></a>
## <font color='darkblue'>Imputation</font>
when raw data have some missing values so to **make the missing record to a numeric value is know as imputing**.

For demonstration, let's create the random data frame:

In [18]:
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(
    np.random.randn(4, 3), 
    index=['a', 'c', 'e', 'h'],
    columns=['First', 'Second', 'Three']
)

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

Unnamed: 0,First,Second,Three
a,-0.35723,-1.119431,-0.004218
b,,,
c,0.543968,-0.053535,0.447563
d,,,
e,0.165891,1.365003,-1.010284
f,,,
g,,,
h,1.011389,-1.192242,-1.463785


Now replacing with zero value.

In [19]:
print ("NaN replaced with '0':")
print (df.fillna(0))

NaN replaced with '0':
      First    Second     Three
a -0.357230 -1.119431 -0.004218
b  0.000000  0.000000  0.000000
c  0.543968 -0.053535  0.447563
d  0.000000  0.000000  0.000000
e  0.165891  1.365003 -1.010284
f  0.000000  0.000000  0.000000
g  0.000000  0.000000  0.000000
h  1.011389 -1.192242 -1.463785


Replacing the missing values with mean:

In [24]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(df)

array([[-0.35722967, -1.11943133, -0.0042178 ],
       [ 0.34100455, -0.25005143, -0.50768083],
       [ 0.54396779, -0.05353503,  0.44756316],
       [ 0.34100455, -0.25005143, -0.50768083],
       [ 0.16589073,  1.36500263, -1.01028369],
       [ 0.34100455, -0.25005143, -0.50768083],
       [ 0.34100455, -0.25005143, -0.50768083],
       [ 1.01138933, -1.192242  , -1.46378498]])

## <font color='darkblue'>Conclusion</font>
The data preprocessing is an important step to perform to make the data set more reliable to our estimators.