# Content

This notebooks shows how we deal with the numeric data with different scale by using normalization techniques.

Currently, there are 3 normalization techniques which are used mostly in data preprocessing.
<ol>
    <li>1. L1 normalization - <strong>Least absolute deviations | Least absolute errors</strong></li>
    <li>2. L2 normalization - <strong>Least squares</strong></li>
    <li>3. MinMax normalization</li>
</ol>

It is clear that different normalization will be used in different purposes with different datasets. It is possible to use only one normalization technique for a particular dataset. Also, it is possible to use mixed normalization techniques (more than 2) for a different dataset. 

Furthermore, there are two different ways of applying normalization technique(s) in a particular dataset (rows, columns or both). If we want to scale all values in <strong>one feature or one column</strong>, we will need to apply normalization <strong>by column</strong>. If we want to make <strong>all features</strong> in the same scale, we will need to apply normalization <strong>by row</strong>. In some cases, we are also able to apply normalization by row and column sequentially or vice versa.


<strong>When do we need to normalize our data?</strong>
<ol>
    <li>We want to seek for relationships between features.</li>
    <li>We want to use regression and multivariate for our further analysis. This is because these two types of analysis focus on exploring the relationships between features.</li>
    <li>We do <strong>not</strong> consider much about the mean between or within features.</li>
</ol>

<strong>When should we not use normalize our data?</strong>
<ol>
    <li>In experimental research, we compare the mean of treatment with the mean of another treatment.</li>
    <li>We are dealing with categorical data (more than 2 categories per feature).</li>
</ol>

# Libraries

For reading data: <strong>Pandas</strong>

For scratch implementation: <strong>Numpy</strong>

For existing implementation: <strong>Scikit learn</strong>

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import normalize, MinMaxScaler

# Data preparation

In this section, data is prepared for different normalization techniques.

<strong>Data summary</strong>
<ul>
    <li>Age - symmetric data (measured in range of 1-100)</li>
    <li>Number of songs listened per day - symmetric data (measured in range of 5-15)</li>
    <li>Satisfaction score - symmetric data (measured in range of 0-1)</li>
</ul>

In [2]:
data = pd.DataFrame(
    [[20, 10, 0.7], [30, 8, 0.4], [25, 11, 0.4], [10, 5, 0.8], [40, 7, 0.6]],
    columns=['Age', 'Songs Listened', 'Satisfaction']
)

display(data)

Unnamed: 0,Age,Songs Listened,Satisfaction
0,20,10,0.7
1,30,8,0.4
2,25,11,0.4
3,10,5,0.8
4,40,7,0.6


# Normalization techniques

## L1 normalization

As mentioned above, L1 normalization is known as <strong>Least absolute deviations</strong>. It takes the sum of all elements in a particular row or column. Then, it divides the value of each element by that sum.

<strong>Formula:</strong> 
<img width=550 src="./images/normalization/L1/formula.PNG">

<strong>For example: (by row)</strong>
<img width=1000 src="./images/normalization/L1/example_by_row.PNG">

<strong>Questions:</strong>
<ol>
    <li>
        <strong>When do we need to use L1 normalization?</strong>
        <ul>
            <li>We want to have a robust normalization for the model. (Robust normalization: normalization with good performance for data drawn from a wide range of probability distributions.</li>
            <li>We do not consider much about outliers. (Outliers: Unusual data points in the distribution)</li>
            <li>The stability is not required. (Stability is the slight change in the fitting line when there is an adjustment in the data.</li>
            <li>We want to use sprase models which generate mostly zeros (vectors)</li>
        </ul>
    </li>
    <li>
        <strong>What are advantages of using L1 normalization?</strong>
        <ul>
            <li>It is not sensitive to outliers.</li>
            <li>It is used for performing feature selection (we can delete all feataures where the coefficient is 0.</li>
            <li>It optimizes the median.</li>
        </ul>
    </li>
    <li>
    <strong>What are disadvantages of using L2 normalization?</strong>
        <ul>
            <li>It does not perform well in the model which requires to have small number of columns. This is because it crease a sparse matrix.</li>
        </ul>
    </li>
</ol>

In [3]:
def l1_from_scratch(data, axis=1):
    ''' This is the implementation of L1 normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    L1-normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    
    # Sum
    total = np.sum(data, axis=axis)
    
    # Division
    if axis == 0:
        # If we do normalization by columns
        data = data.T
    
    res = np.divide(data.values, total[:, None])
    
    # Tranpose if we do normalization by rows
    if axis == 0:
        res = res.T
        
    # Make DataFrame
    res = pd.DataFrame(res, columns=cols)
        
    return res

In [4]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(l1_from_scratch(data, axis=1))

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        normalize(data, norm='l1'), columns=data.columns
    )
)

INPUT:


Unnamed: 0,Age,Songs Listened,Satisfaction
0,20,10,0.7
1,30,8,0.4
2,25,11,0.4
3,10,5,0.8
4,40,7,0.6


OUTPUT (from scratch):


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.651466,0.325733,0.022801
1,0.78125,0.208333,0.010417
2,0.686813,0.302198,0.010989
3,0.632911,0.316456,0.050633
4,0.840336,0.147059,0.012605


OUTPUT (sklearn):


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.651466,0.325733,0.022801
1,0.78125,0.208333,0.010417
2,0.686813,0.302198,0.010989
3,0.632911,0.316456,0.050633
4,0.840336,0.147059,0.012605


## L2 Normalization

As mentioned above, L2 normalization is known as <strong>Least squares</strong>. It takes the sum of square of all elements in a particular row or column. Then, it divides the value of each element by the root squared of that sum.

<strong>Formula:</strong> 
<img width=550 src="./images/normalization/L2/formula.PNG">

<strong>For example: (by row)</strong>
<img width=1000 src="./images/normalization/L2/example_by_row.PNG">

<strong>Questions:</strong>
<ol>
    <li>
        <strong>When do we need to use L2 normalization?</strong>
        <ul>
            <li>Robustness for the model is not required.</li>
            <li>We do consider about outliers.</li>
            <li>The stability is required.</li>
            <li>We want to use non-sprase models which generate mostly non-zeros (vectors). Also, this support computational efficiency</li>
        </ul>
    </li>
    <li>
        <strong>What are advantages of using L2 normalization?</strong>
        <ul>
            <li>It support computational efficiency, so it is also easy to use gradient based learning methods.</li>
            <li>It keeps the overall error small.</li>
            <li>It improves the prediction performance. This is because we consider almost all features.</li>
        </ul>
    </li>
    <li>
    <strong>What are disadvantages of using L2 normalization?</strong>
        <ul>
            <li>It is not used for performing feature selection.</li>
            <li>It is very sensitive to outliers.</li>
        </ul>
    </li>
</ol>

In [5]:
def l2_from_scratch(data, axis=1):
    ''' This is the implementation of L2 normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    L2-normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    
    # Square
    sq = np.square(data)
    
    # Sum of square
    total = np.sum(sq, axis=axis)
    
    # Root squared
    sq_rt = np.sqrt(total)
    
    # Division
    if axis == 0:
        # If we do normalization by columns
        data = data.T
    
    res = np.divide(data.values, sq_rt[:, None])
    
    # Tranpose if we do normalization by rows
    if axis == 0:
        res = res.T
        
    # Make DataFrame
    res = pd.DataFrame(res, columns=cols)
        
    return res

In [6]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(l2_from_scratch(data, axis=1))

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        normalize(data, norm='l2'), columns=data.columns
    )
)

INPUT:


Unnamed: 0,Age,Songs Listened,Satisfaction
0,20,10,0.7
1,30,8,0.4
2,25,11,0.4
3,10,5,0.8
4,40,7,0.6


OUTPUT (from scratch):


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.893989,0.446995,0.03129
1,0.966155,0.257641,0.012882
2,0.915217,0.402695,0.014643
3,0.892146,0.446073,0.071372
4,0.984923,0.172362,0.014774


OUTPUT (sklearn):


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.893989,0.446995,0.03129
1,0.966155,0.257641,0.012882
2,0.915217,0.402695,0.014643
3,0.892146,0.446073,0.071372
4,0.984923,0.172362,0.014774


## Min-Max Normalization

Min-Max normalization is the way we map the entire range of values of a particular row or column to the range 0 to 1. The minimum value is mapped to 0, the maximum value is mapped to 1 and every other value is mapped to a decimal between 0 and 1.

<strong>Formula:</strong>
<img width=1000 src="./images/normalization/MinMax/formula.PNG">

    X_min = The minimum value of a particular row or column
    X_max = The maximum value of a particular row or column
    X_i   = The value of a particular unit.
    y     = The value after normalizing.

<strong>For example: (by column)</strong>
<img width=1000 src="./images/normalization/MinMax/example_by_col.PNG">

<strong>Questions:</strong>
<ol>
    <li>
        <strong>When do we need to use Min-Max normalization?</strong>
        <ul>
            <li>We want gradient descent to converge much faster. (Logistic Regression, SVMs, perceptons, neural networks, etc.)</li>
            <li>When our further analysis uses K-NN for classification problems or K-means for clustering problems</li>
            <li>When we want to find directions of maximizing the variance. (LDA, PCA, Kernel-PCA)</li>
        </ul>
    </li>
    <li>
        <strong>What are advantages of using Min-Max normalization?</strong>
        <ul>
            <li>The distribution of the feature is not normally distributed.</li>
            <li>The feature falls within a bound interval. (pixels intensities fit within 0-255 range)</li>
        </ul>
    </li>
    <li>
    <strong>What are disadvantages of using Min-Max normalization?</strong>
        <ul>
            <li>Very sensitive to outliers</li>
        </ul>
    </li>
</ol>

In [7]:
def mm_from_scratch(data, axis=0):
    ''' This is the implementation of Min-Max normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    Min-Max normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    vals = None
    
    # Transpose dataframe
    if axis == 0:
        vals = data.values
    else:
        vals = data.T.values
        
    # Collect min and max of a particular row or column
    bags = []
    for ind in range(len(vals[0])):
        min_val = min(vals[:, ind])
        max_val = max(vals[:, ind])
        
        vals[:, ind] = (vals[:, ind] - min_val)/(max_val - min_val)
        
    # Transform values
    if axis == 1:
        vals = vals.T
    
    # Make dataframe
    res = pd.DataFrame(vals, columns=cols)
        
    return res

In [8]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(mm_from_scratch(data, axis=0))

scaler = MinMaxScaler()

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        scaler.fit_transform(data), columns=data.columns
    )
)

INPUT:


Unnamed: 0,Age,Songs Listened,Satisfaction
0,20,10,0.7
1,30,8,0.4
2,25,11,0.4
3,10,5,0.8
4,40,7,0.6


OUTPUT (from scratch):


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.333333,0.833333,0.75
1,0.666667,0.5,0.0
2,0.5,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.333333,0.5


OUTPUT (sklearn):


  return self.partial_fit(X, y)


Unnamed: 0,Age,Songs Listened,Satisfaction
0,0.333333,0.833333,0.75
1,0.666667,0.5,0.0
2,0.5,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.333333,0.5
