# Data Preprocessing

In [1]:
import numpy as np
from sklearn import preprocessing

In [2]:
input_data = np.array([[5.1, -2.9, 3.3],
                       [-1.2, 7.8, -6.1],
                       [3.9, 0.4, 2.1],
                       [7.3, -9.9, -4.5]])

## Binarization

Binariztion is the process of converting numerical values into boolean type values given a certain threshold.

In [5]:
binarizer = preprocessing.Binarizer(threshold = 2.1)
binarizer.transform(input_data)

array([[1., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

We can see that all values above 2.1 are transformed into 1s and all values below 2.1 are transformed into 0s.

## Mean Removal

Another form of preprocessing is to remove the mean from the features. This will cause the transformed features to have a mean of close to 0 and a standard deviation of close to 1. 

In [10]:
print('\nBEFORE:')
print('Mean = {}'.format(input_data.mean(axis=0)))
print('Std deviation = {}'.format(input_data.std(axis=0)))


BEFORE:
Mean = [ 3.775 -1.15  -1.3  ]
Std deviation = [3.12039661 6.36651396 4.0620192 ]


We can see here that each column or feature has a different distribution, which in turn produces different means and standard deviations.

In [11]:
data_scaled = preprocessing.scale(input_data)
print('\nAfter')
print('Mean = {}'.format(data_scaled.mean(axis = 0)))
print('Std deviation = {}'.format(data_scaled.std(axis = 0)))


After
Mean = [1.11022302e-16 0.00000000e+00 2.77555756e-17]
Std deviation = [1. 1. 1.]


We can see that the newly scaled data has features with 0 mean and unit standard deviation.

## Scaling

Scaling is useful as features can vary largely due to its measurements and units of scale.

In [12]:
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
data_scaled_minmax

array([[0.74117647, 0.39548023, 1.        ],
       [0.        , 1.        , 0.        ],
       [0.6       , 0.5819209 , 0.87234043],
       [1.        , 0.        , 0.17021277]])

We can see that this squeezes each value within the columns into values between 0 and 1. This allows for comparability between the different feautres. This also means that every column will have a 1 and 0 value for the highest and lowest values.

## Normalization

L1 normalization refers to Least Absolute Deviations. Makes the sum of absolute values within each row = 1. <br>
L2 normalization refers to Least Square Deviations. Makes the sum of squares in each row = 1. <br>
L1 considered more robust than L2 because it is more resistant to outliers. Depending on whether we think outliers are important, we may decide that we want to use L2 normalization instead.

In [14]:
data_normalized_l1 = preprocessing.normalize(input_data, norm = 'l1')
data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
print('\nL1 normalized data L1:\n{}'.format(data_normalized_l1))
print('\nL2 normalized data L2:\n{}'.format(data_normalized_l2))


L1 normalized data L1:
[[ 0.45132743 -0.25663717  0.2920354 ]
 [-0.0794702   0.51655629 -0.40397351]
 [ 0.609375    0.0625      0.328125  ]
 [ 0.33640553 -0.4562212  -0.20737327]]

L2 normalized data L2:
[[ 0.75765788 -0.43082507  0.49024922]
 [-0.12030718  0.78199664 -0.61156148]
 [ 0.87690281  0.08993875  0.47217844]
 [ 0.55734935 -0.75585734 -0.34357152]]


## Label Encoding

This is especially useful when there are alot of labels within 1 feature and 1-hot encoding will produce way too many columns.

In [15]:
input_labels = ['red', 'black', 'red', 'green', 'black', 'yellow', 'white']

In [19]:
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)
print('\nLabel mapping:')
for i, item in enumerate(encoder.classes_):
    print(item + '---> ' + str(i))


Label mapping:
black---> 0
green---> 1
red---> 2
white---> 3
yellow---> 4
