### In data science, we need to conduct normalization or standardization for every numeric data variables (features).
- scipy.sparse matrices include: Compressed Sparse Rows (scipy.sparse.csr_matrix) and Compressed Sparse Columns format (scipy.sparse.csc_matrix). Any other sparse input will be converted to the Compressed Sparse Rows representation.
- If the centered data is expected to be small enough, explicitly converting the input to an array using the toarray() method of sparse matrices is used.

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets.samples_generator import make_classification

# gererator some data to work with
X, y = make_classification(n_samples=300, 
                           n_features=3, 
                           n_redundant=0, 
                           n_informative=2, 
                           random_state=22, 
                           n_clusters_per_class=1,
                           scale=100)
print(type(X))
print(X[2:5],)

<class 'numpy.ndarray'>
[[-144.93189226  213.45502692 -159.49253106]
 [-139.42116928  179.02777073 -129.00901949]
 [  16.71490058  164.14533923 -160.13514532]]


#### In sklearn, there are several different scaling (normalization) methods. 
#### first, without scaling

In [3]:
# sklearn.cross_validation will be deprecated. Thus, I use model_selection
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC

Xc = X.copy()
yc = y.copy()

# without normalization
X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf = SVC()
clf.fit(X_train,y_train)
print("The prediction accuracy without scaling is: ", clf.score(X_test, y_test))

The prediction accuracy without scaling is:  0.416666666667


#### Normalization using various scaling methods inside preprocessing package
#### default scaling: scale() or StandardScaler()  [they are the same thing!!!]
#### formula: (x - mean)/std
- The utility class StandardScaler() that implements the Transformer API [such as fit(), transform() and fit_transform()] to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.
- The scale() function combines all the above operations within a single step, but you will not be able to do the same transformation in test dataset as in training dataset
- They can handle scipy.sparse matrices as long as with_mean=False

In [4]:
# With normalization (Scaling)
Xc = X.copy()
yc = y.copy()

# Using default scaling using scale():
Xc = preprocessing.scale(Xc)
print(Xc[0:2,])
print(Xc.mean(axis=0))
print(Xc.std(axis=0))

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("Using default scaling: ", clf.score(X_test, y_test))

###################################################
# using default scaling through StandardScaler():
Xc = X.copy()
yc = y.copy()

tmp_scaler = preprocessing.StandardScaler()
tmp_scaler = tmp_scaler.fit(Xc)
print("Mean: ", tmp_scaler.mean_)
# print("Standard Deviation: ", tmp_scaler.std_)  
# std_ will be deprecated, use scale_ instead
print("Scale_:", tmp_scaler.scale_)
Xc = tmp_scaler.transform(Xc)

# Here y is not continuous data (discrete), so we will use classification
# the loss function will be one of [MAE, MSE, RMSE etc] if y is continuous

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("Using StandardScaler(): ", clf.score(X_test, y_test))

[[ 3.34037378 -1.72873468 -0.63109555]
 [ 0.71884974  0.77980568 -1.17636869]]
[ -4.14483263e-17  -2.77925830e-16   1.85037171e-17]
[ 1.  1.  1.]
Using default scaling:  0.966666666667
Mean:  [   7.78519807  103.66603519   -0.80587646]
Scale_: [ 102.534033     71.65868856  134.2951084 ]
Using StandardScaler():  0.933333333333


#### MinMaxScaler(): [minmax_scale() is the single step version]
Transforms features by scaling each feature to a given range.
The transformation is given by:
- X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
- X_scaled = X_std * (max - min) + min
- where min, max = feature_range(min,max), default=(0, 1).
- This transformation is often used as an alternative to zero mean, unit variance scaling.
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

In [5]:
Xc = X.copy()
yc = y.copy()

# minmax_scale()
Xc = preprocessing.minmax_scale(X, feature_range=(-1,1))
print(Xc[0:5:,])

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("Using minmax_scaling: ", clf.score(X_test, y_test))

################################################
# MinMaxScaler()
Xc = X.copy()
yc = y.copy()

mm_scaler = preprocessing.MinMaxScaler(feature_range=(-2,2))
Xc = mm_scaler.fit_transform(Xc)
X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("Using MinMaxScaler(): ", clf.score(X_test, y_test))

[[ 1.         -0.43917499 -0.46885612]
 [ 0.11032647  0.31086199 -0.71041624]
 [-0.63910267  0.535796   -0.71274547]
 [-0.62086298  0.39214947 -0.61218764]
 [-0.10407545  0.33005304 -0.7148653 ]]
Using minmax_scaling:  0.95
Using MinMaxScaler():  0.916666666667


#### MaxAbsScaler() and maxabs_scale() for Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it makes sense to scale sparse inputs, especially if features are on different scales.


In [6]:
Xc = X.copy()
yc = y.copy()

# maxabs_scale()
Xc = preprocessing.maxabs_scale(Xc)
print(Xc[2:5,:])

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("using maxabs_scale():", clf.score(X_test, y_test))

############################################################
# MaxAbsScaler()
Xc = X.copy()
yc = y.copy()

ma_scaler = preprocessing.MaxAbsScaler()
Xc = ma_scaler.fit_transform(Xc)
print(Xc[2:5, :])

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("using MaxAbsScaler(): ", clf.score(X_test, y_test))

[[-0.41375162  0.65737298 -0.44338447]
 [-0.3980196   0.55134808 -0.35864122]
 [ 0.0477177   0.50551496 -0.44517092]]
using maxabs_scale(): 0.95
[[-0.41375162  0.65737298 -0.44338447]
 [-0.3980196   0.55134808 -0.35864122]
 [ 0.0477177   0.50551496 -0.44517092]]
using MaxAbsScaler():  0.866666666667


#### Scaling data with outliers
use robust_scale() and RobustScaler() as drop-in replacements 

In [7]:
from sklearn.preprocessing import RobustScaler

# robust_scale()
Xc = X.copy()
yc = y.copy()

Xc = preprocessing.robust_scale(Xc)
print(Xc[2:5,])

X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("using robust_scale(): ", clf.score(X_test, y_test))

###################################################################
# RobustScaler
Xc = X.copy()
yc = y.copy()

r_scaler = preprocessing.RobustScaler()
Xc = r_scaler.fit_transform(Xc)
print(Xc[2:5,])
X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.2)
clf.fit(X_train, y_train)
print("using RobustScaler(): ", clf.score(X_test, y_test))

[[-1.09751646  1.44361657 -0.59836839]
 [-1.05923922  1.00427368 -0.45472651]
 [ 0.02527523  0.81435176 -0.60139647]]
using robust_scale():  0.983333333333
[[-1.09751646  1.44361657 -0.59836839]
 [-1.05923922  1.00427368 -0.45472651]
 [ 0.02527523  0.81435176 -0.60139647]]
using RobustScaler():  0.966666666667


#### Binarization
Use a threshold value to set feature values to either 0 or 1 (or boolean value).
Just like others, it has two forms: binarize() and Binarizer()

In [68]:
from sklearn.preprocessing import Binarizer

Xc = X.copy()
yc = y.copy()

# binarize()
Xc = preprocessing.binarize(Xc, threshold=50)
print(Xc[2:5,])

#########################################################
# Binarizer()
Xc = X.copy()
yc = y.copy()

binarizer = preprocessing.Binarizer(threshold=2.0)
Xc = binarizer.fit_transform(Xc)
print(Xc[2:5,])

[[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]]
[[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 1.  1.  0.]]


#### Normalization
It is used to scale individual SAMPLEs to have unit norm. The general formula is: (each individual data from each row) / norm(row). Here the norm(row) is the square root of the sum of square of each individual data from this specific row (this is L2 norm). you could also use L1 norms.
- There are two forms available just like others
- normalize() and Normalizer()


In [12]:
from sklearn.preprocessing import Normalizer

Xc = X.copy()

# use L1 norm
X_normalized = preprocessing.normalize(Xc, norm='l1')
print("using L1 norm:", X_normalized[2:4,])

# use L2 norm
X_normalized = preprocessing.normalize(Xc, norm='l2')
print("Using L2 norm", X_normalized[2:4,])

# using default norm(row)
X_normalized = preprocessing.normalize(Xc)
print("Using default norm", X_normalized[2:4,])

using L1 norm: [[-0.27985643  0.41217126 -0.30797231]
 [-0.31158496  0.40009964 -0.2883154 ]]
Using L2 norm [[-0.47781028  0.70371679 -0.52581367]
 [-0.53413552  0.6858721  -0.4942456 ]]
Using default norm [[-0.47781028  0.70371679 -0.52581367]
 [-0.53413552  0.6858721  -0.4942456 ]]
