Important part of data preprocessing is feature scaling.

#### Typical feature transformations
- Normalization
- Standardization
- Exponentiation

#### Implemented Sklearn transformations

- MaxAbsScaler

        normalize maximum absolute value


- MinMaxScaler

        map values onto predefined interval, default is [0,1]


- StandardScaler

        subtracts average and divides by standard deviation

- RobustScaler

        standard scaler that filters out 25% outliers

- Normalizer

        converts each data row so that it has unit norm (L1, L2 or Maximum)

- PowerTransformer

        applies box-cox / yeo-jensen transformation to decrease skewness of data
        https://en.wikipedia.org/wiki/Power_transform


#### Sklearn transformers functions

- fit() 

        -- learns the required transformation
- transform()
        
        -- applies transformations
- fit_transform()

        -- does both

predict() data preprocessing (if used) must be applied to all (train/test/validation) parts of data.

#### Where to fit the scalers
To imitate the most realistic estimates of model performance, we should restrict fitting a scaler to Train part of data and apply to Test parts.

#### Is it necessary?
- If there is enough training data and train/test splitting retains feature distributions => No
- If train/test splitting changes feature distribution => Yes

#### Does data splitting retain distributions
Let's try it on real datasets

In [109]:
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import ShuffleSplit, KFold
from sklearn.datasets import fetch_openml

bunch = fetch_openml(name="abalone", data_home="/Users/Konstantin/git/", as_frame=True)
X = bunch['data'].select_dtypes('number')

Let's try ShuffledSplit

In [119]:
scaler_train = StandardScaler()
scaler_test = StandardScaler()

scale_result = []
center_result = []

for train_records, test_records in ShuffleSplit(n_splits=10, test_size=.25).split(X):
    X_train = X.loc[train_records]
    X_test = X.loc[test_records]
    scaler_train.fit(X_train)
    scaler_test.fit(X_test)
    scale_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.scale_, scaler_test.scale_)]), 5)
    center_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.mean_, scaler_test.mean_)]), 5)
    scale_result.append(scale_stats)
    center_result.append(center_stats)

print("scale = {}".format(round(np.mean(scale_result),5)))
print("center = {}".format(round(np.mean(center_result),5)))


scale = 0.00057
center = 0.00043


Let's try Robust Scaler

In [120]:
scaler_train = RobustScaler()
scaler_test = RobustScaler()

scale_result = []
center_result = []

for train_records, test_records in ShuffleSplit(n_splits=10, test_size=.25).split(X):
    X_train = X.loc[train_records]
    X_test = X.loc[test_records]
    scaler_train.fit(X_train)
    scaler_test.fit(X_test)
    scale_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.scale_, scaler_test.scale_)]), 5)
    center_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.center_, scaler_test.center_)]), 5)
    scale_result.append(scale_stats)
    center_result.append(center_stats)

print("scale = {}".format(round(np.mean(scale_result),5)))
print("center = {}".format(round(np.mean(center_result),5)))


scale = 0.00076
center = 0.00071


Let's try Kfold

In [88]:
scaler_train = StandardScaler()
scaler_test = StandardScaler()

scale_result = []
center_result = []

for train_records, test_records in KFold(n_splits=10).split(X):
    X_train = X.loc[train_records]
    X_test = X.loc[test_records]
    scaler_train.fit(X_train)
    scaler_test.fit(X_test)
    scale_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.scale_, scaler_test.scale_)]), 5)
    center_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.mean_, scaler_test.mean_)]), 5)
    scale_result.append(scale_stats)
    center_result.append(center_stats)

print("scale = {}".format(round(np.mean(scale_result),5)))
print("center = {}".format(round(np.mean(center_result),5)))


scale = 0.00354
center = 0.01828


Let's add shuffling to KFold

In [87]:
scaler_train = StandardScaler()
scaler_test = StandardScaler()

scale_result = []
center_result = []

X_shuffled = X.copy().sample(frac=1).reset_index(drop=True)
X_shuffled

for train_records, test_records in KFold(n_splits=10).split(X_shuffled):
    X_train = X_shuffled.loc[train_records]
    X_test = X_shuffled.loc[test_records]
    scaler_train.fit(X_train)
    scaler_test.fit(X_test)
    scale_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.scale_, scaler_test.scale_)]), 5)
    center_stats = round(sum([round(pow(x[1]-x[0],2), 5) for x in zip(scaler_train.mean_, scaler_test.mean_)]), 5)
    scale_result.append(scale_stats)
    center_result.append(center_stats)

print("scale = {}".format(round(np.mean(scale_result),5)))
print("center = {}".format(round(np.mean(center_result),5)))


scale = 0.00034
center = 0.00059
