<div style="line-height:0.5">
<h1 style="color:#E74C3C"> Data preprocessing 3 </h1>
</div>
<div style="line-height:1.2">
<h4> Scaling and standardizing features with sklearn.preprocessing. </h4>
</div>
<div style="margin-top: 5px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>  Binarizer + fit_transform
</span>
</div>

In [43]:
import numpy as np

from sklearn.preprocessing import scale, normalize
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, PolynomialFeatures, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer, Binarizer
from sklearn.impute import SimpleImputer

from sklearn.datasets import make_regression, load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

<h3 style="color:#E74C3C"> => Binarization </h3>

In [33]:
""" Binarization is used when it is necessary to convert our numerical values into Boolean values."""

data_inp = np.array([[2.3, 4.3, 6.4, -1.1],
                    [1.5, 5.7, 8.2, -6.3], 
                    [3.3, -6.3, 3.5, -4.5],
                    [7.8, 2.1, -2.2, 1.3]])
data_binarized = Binarizer(threshold=0.5).transform(data_inp)
print("\nBinarized data:\n", data_binarized)


Binarized data:
 [[1. 1. 1. 0.]
 [1. 1. 1. 0.]
 [1. 0. 1. 0.]
 [1. 1. 0. 1.]]


<h3 style="color:#E74C3C"> => Standarization </h3>

In [34]:
""" Minmax scaler estimator! transform (scales and translates) features individually by scaling each feature to a given range.
    e.g. between zero and one.
    The transformation is given by::
        X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
        X_scaled = X_std * (max - min) + min
"""

data_inp = np.array([[2.3, 4.3, 6.4, -1.1],
                    [1.5, 5.7, 8.2, -6.3], 
                    [3.3, -6.3, 3.5, -4.5],
                    [7.8, 2.1, -2.2, 1.3]])

data_scaler_minmax = MinMaxScaler(feature_range=(0.1, 0.9))
#data_scaler_minmax.fit(data_inp)
#data_scaled = data_scaler_minmax.transform(data_inp)
data_scaler_minmax = data_scaler_minmax.fit_transform(data_inp)
print ("\nMin max scaled data:\n", data_scaler_minmax)


Min max scaled data:
 [[0.2015873  0.80666667 0.76153846 0.64736842]
 [0.1        0.9        0.9        0.1       ]
 [0.32857143 0.1        0.53846154 0.28947368]
 [0.9        0.66       0.1        0.9       ]]


In [35]:
""" Remove mean """

example_data = np.array([[2.1, -1.9, 5.5],
                      [-1.5, 2.4, 3.5],
                      [0.5, -7.9, 5.6],
                      [5.9, 2.3, -5.8]])
## Display the mean and the standard deviation of the input data
print("Mean =", example_data.mean(axis=0)) 
print("Stddeviation = ", example_data.std(axis=0))
## Remove the mean and the standard deviation of the input data
data_scaled = scale(example_data)
print("Mean_removed =", data_scaled.mean(axis=0))
print("Stddeviation_removed =", data_scaled.std(axis=0))

Mean = [ 1.75  -1.275  2.2  ]
Stddeviation =  [2.71431391 4.20022321 4.69414529]
Mean_removed = [1.11022302e-16 0.00000000e+00 0.00000000e+00]
Stddeviation_removed = [1. 1. 1.]


1) Standardize features by removing the mean and scaling to unit variance.

    The standard score of a sample `x` is calculated as:

        z = (x - u) / s

    where `u` is the mean of the training samples or zero if `with_mean=False`, \
    and `s` is the standard deviation of the training samples or one if `with_std=False`. 

2) RobustScaler (robust to outliers) removes the median and scales the data according to 
    the quantile range (defaults to IQR: Interquartile Range). \
    The IQR is the range between the 1st quartile (25th quantile)
    and the 3rd quartile (75th quantile).

3) MaxAbsScaler scale each feature by its maximum absolute value.It does not shift/center the data, and
    thus does not destroy any sparsity.


In [36]:
std_scaler = StandardScaler()
data_std_scaled = std_scaler.fit_transform(example_data)

scaler1 = MinMaxScaler()
data_scaled1 = scaler1.fit_transform(example_data)
scaler2 = RobustScaler()
data_scaled2 = scaler2.fit_transform(example_data)
scaler3 = MaxAbsScaler()
data_scaled3 = scaler3.fit_transform(example_data)

print(f"data_std_scaled\n {data_std_scaled}")
print(f"data_scaled1\n {data_scaled1}")
print(f"data_scaled2\n {data_scaled2}")
print(f"data_scaled3\n {data_scaled3}")


data_std_scaled
 [[ 0.12894603 -0.14880162  0.70300338]
 [-1.19735598  0.8749535   0.27694073]
 [-0.46052153 -1.57729713  0.72430651]
 [ 1.52893149  0.85114524 -1.70425062]]
data_scaled1
 [[0.48648649 0.58252427 0.99122807]
 [0.         1.         0.81578947]
 [0.27027027 0.         1.        ]
 [1.         0.99029126 0.        ]]
data_scaled2
 [[ 0.26229508 -0.36681223  0.22988506]
 [-0.91803279  0.38427948 -0.22988506]
 [-0.26229508 -1.41484716  0.25287356]
 [ 1.50819672  0.36681223 -2.36781609]]
data_scaled3
 [[ 0.3559322  -0.24050633  0.94827586]
 [-0.25423729  0.30379747  0.60344828]
 [ 0.08474576 -1.          0.96551724]
 [ 1.          0.29113924 -1.        ]]


<h3 style="color:#E74C3C"> => Imputation </h3>

In [37]:
# Create imputer obj
imputer = SimpleImputer(strategy='mean')
data_imputed1 = imputer.fit_transform(example_data)
data_imputed1

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

<h3 style="color:#E74C3C"> => Discretization </h3>
Technique used to convert continuous data into discrete data by dividing the data into bins or categories.

In [38]:
# Create discretizer obj
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
data_discretized = discretizer.fit_transform(example_data)
data_discretized

array([[1., 1., 2.],
       [0., 2., 2.],
       [0., 0., 2.],
       [2., 2., 0.]])

<h3 style="color:#E74C3C"> => Polynomial features </h3>
Technique used to create polynomial features from the existing features to capture non-linear relationships

In [39]:
# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(example_data)
print("data_poly")
print(data_poly)

data_poly
[[  1.     2.1   -1.9    5.5    4.41  -3.99  11.55   3.61 -10.45  30.25]
 [  1.    -1.5    2.4    3.5    2.25  -3.6   -5.25   5.76   8.4   12.25]
 [  1.     0.5   -7.9    5.6    0.25  -3.95   2.8   62.41 -44.24  31.36]
 [  1.     5.9    2.3   -5.8   34.81  13.57 -34.22   5.29 -13.34  33.64]]


In [40]:
## Generate a random regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

## Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Create polynomial features up to degree 3
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
print(f"X_train_poly\n {X_train_poly[:4]}")
print(f"X_test_poly\n {X_test_poly[:4]}")
print()
print("-------------------------------------------------------")

## Fit a linear regression model on the original features
lr = LinearRegression()
lr.fit(X_train, y_train)

## Evaluate the linear regression model on the test set
lr_score = lr.score(X_test, y_test)
print("Linear regression score:", lr_score)

## Fit a linear regression model on the polynomial features
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)

## Evaluate the polynomial regression model on the test set
lr_poly_score = lr_poly.score(X_test_poly, y_test)
print("Polynomial regression score:", lr_poly_score)


X_train_poly
 [[ 1.          0.34361829  0.11807353  0.04057222]
 [ 1.         -1.01283112  1.02582688 -1.03898939]
 [ 1.         -0.60063869  0.36076684 -0.21669052]
 [ 1.          1.52302986  2.31961994  3.53285043]]
X_test_poly
 [[ 1.         -1.32818605  1.76407818 -2.34302403]
 [ 1.          1.47789404  2.18417081  3.22797303]
 [ 1.          0.81252582  0.66019821  0.5364281 ]
 [ 1.         -0.39210815  0.1537488  -0.06028616]]

-------------------------------------------------------
Linear regression score: 0.9374151607623286
Polynomial regression score: 0.9365814213479909


<h3 style="color:#E74C3C"> => Normalization </h3>

In [52]:
""" Lasso l1 and Ridge l2 
Neccesary to modify the feature vectors, so that the feature vectors can be measured at common scale.
L1 and L2 are types of vector norms that are used to measure the magnitude of a vector, 
used to scale the data so that the sum of the absolute values (L1 norm) or the sum of the squared values (L2 norm).
"""

dat = np.array([[2.1, -1.9, 5.5],
                    [-1.5, 2.4, 3.5],
                    [0.5, -7.9, 5.6],
                    [5.9, 2.3, -5.8]])

data_normalized_l1 = normalize(dat, norm='l1')
data_normalized_l2 = normalize(dat, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l1)
print("\nL1 normalized data:\n", data_normalized_l2)

print()
print("###########################################################")
print()
X = np.random.rand(4, 3)

X_norm_l1 = normalize(X, norm='l1')
X_norm_l2 = normalize(X, norm='l2')

print("Original data:\n", X)
print("\nL1 normalized data:\n", X_norm_l1)
print("\nL2 normalized data:\n", X_norm_l2)


L1 normalized data:
 [[ 0.22105263 -0.2         0.57894737]
 [-0.2027027   0.32432432  0.47297297]
 [ 0.03571429 -0.56428571  0.4       ]
 [ 0.42142857  0.16428571 -0.41428571]]

L1 normalized data:
 [[ 0.33946114 -0.30713151  0.88906489]
 [-0.33325106  0.53320169  0.7775858 ]
 [ 0.05156558 -0.81473612  0.57753446]
 [ 0.68706914  0.26784051 -0.6754239 ]]

###########################################################

Original data:
 [[0.17698986 0.01327682 0.46659534]
 [0.70704891 0.9379844  0.27753776]
 [0.9209406  0.77397533 0.45378789]
 [0.89133978 0.19404063 0.31596212]]

L1 normalized data:
 [[0.26944755 0.02021249 0.71033996]
 [0.36776217 0.48788022 0.14435761]
 [0.42860286 0.36020568 0.21119146]
 [0.63606132 0.13846767 0.22547102]]

L2 normalized data:
 [[0.35453828 0.02659553 0.93466319]
 [0.58580819 0.77714418 0.22994716]
 [0.71628028 0.60197505 0.35294276]
 [0.92329869 0.20099794 0.32729092]]


In [48]:
X, y = load_breast_cancer(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Normalize the data using L1 norm
X_train_norm_l1 = normalize(X_train, norm='l1')
X_test_norm_l1 = normalize(X_test, norm='l1')

## Normalize the data using L2 norm
X_train_norm_l2 = normalize(X_train, norm='l2')
X_test_norm_l2 = normalize(X_test, norm='l2')

print(f"X_train_norm_l1 ==>\n {X_train_norm_l1}")
print()
print(f"X_test_norm_l1 ==>\n {X_test_norm_l1}")
print()
print(f"X_train_norm_l2 ==>\n {X_train_norm_l2}")
print()
print(f"X_test_norm_l2 ==>\n {X_test_norm_l2}")
print()

X_train_norm_l1 ==>
 [[1.15192592e-02 2.21097310e-02 7.50046789e-02 ... 2.23266181e-04
  5.39411094e-04 1.49907293e-04]
 [5.38293180e-03 6.78162627e-03 3.64222081e-02 ... 7.40950736e-05
  1.04595801e-04 3.27723302e-05]
 [1.17838189e-02 1.78048327e-02 7.60495018e-02 ... 6.53486175e-05
  4.21612272e-04 1.09064235e-04]
 ...
 [9.03326722e-03 1.06325791e-02 5.70821575e-02 ... 2.10691950e-05
  1.55379782e-04 3.86869107e-05]
 [7.85783596e-03 1.10279500e-02 5.12164530e-02 ... 1.02691461e-04
  1.78684267e-04 5.92991197e-05]
 [9.62106825e-03 1.62088933e-02 6.09966248e-02 ... 5.86979952e-05
  2.12800968e-04 5.43298090e-05]]

X_test_norm_l1 ==>
 [[8.63782899e-03 1.28840112e-02 5.61701325e-02 ... 7.03079104e-05
  2.08776396e-04 6.06102676e-05]
 [5.43924641e-03 6.11987017e-03 3.54958214e-02 ... 5.13770424e-05
  7.32603886e-05 1.89224892e-05]
 [6.82127280e-03 8.59498021e-03 4.48721503e-02 ... 6.68008216e-05
  1.25174327e-04 3.53814920e-05]
 ...
 [1.01387941e-02 1.31399475e-02 6.50132567e-02 ... 8.456