# Preprocessing data

### Standardization, or mean removal and variance scaling


Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn;

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.



In [56]:
from sklearn import preprocessing
import numpy as np
import pandas as pd

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

X_scaled = preprocessing.scale(X_train)

X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [57]:
# (1. - 1.)/0.81649658
# (-1. - 0)/0.81649658
# (2. - 0.33333333)/1.24721913

# (2. - 1.)/0.81649658
# (0 - 0)/0.81649658
# (0 - 0.33333333)/1.24721913

# (0 - 1.)/0.81649658
# (1. - 0)/0.81649658
# (-1. - 0.33333333)/1.24721913

In [58]:
X_train.mean(axis=0)

array([1.        , 0.        , 0.33333333])

In [59]:
X_train.std(axis=0)

array([0.81649658, 0.81649658, 1.24721913])

In [60]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

In [61]:
X_scaled.std(axis=0)

array([1., 1., 1.])

The standard score of a sample x is calculated as:
<tt>z = (x - u) / s</tt>
where u is the mean of the training samples or zero if <tt>with_mean=False</tt>, and s is the standard deviation of the training samples or one if <tt>with_std=False</tt>.


The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

In [62]:
sss = preprocessing.StandardScaler()
scaler = sss.fit(X_train)

In [63]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [64]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [65]:
newX_Train = scaler.transform(X_train)

In [66]:
newX_Train

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [67]:
scaler.fit(X_train)
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [68]:
scaler.fit_transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [69]:
scaler.fit_transform(X_train).mean()

4.9343245538895844e-17

In [70]:
scaler.fit_transform(X_train).std()

1.0

The scaler instance can then be used on new data to transform it the same way it did on the training set



In [71]:
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

array([[-2.44948974,  1.22474487, -0.26726124]])

### Scaling features to a range

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

In [72]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [73]:
min_max_scaler = preprocessing.MinMaxScaler((0,1)) # default [0,1]

In [74]:
min_max_scaler.fit(X_train)

In [75]:
min_max_scaler.transform(X_train)

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [76]:
X_train_minmax = min_max_scaler.transform(X_train)

In [77]:
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [78]:
X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [79]:
X_train.mean()

0.4444444444444444

In [80]:
X_train_minmax.mean()

0.48148148148148145

In [81]:
X_train.std()

1.0657403385139377

In [82]:
X_train_minmax.std()

0.4115946439054235

The transformation is computed as:

<tt>X_scaled = scale * X + min - X.min(axis=0) * scale</tt>

where <tt>scale = (max - min) / (X.max(axis=0) - X.min(axis=0))</tt>

In [83]:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler (x_scaled = x / max(abs(x)))
as drop-in replacements instead. They use more robust estimates for the center and range of your data.



In [84]:
from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]

In [85]:
transformer = RobustScaler().fit(X)

In [86]:
transformer.scale_

array([3. , 1.5, 2.5])

In [87]:
transformer.transform(X)

array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

The transformation is computed as:

<tt>X_scaled = (X - X.median(axis=0)) / X.IQR(axis=0)</tt>

where <tt>IQR = q3 - q1</tt>

The IQR is the range between the 1st quartile (q1) and the 3rd quartile (q3).

## Normalization


Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

‘l1’
The l1 norm uses the sum of all the values and thus gives equal penalty to all parameters, enforcing sparsity.
x_normalized = x / sum(X)


‘l2’
The l2 norm uses the square root of the sum of all the squared values. This creates smoothness and rotational invariance. Some models, like PCA, assume rotational invariance, and so l2 will perform better.
x_normalized = x / sqrt(sum((i**2) for i in X))

In [88]:
X = np.array([[1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X_normalized = preprocessing.normalize(X, norm='l2')

In [89]:
# the same as previous X_normalized = preprocessing.Normalizer(norm='l2').fit_transform(X)

In [90]:
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [91]:
np.sqrt(0.40824829**2 + 0.40824829**2 + 0.81649658**2)

0.9999999988637723

In [92]:
np.sqrt(0**2 + 0.70710678**2 + (-0.70710678)**2)

0.9999999983219684

You can refer to [https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) for more details.

##  Encoding categorical features

In [93]:
X = [['male', 'from US', 'uses Safari'],
     ['female', 'from Europe', 'uses Firefox']]
X

[['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

In [94]:
enc = preprocessing.OneHotEncoder()
enc.fit(X)

In [95]:
enc.transform(X).toarray()

array([[0., 1., 0., 1., 0., 1.],
       [1., 0., 1., 0., 1., 0.]])

In [96]:
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Firefox']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 1., 0.]])

In [97]:
enc.get_feature_names_out()

array(['x0_female', 'x0_male', 'x1_from Europe', 'x1_from US',
       'x2_uses Firefox', 'x2_uses Safari'], dtype=object)

In [98]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [99]:
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

In [100]:
pd.DataFrame(X, columns=['gender', 'locations', 'browser'])

Unnamed: 0,gender,locations,browser
0,male,from US,uses Safari
1,female,from Europe,uses Firefox


In [101]:
X = [['male', 'from Asia', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['male', 'from Asia', 'uses Safari']]
enc.fit(X)

new_features = []
for i in enc.categories_:
  for j in i:
    new_features.append(j)


Xnew = enc.transform(X).toarray()
pd.DataFrame(Xnew, columns=new_features)

Unnamed: 0,female,male,from Asia,from Europe,uses Firefox,uses Safari
0,0.0,1.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,1.0


In [102]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Asia', 'from Europe'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

with fixed categories value

In [103]:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

In [104]:
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

with unknown values

In [105]:
enc.transform([['female', 'from Reggio', 'uses Chrome']]).toarray()

ValueError: Found unknown categories ['from Reggio'] in column 1 during transform

In [None]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore') # behaviour with unknown values
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

In [None]:
enc.transform([['female', 'from Reggio', 'uses Firefox']]).toarray()

array([[1., 0., 0., 0., 1., 0.]])

In [None]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [None]:
enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist') #doesn't count if low frequent
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

In [None]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [None]:
enc.transform([['female', 'from Reggio', 'uses Firefox']]).toarray()

array([[1., 0., 0., 0., 1., 0.]])

In [None]:
enc.inverse_transform([[1., 0., 0., 0., 1., 0.]])

array([['female', None, 'uses Firefox']], dtype=object)

In [None]:
enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=1, max_categories=2)
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['female', 'from Europe', 'uses Firefox'],
     ['female', 'from Europe', 'uses Firefox'],['female', 'from Europe', 'uses Firefox'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

In [None]:
enc.get_feature_names_out()

array(['x0_female', 'x0_infrequent_sklearn', 'x1_from Europe',
       'x1_infrequent_sklearn', 'x2_uses Firefox',
       'x2_infrequent_sklearn'], dtype=object)

In [None]:
enc.transform([['female', 'from Reggio', 'uses Firefox']]).toarray()

array([[1., 0., 0., 1., 1., 0.]])

In [None]:
enc.transform([['male', 'from Europe', 'uses Firefox']]).toarray()

array([[0., 1., 1., 0., 1., 0.]])

In [None]:
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

array([[1., 0., 0., 1., 0., 1.]])

In [None]:
enc.inverse_transform([[1., 0., 0., 1., 0., 1.]])

array([['female', 'infrequent_sklearn', 'infrequent_sklearn']],
      dtype=object)

drop first

In [106]:
X = [['male', 'from US', 'uses Safari'],
    ['female', 'from Europe', 'uses Firefox'],
     ['male', 'from Asia', 'uses Safari'],
      ['male', 'from Indo', 'uses Safari']]
drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)

In [107]:
drop_enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Asia', 'from Europe', 'from Indo', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [108]:
drop_enc.get_feature_names_out()

array(['x0_male', 'x1_from Europe', 'x1_from Indo', 'x1_from US',
       'x2_uses Safari'], dtype=object)

In [109]:
drop_enc.transform(X).toarray()

array([[1., 0., 0., 1., 1.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 1.]])

In [110]:
drop_enc.transform([['female','from Asia', 'uses Firefox']]).toarray()

array([[0., 0., 0., 0., 0.]])

In [111]:
drop_enc.inverse_transform([[0., 0., 0., 0., 0.]])

array([['female', 'from Asia', 'uses Firefox']], dtype=object)

In [112]:
enc = preprocessing.OrdinalEncoder()

In [113]:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['female', 'from Asia', 'uses Firefox']]
enc.fit(X)

In [114]:
enc.categories_

[array(['female', 'male'], dtype=object),
 array(['from Asia', 'from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

In [115]:
enc.transform(X)

array([[1., 2., 1.],
       [0., 1., 0.],
       [0., 0., 0.]])

In [116]:
enc.transform([['female','from Reggio', 'uses Safari']])

ValueError: Found unknown categories ['from Reggio'] in column 1 during transform

## Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

In [None]:
X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 3], encode='ordinal').fit(X)
est

In [None]:
est.transform(X)

array([[0., 1., 2.],
       [1., 1., 1.],
       [2., 0., 0.]])

There are different strategies implemented in KBinsDiscretizer:

- ‘uniform’: The discretization is uniform in each feature, which means that the bin widths are constant in each dimension.

- ‘quantile’: The discretization is done on the quantiled values, which means that each bin has approximately the same number of samples.

- ‘kmeans’: The discretization is based on the centroids of a KMeans clustering procedure.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html?highlight=kbinsdiscretizer

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

In [None]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

In [None]:
binarizer.transform(X)

In [None]:
binarizer = preprocessing.Binarizer(threshold=1.1)

In [None]:
binarizer.transform(X)

## Custom transformers

Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing.

In [117]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

The *validate* parameter indicates that the input X array should be checked before calling func. The possibilities are:

- If False, there is no input validation.

- If True, then X will be converted to a 2-dimensional NumPy array or sparse matrix. If the conversion is not possible an exception is raised.

In [118]:
type(X)

numpy.ndarray

In [119]:
type(transformer.transform(X))

numpy.ndarray

In [120]:
df = pd.DataFrame(X)

In [121]:
df

Unnamed: 0,0,1
0,0,1
1,2,3


In [122]:
transformer.transform(df)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

In [123]:
transformer = FunctionTransformer(np.log1p, validate=False)

In [124]:
transformer.transform(df)

Unnamed: 0,0,1
0,0.0,0.693147
1,1.098612,1.386294


Pay Attention!
The result of a transformer is typically a np.array!

In [125]:
preprocessing.StandardScaler().fit_transform(df)

array([[-1., -1.],
       [ 1.,  1.]])

In [126]:
preprocessing.StandardScaler().fit_transform(X)

array([[-1., -1.],
       [ 1.,  1.]])

In [127]:
X.shape

(2, 2)