<a href="https://colab.research.google.com/github/pceuropa/machine_learning/blob/master/Fature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

## Filtering and Scaling

Many [algorithms](https://en.wikipedia.org/wiki/Algorithm) are sensitive to feature being on diffrent scales, e.g. [gradient descent](https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html) and kNN

**Solution:**
Aligh features onto the same scale


**Diffrent scales**

Some algorithms, like [decision tree](https://en.wikipedia.org/wiki/Decision_tree) and [random forests](https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd), **aren't sensitive** to fatures on diffrent scales

**Important**: Fit the scaler to training data only, then transform both train and validation data


Common choices in sklearn

- Normalizer 
- Mean/variance standarization
- MinMax scaling
- Maxabs scaling
- Robust scaling

Normalizer (one row) - scaling (one column)




### Mean/Variance Standarization

$\mu$ - [mean](https://colab.research.google.com/drive/1SauArq6_5lN_9uvcxJ3DYBeTZ81T2Rbp#scrollTo=1jFkn9tfeqYX)
$\sigma$ - standard deviation

Transform :  $$ x_{i,j} = \frac{x_j -  \mu_x}{\sigma}$$

In [9]:
from sklearn.preprocessing import StandardScaler

data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()

print('Algorithm:', scaler.fit(data), '\n')
print(scaler.mean_, '\n')
print(scaler.transform(data), '\n')
print(scaler.transform([[2, 2]]))

Algorithm: StandardScaler(copy=True, with_mean=True, with_std=True) 

[0.5 0.5] 

[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]] 

[[3. 3.]]


### MinMax Scaling

Advantage: Robust to small standard devations

Transform: $$x_i = \frac{x_i - x_{min}}{x_{max} - x_{min}}$$

Scale values so that:

minimum = 0

maximum = 1

In [16]:
from sklearn.preprocessing import MinMaxScaler

d = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()

print('Algorithm:', scaler.fit(d), '\n')
print('Data min:', scaler.data_min_)
print('Data max:', scaler.data_max_, '\n')
print('Transform:')
print(scaler.transform(d), '\n')
print(scaler.transform([[2, 2]]), '\n')

scaler.fit_transform(d)


Algorithm: MinMaxScaler(copy=True, feature_range=(0, 1)) 

Data min: [-1.  2.]
Data max: [ 1. 18.] 

Transform:
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]] 

[[1.5 0. ]] 



array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

### Normalizer

Transform :  $$ x_{i,j} = \frac{x_j -  \mu_x}{\sigma_x}$$
Rescales $x_j$ to unit norm based on
- L1 norm
- L2 norm
- Max norm

In [20]:
from sklearn.preprocessing import Normalizer
X = [[4, 1, 2, 2],
     [1, 3, 9, 3],
     [5, 7, 5, 1]]
transformer = Normalizer().fit(X) # fit does nothing.
transformer.transform(X)

array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])