In [1]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Scaling and Outliers

## 1. Definition

***Scaling***, ***standardizing*** and ***normalizing*** are used somewhat interchangeably, even though they are ***not the same thing***.

True, ***standardizing*** and ***normalizing*** can be seen as particular ways of ***scaling***. Let's sort the differences between all them before proceeding.

### 1.1 Scaling vs Standardizing vs Normalizing

According to the implementations in Scikit-Learn library:

1. ***Scaling (MinMaxScaler)***:

    Transforms features by scaling each feature to a given range.

    This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.


2. ***Standardizing (StandardScaler)***:

    Standardize features by removing the mean and ***scaling to unit variance

    The standard score of a sample x is calculated as:

        z = (x - u) / s

    where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
        
        
3. ***Normalizing (Normalizer)***:

    Normalize samples individually to unit norm.

    Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
        
The main difference is: ***scaling*** and ***standardizing*** operate on ***features*** (columns of your dataset), while ***normalizing*** operates on ***individual samples*** (rows on your dataset).

Even though it is very common for people to say *"you need to normalize on your features"*, they usually mean ***standardize your features***.

Why ***standardize*** instead of simply ***scaling***? It turns out, ***scaling*** is fine when you have features with a ***limited range***, like ***age*** (0-120) or ***pixel values*** (0-255), but it is likely a bad choice whenever your features may have values ***very far apart***, like ***salaries*** (maybe not yours and mine, but think of some CEOs...).

### 1.2 Why Scaling?

Most Machine Learning algorithms use ***gradient descent*** to learn the weights of the model. You'll see in the next lesson that ***scaling*** the features makes a ***BIG*** difference in performance.

It is also important for other techniques, like ***Principal Component Analysis (PCA)*** for dimensionality reduction, and for identifying ***outliers***.

![](https://imgs.xkcd.com/comics/log_scale.png)
<center>Source: <a href="https://xkcd.com/1162/">XKCD</a></center>

### 1.3 Outliers

What is an ***outlier***? It could be defined in several ways:

- a point that is distant from the others (again, think of salaries, everyone is between USD 30k and 100k per year, and a CEO is making 50M!)
- a point that is distinct from the others (think of black sheep in a flock of white sheep)
- an error of measurement (think of someone listed as being 450 years old)
- an anomaly / fraud (think of finding the purchase of a USD 10k Rolex on your credit card bill)

The last case, anomalies / frauds, is a special case where ***your goal is to detect the outlier***.

For now, let's focus on the other cases: in all of them, the ***presence of an outlier*** may ***hurt your model***, impacting its training and making its predictions less useful.

So, how do you ***detect*** and ***remove or fence outliers***?

#### 1.3.1 Tukey's Fences

This is a very straightforward way of detecting ***possible*** outliers based on the ***InterQuartile Range (IQR)***.

It defines a ***lower*** and an ***upper fence***, which are given by:

$$
IQR = Q_3 - Q_1
\\
lower fence = Q1 - k * IQR
\\
upper fence = Q3 + k * IQR
$$

Typical values for ***k*** are 1.5 (outlier) and 3.0 (far out).

The plot below illustrates this:

![](http://www.physics.csbsju.edu/stats/complex.box.defs.gif)

<center>Source: <a href="http://www.physics.csbsju.edu/stats/box2.html">Box Plot: Display of Distribution</a></center>

Although easy to compute, ***Tukey's Fences*** only consider the distribution of a ***single feature*** to assess values as outliers or not.

What if we wanted to check if a value can be considered an outlier, ***given its many features***?

#### 1.3.2 Mahalanobis Distance

In a single dimension, we can easily compute how far (in standard deviations) a point is from the mean. This is what ***standardization*** does for a single feature (refer to the previous lesson for more details).

Mahalanobis Distance is the generalization of the same idea in multiple dimensions.

The ***Mahalanobis Distance*** is given by:

$$
\large\sqrt{(x_1 - x_2)\ S^{-1}\ (x_1 - x_2)}
$$

where ***S*** is the covariance matrix.

If we ***standardize*** all features, the ***Mahalanobis Distance*** corresponds to the distance from the origin:

$$
\large\sqrt{x\ S^{-1}\ x}
$$

Knowing the distance is not enough, though. To determine if a given point is an ***outlier*** or not, we need to compare its computed distance to the ***cumulative chi-squared distribution*** using the ***number of features*** as ***degrees of freedom***. If it falls ***above a threshold***, like 99.9%, it is considered an ***outlier***.

```python
from scipy.stats import chi2

chi2.cdf(mahalanobis_distance, df=n_features)
```

## 2. Experiment

Time to try it yourself!

There are 200 data points (in blue).

The controls below allow you:

- change the ***scaling method***
    - obs.: MinMaxScaling is configured to scale features in [-5, 5] range
- include ***ONE outlier*** (single red point)
- plot ***Tukey's fences*** on horizontal and vertical axes (k = 1.5)
- plot the ***Chi-Squared Probabilities*** contour plot for ***Mahalanobis Distance*** (under StandardScaling only!)
- choose a ***threshold for Chi-Sq.Prob.*** between 99.1% and 99.9% for (under StandardScaling only!)

Use the controls to play with different configurations and answer the questions below.

In [2]:
from intuitiveml.feature.Scaling import *

In [3]:
X = data()
mysc = plotScaling(X, outlier=(-9, 6))
vb = VBox(build_figure(mysc), layout={'align_items': 'center'})

In [4]:
vb

VBox(children=(FigureWidget({
    'data': [{'mode': 'markers',
              'showlegend': False,
            …

#### Questions

1. Using ***no scaling***, turn ***Tukey's fences*** on:
    - how many ***outlier candidates*** you found on the horizontal axis (X1 feature)?
    
    4
    
    
    - how many ***outlier candidates*** you found on the vertical axis (X2 feature)?
    
    1
    
    
    - is there any point that is an ***outlier candidate*** on both features? If so, would you consider it an outlier or not? Why?
    
    
   they are too close to the center
    
    - how do you like Tukey's method for outlier detection?

    seems valid and fair
    
    
2. Include the ***outlier*** to the same configuration above:
    - is the ***red point*** an outlier according to ***Tukey's method***?
    - how do you compare the ***red point*** to any outliers you found in question 1?

    it is further
    
    
    
3. Using ***MinMax Scaling***, make all boxes ***unchecked***:
    - take note of the general position of the blue points
    
    
     between -5 and 5

    
    - include the ***outlier*** - what happens to the blue points? Why?
    
    
   they shift to be able to fit into the range
    
    - add ***Tukey's Fences*** - do you see any differences?

    nope, same outliers just different range
    
    
4. Using ***Standard Scaling***, make all boxes ***unchecked***:
    - take note of the general position of the blue points
    
    range reduced
    
    - include the ***outlier*** - what happens to the blue points? Why?
    
    range reduced even more
    
    - how is this different from what happened using ***MinMax Scaling***? Why?
    
    did not reduce that much
    
    - add ***Tukey's Fences*** - do you see any differences?

    same outliers just different scale


5. Using ***Standard Scaling*** and check ONLY ***ChiSq.Prob. for L2 Norm***:

    - are there any points outside the 90% probability circle (dashed circle)?
    
    yep a lot
    
    - include the ***outlier*** - is it outside the 90% probability circle?\
    
    yep
    
    - change the ***probability threshold*** to different values and observe how the circle grows
    
    

## 3. Scikit-Learn

[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

[Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)

[Comparison of the effect of different scalers on data with outliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py)

## 4. More Resources

[About Feature Scaling and Normalization](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html)

[Outlier Detection with Isolation Forest](https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [5]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')