# Standardization

Most features in our data are just numbers without units attached to them. However, giving a length in cm or inch (or in [Newton seconds vs pound-force seconds][1]) will give different values and thus different outputs when feeding these numbers e.g. to a neural network. 

---
### Mars Climate Orbiter: a sad story
* mission launched Dec 98, planned to reach Mars orbit in Sep 99
* Trajectory Correction Maneuver-4 was computed beginning of September
    * some ground software by Lockheed Martin produced results in a United States customary unit, contrary to its specification, while a second system by NASA expected those results to be in SI units
    * specifically: the total impulse produced by thruster firings was given in pound-force seconds instead of Ns (factor 4.45 mismatch)
* unfortunately, the disagreement was noticed but the concern by the engineers ignored because they didn't fill the forms correctly
    * a 5th correction maneuver was considered but not done
* the probe entered a too low Mars orbit (57 km instead of 226 km)
    * communication was lost on 23.09.1999 when it went into orbital insertion
    * likely skipped from the atmosphere and reentered space or was destroyed
* total cost of the mission: around 0.33 billion USD
---

Even more importantly, the values of different features may naturally cover completely different ranges. This may have a large effect on the performance of machine learning models:
* obviously e.g. when computing the distance of nearest neighbors in the kNN estimator
* but also if a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly.

In general, learning algorithms benefit from standardization of the data set.
Some algorithms will be more susceptible to the scaling of the data than others:
* Neural networks expect all input features to vary in a similar way, and ideally to look like standard normally distributed data, i.e. Gaussian with zero mean and unit variance.
* BDTs just apply `if` statements on the features and are typically very robust.

The `sklearn` documentation provides further information on [preprocessing of data][2].

[1]: https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_failure
[2]: https://scikit-learn.org/stable/modules/preprocessing.html


*Exercise: Read about different scalers in `sklearn` ([visual overview](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#original-data)).*

### Example I
One example that illustrates the importance of preprocessing the data to rescale feature values to "standard ranges" is the application of SVMs to the `sklearn`'s cancer dataset.

We use here a *linear support vector machine*, a common linear classification model, implemented in `LinearSVC` (support vector classifier). SVC try to find a (hyper-) plane in phase space that optimally separates the two given populations. 
* Predictions are then made by checking on which side of this plane new samples lie. 
* The (hyper-) plane is defined by only a few of the training samples (the supporting vectors of features); those that lie close to the resulting plane.
* generative classification vs discriminative classification: 
    * generative classification: model each class and compute likelihood of belonging to each (e.g. Gaussian Naive Bayes)
    * discriminative classification: find curve or manifold in feature space separating the classes from each other

In [None]:
# import matplotlib
import matplotlib.pyplot as plt
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 

In [None]:
# import SVC and load data
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

The features in this dataset have very different ranges as illustrated by the following plot.

In [None]:
# plot features
plt.figure(figsize = (20, 8))
plt.boxplot(cancer.data, showfliers = False) # boxes: lower to upper quartile, whiskers: Q1-1.5IQR/Q3+1.5IQR, outliert are not shown
plt.xlabel("feature")
plt.yscale("log")

In [None]:
# fit data without rescaling
svm = SVC(gamma = "auto", random_state = 42)
svm.fit(X_train, y_train).score(X_test, y_test)

This is a very low accuracy compared to our previous results.
(Note that `SVC` by default uses `kernel="rbf"`, i.e. a [radial-basis-function kernel][1]. The use of the Euclidean distance between the feature vectors may be causing this. If we instead use `SVC(kernel="linear")`, the resulting score is much higher.)

[1]: https://en.wikipedia.org/wiki/Radial_basis_function_kernel

In [None]:
# now rescale the features and retrain
from sklearn.preprocessing import MinMaxScaler
# compute minimum and maximum on the training data
# note that we must train the scaler only on the training data (but not the full dataset)
scaler = MinMaxScaler().fit(X_train)
# rescale the training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)
# retrain
svm.fit(X_train_scaled, y_train).score(X_test_scaled, y_test)

In [None]:
### you can plot it if you want to see the effect
plt.figure(figsize = (20, 4))
plt.boxplot(scaler.transform(cancer.data), showfliers = False);

In [None]:
# same effect (although much smaller) e.g. on MLP
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=1000, random_state=0)
print("Score using unscaled data:", mlp.fit(X_train, y_train).score(X_test, y_test))
print("Score using rescaled data:", mlp.fit(X_train_scaled, y_train).score(X_test_scaled, y_test))

In [None]:
# no effect e.g. on BDT
from sklearn.ensemble import AdaBoostClassifier
bdt = AdaBoostClassifier(random_state=0)
print("Score using unscaled data:", bdt.fit(X_train, y_train).score(X_test, y_test))
print("Score using rescaled data:", bdt.fit(X_train_scaled, y_train).score(X_test_scaled, y_test))

### Example II

Large values of features may also lead to convergence issues. 
(Note that many linear models include regularization terms that are introduced to avoid very large values of the parameters.) 

Let us write a quick function to define a datasets with two populations that should be easy to separate:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(43)
nsamples = 50
pops     = 2
yshift   = 0.5
yscale   = 100 # try changing to 1 later
noise    = 0.3

### helpers
def MakeXYcorrSample(x0 = 0, y0 = 0, noise = noise, yscale = yscale, pops = pops, yshift = yshift):
    # space
    X = np.zeros((nsamples * pops, 2))
    y = np.zeros(nsamples * pops)
    # fill
    X[:, 0] = x0 + np.random.rand(nsamples * pops)
    for n in range(pops):
        X[n*nsamples:(n+1)*nsamples, 1] = (
            yscale * (y0 + X[n*nsamples:(n+1)*nsamples, 0] - noise/2. + noise * np.random.rand(nsamples) + n*yshift)
        )
        y[n*nsamples:(n+1)*nsamples] = n
    return X, y

### make and plot dataset
X, y = MakeXYcorrSample()
for pop in range(pops):
    plt.scatter(*X[y == pop,0:2].T)

Plot: We have two populations that are obviously easy to separate? Or are they?

In [None]:
### prepare data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

### fit
from sklearn.svm import LinearSVC

model = LinearSVC().fit(X_train, y_train)

print("Score on training data:", model.score(X_train, y_train))
print("Score on validation data:", model.score(X_test, y_test))

The score on the training data shows that we cannot even model the training data correctly using the linear model.

In [None]:
from mltools import  visualize_classifier
visualize_classifier(model, X, y, cmap="rainbow", plot_proba=False)

# add line for separation
line = np.linspace(0, 1)
coef, intercept = model.coef_.flatten(), model.intercept_.flatten()
plt.plot(line, -(line * coef[0] + intercept) / coef[1])


We can solve this problem how?

<!--
Again, the scaling is the problem. Try to 
* change to rerun with `yscale` set to 1 (instead of `yscale = 100`) when generating the populations to separate
* introduce a Scaler (like the one we used above)
In both cases you should get 100 % separation both on the training and on the test sample.
-->