# Escalamiento de números

When working with data, it's common to encounter variables with different scales or magnitudes. For example, imagine a medical dataset where we can find information related to people's weight and height. In this dataset, weight varies between 50 and 150 kilograms, while height varies between 1.50 and 1.90 meters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def make_dataset(n):
    min_w, max_w = 50, 150
    noise_w = np.random.normal(0, 10, n)
    weights = np.random.uniform(min_w, max_w, n)
    
    min_h, max_h = 1.50, 1.90
    noise_h = np.random.normal(0, 10, n)
    heights = np.random.uniform(min_h, max_h, n)
    return np.vstack([weights, heights]).T
    
def plot(ax, dataset, title):
    weights, heights = dataset[:,0], dataset[:,1]
    noise = np.random.uniform(-0.2, 0.2, len(weights))
    ax.scatter(
        weights,
        np.full(len(weights), 1) + noise,
    )
    ax.scatter(
        heights,
        np.full(len(heights), 2) + noise,
    )
    ax.set_ylim(0.5, 2.5)
    ax.set_yticks([1, 2], ['Weight','Height'])
    ax.set_title(title)
    return ax

def show_dataframe(dataset):
    return pd.DataFrame(dataset, columns=['Weight', 'Height'])

def plot_dataset(*objects):
    objects = [(objects[i], objects[i+1]) for i in range(0, len(objects) -1, 2)]
    plots = len(objects)
    fig, axs = plt.subplots(1, plots, figsize=(5 * plots, 5))
    if len(objects) == 1:
        axs = [axs]
    for (dataset, title), ax in zip(objects, axs):
        plot(ax, dataset, title)
    fig.tight_layout()

In [None]:
original_dataset = make_dataset(100)
show_dataframe(original_dataset)

In [None]:
fig = plt.figure()
ax = fig.gca()
plot(ax, original_dataset, "Original data")

$$
  X_{norm} = (X - μ) / σ
$$

Our machine learning model has no notion that some things are measured in kilograms and others in meters, and if the data is not scaled, some attributes may have more weight than others due to their scales, which can lead to incorrect decisions by the model.

Returning to our example, the algorithm may "focus" more on weight since it has a larger variance range, 100, while weight only has 0.4. This is where the importance of scaling our variables comes in.

## Why is it important?

Specifically, we can think of three reasons why it's worth scaling the values:

 1. It facilitates training: By having all features on the same scale, Machine Learning algorithms converge faster towards a minimum in the loss function.
 1. It improves performance: Some algorithms, like SVM or KNN, which are based on distances, are very sensitive to the scale of the data and can give incorrect results if the features are not standardized.
 1. It allows for easier interpretation: By standardizing, we can compare the relative importance of features in our model.

## How to do it?

There are various techniques to achieve variable scaling, and we can use the most common ones with scikit-learn.

### Standardization

Standardization is perhaps the most common of scaling transformations. It consists of centering all the data of a given attribute in the set at 0 and making its variance 1.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standard_scaled = scaler.fit_transform(original_dataset)

plot_dataset(
    original_dataset, "Original data",
    standard_scaled, "Standardized data")

$$
  X_{norm} = (X - X_{min}) / (X_{max} - X_{min})
$$

This scaler is commonly used when you have normally distributed data and want all your data to have similar scales. Also note that the range of features is variable. It is also used when preparing data for regression or neural networks.

### Min-max scaling

This scaling technique helps us transform the values of our dataset so that they fall within a known range.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
minmax_dataset = scaler.fit_transform(original_dataset)

plot_dataset(
    original_dataset, "Original data",
    minmax_dataset, "Min-max scaled data")

$$
  X_{scaled} = X / max(abs(X))
$$

Use this scaler when you want the data to be within a specific range, in this case, between 0 and 1 by default - especially useful for distance-based models, such as the k-Nearest neighbors model or SVM. In this case, it doesn't matter much if your features are normally distributed.

### Maximum absolute scaling

This scaler transforms the data by dividing it by the maximum absolute value of each variable. This is useful when working with data that has very large or very small values.

In [None]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
maxabs_scaled = scaler.fit_transform(original_dataset)

plot_dataset(
    original_dataset, "Original data",
    maxabs_scaled, "Max-abs scaled data")

`MaxAbsScaler` is a good choice when features are sparse or mostly zero and have variable scales. It's also useful when using neural networks or sparse linear models such as Logistic Regression or SVM.

### Other ways to scale values

Scikit-learn also has other scalers that we won't be able to cover here, but which are more specialized for working with data with other distributions that help transform data into scaled values with normal distributions for processing.

In [None]:
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer

There you have it, I hope you understood the value of scaling your data and that from now on you'll use this technique in your projects.

In the next chapter, we'll learn how we can convert continuous variables to discrete ones.