# Data Scaling Tutorial

This notebook is meant to introduce some data scaling methods used in machine learning. Scaling features in ML is used for many reasons. Some techniques within ML critically depend on features of on a common scale, such as L1/L2 regularization or nearest-neighbors techniques. In Neural Networks, scaling allows the network to learn the relative contributions of each feature without being dominated by the scale of one feature or another.

*Note:* data can be transformed and inverse-transformed using the methods below, but exact results cannot be recovered due to approximation errors.

## Environment

In [1]:
# Environment
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sys
# Local modules
sys.path.append('..')
import reproducibility
from utils import hash2
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from moisture_rnn_pkl import pkl2train
from moisture_rnn import create_rnn_data2



## Setup & Data Read


In [2]:
file='test_CA_202401.pkl'
train = pkl2train([file])

In [3]:
import yaml

with open("../params.yaml") as file:
    params = yaml.safe_load(file)["rnn"]
params

{'batch_size': 32,
 'timesteps': 5,
 'optimizer': 'adam',
 'rnn_layers': 1,
 'rnn_units': 6,
 'dense_layers': 1,
 'dense_units': 1,
 'activation': ['linear', 'linear'],
 'centering': [0.0, 0.0],
 'dropout': [0.2, 0.2],
 'recurrent_dropout': 0.2,
 'reset_states': True,
 'epochs': 100,
 'learning_rate': 0.001,
 'phys_initialize': False,
 'stateful': True,
 'verbose_weights': True,
 'verbose_fit': False,
 'features_list': ['Ed', 'Ew', 'rain'],
 'scale': False,
 'scaler': 'minmax',
 'train_frac': 0.5,
 'val_frac': 0.1}

In [4]:
case = 'KRNC1_202401'
rnn_dat = create_rnn_data2(train[case], params)
X = rnn_dat["X_train"]

Not scaling data


In [5]:
X[0:5, :]

array([[15.65665995, 14.24907313,  0.        ],
       [16.37073623, 14.95777203,  0.        ],
       [16.8830433 , 15.46613268,  0.        ],
       [16.58511884, 15.17121406,  0.        ],
       [15.42272608, 14.02588457,  0.        ]])

## Min-Max Scaler

Rescales data to a given range, (0, 1) by default in `sklearn`. If $x$ is a feature vector, we calculate the transformation $x'$ by:

$$
x' = \frac{x-\min\{x\}}{\max\{x\}-\min\{x\}}
$$

Notice that $x'=0$ if $x$ equals the minimum, and $x'=1$ if $x$ equals the maximum, as desired.

### Manual Calculation

In [6]:
min = X.min(axis=0)
max = X.max(axis=0)
print(f"X column mins: {min}")
print(f"X column maxs: {max}")

X column mins: [9.18025688 7.97581012 0.        ]
X column maxs: [34.47706758 32.29556689  1.80660525]


In [7]:
X_scaled = (X - min) / (max - min)

print(X_scaled[0:5, :])

[[0.25601658 0.25794925 0.        ]
 [0.2842445  0.28709012 0.        ]
 [0.30449635 0.30799332 0.        ]
 [0.29271919 0.29586661 0.        ]
 [0.24676902 0.248772   0.        ]]


The scaled data should have column mins & maxes approximately equal to 0 and 1, respectively.

In [8]:
print(f"X-scaled column mins: {X_scaled.min(axis=0)}")
print(f"X-scaled column maxs: {X_scaled.max(axis=0)}")

X-scaled column mins: [0. 0. 0.]
X-scaled column maxs: [1. 1. 1.]


### Using `sklearn`

In [9]:
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled2 = scaler.transform(X)

In [10]:
print(X_scaled2[0:5, :])

[[0.25601658 0.25794925 0.        ]
 [0.2842445  0.28709012 0.        ]
 [0.30449635 0.30799332 0.        ]
 [0.29271919 0.29586661 0.        ]
 [0.24676902 0.248772   0.        ]]


### Compare Difference

The difference between the methods should be approximately zero, or close to machine-epsilon.

In [11]:
np.max(np.abs(X_scaled - X_scaled2))

2.220446049250313e-16

## Standard Scaler

Scale features to mean 0 and standard deviation 1, equivalent to z-scores. This method assumes features are approximately normally distributed and will lead to strange results if not. If $x$ is a feature vector of length $N$, we calculate the standard transformation $x'$ by:

$$
x' = \frac{x-\mu}{s}
$$



$$\text{Where},\quad \mu = \frac{1}{N}\sum_{i=1}^n x_i \quad \text{And,}\quad s = \sqrt{\sum_{i=1}^n\frac{(x_i-\mu)^2}{N}}$$

### Manual Calculation

In [12]:
mu = X.mean(axis=0)
s = X.std(axis=0)
print(f"X column means: {mu}")
print(f"X column sds: {s}")

X column means: [17.37172913 15.95227122  0.07087056]
X column sds: [4.14878625 4.04256916 0.32151782]


In [13]:
X_scaled = (X - mu)/s
print(X_scaled[0:5, :])

[[-0.41339059 -0.42131576 -0.22042497]
 [-0.24127367 -0.24600672 -0.22042497]
 [-0.11779007 -0.12025485 -0.22042497]
 [-0.1896001  -0.19320811 -0.22042497]
 [-0.46977668 -0.47652534 -0.22042497]]


The resulting scaled data should have column means approximately equal to zero and column standard deviations approximately equal to one.

In [14]:
print(f"X-scaled column means: {X_scaled.mean(axis=0)}")
print(f"X-scaled column sds: {X_scaled.std(axis=0)}")

X-scaled column means: [ 9.78053617e-16 -1.63890066e-16 -1.96271570e-16]
X-scaled column sds: [1. 1. 1.]


### Using `sklearn`

In [15]:
scaler = StandardScaler()
scaler.fit(X)
X_scaled2 = scaler.transform(X)

In [16]:
print(X_scaled2[0:5, :])

[[-0.41339059 -0.42131576 -0.22042497]
 [-0.24127367 -0.24600672 -0.22042497]
 [-0.11779007 -0.12025485 -0.22042497]
 [-0.1896001  -0.19320811 -0.22042497]
 [-0.46977668 -0.47652534 -0.22042497]]


### Compare Difference

The difference between the methods should be approximately zero, or close to machine-epsilon.

In [17]:
np.max(np.abs(X_scaled - X_scaled2))

0.0

## References

- `MinMaxScaler` from Scikitlearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

- `StandardScaler` from Scikitlearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html