# Scaling Data

Sometimes prior to fitting a model or running an algorithm we will need to scale our data. This is particularly true when some features (columns of $X$) are on vastly different scales than others. In this notebook we will demonstrate how to scale.

## What we will accomplish

In this notebook we will:
- Introduce the concept of scaling your data,
- Demonstrate the `StandardScaler` object in `sklearn`,
- Discuss `fit`, `transform` and `fit_transform` and
- Show the scaling process using `sklearn`.

Before we get started let's generate some data.

In [1]:
import numpy as np

In [2]:
## Make some data
## Notice that the columns of X have vastly different scales
rng = np.random.default_rng(440)
X = np.zeros((1000, 4))
X[:, 0] = 1000 * rng.standard_normal(1000)
X[:, 1] = rng.random(1000) - 10
X[:, 2] = rng.integers(-250, 150, 1000)
X[:, 3] = 10 * rng.standard_normal(1000) - 75

In [3]:
## demonstrating the different scales
print("mean of X1:",np.mean(X[:,0]))
print("variance of X1:",np.var(X[:,0]))
print()

print("mean of X2:",np.mean(X[:,1]))
print("variance of X2:",np.var(X[:,1]))
print()

print("mean of X3:",np.mean(X[:,2]))
print("variance of X3:",np.var(X[:,2]))
print()

print("mean of X4:",np.mean(X[:,3]))
print("variance of X4:",np.var(X[:,3]))

mean of X1: 11.935201655200657
variance of X1: 1026903.0991374647

mean of X2: -9.503242684051795
variance of X2: 0.0821580848191559

mean of X3: -52.442
variance of X3: 13772.400636

mean of X4: -74.45477168608731
variance of X4: 96.85199591904106


We can see that the columns of `X` have very different scales. We will soon learn some algorithms whose results can be greatly distorted when this happens.

The main approach to fixing this issue is to scale the data so they are all on the same scale.

## Standardizing your data

While there are a few different ways to scale data, one of the most common is to <i>standardize</i> it. When you standardize a variable, $x$, you apply the following transformation:

$$
x_\text{scaled} = \frac{x - \text{mean}(x)}{\text{standard deviation}(x)},
$$

if you have taken a statistics course (or used $Z$-tables), this should look familiar. This is precisely the transformation applied to turn any arbitary normal random variable into a <i>standard normal</i> random variable, hence the term <i>standardizing</i>.

Standardizing your data will transform it to have mean $0$ and standard deviation $1$.

### `StandardScaler`

We could do this by hand using `numpy`, but that will quickly become tedious. `sklearn` provides a nice `scaler` object called `StandardScaler` that will perform this on all columns of your data set, and has functionality that plays nicely with train test splits. Here is the documentation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a>.

In [4]:
## import StandardScaler


In [5]:
## Make a scaler object
scaler = 

## fit the scaler


In [6]:
## scale the data, i.e. transform it
X_scale = 

In [7]:
## Checking the scaled means and variances
print("mean of standardized X1:",np.mean(X_scale[:,0]))
print("variance of standardized X1:",np.var(X_scale[:,0]))
print()

print("mean of standardized X2:",np.mean(X_scale[:,1]))
print("variance of standardized X2:",np.var(X_scale[:,1]))
print()

print("mean of standardized X3:",np.mean(X_scale[:,2]))
print("variance of standardized X3:",np.var(X_scale[:,2]))
print()

print("mean of standardized X4:",np.mean(X_scale[:,3]))
print("variance of standardized X4:",np.var(X_scale[:,3]))

mean of standardized X1: -2.4868995751603507e-17
variance of standardized X1: 1.000000000000001

mean of standardized X2: -3.849809360190193e-14
variance of standardized X2: 0.9999999999999999

mean of standardized X3: 4.085620730620576e-17
variance of standardized X3: 0.9999999999999991

mean of standardized X4: -1.0054179711005418e-14
variance of standardized X4: 1.0


#### `fit`, `transform` and `fit_transform` & scaling for train test splits

You may be wondering what `fit`, `transform` and `fit_transform` do. Let's describe:
- `fit` performs a fit of the `scaler` object, for `StandardScaler` this means finding the mean and standard deviation of each column and storing it. `fit` must be called <i>before</i> `transform`.
- `transform` is what actually performs the scaling, for `StandardScaler` this means substracting the respective means and dividing by the respective standard deviations for each column. `transform` must be called <i>after</i> `fit`.
- `fit_transform` does this all in one fell swoop, i.e. it fits the `scaler` object then uses the fit to transform the data.

##### Why do we need anything other than `fit_transform`?

Excellent question! In the example above we probably could have just used `fit_transform`.

However, this is because we were not dealing with train test splits, validation sets or cross-validation. We consider scaling the data part of the model, meaning the algorithm/model was fit using the data scaled with the training set's means and standard deviations. For example, if we scaled data prior to fitting a linear regression model, then $\hat{\beta}$ was found using the data scaled according to the training set (using the means and standard deviations of the training set columns). To assess how that particular model performs we must also scale any validation set, cross-validation holdout set or test set using the same exact scaling (i.e. using the means and standard deviations from the training set).

Let's illustrate what I mean with a quick final example.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test = train_test_split(X,
                                      shuffle=True,
                                      random_state=614,
                                      test_size=.2)

In [10]:
## Fit then transform the training set
scaler_new = StandardScaler()

scaler_new.fit(X_train)

X_train_scale = scaler_new.transform(X_train)

## alternatively I could do
## X_train_scale = scaler_new.fit_transform(X_train), Why?
## because this is the training set, so it is okay to run fit_transform.

<i>Imagine we build a model here.</i>

In [11]:
## Transform the test set
## DO NOT refit the scaler!
X_test_scale = 

In [12]:
X_test_scale

array([[ 0.61246894,  1.56838084,  0.43381426,  0.36432437],
       [-0.2867345 ,  0.08436005,  0.48451448, -0.5869404 ],
       [-1.68220459,  1.40586541, -1.50124388,  0.23520427],
       [ 1.09587854,  0.08470574, -1.28999299, -0.49661516],
       [ 0.90958846,  0.9141458 , -1.07029207, -1.8859758 ],
       [ 0.50806649,  0.53895384,  0.64506515, -0.62261419],
       [-1.04757642, -0.40909106,  1.38866828,  0.74814426],
       [ 0.54031603,  0.77708632,  0.8225159 , -1.24942838],
       [-0.71156966,  0.06794486,  0.49296451,  1.03711241],
       [-0.22276656,  0.91986649,  1.61681924, -0.88373499],
       [-0.2677913 , -0.87954903, -1.09564217,  1.66442569],
       [ 1.04134888, -1.17642617, -0.30978887, -0.557363  ],
       [-0.31296045,  1.64411483,  0.1803132 , -0.18942955],
       [ 0.13766641,  0.96794768,  1.40556835,  1.04812381],
       [ 0.36541166, -0.83318276, -0.25908865, -1.01042997],
       [-0.50210526,  1.52391277, -1.40829349,  0.63605844],
       [-1.13054744,  1.

<i>Imagine we test model performance on the test set here.</i>

## Other scaler objects

`sklearn` has more scalers than just `StandardScaler`, which you can find at this link, <a href="https://scikit-learn.org/stable/modules/preprocessing.html">https://scikit-learn.org/stable/modules/preprocessing.html</a>. Moreover, some `sklearn` models have their own input arguments that will handle scaling for you when set `= True`.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)