# Scaling Data

The problems in this notebook give you an opportunity to practice and extend the content covered in Lecture 4: Scaling Data.  

In [2]:
import pandas as pd
import numpy as np

##### 1. Practice `StandardScaler`

Load then standardize `X` below.

In [3]:
np.random.seed(2039104)
X = np.zeros((100, 20))

for i in range(20):
    multiplier = np.random.randint(0,10000)
    constant = np.random.randint(-100,100)
    X[:,i] = [constant + multiplier*np.random.randn(100), 
                 constant + multiplier*np.random.random(size=100),
                 constant + multiplier*np.random.binomial(100,.2, 100)][np.random.randint(0,3)]

###### Sample Solution

In [4]:
from sklearn.preprocessing import StandardScaler

In [5]:
scale = StandardScaler()

scale.fit(X)

scale.transform(X)

## or 

## scale.fit_transform(X)

array([[-1.31287285, -0.62045619, -1.18983853, ...,  0.15305584,
        -0.22956033, -0.04419577],
       [ 0.5299962 ,  0.82724372,  0.49276141, ..., -0.8056007 ,
         0.31488584,  1.18346457],
       [-1.49750194, -1.4823713 ,  0.97350425, ...,  1.38979169,
        -1.23350402, -0.28972784],
       ...,
       [-0.23152719,  1.08378946, -0.46872427, ...,  0.53459355,
         0.38622395, -0.53525991],
       [-1.83366698,  0.25898817,  1.69461851, ...,  0.4742891 ,
        -1.19136368,  0.9379325 ],
       [-0.09491744, -0.71413469, -1.67058137, ...,  0.74085942,
        -0.01211058, -0.78079198]])

##### 2. `.mean_` and `.scale_`

You can return the fitted mean of the `StandardScaler` with `.mean_` and you can find the fitted standard deviation with `.scale_`.

Produce arrays of the means and standard deviations of `X` from 1. (Using `StandardScaler` not `numpy`).

##### Sample Solution

In [6]:
scale.mean_

array([ 1.60453337e+03, -2.98903198e+01,  1.11763850e+05,  1.02194537e+02,
       -7.79673342e+02,  4.83236400e+04,  1.72429660e+05, -4.18597449e+01,
        8.91364000e+04,  3.53222387e+03, -1.95904285e+02,  1.29952248e+03,
        1.62944614e+03,  1.76870050e+05, -7.11689666e+01,  1.15599563e+02,
        6.03404800e+04,  4.76551264e+01,  3.63811802e+03,  5.58804200e+04])

In [7]:
scale.scale_

array([7.95751469e+02, 3.06645479e+03, 2.33097595e+04, 2.44657283e+03,
       6.38525701e+03, 8.79223722e+03, 2.83194906e+04, 3.58344900e+02,
       1.69738937e+04, 1.82791247e+03, 1.28967614e+03, 7.39959898e+03,
       9.43471302e+02, 3.61512905e+04, 2.53408966e+02, 2.95238698e+03,
       1.28792176e+04, 3.39774572e+01, 2.30804982e+03, 1.12775493e+04])

##### 3. Other scalers

While we have used `StandardScaler` there are other scaler objects in `sklearn`. Here we will introduce some of the other scalers.

- `MinMaxScaler`: This will scale your data so that the minimum value of the column is linearly scaled to `min` and the maximum value is linearly scaled to `max`, where `min` and `max` are inputs you control with `feature_range=(min, max)`. The default is such that your features get scaled to the interval $[0,1]$. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler</a>.
- `MaxAbsScaler`: This will scale your data by dividing by the largest absolute value of the column, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler</a>.
- `RobustScaler`: This works in much the same way as `StandardScaler`, but it instead of the mean it subtracts the median and instead of the standard deviation it divides by the interquartile range. It is called "robust" because these metrics are more robust to outliers, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler</a>.

Choose one of these three scalers and scale the columns of `X` using it.

##### Sample Solution

In [8]:
from sklearn.preprocessing import RobustScaler

In [9]:
rob_scale = RobustScaler()

rob_scale.fit(X)

rob_scale.transform(X)

array([[-0.81321136, -0.54428819, -0.76190476, ...,  0.11223819,
        -0.09445719,  0.        ],
       [ 0.39804514,  0.56480561,  0.57142857, ..., -0.4920924 ,
         0.2057182 ,  0.95238095],
       [-0.93456192, -1.20460785,  0.95238095, ...,  0.8918681 ,
        -0.64797231, -0.19047619],
       ...,
       [-0.10247884,  0.76134724, -0.19047619, ...,  0.35275698,
         0.24504982, -0.38095238],
       [-1.15551201,  0.12946076,  1.52380952, ...,  0.31474146,
        -0.62473862,  0.76190476],
       [-0.0126898 , -0.61605599, -1.14285714, ...,  0.48278559,
         0.02543174, -0.57142857]])

In [10]:
rob_scale.center_

array([ 1.54436619e+03,  2.46100019e+02,  1.06441000e+05,  4.98386954e+01,
       -1.09080829e+03,  4.71810000e+04,  1.67662000e+05, -3.45101309e+01,
        8.73880000e+04,  3.43361226e+03, -8.96186490e+01,  1.99276047e+03,
        1.56554637e+03,  1.75852000e+05, -8.27690373e+01,  2.99123709e+02,
        6.11550000e+04,  4.68060676e+01,  3.50370263e+03,  5.53820000e+04])

In [11]:
np.median(X, axis=0)

array([ 1.54436619e+03,  2.46100019e+02,  1.06441000e+05,  4.98386954e+01,
       -1.09080829e+03,  4.71810000e+04,  1.67662000e+05, -3.45101309e+01,
        8.73880000e+04,  3.43361226e+03, -8.96186490e+01,  1.99276047e+03,
        1.56554637e+03,  1.75852000e+05, -8.27690373e+01,  2.99123709e+02,
        6.11550000e+04,  4.68060676e+01,  3.50370263e+03,  5.53820000e+04])

In [12]:
rob_scale.scale_

array([1.21069794e+03, 4.00264279e+03, 2.94157500e+04, 2.95252842e+03,
       9.97645338e+03, 1.24200000e+04, 4.41450000e+04, 4.70064595e+02,
       2.18550000e+04, 3.05318429e+03, 1.81176006e+03, 9.20794058e+03,
       1.40716720e+03, 5.55300000e+04, 4.03593302e+02, 3.18248056e+03,
       1.74540000e+04, 5.38988293e+01, 4.18624888e+03, 1.45372500e+04])

##### 4. Scaling mixed columns

When your data includes both quantitative and categorical variables scaling gets slightly more tricky. Demonstrate this by running the `X` below through `StandardScaler`.

In [13]:
X = np.zeros((100,4))

X[:,0] = np.random.randn(100)*10 + 89
X[:,1] = np.random.binomial(1, .4, 100)
X[:,2] = np.random.binomial(1, .6, 100)
X[:,3] = np.random.binomial(1, .8, 100)

In [14]:
## First 10 rows of X
## note that columns 1, 2, 3 are binary categorical variables
## while column 0 is quantitative
X[:10,:]

array([[ 69.80368503,   0.        ,   1.        ,   1.        ],
       [ 90.69259659,   0.        ,   0.        ,   1.        ],
       [106.09862423,   0.        ,   0.        ,   1.        ],
       [ 85.09049446,   1.        ,   1.        ,   0.        ],
       [ 66.30189825,   1.        ,   0.        ,   0.        ],
       [ 82.83929007,   0.        ,   1.        ,   0.        ],
       [100.81836442,   1.        ,   0.        ,   1.        ],
       [ 68.24697839,   1.        ,   0.        ,   1.        ],
       [ 85.4479043 ,   0.        ,   1.        ,   0.        ],
       [ 73.59894067,   0.        ,   1.        ,   1.        ]])

In [15]:
scale = StandardScaler()

scale.fit(X)

scale.transform(X)[:10, :]

array([[-1.5300141 , -0.78288136,  0.92295821,  0.531085  ],
       [ 0.28195912, -0.78288136, -1.08347268,  0.531085  ],
       [ 1.61832887, -0.78288136, -1.08347268,  0.531085  ],
       [-0.20398573,  1.27733275,  0.92295821, -1.88293774],
       [-1.83377065,  1.27733275, -1.08347268, -1.88293774],
       [-0.39926264, -0.78288136,  0.92295821, -1.88293774],
       [ 1.16030169,  1.27733275, -1.08347268,  0.531085  ],
       [-1.66504797,  1.27733275, -1.08347268,  0.531085  ],
       [-0.17298282, -0.78288136,  0.92295821, -1.88293774],
       [-1.20080108, -0.78288136,  0.92295821,  0.531085  ]])

We should see that our three binary variables have been turned into nonsense columns. What we actually want is to scale column `0` but not columns `1,2,3`.

You can do this, but it is slightly more complicated than what `sklearn`'s set scaler objects are capable of. You can see one approach to dealing with this kind of situation in the next optional extra practice notebook "More Advanced Pipelines".

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)