# Scaling Data

The problems in this notebook give you an opportunity to practice and extend the content covered in Lecture 4: Scaling Data.  

In [None]:
import pandas as pd
import numpy as np

##### 1. Practice `StandardScaler`

Load then standardize `X` below.

In [None]:
np.random.seed(2039104)
X = np.zeros((100, 20))

for i in range(20):
    multiplier = np.random.randint(0,10000)
    constant = np.random.randint(-100,100)
    X[:,i] = [constant + multiplier*np.random.randn(100), 
                 constant + multiplier*np.random.random(size=100),
                 constant + multiplier*np.random.binomial(100,.2, 100)][np.random.randint(0,3)]

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### 2. `.mean_` and `.scale_`

You can return the fitted mean of the `StandardScaler` with `.mean_` and you can find the fitted standard deviation with `.scale_`.

Produce arrays of the means and standard deviations of `X` from 1. (Using `StandardScaler` not `numpy`).

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### 3. Other scalers

While we have used `StandardScaler` there are other scaler objects in `sklearn`. Here we will introduce some of the other scalers.

- `MinMaxScaler`: This will scale your data so that the minimum value of the column is linearly scaled to `min` and the maximum value is linearly scaled to `max`, where `min` and `max` are inputs you control with `feature_range=(min, max)`. The default is such that your features get scaled to the interval $[0,1]$. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler</a>.
- `MaxAbsScaler`: This will scale your data by dividing by the largest absolute value of the column, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler</a>.
- `RobustScaler`: This works in much the same way as `StandardScaler`, but it instead of the mean it subtracts the median and instead of the standard deviation it divides by the interquartile range. It is called "robust" because these metrics are more robust to outliers, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler</a>.

Choose one of these three scalers and scale the columns of `X` using it.

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### 4. Scaling mixed columns

When your data includes both quantitative and categorical variables scaling gets slightly more tricky. Demonstrate this by running the `X` below through `StandardScaler`.

In [None]:
X = np.zeros((100,4))

X[:,0] = np.random.randn(100)*10 + 89
X[:,1] = np.random.binomial(1, .4, 100)
X[:,2] = np.random.binomial(1, .6, 100)
X[:,3] = np.random.binomial(1, .8, 100)

In [None]:
## First 10 rows of X
## note that columns 1, 2, 3 are binary categorical variables
## while column 0 is quantitative
X[:10,:]

In [None]:
## code here




We should see that our three binary variables have been turned into nonsense columns. What we actually want is to scale column `0` but not columns `1,2,3`.

You can do this, but it is slightly more complicated than what `sklearn`'s set scaler objects are capable of. Check out `Lectures/Cleaning/5. More Advanced Pipelines` to see one way to approach such an issue.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)