# Scaling

Scaling is the process of adjusting the numerice values in a dataset so that they fit within a specific range e.g.:

* between 0 and 1, or
* so that they have a mean of 0 and a std dev of 1

Whilst it's not strictly necessary for linear regression, other machine learning algorithms do require that the independent variables have an equal contribution to the outcome. Scaling is a way to achieve this.

In [None]:
# libraries
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import StandardScaler


In [None]:
#Load data and examine
df_dia = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt",sep="\t")

df_dia.head(10)

In [None]:
df_dia.describe().round(2)

From the summary above, you can see that each variable has a different mean and standard deviation.

In [None]:
# get column names to select which ones to scale
df_dia.columns

We can apply the standard scalar to selected columns using sci-kit learn's StandardScaler() function.

This transforms the selected to have a mean of 0 and a standard deviation of 1.

In [None]:
# we have ignored the SEX columnn as we do not want to scale this
cols_to_scale = ['AGE', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'Y']

df_dia[cols_to_scale] =  StandardScaler().fit_transform(df_dia[cols_to_scale])
                                                                                                   
# Now this will work
df_dia[cols_to_scale].describe().round(2)

All of the selected variables now have the same mean and standard deviation.