# **Why do we need to scale features?**

* Many machine learning techniques will incorrectly assign a **higher weight to features of a higher magnitude**.

* There are wo common approaches for scaling.
    + **MinMaxScaler**
    + **StandardScaler**


**NOTE: tree based algorithms don't require Scaling**

-------

## Min Max Scaler

+ Min-max scaling involves scaling your feature to `a range between 0 and 1`, as defined by the `min and max of your feature`.
+ is recommended when your algorithm doesn't require assumptions about the distributions of your variables, as in the case of KNN.


## Standard Scaler

+ The StandardScaler will scale features to be the `standard deviation from the mean for that feature`. Thus we have a `range of values both positive and negative`.
+ this approach assumes a bell curve distribution (Normal Distribution) for your variables and it's most effective when it's the case.


-------

### Before we scale, we need to perform the train-test split.

The reason we do scaling is that we will actually derive the scaling bounds from the training set, then apply it to test set.

Remember in machine learning, it's important that anything our model learns must come from training set, not the test set.

-----

# Data

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

In [3]:
df = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
df['target'] = pd.Series(cancer['target'])

df.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0


-----

# Train, Test, Split

### Before we scale, we need to perform the train-test split.

The reason we do scaling is that we will actually derive the scaling bounds from the training set, then apply it to test set.

Remember in machine learning, it's important that anything our model learns must come from training set, not the test set.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X = df.drop('target', axis=1)
y = df['target']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

-------

# Two common approaches: **MinMaxScaler & StandardScaler**

# Min Max Scaler

+ Min-max scaling involves scaling your feature to `a range between 0 and 1`, as defined by the `min and max of your feature`.
+ is recommended when your algorithm doesn't require assumptions about the distributions of your variables, as in the case of KNN.

In [7]:
from sklearn.preprocessing import MinMaxScaler

In [8]:
scaler = MinMaxScaler()

In [9]:
scaled_X_train = scaler.fit_transform(X_train) 
scaled_X_test = scaler.transform(X_test)

In [10]:
scaled_X_train

array([[0.21340338, 0.20248963, 0.20869325, ..., 0.25597658, 0.2712399 ,
        0.24111242],
       [0.16607506, 0.36929461, 0.15942229, ..., 0.22487082, 0.12773507,
        0.1533517 ],
       [0.2493729 , 0.34149378, 0.23826964, ..., 0.28284533, 0.30514488,
        0.17237308],
       ...,
       [0.11619102, 0.35726141, 0.11077327, ..., 0.17402687, 0.17524147,
        0.17263545],
       [0.12963226, 0.35311203, 0.11706171, ..., 0.        , 0.06780997,
        0.06919848],
       [0.21434995, 0.59004149, 0.21235575, ..., 0.33251808, 0.10782574,
        0.21172767]])

------

# Standard Scaler

+ The StandardScaler will scale features to be the `standard deviation from the mean for that feature`. Thus we have a `range of values both positive and negative`.
+ this approach assumes a bell curve distribution (Normal Distribution) for your variables and it's most effective when it's the case.

In [11]:
from sklearn.preprocessing import StandardScaler

In [12]:
scaler = StandardScaler()

In [13]:
scaled_X_train = scaler.fit_transform(X_train) 
scaled_X_test = scaler.transform(X_test)

In [14]:
scaled_X_train

array([[-0.74998027, -1.09978744, -0.74158608, ..., -0.6235968 ,
         0.07754241,  0.45062841],
       [-1.02821446, -0.1392617 , -1.02980434, ..., -0.7612376 ,
        -1.07145262, -0.29541379],
       [-0.53852228, -0.29934933, -0.56857428, ..., -0.50470441,
         0.34900827, -0.13371556],
       ...,
       [-1.3214733 , -0.20855336, -1.3143845 , ..., -0.98621857,
        -0.69108476, -0.13148524],
       [-1.24245479, -0.23244704, -1.27759928, ..., -1.7562754 ,
        -1.55125275, -1.01078909],
       [-0.74441558,  1.13188181, -0.72016173, ..., -0.28490593,
        -1.2308599 ,  0.20083251]])