# Feature Scaling

Feature Scaling is a way of transforming your data into a common range of values.

There are two common scalings:

- Standardizing

- Normalizing

### Standardizing

For example, we have a column in `df` called `height`, we could create a standardized height as :

```py
df["height_standard"] = (df["height"] - df["height"].mean()) / df["height"].std()
```

`df["height"].std()` : standard deviation of the columln 


### Normalizing
With Normalizing, data are scaled between 0 and 1:
```
df["height_normal"] = (df["height"] - df["height"].min()) / (df["height"].max() - df['height'].min())
```

### When to use Feature Scaling
- When your algorithm uses a distance-based metric to predict.
- When you incorporate regularization.

### Distance-Based Metrics

In future lessons, you will see one common supervised learning technique that is based on the distance points are from one another called [Support Vector Machines (or SVMs)](https://en.wikipedia.org/wiki/Support_vector_machine). 

Another technique that involves distance-based methods to determine a prediction is [k-nearest neighbors (or k-nn)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). 

With either of these techniques, choosing not to scale your data may lead to drastically different (and likely misleading) ending predictions.

For this reason, choosing some sort of feature scaling is necessary with these distance-based techniques.

### Regularization

When you start introducing regularization, you will again want to scale the features of your model. 

The penalty on particular coefficients in regularized linear regression techniques depends largely on the scale associated with the features.

When one feature is on a small range, say from 0 to 10, and another is on a large range, say from 0 to 1 000 000, applying regularization is going to unfairly punish the feature with the small range.

Features with small ranges need to have larger coefficients compared to features with large ranges in order to have the same effect on the outcome of the data.

(Think about how $ab = baab=ba$ for two numbers $aa$ and $bb$.)

Therefore, if regularization could remove one of those two features with the same net increase in error, it would rather remove the small-ranged feature with the large coefficient, since that would reduce the regularization term the most.

## Exercise

Previously, you saw how regularization will remove features from a model (by setting their coefficients to zero) if the penalty for removing them is small. 

In this exercise, you'll revisit the same dataset as before and see how scaling the features changes which features are favored in a regularization step

In [1]:
import pandas as pd
spreadsheet = pd.read_csv('./09_data.csv', delimiter = ',',header=None)
spreadsheet.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1.25664,2.04978,-6.2364,4.71926,-4.26931,0.2059,12.31798
1,-3.89012,-0.37511,6.14979,4.94585,-3.57844,0.0064,23.67628
2,5.09784,0.9812,-0.29939,5.85805,0.28297,-0.20626,-1.53459
3,0.39034,-3.06861,-5.63488,6.43941,0.39256,-0.07084,-24.6867
4,5.84727,-0.15922,11.41246,7.52165,1.69886,0.29022,17.54122


In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# Assign the data to predictor and outcome variables
train_data = pd.read_csv('./09_data.csv', header = None)
X = train_data.iloc[:,:-1]
y = train_data.iloc[:,-1]

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

lasso_reg = Lasso()

lasso_reg.fit(X_scaled, y)

reg_coef = lasso_reg.coef_
print(reg_coef)

[  0.           3.90753617   9.02575748  -0.         -11.78303187
   0.45340137]
