# Feature Scaling

> Feature scaling is essential for machine learning algorithms that calculate distances between data. If not scaled the feature with a higher value range will start dominating when calculating distances, as explained intuitively in the introduction section. The algorithms that use distance calculations like K Nearest Neighbor, Regression, SVMs, etc are the ones that require feature scaling.
> However, algorithms that do not use distance calculations like Naive Bayes, Tree-based models, LDA do not require feature scaling.
Which feature technique to use, varies from case to case. For e.g. PCA that deals with variance data, it is advisable to use standardization.
Feature scaling also helps gradient descend to converge more faster on a small range of values.

## Min-Max Scaling

>In this method we find the minimum value and the maximum value of the column and then we will subtract the minimum value from the entry and divide the result by the difference between the maximum and the minimum value.


In [4]:
from sklearn.preprocessing import MinMaxScaler

data=[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scalarfunction = MinMaxScaler()
scalarfunction.fit(data)
print(scalarfunction.transform(data))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


In [9]:
from sklearn.preprocessing import MinMaxScaler

data=[[12,12],[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scalarfunction = MinMaxScaler()
scalarfunction.fit(data)
print(scalarfunction.transform(data))

[[1.         0.625     ]
 [0.         0.        ]
 [0.03846154 0.25      ]
 [0.07692308 0.5       ]
 [0.15384615 1.        ]]


## Standardization 

 It is also called as z-score normalization.

In [15]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler().fit(data)

In [17]:
std_scaler.transform(data)

array([[ 1.98165628,  0.44232587],
       [-0.67417172, -1.40069858],
       [-0.57202449, -0.6634888 ],
       [-0.46987726,  0.07372098],
       [-0.2655828 ,  1.54814054]])

## Label Encoding 
 
Label Encoding is a technique that is used to convert categorical columns into numerical.

In [24]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

df=pd.read_csv('iris.csv')
df['variety'].unique()



array(['Setosa', 'Versicolor', 'Virginica'], dtype=object)

In [33]:
df['label_encoder'] = LabelEncoder().fit_transform(df['variety'])

df['label_encoder'].unique()

array([0, 1, 2], dtype=int64)

In [39]:
print(df['variety'].tail())
print(df['label_encoder'].tail())

145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, dtype: object
145    2
146    2
147    2
148    2
149    2
Name: label_encoder, dtype: int32


In [43]:
df['variety'].value_counts()

Versicolor    50
Setosa        50
Virginica     50
Name: variety, dtype: int64

In [42]:
one_hot_encoding = pd.get_dummies(df,columns=['variety'])
one_hot_encoding

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,species,label_encoder,variety_Setosa,variety_Versicolor,variety_Virginica
0,5.1,3.5,1.4,0.2,0,0,1,0,0
1,4.9,3.0,1.4,0.2,0,0,1,0,0
2,4.7,3.2,1.3,0.2,0,0,1,0,0
3,4.6,3.1,1.5,0.2,0,0,1,0,0
4,5.0,3.6,1.4,0.2,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,2,0,0,1
146,6.3,2.5,5.0,1.9,2,2,0,0,1
147,6.5,3.0,5.2,2.0,2,2,0,0,1
148,6.2,3.4,5.4,2.3,2,2,0,0,1


In [49]:
# Handling the dummy variable trap

one_hot_encoding.drop('variety_Virginica',axis=1)
one_hot_encoding

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,species,label_encoder,variety_Setosa,variety_Versicolor
0,5.1,3.5,1.4,0.2,0,0,1,0
1,4.9,3.0,1.4,0.2,0,0,1,0
2,4.7,3.2,1.3,0.2,0,0,1,0
3,4.6,3.1,1.5,0.2,0,0,1,0
4,5.0,3.6,1.4,0.2,0,0,1,0
...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,2,0,0
146,6.3,2.5,5.0,1.9,2,2,0,0
147,6.5,3.0,5.2,2.0,2,2,0,0
148,6.2,3.4,5.4,2.3,2,2,0,0


In [32]:
# Import label encoder
from sklearn.preprocessing import LabelEncoder

# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.
df['species']= LabelEncoder().fit_transform(df['variety'])

df['species'].unique()


array([0, 1, 2], dtype=int64)