<p style="font-family:Verdana; font-size: 22px; color: cyan"> ML | Feature Engineering: Scaling, Normalization, and Standardization</p>

In [1]:
import pandas as pd
import numpy as np

In [2]:
#  If feature scaling is not done 
# then machine learning algorithm tends to use greater values as higher 
# and consider smaller values as lower regardless of the unit of the values.

<p style="font-family:Verdana; font-size: 20px; color: orange"> 1. Absolute Maximum Scaling</p>

In [20]:
# After performing the above-mentioned two steps we will observe 
# that each entry of the column lies in the range of -1 to 1
import pandas as pd
df = pd.read_csv('../data/sampleFile.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [30]:
df = df.loc[:, ['LotArea', 'MSSubClass']]

In [31]:
# > Check the data information using df.info()
print(df.head())

   LotArea  MSSubClass
0     8450          60
1     9600          20
2    11250          60
3     9550          70
4    14260          60


In [32]:
import numpy as np
max_vals = np.max(np.abs(df))
max_vals

215245

In [33]:
print((df - max_vals) / max_vals)

       LotArea  MSSubClass
0    -0.960742   -0.999721
1    -0.955400   -0.999907
2    -0.947734   -0.999721
3    -0.955632   -0.999675
4    -0.933750   -0.999721
...        ...         ...
1455 -0.963219   -0.999721
1456 -0.938791   -0.999907
1457 -0.957992   -0.999675
1458 -0.954856   -0.999907
1459 -0.953834   -0.999907

[1460 rows x 2 columns]


<p style="font-family:Verdana; font-size: 20px; color: orange"> 2. Min-Max Scaling</p>

In [None]:
# As we are using the maximum and the minimum value this method is also prone to outliers 
# but the range in which the data will range after performing the above 
# two steps is between 0 to 1.
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, 
                         columns=df.columns)
scaled_df.head()

Unnamed: 0,LotArea,MSSubClass
0,0.03342,0.235294
1,0.038795,0.0
2,0.046507,0.235294
3,0.038561,0.294118
4,0.060576,0.235294


<p style="font-family:Verdana; font-size: 20px; color: orange"> 3. Normalization</p>

In [35]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0  0.999975    0.007100
1  0.999998    0.002083
2  0.999986    0.005333
3  0.999973    0.007330
4  0.999991    0.004208


<p style="font-family:Verdana; font-size: 20px; color: orange"> 4. Standardization</p>

In [36]:
# This method of scaling is basically based on the central tendencies and variance of the data. 
# This helps us achieve a normal distribution of the data with a mean equal to zero and a standard deviation equal to 1.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.207142    0.073375
1 -0.091886   -0.872563
2  0.073480    0.073375
3 -0.096897    0.309859
4  0.375148    0.073375


<p style="font-family:Verdana; font-size: 20px; color: orange"> 5. Robust Scaling</p>

* In this method of scaling, we use two main statistical measures of the data.

> Median
>> Inter-Quartile Range

In [37]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.254076         0.2
1  0.030015        -0.6
2  0.437624         0.2
3  0.017663         0.4
4  1.181201         0.2


> In conclusion 
>> scaling, normalization and standardization are essential feature engineering techniques that ensure data is well-prepared for machine learning models.