Feature Engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performances of ML algo

1. Feature Transformation - Missing value Imputation, Handling categorical Features, Outlier Detection, Feature Scaling

2. Feature Construction

3. Feature Selection

4. Feature Extraction

In [4]:
# Feature Scaling is a technique to standardize the independent features present in the data in a fixed range.
# But, why do we need feature scaling ?
# Let x be the no. of Salary and y be the no. of Age 
# e.g. x1 & y1 are in row2, x2 & y2 are in row9. 
# (x2-x1)^2 = (83000-48000)^2 = 1225000000
# (y2-y1)^2 = (50-27)^2 = 529
# here, x will be domninating => our result will not be good

# Types of Feature Scaling - 
# 1) Standardization - Also called as Z-score Normalization
# 2) Normalization

In [5]:
# Lets understand the maths of Standardization 
# e.g. Age(xi) => x1, x2, x3, x4, ..........., xn
# so, (xi)standardized = [xi - (x)mean]/sigma(S.D.)
#and the interesting part is new xi's will have there mean=0 and Std.Deviation(S.D.)=1

In [6]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

In [7]:
df = pd.read_csv("Social_Network_Ads.csv")

In [8]:
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [9]:
df = df.iloc[:, 2:]

In [10]:
df.sample(5)

Unnamed: 0,Age,EstimatedSalary,Purchased
396,51,23000,1
338,38,55000,0
81,39,42000,0
384,57,33000,1
73,33,113000,0


In [11]:
x = df.drop('Purchased', axis=1)
y = df['Purchased']

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [16]:
x_train.shape

(280, 2)

In [17]:
x_test.shape

(120, 2)

In [18]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(x_train)
x_train_scaled = ss.transform(x_train)
x_test_scaled = ss.transform(x_test)

In [19]:
ss.mean_

array([3.75750000e+01, 7.05892857e+04])

In [20]:
x_train_scaled

array([[-0.84252154,  0.1301563 ],
       [ 0.04175763,  0.2777019 ],
       [ 0.72953032, -1.31579061],
       [ 1.61380949,  1.10395728],
       [ 0.82778356, -1.40431797],
       [-1.43204099, -1.25677236],
       [-0.05649561,  0.1301563 ],
       [ 0.43477059, -0.16493491],
       [-0.2530021 ,  0.01211982],
       [ 1.31904976,  2.22530386],
       [ 0.14001087,  0.74984783],
       [-1.33378775,  0.54328399],
       [ 2.00682245,  0.72033871],
       [-1.23553451, -1.43382709],
       [ 0.33651735, -0.34198963],
       [-0.94077478,  0.54328399],
       [ 0.43477059,  0.2777019 ],
       [ 0.43477059,  1.10395728],
       [ 0.82778356,  0.74984783],
       [ 0.9260368 ,  1.25150288],
       [-0.44950858, -1.25677236],
       [-1.82505395, -1.34529973],
       [ 1.12254328,  0.54328399],
       [-0.64601506, -1.64039093],
       [-0.7442683 ,  0.24819278],
       [ 1.02429004,  2.07775825],
       [-0.54776182,  1.36953936],
       [-0.05649561,  0.01211982],
       [-1.9233072 ,

In [21]:
x_train_scaled = pd.DataFrame(x_train_scaled, columns=x_train.columns)
x_test_scaled = pd.DataFrame(x_test_scaled, columns=x_test.columns)

In [22]:
x_train_scaled.head()

Unnamed: 0,Age,EstimatedSalary
0,-0.842522,0.130156
1,0.041758,0.277702
2,0.72953,-1.315791
3,1.613809,1.103957
4,0.827784,-1.404318


In [23]:
np.round(x_train.describe(), 2)

Unnamed: 0,Age,EstimatedSalary
count,280.0,280.0
mean,37.58,70589.29
std,10.2,33948.5
min,18.0,15000.0
25%,30.0,44000.0
50%,37.0,71000.0
75%,45.0,88000.0
max,60.0,150000.0


In [24]:
np.round(x_train_scaled.describe(), 2)

Unnamed: 0,Age,EstimatedSalary
count,280.0,280.0
mean,-0.0,0.0
std,1.0,1.0
min,-1.92,-1.64
25%,-0.74,-0.78
50%,-0.06,0.01
75%,0.73,0.51
max,2.2,2.34


Effect of Scaling

In [38]:
# Data becomes mean centred .
# After scaling - Distribution of data remains same.
# impact of outlier remains same, it doesnt help in decreasing the outliers

Why Scaling is Important ?

In [26]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr_scaled = LogisticRegression()

In [32]:
print(x_train)
print(x_test_scaled)

     Age  EstimatedSalary
157   29            75000
109   38            80000
17    45            26000
347   54           108000
24    46            23000
..   ...              ...
71    24            27000
106   26            35000
270   43           133000
348   39            77000
102   32            86000

[280 rows x 2 columns]
          Age  EstimatedSalary
0    0.827784        -1.433827
1    2.105076         0.513775
2   -0.940775        -0.784626
3    1.024290         0.749848
4   -0.842522        -1.256772
..        ...              ...
115 -1.039028        -1.492845
116 -1.137281        -1.581373
117 -0.056496         0.661320
118  0.434771        -0.489535
119 -0.253002        -0.282971

[120 rows x 2 columns]


In [33]:
lr.fit(x_train, y_train)


In [34]:
lr_scaled.fit(x_train_scaled, y_train)

In [35]:
y_pred = lr.predict(x_test)
y_pred_scaled = lr_scaled.predict(x_test_scaled)

In [36]:
from sklearn.metrics import accuracy_score

In [37]:
print("Actual",accuracy_score(y_test,y_pred))
print("Scaled",accuracy_score(y_test,y_pred_scaled))

Actual 0.85
Scaled 0.85


When to use Standardization ?

In [40]:
# 1) K-Means: Use the Euclidean distance measure.
# 2) K-nearest-neighbours: distances measures.
# 3) PCA: try to get the feature with max variance.
# 4) Artificial Neural network: Apply gradient descent.
# 5) Gradient Descent