## Feature Engineering

### Label Encoding
#### Sklearn implementation of Label Encoding

In [1]:
# importing libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [2]:
# label Encoding

targets = np.array(['Sun','Sun','Moon','Earth','Monn','Venus'])
labelenc = LabelEncoder()
labelenc.fit(targets)
targets_trans = labelenc.transform(targets)
print('--- targets ---')
print(targets)
print('--- targets trans ---')
print(targets_trans)

--- targets ---
['Sun' 'Sun' 'Moon' 'Earth' 'Monn' 'Venus']
--- targets trans ---
[3 3 2 0 1 4]


### Label Encoding
#### Pandas implementation of Label Encoding

In [4]:
# importing pandas library

import pandas as pd

In [13]:
df = pd.DataFrame({'col1':['Sun','Sun','Moon','Earth','Monn','Venus']})

print('The original dataframe type')
print(df['col1'].dtype)
print('*'*30)

df['col1'] = df['col1'].astype('category')
print('New Datatype')
print(df['col1'].dtype)
print('*'*30)

df['cat_code'] = df['col1'].cat.codes
print('The new column')
print(df)

The original dataframe type
object
******************************
New Datatype
category
******************************
The new column
    col1  cat_code
0    Sun         3
1    Sun         3
2   Moon         2
3  Earth         0
4   Monn         1
5  Venus         4


### One hot Encoding
#### Sklearn implementation of Label Encoding

In [14]:
# importing libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [20]:
targets = np.array(['Sun','Sun','Moon','Earth','Monn','Venus'])

labelEnc = LabelEncoder()
new_targets = labelEnc.fit_transform(targets)

oneHot = OneHotEncoder()
oneHot.fit(new_targets.reshape(-1,1))
targets_trans = oneHot.transform(new_targets.reshape(-1,1))
print('-- origial --')
print(new_targets)
print('-- transformed --')
print(targets_trans.toarray())

-- origial --
[3 3 2 0 1 4]
-- transformed --
[[0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]]


### One hot Encoding
#### Pandas implementation of Label Encoding
pandas provides a function called get_dummies for one-hot encoding

In [22]:
df_new = pd.get_dummies(df,columns=['col1'],prefix='Planets')
df_new

Unnamed: 0,cat_code,Planets_Earth,Planets_Monn,Planets_Moon,Planets_Sun,Planets_Venus
0,3,0,0,0,1,0
1,3,0,0,0,1,0
2,2,0,0,1,0,0
3,0,1,0,0,0,0
4,1,0,1,0,0,0
5,4,0,0,0,0,1


### Count Encoding
Count encoding uses the number of categories in the dataset as the new feature. For example, in the column Type, the category value Sun appears 3 times, then the new feature for the Sun is 3.

In [25]:
df = pd.DataFrame({'col1':['Sun','Sun','Moon','Earth','Monn','Venus']})

print('--- The original Dataset ---')
print(df)

print('--- Count Encoding ---')
df['count_encode'] = df['col1'].map(df['col1'].value_counts().to_dict())
print(df)

--- The original Dataset ---
    col1
0    Sun
1    Sun
2   Moon
3  Earth
4   Monn
5  Venus
--- Count Encoding ---
    col1  count_encode
0    Sun             2
1    Sun             2
2   Moon             1
3  Earth             1
4   Monn             1
5  Venus             1


### Mean Encoding
Mean encoding uses the mean of the target value as a new feature. It’s usually done for classification tasks, particularly a binary classification. The mean value is always on the target value. However, mean encoding can be performed on any numerical features, not only the target value

In [34]:
df = pd.DataFrame({'col1':['Sun','Sun','Moon','Earth','Monn','Venus'],
                  'Price':[20,30,30,35,40,55]})

print('--- Original Dataframe ---')
print(df)
print('*'*30)

print('--- Mean Encoding ---')
d = df.groupby(['col1'])['Price'].mean().to_dict()
df['col_mean'] = df['col1'].map(d)
print(df)

--- Original Dataframe ---
    col1  Price
0    Sun     20
1    Sun     30
2   Moon     30
3  Earth     35
4   Monn     40
5  Venus     55
******************************
--- Mean Encoding ---
    col1  Price  col_mean
0    Sun     20        25
1    Sun     30        25
2   Moon     30        30
3  Earth     35        35
4   Monn     40        40
5  Venus     55        55


## Weight of Evidence Encoding
Weight of evidence (WOE) is a technique used to encode categorical features for classification tasks. It’s a measure of evidence on one side of an issue compared with the evidence on the other side of the issue

In [44]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['Moon','Sun','Moon','Sun','Sun'],
                  'target':[1,1,0,1,0]})

df['target'] = df['target'].astype('float64')
print('--- The original Dataset ---')
print(df)
print('*'*30)

d = df.groupby(['col1'])['target'].mean().to_dict()
df['p1'] = df['col1'].map(d)
df['p0'] = 1 - df['p1']
df['woe'] = np.log(df['p1']/df['p0'])
print('--- The woe dataset ---')
print(df)

--- The original Dataset ---
   col1  target
0  Moon     1.0
1   Sun     1.0
2  Moon     0.0
3   Sun     1.0
4   Sun     0.0
******************************
--- The woe dataset ---
   col1  target        p1        p0       woe
0  Moon     1.0  0.500000  0.500000  0.000000
1   Sun     1.0  0.666667  0.333333  0.693147
2  Moon     0.0  0.500000  0.500000  0.000000
3   Sun     1.0  0.666667  0.333333  0.693147
4   Sun     0.0  0.666667  0.333333  0.693147


## Feature Interaction
Feature interaction is a method for new features by interacting with two or more existing features. In short, if you have two features of category types, you can create a new feature by joining them together

In [47]:
import pandas as pd

df = pd.DataFrame({'feat1':['a','b','c','d'],
                  'feat2':['apple','ball','cat','dog']})
print('--- original dataframe ---')
print(df)
df['feat1_feat2'] = df['feat1'].astype('str') + '_' + df['feat2'].astype('str')
print('--- feature interaction ---')
print(df)

--- original dataframe ---
  feat1  feat2
0     a  apple
1     b   ball
2     c    cat
3     d    dog
--- feature interaction ---
  feat1  feat2 feat1_feat2
0     a  apple     a_apple
1     b   ball      b_ball
2     c    cat       c_cat
3     d    dog       d_dog
