# Introduction to Feature Engineering
<hr style="border:2px solid black">

## 1. Introduction

### Feature engineering: what & why?

- "art" of formulating useful features from existing data 
- transforms data to better relate to the underlying target variable
- improves the performance of an ML model
- follows naturally from domain knowledge
- helps incorporate non-numeric features into an ML model

### Feature engineering techniques

 |       technique      |                                        usefulness                                |
 |:--------------------:|:--------------------------------------------------------------------------------:|
 |     `Imputation`     |                    fills out missing values in data                    |
 |   `Discretization`   |                groups a feature in some logical fashion into bins                |
 |`Categorical Encoding`|encodes categorical features into numerical values|
 |  `Feature Splitting` |splits a feature into parts|
 |   `Outlier Handling` |takes care of unusually high/low values in the dataset|
 | `Log Transformation` |deals with ill-behaved (skewed of heteroscedastic) data       |
 |   `Feature Scaling`  |handles the sensitivity of ML algorithms to the scale of input values| 
 | `RBF Transformation` |uses a continuous distribution to encode ordinal features|

<hr style="border:2px solid black">

## 2. Example: Penguin Data

**load packages**

In [27]:
# data analysis stack
import numpy as np
import pandas as pd

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

**read data**

In [28]:
df = pd.read_csv('../data/penguins_unclean.csv')
df.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,,FEMALE
2,Adelie,40.3,18.0,195.0,,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


### 2.1 Train-Test split

In [29]:
train,test = train_test_split(df, test_size=0.2, random_state=42)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

### 2.2 Quick exploration

In [30]:
train.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Gentoo,40.9,13.7,214.0,4650.0,FEMALE
1,Adelie,37.3,17.8,191.0,,FEMALE
2,Gentoo,50.0,15.9,224.0,5350.0,MALE
3,Adelie,35.9,19.2,189.0,3800.0,FEMALE
4,Adelie,40.5,18.9,180.0,3950.0,MALE


In [31]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              266 non-null    object 
 1   Culmen Length (mm)   266 non-null    float64
 2   Culmen Depth (mm)    266 non-null    float64
 3   Flipper Length (mm)  266 non-null    float64
 4   Body Mass (g)        251 non-null    float64
 5   Sex                  266 non-null    object 
dtypes: float64(4), object(2)
memory usage: 12.6+ KB


### 2.3 Imputation

In [32]:
mean_weight = round(
    train.groupby(['Species','Sex'])['Body Mass (g)'].mean(),1
)
mean_weight

Species    Sex   
Adelie     FEMALE    3350.5
           MALE      4040.1
Chinstrap  FEMALE    3525.0
           MALE      3930.8
Gentoo     FEMALE    4684.9
           MALE      5465.4
Name: Body Mass (g), dtype: float64

In [33]:
train['Body Mass (g)'] = train.apply(
    lambda x: mean_weight[x['Species']][x['Sex']] \
    if x['Body Mass (g)']!=x['Body Mass (g)'] \
    else x['Body Mass (g)'],
    axis=1
)

In [34]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              266 non-null    object 
 1   Culmen Length (mm)   266 non-null    float64
 2   Culmen Depth (mm)    266 non-null    float64
 3   Flipper Length (mm)  266 non-null    float64
 4   Body Mass (g)        266 non-null    float64
 5   Sex                  266 non-null    object 
dtypes: float64(4), object(2)
memory usage: 12.6+ KB


In [35]:
train.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Gentoo,40.9,13.7,214.0,4650.0,FEMALE
1,Adelie,37.3,17.8,191.0,3350.5,FEMALE
2,Gentoo,50.0,15.9,224.0,5350.0,MALE
3,Adelie,35.9,19.2,189.0,3800.0,FEMALE
4,Adelie,40.5,18.9,180.0,3950.0,MALE


### 2.4 Categorical Encoding

In [36]:
pd.get_dummies(
    data=train['Sex'],
    #drop_first=True
)

Unnamed: 0,FEMALE,MALE
0,1,0
1,1,0
2,0,1
3,1,0
4,0,1
...,...,...
261,0,1
262,0,1
263,1,0
264,0,1


In [37]:
train = train.join(
    pd.get_dummies(data=train['Sex'], drop_first=True)
)
train.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,MALE
0,Gentoo,40.9,13.7,214.0,4650.0,FEMALE,0
1,Adelie,37.3,17.8,191.0,3350.5,FEMALE,0
2,Gentoo,50.0,15.9,224.0,5350.0,MALE,1
3,Adelie,35.9,19.2,189.0,3800.0,FEMALE,0
4,Adelie,40.5,18.9,180.0,3950.0,MALE,1


### 2.5 Scaling

In [38]:
def standardize(series, mean, std):
    """
    returns the standardized counterpart of a series,
    given a mean and standard deviation
    """
    return (series-mean)/std

In [39]:
numerical_features = [
    'Culmen Length (mm)',
    'Culmen Depth (mm)',
    'Flipper Length (mm)',
    'Body Mass (g)'
]

In [40]:
# standard scaling parameter dictionary
parameters = {}

for feature in numerical_features: 
    # populate parameter dictionary
    mean = train[feature].mean()
    std = train[feature].std()
    parameters[feature] = (mean, std)
    
    # create standadrdized numerical columns
    train[feature] = standardize(train[feature], mean, std)

In [41]:
train.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,MALE
0,Gentoo,-0.59261,-1.747645,0.934182,0.543626,FEMALE,0
1,Adelie,-1.25867,0.322499,-0.718601,-1.073456,FEMALE,0
2,Gentoo,1.091042,-0.636836,1.652783,1.414697,MALE,1
3,Adelie,-1.517694,1.029378,-0.862322,-0.514104,FEMALE,0
4,Adelie,-0.666617,0.877904,-1.509063,-0.327446,MALE,1


### 2.6 Feature-Target Separation

In [42]:
# features
X_train = train[numerical_features + ['MALE']]

# target
y_train = train['Species']

In [43]:
X_train.head()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),MALE
0,-0.59261,-1.747645,0.934182,0.543626,0
1,-1.25867,0.322499,-0.718601,-1.073456,0
2,1.091042,-0.636836,1.652783,1.414697,1
3,-1.517694,1.029378,-0.862322,-0.514104,0
4,-0.666617,0.877904,-1.509063,-0.327446,1


In [44]:
y_train

0         Gentoo
1         Adelie
2         Gentoo
3         Adelie
4         Adelie
         ...    
261    Chinstrap
262       Adelie
263       Adelie
264       Gentoo
265       Adelie
Name: Species, Length: 266, dtype: object

### 2.7 Model Building

**instantiate model**

In [45]:
classifier_model = LogisticRegression()

**train model**

In [46]:
classifier_model.fit(X_train,y_train)

**model validation**

In [47]:
training_acccuracy = classifier_model.score(X_train,y_train)
print(f"training accuracy: {round(training_acccuracy, 6)}")

training accuracy: 0.992481


### 2.8 Model Evaluation

**test data quick exploration**

In [48]:
test.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.5,16.7,178.0,3250.0,FEMALE
1,Gentoo,46.9,14.6,222.0,4875.0,FEMALE
2,Adelie,42.1,19.1,195.0,4000.0,MALE
3,Chinstrap,49.8,17.3,198.0,3675.0,FEMALE
4,Adelie,41.1,18.2,192.0,4050.0,MALE


In [49]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              67 non-null     object 
 1   Culmen Length (mm)   67 non-null     float64
 2   Culmen Depth (mm)    67 non-null     float64
 3   Flipper Length (mm)  67 non-null     float64
 4   Body Mass (g)        63 non-null     float64
 5   Sex                  67 non-null     object 
dtypes: float64(4), object(2)
memory usage: 3.3+ KB


**imputation**

In [50]:
test['Body Mass (g)'] = test.apply(
    lambda x: mean_weight[x['Species']][x['Sex']] \
    if x['Body Mass (g)']!=x['Body Mass (g)'] else x['Body Mass (g)'],
    axis=1
)

**categorical encoding**

In [51]:
test = test.join(
    pd.get_dummies(data=test['Sex'], drop_first=True)
)
test.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,MALE
0,Adelie,39.5,16.7,178.0,3250.0,FEMALE,0
1,Gentoo,46.9,14.6,222.0,4875.0,FEMALE,0
2,Adelie,42.1,19.1,195.0,4000.0,MALE,1
3,Chinstrap,49.8,17.3,198.0,3675.0,FEMALE,0
4,Adelie,41.1,18.2,192.0,4050.0,MALE,1


**scaling**

In [52]:
for feature in numerical_features:
    # call out standardization parameters
    mean, std = parameters[feature]
    
    # transform test data
    test[feature] = standardize(test[feature], mean, std)

In [53]:
test.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,MALE
0,Adelie,-0.851633,-0.232905,-1.652783,-1.198517,FEMALE,0
1,Gentoo,0.51749,-1.293223,1.509063,0.823613,FEMALE,0
2,Adelie,-0.37059,0.978886,-0.431161,-0.265226,MALE,1
3,Chinstrap,1.054039,0.070042,-0.21558,-0.669652,FEMALE,0
4,Adelie,-0.555607,0.524464,-0.646741,-0.203007,MALE,1


**feature-target separation**

In [54]:
# features
X_test = test[numerical_features + ['MALE']]

# target
y_test = test['Species']

In [55]:
X_test.head()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),MALE
0,-0.851633,-0.232905,-1.652783,-1.198517,0
1,0.51749,-1.293223,1.509063,0.823613,0
2,-0.37059,0.978886,-0.431161,-0.265226,1
3,1.054039,0.070042,-0.21558,-0.669652,0
4,-0.555607,0.524464,-0.646741,-0.203007,1


**model performance**

In [31]:
test_acccuracy = classifier_model.score(X_test,y_test)
print(f"test accuracy: {round(test_acccuracy, 6)}")

training accuracy: 0.992481


<hr style="border:2px solid black">

## 3. Exercise: Titanic Data

**3.1 create a feature named Title**

In [None]:
# hint
# .apply(lambda x: x.split(',')[1].split('.')[0].lower().strip()) 

**3.2 binning: handling of rare titles**

In [None]:
# hint
# 1. find the list of unique titles

# 2. write a function that does the following transformations:
## ['mrs','mr','miss','master','dr','rev'] remain the same
## ['mlle','ms'] become 'miss'
## 'mme' becomes 'mrs'
## ['col','major','capt'] become 'army'
## ['don','lady','the countess','sir','the count','madam','lord'] become 'nobl'
## other titles become 'unknown'

# 3. use .apply() methif for binning the ttile column

**3.3 imputation of age**

In [None]:
# hint:
# .groupby(['Pclass','Sex'])['Age'].mean()

**3.4 imputation of embarkation**

In [None]:
# hint: use most frequent class

**3.5 imputation of cabin**

In [None]:
# hint: incorporate missing cabin as a class

**3.6 engineer fare price**

In [None]:
# hint
# .apply(lambda x: x['Fare']/(x['SibSp']+x['Parch']),axis=1)

**scaling: numerical features**

****

<hr style="border:2px solid black">

## References

- [8 Feature Engineering Techniques for Machine Learning](https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423)

- [Fundamental Techniques of Feature Engineering for Machine Learning](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)