<h1 style="font-size:30px">Feature Engineering</h1>
<hr>

1. Domain knowledge
2. Create interaction features
3. Create indicator features
4. Group sparse classes
5. Encode dummy variables
6. Remove unused or redundant features

<span style="font-size:18px">**Import libraries**</span>

In [None]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

<span style="font-size:18px">**Load dataset**</span>

In [None]:
df = pd.read_csv('cleaned_df.csv')

<span style="font-size:18px">**1. Domain knowledge**</span><br>
Engineer informative features by tapping our expertise about the domain
* Try to think of specific information you might want to **isolate**
* You have a lot of create freedom to think of ideas for new features

<span style="font-size:18px">**2. Create interaction features**</span><br>
Interaction features are operations between two or more other features<br>
* Must be products of two variables<br>
* Can be **products**, **sums**, or **differences** between two features<br>
* Do a quick sanity check after creating the feature

<span style="font-size:18px">**3. Create indicator features**</span><br>
One side benefit from indicator variables is the ability to quickly check the **proportion** of our observations that meet the condition
* Use boolean masks
* Avoid originally missing observations that were flagged and filled

In [None]:
# Scatterplot of numeric_feature_1 vs numeric_feature_2, only for the target
sns.lmplot(x = 'numeric_feature_1', 
           y = 'numeric_feature_2',
           data = df[df.target = 'Target'], 
           fit_reg = False)
plt.show()

In [None]:
# Create indicator features
df['indicator_feature_1'] = (df.numeric_feature_1 > 0.5).astype(int)
df['indicator_feature_2'] = (df.numeric_feature_1 <= 0.5).astype(int)

<span style="font-size:18px">**4. Group sparse classes**</span><br>
Group same/similar classes in one class to reduce the number of sparse classes in categorical features
* Each class should have a decent number of observations
* The number of observations depends on the size of the dataset and the number of other features
* As a guideline, it is recommended combining classes until each one has at leat around 50 observations

In [None]:
# Bar plot for categorical_feature
sns.countplot(y = 'categorical_feature', data = df)
plt.show()

<span style="font-size:18px">**5. Encode dummy variables**</span><br>
Dummy variables are a set of binary (0 or 1) features that **each represent a single class** from a categorical feature
* Create new dataframe with dummy variables for categorical features

In [None]:
# Create new dataframe with dummy features
df = df.get_dummies(df, columns = ['category_one', 'category_two'])

<span style="font-size:18px">**6. Remove unused or redundant features**</span><br>
- Unused features:<br>
    - ID columns<br>
    - Features that wouldn't be available at the time of prediction<br>
    - Other text descriptions<br>
- Redundant features: those that have been **replace by other features** that were added

In [None]:
# Drop feature from the dataset
df = df.drop(['feature'], axis = 1)

<span style="font-size:18px">**7. Save the cleaned dataframe**</span><br>

In [None]:
# Save cleaned dataframe to new file
df.to_csv('analytical_base_table.csv', index = None)