<h1 style="font-size:30px">Feature Engineering</h1>
<hr>

1. Domain knowledge
2. Create interaction features
3. Create indicator features
4. Group sparse classes
5. Encode dummy variables
6. Remove unused or redundant features

<span style="font-size:18px">**Import libraries**</span>

In [1]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

<span style="font-size:18px">**Load dataset**</span>

In [2]:
df = pd.read_csv('cleaned_df.csv')

<span style="font-size:18px">**1. Domain knowledge**</span><br>
Engineer informative features by tapping our expertise about the domain
* Try to think of specific information you might want to **isolate**
* You have a lot of create freedom to think of ideas for new features

<span style="font-size:18px">**2. Create interaction features**</span><br>
Interaction features are operations between two or more other features<br>
* Must be products of two variables<br>
* Can be **products**, **sums**, or **differences** between two features<br>
* Do a quick sanity check after creating the feature

<span style="font-size:18px">**3. Create indicator features**</span><br>
One side benefit from indicator variables is the ability to quickly check the **proportion** of our observations that meet the condition
* Use boolean masks
* Avoid originally missing observations that were flagged and filled

<span style="font-size:18px">**4. Group sparse classes**</span><br>
Group same/similar classes in one class to reduce the number of sparse classes in categorical features
* Each class should have a decent number of observations
* The number of observations depends on the size of the dataset and the number of other features
* As a guideline, it is recommended combining classes until each one has at leat around 50 observations

<span style="font-size:18px">**5. Encode dummy variables**</span><br>
Dummy variables are a set of binary (0 or 1) features that **each represent a single class** from a categorical feature
* Create new dataframe with dummy variables for categorical features

In [3]:
# Create new dataframe with dummy features
df = pd.get_dummies(df, columns = ['cut', 'color', 'clarity'])
df.head()

Unnamed: 0.1,Unnamed: 0,carat,depth,table,price,x,y,z,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,color_I,color_J,clarity_I1,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,1,0.23,61.5,55.0,326,3.95,3.98,2.43,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
1,2,0.21,59.8,61.0,326,3.89,3.84,2.31,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
2,3,0.23,56.9,65.0,327,4.05,4.07,2.31,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
3,4,0.29,62.4,58.0,334,4.2,4.23,2.63,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0
4,5,0.31,63.3,58.0,335,4.34,4.35,2.75,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0


<span style="font-size:18px">**6. Remove unused or redundant features**</span><br>
- Unused features:<br>
    - ID columns<br>
    - Features that wouldn't be available at the time of prediction<br>
    - Other text descriptions<br>
- Redundant features: those that have been **replace by other features** that were added

In [4]:
# Drop feature from the dataset
df = df.drop(['Unnamed: 0'], axis = 1)
df.head()

Unnamed: 0,carat,depth,table,price,x,y,z,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,color_I,color_J,clarity_I1,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,0.23,61.5,55.0,326,3.95,3.98,2.43,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0.21,59.8,61.0,326,3.89,3.84,2.31,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,327,4.05,4.07,2.31,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
3,0.29,62.4,58.0,334,4.2,4.23,2.63,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0
4,0.31,63.3,58.0,335,4.34,4.35,2.75,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0


<span style="font-size:18px">**7. Save the cleaned dataframe**</span><br>

In [5]:
# Save cleaned dataframe to new file
df.to_csv('analytical_base_table.csv', index = None)