# Session 08: Feature Engineering

In this session we are going to practice the concepts learned in the previous Session about Feature Engineering.

We are going to use the HR employee attrition dataset.

The dataset contains the following columns:

- **Age**: The age of the employee.
- **Attrition**: Indicates whether the employee has left the company (Yes/No).
- **BusinessTravel**: The frequency of business travel for the employee (e.g., Rarely, Frequently).
- **DailyRate**: The daily rate of the employee's pay.
- **Department**: The department in which the employee works (e.g., Sales, Research & Development).
- **DistanceFromHome**: The distance between the employee's home and workplace.
- **Education**: The education level of the employee (e.g., 1 = 'Below College', 2 = 'College', 3 = 'Bachelor', 4 = 'Master', 5 = 'Doctor').
- **EducationField**: The field of education of the employee (e.g., Life Sciences, Medical, Marketing).
- **EmployeeCount**: The number of employees (always 1 in this dataset).
- **EmployeeNumber**: A unique identifier for the employee.
- **EnvironmentSatisfaction**: The employee's satisfaction with the work environment (1 to 4 scale).
- **Gender**: The gender of the employee (Male/Female).
- **HourlyRate**: The hourly wage of the employee.
- **JobInvolvement**: The level of involvement in the job (1 to 4 scale).
- **JobLevel**: The job level of the employee.
- **JobRole**: The role of the employee within the company (e.g., Sales Executive, Research Scientist).
- **JobSatisfaction**: The employee's satisfaction with the job (1 to 4 scale).
- **MaritalStatus**: The marital status of the employee (e.g., Single, Married, Divorced).
- **MonthlyIncome**: The monthly income of the employee.
- **MonthlyRate**: The monthly rate of the employee's pay.
- **NumCompaniesWorked**: The number of companies the employee has worked for.
- **Over18**: Whether the employee is over 18 years old (Yes).
- **OverTime**: Whether the employee works overtime (Yes/No).
- **PercentSalaryHike**: The percentage increase in salary for the employee.
- **PerformanceRating**: The performance rating of the employee (1 to 4 scale).
- **RelationshipSatisfaction**: The employee's satisfaction with relationships at work (1 to 4 scale).
- **StandardHours**: The standard number of working hours (always 80 in this dataset).
- **StockOptionLevel**: The stock option level of the employee (0 to 3 scale).
- **TotalWorkingYears**: The total number of years the employee has worked.
- **TrainingTimesLastYear**: The number of training sessions attended by the employee last year.
- **WorkLifeBalance**: The employee's work-life balance satisfaction (1 to 4 scale).
- **YearsAtCompany**: The number of years the employee has been with the company.
- **YearsInCurrentRole**: The number of years the employee has been in their current role.
- **YearsSinceLastPromotion**: The number of years since the employee's last promotion.
- **YearsWithCurrManager**: The number of years the employee has worked with their current manager.

## Problem description

We are going to train a Machine Learning model that predicts if an employee is going to leave the company or not. We are going to use the `Attrition` column as the target variable. Since we are predicting a binary variable, this is a binary classification problem.

We will be using a Random Forest model to make the predictions.

## Machine Learning workflow

When working on a Machine Learning problem, we usually follow the steps below:

1. Data Preprocessing: Load and prepare the dataset.
2. Feature Engineering: Create new features or modify existing ones.
3. Model Training: Train a Machine Learning model.
4. Model Evaluation: Evaluate the model using a validation set.

In the following cells, we are going to implement these steps with the dataset as it comes without any preprocessing.

Load the dataset and display the first few rows of it.

In [1]:
import pandas as pd

data = pd.read_csv("HR-Employee-Attrition.csv")

data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [2]:
data.shape

(1470, 35)

We can see that there are some columns that are numeric and others are categorical. We need to handle the categorical columns before training the model.

In [3]:
# check for columns that are strings
string_categorical_columns = data.select_dtypes(include=['object']).columns

These columns need to be encoded into numbers before training the model.

As we learned, we can use `LabelEncoder` to convert each column into a numerical one, or `OneHotEncoder` to create a binary column for each category.

Let's do both and compare the performance of the model with each type of encoding.

But the first step is to separate the target variable from the features.

* `x` should contain all columns except `Attrition`.
* `y` should only contain the `Attrition` column.

In [4]:
# separate the data into x and y
x = data.drop('Attrition', axis=1)
y = data['Attrition']

# check for columns that are strings in x
string_categorical_columns_x = x.select_dtypes(include=['object']).columns

string_categorical_columns_x

Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')

### Categorical encoding with LabelEncoder

Now we can apply the encoding to the features (x).

In [5]:
# label encoding

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

x_le = x.copy()

for column in string_categorical_columns_x:
    x_le[f'{column}_le'] = label_encoder.fit_transform(data[column])

x_le = x_le.drop(string_categorical_columns_x, axis=1)

x_le.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_le,Department_le,EducationField_le,Gender_le,JobRole_le,MaritalStatus_le,Over18_le,OverTime_le
0,41,1102,1,2,1,1,2,94,3,2,...,0,5,2,2,1,0,7,2,0,1
1,49,279,8,1,1,2,3,61,2,2,...,1,7,1,1,1,1,6,1,0,0
2,37,1373,2,2,1,4,4,92,2,1,...,0,0,2,1,4,1,2,2,0,1
3,33,1392,3,4,1,5,4,56,3,1,...,3,0,1,1,1,0,6,1,0,1
4,27,591,2,1,1,7,1,40,3,1,...,2,2,2,1,3,1,2,1,0,0


Now we have all the columns as numbers, we can split the dataset into training and test sets.

In [6]:
# train test split, using 20% of the data for testing

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_le, y, test_size=0.3, random_state=42)

# random_state is used to ensure that the split is the same every time the code is run, so that the results are reproducible

Time for training the model. But first we instantiate the model.

In [7]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42)

Now we train it using the training sets

In [8]:
tree.fit(x_train, y_train)

We create a prediction using the test set.

In [9]:
y_pred = tree.predict(x_test)

And now we compare the predictions with the actual values, using the accuracy metric.

$$ Accuracy = \frac{\text{Correct predictions}}{\text{Total predictions}} $$

In [10]:
from sklearn.metrics import accuracy_score

baseline_label_encoding_accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {baseline_label_encoding_accuracy}')

Accuracy: 0.782312925170068


### Categorical encoding with OneHotEncoder

Now we are going to repeat the process, but using one-hot encoding instead of label encoding.

Originally we had 7 categorical columns. Let's see how many columns we have after applying the one-hot encoding.

In [11]:
# One hot encoding

from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder()

x_ohe = x.copy()

x_ohe = one_hot_encoder.fit_transform(x_ohe[string_categorical_columns_x])

x_ohe = pd.DataFrame(x_ohe.toarray(), columns=one_hot_encoder.get_feature_names_out(string_categorical_columns_x))

print(x_ohe.shape)

x_ohe.head()

(1470, 29)


Unnamed: 0,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Human Resources,Department_Research & Development,Department_Sales,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


With the one-hot encoding, we have 29 extra columns. This is because each category in each column is transformed into a binary column.

Let's put all the data together and split it into training and test sets.

In [12]:
# remove the original string columns from x
x = x.drop(string_categorical_columns_x, axis=1)

# add the new columns to x
x_ohe = pd.concat([x, x_ohe], axis=1)

print(x_ohe.shape)
x_ohe.head()

(1470, 55)


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,41,1102,1,2,1,1,2,94,3,2,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
1,49,279,8,1,1,2,3,61,2,2,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,37,1373,2,2,1,4,4,92,2,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
3,33,1392,3,4,1,5,4,56,3,1,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,27,591,2,1,1,7,1,40,3,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


Now let's repeat the Machine Learning process with the one-hot encoded data.

In [13]:
# train test split, using 30% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(x_ohe, y, test_size=0.3, random_state=42)

# instantiate the model
tree = DecisionTreeClassifier(random_state=42)

# train the model
tree.fit(x_train, y_train)

# make predictions
y_pred = tree.predict(x_test)

# calculate the accuracy
baseline_onehot_encoding_accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {baseline_onehot_encoding_accuracy}')

Accuracy: 0.7732426303854876


### Which features are the most important?

One of the advantages of using a Decision Tree is that we can see which features are the most important for the model to make predictions.

In [14]:
# feature importance

importances = tree.feature_importances_

feature_importances = pd.DataFrame(importances, index=x_ohe.columns, columns=['importance'])

feature_importances = feature_importances.sort_values(by='importance', ascending=False)

import plotly.express as px

fig = px.bar(feature_importances, x=feature_importances.index, y='importance')
fig.show()

With the `feature_importances_` attribute, we can see the importance of each feature.

## Exercises

### Exercise 1

Create 5 new features based on the existing ones.

* You can use arithmetic operations between columns, or apply a mathematical function to a column, or even create a new feature based on the values of multiple columns using ratios, differences, etc.
    * Since this is not a time series dataset, you can't use the `shift`, `rolling` or `ewm` methods.

After you have created the new features, follow the process we have done above to train a model using the new features, using either label encoding or one-hot encoding for dealing with the categorical columns.

Save the new accuracy in a variable named `accuracy_ex1`.


In [15]:
# example of new features

x['MonthlyIncomeOverYearsAtCompany'] = x['MonthlyIncome'] / x['YearsAtCompany']

x['MonthlyIncome_above_mean'] = x['MonthlyIncome'] > x['MonthlyIncome'].mean()



### Exercise 2

With your new features, and the model trained, use the `feature_importances_` attribute to find out which features are the most important.

Create a new `x` variable with only the 10 most important features and train the model again.

Save the new accuracy in a variable named `accuracy_ex2`.

### Exercise 3

Identify in the columns description at the beginning of the notebook, which of the numerical columns represent a categorical variable. For example, the `Education` column is numerical but represents a category.

Apply the one-hot encoding to these columns and train the model again.

Save the new accuracy in a variable named `accuracy_ex3`.

### Exercise 4

Using everything you have learned so far, try to get the best accuracy you can.

Save the best accuracy in a variable named `accuracy_ex4`. Post it in the forum discussion!