# Data Preprocessing

In [1]:
# Import packages
import json
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
# Load the dataset
df = pd.read_csv('./../../../data/cleaned_data.csv')

In [3]:
# Load lists of numerical and categorical columns from the static file
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

Before we begin the preprocessing, it is necessary to split the data into training and testing sets. This is necessary because every transformation has to be trainined on training data while transformation should be done on training and testing set. 

In [4]:
# Separate the target variable from the other data
y = df['Attrition']
X = df.drop('Attrition', axis=1)

In [5]:
categorical_columns.remove('Attrition')

In [6]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Seggregate the data into numerical and categorical variable for training data
num_df_train = X_train[numerical_columns]
cat_df_train = X_train[categorical_columns]

In [8]:
# Seggregate the data into numerical and categorical variable for testing data
num_df_test = X_test[numerical_columns]
cat_df_test = X_test[categorical_columns]

## Preprocessing per data types

### Numerical columns

Let us begin the data preprocessing with the numerical columns. Since some of the columns are positively skewed and they does not belong to the same scale, it would be better to make the their scale common. The transformation that will be used in the MinMaxScaler from the scikit-learn. Mathematically, it can be given as:
$$
X' = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

In [10]:
# Scale the data
transformer = MinMaxScaler()
num_df_train = transformer.fit_transform(num_df_train)
num_df_test = transformer.transform(num_df_test)

### Categorical columns

As far as categorical columns are concerned, they need to be represented by numbers so that machines can process the data. For our data, some columns need ordinal encoding while others need one-hot encoding.

In [None]:
ordinal_columns = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'Overtime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance']
one_hot_columns = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']