# **Modeling and Evaluation**

## Objectives

* Fit and evaluate the ML pipeline to predict attrition

## Inputs

* Dataset in outputs/datasets/cleaned/employee-attrition.csv

## Outputs

* TrainSet and TestSet
* Data cleaning and feature engineering pipeline
* Modeling pipeline

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

'/workspace/attrition-predictor'

---

# Load the dataset

We want to design the pipeline where data cleaning, feature engineering and modeling are handled by the pipeline. Therefore, we load the dataset in collection and drop the variables mentioned in notebook number 02.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
df = (pd.read_csv(f"outputs/datasets/collection/employee-attrition.csv")
        .drop(labels=['DailyRate','EmployeeCount', 'EmployeeNumber', 'HourlyRate',
                      'MonthlyRate', 'StandardHours', 'Over18'], axis=1))
df.head()

Next, we will create:
* Split the dataset
* Data cleaning and feature engineering pipeline
* Modeling pipeline

---

# Split the dataset

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['Attrition'], axis=1),
                                                    df['Attrition'],
                                                    test_size=0.2,
                                                    random_state=0,
                                                   )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


(1176, 27) (1176,) (294, 27) (294,)


---

#  ML Pipelines

## Data cleaning and feature engineering pipeline

We create the datacleaning and feature engineering pipeline based on the conclusions from the last notebook.

In [9]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt


def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ('yj', vt.YeoJohnsonTransformer(variables=['MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany']) ),
        ('OrdinalCategoricalEncoder', OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['BusinessTravel', 'Department',
                                                                'EducationField','Gender', 'JobRole',
                                                                'MaritalStatus', 'OverTime'])),

        ('SmartCorrelatedSelection', SmartCorrelatedSelection(method="spearman",
                                                                threshold=0.6,
                                                                selection_method="variance")),

    ])

    return pipeline_base


PipelineDataCleaningAndFeatureEngineering()


### Fit the pipeline

In [11]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1176, 22) (1176,) (294, 22) (294,)


## Modelling and Hyperparameter Optimisation pipeline

Here, the pipeline consists of:
* **StandardScaler**: to rescale the features to have standard normal distribution with zero mean and standard deviation of 1. It is performed on all variables. The variable distribution might be slightly different.
* **SelectFromModel**: to select the relevant features for fitting. We use the embedded method to perform feature selection during training. The model will be the algorithm of our choice.
* **model**: the ML algorithm

In [12]:
# Feat Scaling
from sklearn.preprocessing import StandardScaler
# Feat Selection
from sklearn.feature_selection import SelectFromModel
# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feature_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

Show the features considered important for the given dataset using a certain algorithm

In [7]:
X_train.columns[pipeline['feature_selection'].get_support()] 

---

---

# Conclusions