In [15]:
import pandas as pd

#Model imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

#model dumping
from joblib import dump
import os

In [12]:
#dataset
data = pd.read_csv('../data/data.csv')

1. Random Forest classification: Will handle imbalances in left employees without the need to apply other techniques such as oversampling and undersampling
2. Metrics: 
- accuracy
- ROC/AUC
- precision/recall
- f1

In [14]:
#splitting data dataset
X = data.drop('Attrition', axis = 1)
y = data['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 

#model pipeline 
pipeline = Pipeline([
        ('rf', RandomForestClassifier(random_state=42))
])

## Define the grid of hyperparameters to search
param_grid = {
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [None, 5, 10],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4]
}

# Create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

# Fit the grid search object to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Print the best hyperparameters
print(f"Best hyperparameters: {best_params}")


Best hyperparameters: {'rf__max_depth': 10, 'rf__min_samples_leaf': 4, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}


After splitting the data in train and test splits, we will use RandomClassifier algolrithm. Before using it though, it will be necessary to select the most suitable parameters to so that we come up with the best posible model parameters by using GridSearchCV. Then next, we fit the model to the best performing parameters to come up with the best possible results. 

In [17]:
#fit on model
best_model = grid_search.best_estimator_

#fit on train
result = best_model.fit(X_train, y_train)

#y-pred
y_pred = best_model.predict(X_test)


For future usability, we save our model, trained on this dataset, to be used for future tasks

In [11]:
folder = '../requirements'
model_file = os.path.join(folder, 'best_model.joblib')
dump(best_model, model_file)

['../requirements/best_model.joblib']

In [18]:
#evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
#classification report
classification_report = classification_report(y_test, y_pred)

print(accuracy)
print(classification_report)

0.8594104308390023
              precision    recall  f1-score   support

           0       0.87      0.98      0.92       380
           1       0.47      0.11      0.18        61

    accuracy                           0.86       441
   macro avg       0.67      0.55      0.55       441
weighted avg       0.82      0.86      0.82       441




Interpretation:

Class 0 (the majority class): The model performs well with high precision (87%) and high recall (98%), indicating that it correctly identifies class 0 instances with high confidence.

Class 1 (the minority class): The model performs poorly with low precision (47%) and low recall (11%), indicating that it struggles to correctly identify class 1 instances, often misclassifying them as class 0.

Overall: The model's accuracy is 86%, which seems decent, but it is heavily influenced by the high accuracy on class 0 due to its dominance in the dataset. The macro averages (0.55 for precision, recall, and F1-score) reflect the overall performance across both classes, highlighting the imbalance and the model's struggle with class 1.

In summary, while the model performs well on class 0, its performance on class 1 is poor, indicating a need for improvement, especially in correctly identifying instances of class 1.