
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.2/css/all.min.css">
<link rel="stylesheet" href="../static/css/styles.css">


        
<!-- <body> -->
<!-- Navigation-->
<nav class="navbar navbar-expand-lg navbar-light fixed-top" id="mainNav">
    <div class="container px-4 px-lg-5">
        <a class="navbar-brand" href="../index.html">Home</a>
        <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarResponsive" aria-controls="navbarResponsive" aria-expanded="false" aria-label="Toggle navigation">
            Menu
            <i class="fas fa-bars"></i>
        </button>
        <div class="collapse navbar-collapse" id="navbarResponsive">
            <ul class="navbar-nav ms-auto py-4 py-lg-0">
                <li class="nav-item"><a class="nav-link px-lg-3 py-3 py-lg-4" href="../index.html">Executive Summary</a></li>
                <li class="nav-item"><a class="nav-link px-lg-3 py-3 py-lg-4" href="eda.html">Exploratory Data Analysis</a></li>
                <!-- <li class="nav-item"><a class="nav-link px-lg-3 py-3 py-lg-4" href="models.html">Model Construction & Validation</a></li> -->
                <li class="nav-item"><a class="nav-link px-lg-3 py-3 py-lg-4" href="initial_work.html">Appendix: Model Development</a></li>
            </ul>
        </div>
    </div>
</nav>



<h2 id="title" style="text-align: center; width: 80%;">Model Construction and Validation</h2>

# Table of Contents

- [Description and Deliverables](#description-and-deliverables)
- [Data Dictionary](#data-dictionary)
- [Exploratory Data Analysis Insights](#insights)
- [Modeling Strategies Recap](#modeling-strategies-recap)
  - [Cross-Validation Results](#cross-validation-results)
- [Model Building](#model-building)

<div class="button">
    <a href="models.html">Back to Part 1<br>Exploratory Data Analysis</a>
</div>

<a id="description-and-deliverables"></a>

# Description and Deliverables
---

[Back to top](#)

The hypothetical HR department at the fictional Salifort Motors collected employee data to improve satisfaction. They requested data-driven suggestions based on an analysis of this data. The main question is: what factors are likely to make an employee leave the company?

The **goal** of this project is to **analyze the data** and build a model to **predict employee attrition**. By identifying which employees are likely to leave, it may be possible to determine the factors contributing to their departure. The model should be interpretable so HR can design targeted interventions to improve retention. Improving retention can reduce the costs associated with hiring and training new employees.

**Stakeholders:**  
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers—since employee satisfaction can impact customer satisfaction.

**Ethical Considerations:**  
- Ensure employee data privacy and confidentiality throughout the analysis.
- Avoid introducing or perpetuating bias in model predictions (e.g., not unfairly targeting specific groups).
- Maintain transparency in how predictions are generated and how they will be used in HR decision-making.

### This page summarizes the first part of the project: exploratory data analysis.

<a id="data-dictionary"></a>

# Data Dictionary
---

[Back to top](#)

The dataset contains 15,000 rows and 10 columns for the variables listed below. 

**Note:** For more information about the data, refer to its source on [Kaggle](https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv).

Variable  |Description |
-----|-----|
satisfaction_level|Employee-reported job satisfaction level [0&ndash;1]|
last_evaluation|Score of employee's last performance review [0&ndash;1]|
number_project|Number of projects employee contributes to|
average_monthly_hours|Average number of hours employee worked per month|
time_spend_company|How long the employee has been with the company (years)
Work_accident|Whether or not the employee experienced an accident while at work
left|Whether or not the employee left the company
promotion_last_5years|Whether or not the employee was promoted in the last 5 years
Department|The employee's department
salary|The employee's salary (U.S. dollars)

In [None]:
# Import packages
import time
import joblib
import os

import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import plot_tree

from IPython.display import Image, display

from sklearn.model_selection import (
    StratifiedKFold,
    cross_val_predict,
    GridSearchCV,
    RandomizedSearchCV,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    fbeta_score,
)

In [None]:
# get initial time, for measuring performance at the end
nb_start_time = time.time()

In [None]:
# Load dataset into a dataframe
df0 = pd.read_csv("../resources/HR_capstone_dataset.csv")


# Display first few rows of the dataframe
# df0.head()

In [None]:
# Rename columns as needed
df0.rename(
    columns={
        "Department": "department",
        "Work_accident": "work_accident",
        "average_montly_hours": "average_monthly_hours",
        "time_spend_company": "tenure",
    },
    inplace=True,
)


# Display all column names after the update
# df0.columns

In [None]:
# Drop duplicates and save resulting dataframe in a new variable as needed
df = df0.drop_duplicates()


# Display first few rows of new dataframe as needed
# print(df.info())
# df.head()

In [None]:
# Determine the number of rows containing outliers
q1 = df.tenure.quantile(0.25)
q3 = df.tenure.quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + 1.5 * iqr

# # Filter the dataframe to find outliers
# outliers = df[df.tenure > upper_bound]

# # Display the number of outliers
# print(f"Number of tenure outliers: {len(outliers)}")
# print(f"Outliers percentage of total: {len(outliers) / len(df) * 100:.2f}%")

<a id="insights"></a>

# Exploratory Data Analysis Insights
---

[Back to Top](#)

> The data suggests significant issues with employee retention at this company. Two main groups of leavers emerge:
>
> - **Underworked and Dissatisfied:** Some employees worked on fewer projects and logged fewer hours than a standard work week, with below-average satisfaction. These individuals may have been disengaged, assigned less work as they prepared to leave, or possibly let go.
> - **Overworked and Burned Out:** Another group managed a high number of projects (up to 7) and worked exceptionally long hours (sometimes approaching 80-hour weeks). This group reported very low satisfaction and received few, if any, promotions.
>
> Most employees work well above a typical 40-hour work week (160–184 hours/month, 20-23 work days/month), indicating a culture of overwork. The lack of promotions and high workload likely contribute to dissatisfaction and attrition.
>
> **Employee evaluation scores** show only a weak relationship with attrition; both leavers and stayers have similar performance reviews. High-performing employees are not necessarily retained, especially if they are overworked or dissatisfied.
>
> Other variables—such as department, salary, and work accidents—do not show strong predictive value for employee churn compared to satisfaction and workload.
>
> Overall, the data points to management and workload issues as primary drivers of employee turnover.

<a id="modeling-strategies-recap"></a>

# Modeling Strategies Recap
---

[Back to top](#)

> **Observations from Baseline Model Building:**  
> - **Logistic Regression performed much worse** than tree-based models (recall: 0.24 vs. >0.90 for others). This suggests the relationship between features and attrition is highly non-linear, or that important interactions are not captured by a linear model.
> - **Tree-based models (Decision Tree, Random Forest, XGBoost) all performed very well** (recall >0.90, AUC >0.97), with XGBoost slightly ahead. Surprisingly strong for a shallow Decision Tree (max depth 4). This may indicate the data is either easy to separate or possibly a bit too “clean” (the dataset is synthetic).
> - **Confusion matrices show very few false negatives** for tree-based models, but Logistic Regression misses many true leavers.
>
> **Independent Variables Chosen:**  
> - All available features were included: satisfaction_level, last_evaluation, number_project, average_monthly_hours, tenure, work_accident, promotion_last_5years, salary (ordinal), and department (one-hot encoded).
> - This approach ensures the model can capture all possible relationships, especially since EDA showed satisfaction, workload, and tenure are strong predictors.
>
> **Model Assumptions Met:**  
> - **Logistic Regression:** Outliers were removed and features were scaled. Outcome is categorical and observations are independent (dropped duplicates). Sample size is ample. Multicollinearity was checked in heatmap at end of EDA. The poor performance suggests the linearity assumption is not met.
> - **Tree-based models:** No strong assumptions about feature scaling, linearity, or multicollinearity; these models are robust to the data structure provided.
>
> **Model Fit:**  
> - **Tree-based models fit the data extremely well** (recall, precision, and AUC all very high). This suggests strong predictive power, but also raises the possibility of overfitting.
> - **Logistic Regression fits poorly**, missing most true positives.
>
> **Potential Improvements:**  
> - **Feature engineering:** (Will do.) Create interaction terms or non-linear transformations (e.g., satisfaction × workload, tenure bins) to help linear models like Logistic Regression capture more complex relationships. Consider feature selection to remove redundant or less informative variables.
> - **Interpretability:** (Will do.) Use feature importance plots for tree-based models and SHAP values to explain individual predictions and overall model behavior. This will help stakeholders understand which factors drive attrition risk.
> - **Model validation:** (Done.) Rigorously check for data leakage by reviewing the entire data pipeline, ensuring all preprocessing steps are performed only on training data within cross-validation folds.
> - **Class imbalance:** (Might do.) Although recall is high, further address class imbalance by experimenting with resampling techniques (e.g., SMOTE, undersampling) or adjusting class weights, especially if the business wants to minimize false negatives.
> - **Alternative Models:** (Won't do anytime soon.) Try other algorithms (e.g., LightGBM, SVM, or neural networks) or ensemble approaches to see if performance or interpretability can be improved.
> - **Time series data** (Don't have it.) If this was real-world data, it would be nice to track changes over time in work satisfaction, performance reviews, workload, promotions, absences, etc.
>
> **Ethical Considerations:**  
> - Ensure predictions are used to support employees (e.g., for retention efforts), not for punitive actions.
> - Ensure the model does not unfairly target or disadvantage specific groups (e.g., by department, salary, or tenure).
> - Clearly communicate how predictions are made and how they will be used by HR.
> - Protect employee data and avoid using sensitive or personally identifiable information.
> - Regularly audit the model for bias and unintended consequences after deployment.

<a id="cross-validation-results)"></a>

## Cross-Validation Results


[Back to top](#)

In [None]:
# display base model results df
# df_base = pd.read_csv("../results/base_model_evaluation_results.csv")
# df_base[["model", "recall", "precision", "f1", "accuracy", "roc_auc", "features", "search_time"]]


In [None]:
# show confusion matrix exemplar
display(Image(filename="../resources/images/confusion_matrix_exemplar.png", width=800))

In [None]:
# display confusion matrix for base model
display(Image(filename="../results/images/base_model_confusion_matrices_confusion_grid.png", width=800))

In [None]:
# display all model evaluation results
df_results = pd.read_csv("../results/all_model_evaluation_results.csv") 
df_results = df_results.rename(
    columns={
        "model": "Model",
        "recall": "Recall",
        "f1": "F1 Score",
        "roc_auc": "ROC AUC",
        "precision": "Precision",
        "accuracy": "Accuracy",
        "features": "Num Features",
        "best_params": "Best Params",
        "cv_best_score": "CV Best Score",
        "conf_matrix": "Confusion Matrix",
        "search_time": "Search Time (s)",
    }
)
df_results[["Model", "Recall", "Precision", "F1 Score", "Accuracy", "ROC AUC", "Num Features", "Confusion Matrix", "Search Time (s)"]]
# df_results

In [None]:
# display confusion matrix for all models, top 9
display(Image(filename="../results/images/top_model_confusion_matrices_confusion_grid.png", width=800))

<a id="model-building"></a>

## Model Building
---

[Back to top](#)


# Plan

one section for each type of model (log reg, dt, rf, xgb)

-

-

show the df of results for versions of that type of model

give a rationale for why i'm choosing which version of the model (show confusion matrix)

run it through the cross-validation model evaluation function again and save the model, using the exact same Pipeline and random_state

train the saved model on all of X_train (save it again)

run the model on X_test

compare results of training and testing confusion matrices, plot feature importance, pr-roc, etc.

interpret the model, how it makes decisions, business implications

-

-

summarize all four versions at the end, with summary table / bullet points comparing recall, precision, F1, and interpretability

#### **Choose evaluation metric**

While **ROC AUC** is a common metric for evaluating binary classifiers—offering a threshold-independent measure of how well the model distinguishes between classes—it is **not ideal for imbalanced problems like employee churn**, where the positive class (those likely to leave) is much smaller and more critical to identify.

During model development, I did review ROC AUC to get a general sense of model discrimination. However, for **model selection and tuning**, I ultimately prioritized **recall**. A high recall ensures that we identify as many at-risk employees as possible, aligning with the company's goal to support retention through early intervention. Missing a potential churner (a false negative) is generally more costly than mistakenly flagging someone who is not at risk (a false positive), especially when interventions are supportive rather than punitive.

While precision is also important—since too many false positives could dilute resources or create unnecessary concern—recall is more aligned with a **proactive retention strategy**. This tradeoff assumes that HR interventions are constructive and that the company has systems in place to act ethically on model outputs.

To avoid unintended harm, I recommend implementing **clear usage guidelines** and **transparency** measures, ensuring that predictions are used to help employees, not penalize them. Calibration and regular fairness audits should accompany any deployment of the model.

#### **Evaluation Tie Breaker**

 One final twist (I hope). I made the classic mistake of not clearly and rigidly defining success, and I now have a bunch of models that are all excellent at recall, hovering in the 0.93-0.96 range. So I'm making a post-hoc call. At least this one, I'm planning ahead of time. The best model (of each type) will be chosen based on the following tie-breakers (in order):
- recall > 0.935
- f2 > 0.85 (f2 is a new score, weighing recall at 80%, and precision at 20%)
- fewest number of features
- highest f2
- highest precision

I should hope i can make a choice by then. there can't be that many models. I... hehehe... predict... that I'll have it by number three, fewest number of features

In [None]:
# set evaluation metric
scoring = "recall"


# for XGBoost eval_metric
def get_xgb_eval_metric(scoring):
    mapping = {
        "roc_auc": "auc",  # area under ROC curve
        "accuracy": "error",  # classification error rate
        "f1": "logloss",  # logarithmic loss (not F1, but closest available)
        "precision": "logloss",  # no direct precision metric, logloss is a common fallback
        "recall": "logloss",  # no direct recall metric, logloss is a common fallback
    }
    return mapping.get(scoring, "auc")  # default to 'auc' if not found

In [None]:
# encode categorical variables


# copy the dataframe to avoid modifying the original
df_enc = df.copy()

# encode salary as ordinal
df_enc["salary"] = df_enc["salary"].map({"low": 0, "medium": 1, "high": 2})

# encode department as dummies
df_enc = pd.get_dummies(df_enc, columns=["department"])

# confirm the changes
# print("Original salary values:\n", df["salary"].value_counts())
# print("\nEncoded salary values:\n", df_enc["salary"].value_counts())
# df_enc.columns

In [None]:
# split the data into train / test sets for different models

# One set for tree-based models (decision tree, random forest, XGBoost)
# Another set for logistic regression (which must have outliers removed and data normalized)
# Stratify the target variable each time to account for class imbalance.



# split the data into features and target variable for tree-based models
X = df_enc.drop(columns=["left"])
y = df_enc["left"]

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# scale_pos_weight for XGBoost (ratio of negative to positive class in training set)
scale_pos_weight_value = (y_train == 0).sum() / (y_train == 1).sum()

In [None]:
# split the data into features and target variable for logistic regression
# remove outliers from tenure for logistic regression
df_enc_lr = df_enc.copy()

"""
outliers defined waaaaaay up above, 
at end of inital data exploration and cleaning
code not needed here, but copied for reference
"""
# q1 = df.tenure.quantile(0.25)
# q3 = df.tenure.quantile(0.75)
# iqr = q3 - q1
# upper_bound = q3 + 1.5 * iqr

# remove outliers
df_enc_lr = df_enc_lr[df_enc_lr.tenure < upper_bound]

X_lr = df_enc_lr.drop(columns=["left"])
y_lr = df_enc_lr["left"]

# split the data into training and testing sets for logistic regression
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
    X_lr, y_lr, test_size=0.2, random_state=42, stratify=y_lr
)

In [None]:
#### **Functions to make models, run models, plot confusion matrices and feature importances**

# build models_config for run_model_evaluation

def make_models_config(
    models,
    X_train,
    y_train,
    feature_func=None,  # can be a function, list, dict, or None
    param_grids=None,
    scaler=None,
    name_suffix="",
):
    """
    Build models_config for run_model_evaluation.
    - models: dict of {name: estimator}
    - X_train, y_train: training data
    - feature_func: function, list of functions, dict of {name: func}, or None
    - param_grids: dict of {name: param_grid} (or None for empty)
    - scaler: sklearn transformer (e.g., StandardScaler) or None
    - name_suffix: string to append to model name
    """
    configs = []
    for name, model in models.items():
        # order of steps matters, features first, then scaler, then model
        steps = []

        # determine which feature_func to use for this model
        func = None
        if isinstance(feature_func, dict):  # dict of {name: func}
            func = feature_func.get(name)
        elif callable(feature_func) or isinstance(feature_func, list):
            func = feature_func
        # handles a list of feature functions (apply in sequence), or a single function
        if func is not None:
            if isinstance(func, list):
                for i, f in enumerate(func):
                    steps.append((f"features_{i+1}", FunctionTransformer(f)))
            else:
                steps.append(("features", FunctionTransformer(func)))

        # add scaler if provided
        if scaler is not None:
            steps.append(("scaler", scaler))

        # add model
        steps.append(("model", model))

        # create the pipeline
        pipe = Pipeline(steps)

        # add parameter grid if provided
        param_grid = {}
        if isinstance(param_grids, dict):
            param_grid = param_grids.get(name, {})

        # add model configuration to the list
        configs.append(
            {
                "name": f"{name}{name_suffix}",
                "X_train": X_train,
                "y_train": y_train,
                "pipeline": pipe,
                "param_grid": param_grid,
            }
        )
    return configs

In [None]:
# run model evaluation function


def run_model_evaluation(
    models_config,
    results_df=None,
    scoring="recall",
    save_model=False,
    search_type="grid",
    n_iter=20,
):
    """
    Run model training and evaluation for a list of model configurations using cross-validated hyperparameter search.

    For each model configuration, performs hyperparameter tuning (GridSearchCV or RandomizedSearchCV),
    fits the best pipeline, evaluates cross-validated performance metrics, and optionally saves the best model.

    Parameters:
        models_config (list of dict): List of model configurations, each containing:
            - 'name': Model name (str)
            - 'X_train': Training features (pd.DataFrame or np.ndarray)
            - 'y_train': Training labels (pd.Series or np.ndarray)
            - 'pipeline': sklearn Pipeline object
            - 'param_grid': dict of hyperparameters for search
        results_df (pd.DataFrame or None): Existing results DataFrame to append to, or None to create a new one.
        scoring (str): Scoring metric for model selection (e.g., 'recall', 'accuracy', 'roc_auc').
        save_model (bool): If True, saves the best model pipeline to disk for each configuration.
        search_type (str): 'grid' for GridSearchCV, 'random' for RandomizedSearchCV.
        n_iter (int): Number of parameter settings sampled for RandomizedSearchCV (ignored for grid search).

    Returns:
        pd.DataFrame: Results DataFrame with model name, metrics (recall, f1, roc_auc, precision, accuracy),
                      number of features, best hyperparameters, best CV score, confusion matrix, and search time.

    Notes:
        - Uses stratified 5-fold cross-validation for both hyperparameter search and out-of-fold predictions.
        - Calculates metrics on cross-validated predictions for robust performance estimates.
        - Handles models that do not support predict_proba for ROC AUC gracefully.
        - Saves models to '../results/saved_models/' if save_model=True.
    """
    if results_df is None:
        results_df = pd.DataFrame(
            columns=[
                "model",
                "recall",
                "f2",  # 80% recall, 20% precision (metric created to weigh recall more heavily)
                "f1", # 50% recall, 50% precision
                "roc_auc",
                "precision",
                "accuracy",
                "features",
                "best_params",
                "cv_best_score",
                "conf_matrix",
                "search_time",
            ]
        )

    # ensure cross-validation is stratified for balanced class distribution
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for cfg in models_config:
        # time the model training and evaluation
        start_time = time.time()
        print(f"Running model: {cfg['name']}...")

        # conditional to choose search type, instantiate the appropriate search class
        if search_type == "random":
            grid = RandomizedSearchCV(
                cfg["pipeline"],
                cfg["param_grid"],
                n_iter=n_iter,
                cv=cv,
                scoring=scoring,
                n_jobs=-1,
                verbose=2,
                random_state=42,
            )
        else:
            grid = GridSearchCV(
                cfg["pipeline"],
                cfg["param_grid"],
                cv=cv,
                scoring=scoring,
                n_jobs=-1,
                verbose=2,
            )

        # fit the grid search to the training data
        grid.fit(cfg["X_train"], cfg["y_train"])

        # print the execution time
        end_time = time.time()
        search_time = end_time - start_time
        print(f"Execution time for {cfg['name']}: {search_time:.2f} seconds")

        # get the best model and its parameters
        best_model = grid.best_estimator_
        print(f"Best parameters for {cfg['name']}: {grid.best_params_}")
        print(f"Best score for {cfg['name']}: {grid.best_score_:.4f} ({scoring})")

        # --- get the number of features after all pipeline steps ---
        # try to transform X_train through all steps except the final estimator
        try:
            if hasattr(best_model, "named_steps"):
                # Remove the final estimator step
                steps = list(best_model.named_steps.items())
                if len(steps) > 1:
                    # Remove last step (the model)
                    feature_pipeline = Pipeline(steps[:-1])
                    X_transformed = feature_pipeline.transform(cfg["X_train"])
                    n_features = X_transformed.shape[1]
                else:
                    n_features = cfg["X_train"].shape[1]
            else:
                n_features = cfg["X_train"].shape[1]
        except Exception as e:
            print(f"Could not determine number of features: {e}")
            n_features = cfg["X_train"].shape[1]

        # conditional to save the best model
        if save_model:
            model_path = f"../results/saved_models/{cfg['name'].replace(' ', '_').lower()}.joblib"
            joblib.dump(best_model, model_path)
            print(f"Model {cfg['name']} saved successfully.\n")
        else:
            print(f"Model {cfg['name']} not saved. Set save_model=True to save it.\n")

        # make predictions using cross-validation to generate out-of-fold predictions for each training sample
        # translation:
        # substitute for setting aside a validation set
        # takes more time, but provides better estimates of model performance
        # it makes a prediction for each sample in the training set, using a different fold of the data for each prediction...
        # ...the fold where the sample is not included in the 80% training set (the sample is in the 20%)
        y_pred = cross_val_predict(
            best_model, cfg["X_train"], cfg["y_train"], cv=cv, n_jobs=-1
        )

        # # check misclassified cases for further analysis
        # print(f"Misclassified cases for {cfg['name']}:")
        # misclassified = cfg['X_train'].copy()
        # misclassified['actual'] = cfg["y_train"]
        # misclassified['predicted'] = y_pred
        # misclassified = misclassified[misclassified['actual'] != misclassified['predicted']]

        # # Show counts of each type of misclassification
        # counts = misclassified.groupby(['actual', 'predicted']).size().rename('count')
        # print("\nMisclassification counts:")
        # print(counts)
        # print()

        # # Show .describe() for each group, side by side
        # pd.set_option('display.max_columns', None)
        # for (actual, predicted), group in misclassified.groupby(['actual', 'predicted']):
        #     label_map = {0: "Stayed", 1: "Left"}
        #     print(f"--- Misclassified: Actual={label_map.get(actual, actual)}, Predicted={label_map.get(predicted, predicted)} (n={len(group)}) ---")
        #     print(group.describe().T)
        #     print()
        # pd.reset_option('display.max_columns')
        # print("\n")

        # calculate the ROC AUC score, need predicted probabilities (not just class labels, but confidence in those labels)
        # try / except block to handle models that do not support predict_proba (e.g., SVC)
        try:
            y_proba = cross_val_predict(
                best_model,
                cfg["X_train"],
                cfg["y_train"],
                cv=cv,
                method="predict_proba",
                n_jobs=-1,
            )
            roc_auc = roc_auc_score(cfg["y_train"], y_proba[:, 1])
        except (AttributeError, ValueError):
            roc_auc = np.nan
            print(f"Model {cfg['name']} does not support predict_proba.")

        # save results in the results dataframe
        results_df.loc[len(results_df)] = {
            "model": cfg["name"],
            "features": n_features,
            "accuracy": accuracy_score(cfg["y_train"], y_pred),
            "precision": precision_score(cfg["y_train"], y_pred),
            "recall": recall_score(cfg["y_train"], y_pred),
            "f1": f1_score(cfg["y_train"], y_pred),
            "f2": fbeta_score(cfg["y_train"], y_pred, beta=2), # 80% recall, 20% precision (ratio is "beta squared : 1", b^2:1, 2^2:1, 4:1)
            "roc_auc": roc_auc,
            "conf_matrix": confusion_matrix(cfg["y_train"], y_pred).tolist(),
            "best_params": grid.best_params_,
            "cv_best_score": grid.best_score_,
            "search_time": search_time,
        }

    return results_df

In [None]:
# plot confusion matrices from dataframe

def plot_confusion_from_results(results_df, save_png=False):
    """Plots SINGLE confusion matrices from results dataframe and optionally saves png."""

    class_labels = ["Stayed", "Left"]

    for idx, row in results_df.iterrows():
        cm = row["conf_matrix"]
        model_name = row["model"]

        plt.figure(figsize=(5, 4))
        sns.heatmap(
            cm,
            annot=True,
            fmt="d",
            cmap="Blues",
            cbar=False,
            xticklabels=class_labels,
            yticklabels=class_labels,
        )
        plt.title(f"Confusion Matrix: {model_name}")
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.tight_layout()

        # conditional to save the confusion matrix as a PNG file
        if save_png:
            plt.savefig(
                f"../results/images/{model_name.replace(' ', '_').lower()}_confusion_matrix.png"
            )

        plt.show()

In [None]:
# plot confusion matrix grid from dataframe

def plot_confusion_grid_from_results(results_df, png_title=None):
    """Plots ALL confusion matrices from results_df IN A GRID and optionally saves png."""
    import math

    class_labels = ["Stayed", "Left"]
    n_models = len(results_df)
    n_cols = 2 if n_models <= 4 else 3
    n_rows = math.ceil(n_models / n_cols)

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten() if n_models > 1 else [axes]

    for idx, (i, row) in enumerate(results_df.iterrows()):
        cm = row["conf_matrix"]
        model_name = row["model"]
        ax = axes[idx]
        sns.heatmap(
            cm,
            annot=True,
            fmt="d",
            cmap="Blues",
            cbar=False,
            xticklabels=class_labels,
            yticklabels=class_labels,
            ax=ax,
        )
        ax.set_title(f"{model_name}")
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Actual")

    # hide any unused subplots
    for j in range(idx + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()

    # conditional to save the confusion grid as a PNG file
    if png_title:
        plt.suptitle(png_title, fontsize=16, y=1.02)
        plt.savefig(
            f"../results/images/{png_title.replace(' ', '_').lower()}_confusion_grid.png"
        )

    plt.show()

In [None]:
# plot feature importances function

def load_and_plot_feature_importance(
    file_name, model_name, feature_names, top_n=10, save_png=False
):
    """Load a model and plot its feature importance, optionally saves png."""

    # load model
    model_path = os.path.join("../results/saved_models", file_name)
    model = joblib.load(model_path)

    # if model is a pipeline, get the estimator
    if hasattr(model, "named_steps"):
        # for logistic regression, get the scaler's feature names if available
        # NOTE: StandardScaler does not change feature names, so X_train_lr.columns is correct here
        # if using a transformer that changes the feature set (e.g., OneHotEncoder, ColumnTransformer)...
        # ...one would need to extract the transformed feature names from the transformer
        estimator = model.named_steps["model"]
    # if model is not a pipeline, use it directly (irrelevant for this case, but included for future-proofing)
    else:
        estimator = model

    # get importances
    # for tree-based models, use feature_importances_ or coef_ for logistic regression
    if hasattr(estimator, "feature_importances_"):
        importances = estimator.feature_importances_
        title = "Feature Importance"
    elif hasattr(estimator, "coef_"):
        importances = np.abs(estimator.coef_[0])
        title = "Absolute Coefficient Magnitude"
    else:
        print(f"Model {model_name} does not support feature importance.")
        return

    # sort and select top N
    indices = np.argsort(importances)[::-1][:top_n]

    plt.figure(figsize=(8, 5))
    plt.barh(np.array(feature_names)[indices][::-1], importances[indices][::-1])
    plt.xlabel(title)
    plt.title(f"{model_name}: Top {top_n} Features")
    plt.tight_layout()

    # conditional to save the feature importance plot as a PNG file
    if save_png:
        plt.savefig(
            f"../results/images/{model_name.replace(' ', '_').lower()}_feature_importance.png"
        )

    plt.show()

<a id="logistic regression"></a>

## Logistic Regression

[Back to top](#)

In [None]:
# display model evaluation results for all logistic regression models
df_lr = df_results[df_results["Model"].str.contains("Logistic Regression")]
if df_lr.empty:
    print("No logistic regression models found in results.")

# print("Logistic Regression Model Evaluation Results:")
df_lr[["Model", "Recall", "Precision", "F1 Score", "Accuracy", "ROC AUC", "Num Features", "Confusion Matrix"]]

In [None]:
# define logistic regression base model and its parameter grid
lr_base_model = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42)
}
lr_base_param_grid = {
    "Logistic Regression": {
        "model__C": [
            0.01,
            0.1,
            1.0,
            10.0,
            100.0,
        ],  # regularization strength (inverse): smaller = stronger regularization
        "model__penalty": ["l1", "l2"],  # regularization type (L1 = Lasso, L2 = Ridge)
        "model__solver": [
            "liblinear"
        ],  # optimization algorithm (liblinear supports L1/L2)
        "model__class_weight": [
            None,
            "balanced",
        ],  # handle class imbalance (None = no adjustment, balanced = adjust weights inversely proportional to class frequencies)
    }
}

# create models_config for base logistic regression model
lr_base_config = make_models_config(
    models=lr_base_model,
    X_train=X_train_lr,  # training features for logistic regression (outliers removed, to be scaled)
    y_train=y_train_lr,  # training target labels for logistic regression (outliers removed)
    feature_func=None,  # no additional features for base model
    param_grids=lr_base_param_grid,
    scaler=StandardScaler(),  # scale features for logistic regression
    name_suffix=" (base)",
)

In [None]:
# define tree-based base models and their parameter grids
dt_base_model = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
}
rf_xgb_base_models = {
    "Random Forest": RandomForestClassifier(
        random_state=42, n_jobs=-1
    ),  # n_jobs=-1 uses all available cores, makes training faster
    "XGBoost": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),  # eval_metric: sets evaluation metric for XGBoost (e.g., 'auc', 'logloss')
}
tree_base_param_grids = {
    "Decision Tree": {
        "model__max_depth": [4, 6, 8, None],  # max tree depth (None = unlimited)
        "model__min_samples_leaf": [1, 2, 5],  # min samples required at a leaf node
        "model__min_samples_split": [
            2,
            4,
            6,
        ],  # min samples required to split an internal node
        "model__class_weight": [
            None,
            "balanced",
        ],  # handle class imbalance (None = no weighting, 'balanced' = automatic weighting)
    },
    "Random Forest": {
        "model__n_estimators": [300, 500],  # number of trees in the forest
        "model__max_depth": [3, 5, None],  # max depth of each tree (None = unlimited)
        "model__max_features": [
            "sqrt",
            1.0,
        ],  # number of features to consider at each split (sqrt = square root, 1.0 = all)
        "model__max_samples": [
            0.7,
            1.0,
        ],  # fraction of samples to train each tree (0.7 = 70%, 1.0 = 100%)
        "model__min_samples_leaf": [1, 2, 3],  # min samples at a leaf node
        "model__min_samples_split": [2, 3, 4],  # min samples to split a node
        "model__class_weight": [
            None,
            "balanced",
        ],  # handle class imbalance (None = no weighting, 'balanced' = automatic weighting)
    },
    "XGBoost": {
        "model__n_estimators": [100, 300],  # number of boosting rounds (trees)
        "model__max_depth": [3, 5, 7],  # max tree depth for base learners
        "model__learning_rate": [
            0.01,
            0.1,
            0.2,
        ],  # step size shrinkage (lower = slower, more robust, less overfitting, but more trees / training time)
        "model__subsample": [
            0.6,
            0.8,
            1.0,
        ],  # fraction of samples used per tree (row sampling)
        "model__colsample_bytree": [
            0.6,
            0.8,
            1.0,
        ],  # fraction of features used per tree (column sampling)
        "model__min_child_weight": [
            1,
            5,
            10,
        ],  # min sum of instance weight needed in a child (higher = fewer larger leaves, less overfitting) (like min_samples_leaf)
        "model__gamma": [
            0,
            0.1,
            0.2,
        ],  # min loss reduction required to make a split (higher = fewer splits, less overfitting)
        "model__scale_pos_weight": [
            1,
            scale_pos_weight_value,
        ],  # try 1 and the calculated value for class imbalance
    },
}

# create models_config for base tree-based models
dt_base_config = make_models_config(
    models=dt_base_model,
    X_train=X_train,  # training features for tree-based models
    y_train=y_train,  # training target labels for tree-based models
    feature_func=None,  # no additional features for base model
    param_grids=tree_base_param_grids,
    scaler=None,  # no scaling needed for tree-based models
    name_suffix=" (base)",
)
rf_xgb_base_configs = make_models_config(
    models=rf_xgb_base_models,
    X_train=X_train,  # training features for tree-based models
    y_train=y_train,  # training target labels for tree-based models
    feature_func=None,  # no additional features for base model
    param_grids=tree_base_param_grids,
    scaler=None,  # no scaling needed for tree-based models
    name_suffix=" (base)",
)

#### **Run the baseline models**

In [None]:
# run base model evaluation
results_df = run_model_evaluation(lr_base_config, scoring=scoring, save_model=True, search_type="grid")
results_df = run_model_evaluation(
    dt_base_config, results_df=results_df, scoring=scoring, save_model=True, search_type="grid"
)
results_df = run_model_evaluation(
    rf_xgb_base_configs,
    results_df=results_df,
    scoring=scoring,
    save_model=True,
    search_type="random",
    n_iter=50,
)

#### **Baseline results**

In [None]:
# print results, order by recall
print("Model Evaluation Results:")
results_df.sort_values(by="recall", ascending=False, inplace=True)
results_df

> #### **Observations on Baseline Results**
>
> - **XGBoost** had the best overall performance across metrics, including recall, precision, F1, and ROC AUC.
>
> - **Random Forest** was close behind but took the longest to run.
>
> - **Decision Tree** was fast and reasonably strong—good for quick baselines or interpretation.
>
> - **Logistic Regression** severely underperformed in recall—unsuitable if false negatives are costly.


In [None]:
# confusion matrix plots, saved as PNG files
plot_confusion_grid_from_results(results_df, png_title="Base Model Confusion Matrices")
# plot_confusion_from_results(results_df)

> **Summary of Observations from the Confusion Matrices:**
>
> - **Tree-based models (Decision Tree, Random Forest, XGBoost)** show very high recall, correctly identifying most employees who left (true positives), with very few false positives. They also have relatively few false negatives, indicating strong overall performance.
> - **Logistic Regression** has a much higher number of false negatives, missing many employees who actually left. This results in lower recall and makes it less suitable for identifying at-risk employees.
> - Overall, the ensemble models (Random Forest and XGBoost) provide the best balance between correctly identifying leavers and minimizing incorrect predictions, while Logistic Regression struggles with this non-linear problem.

In [None]:
# show confusion matrix exemplar
display(Image(filename="../resources/images/confusion_matrix_exemplar.png", width=400))

#### **Check feature importance**

> After fitting baseline models, I reviewed the decision tree and feature importances. This step is not to guide feature selection yet, but rather to **cross-check with the EDA** and ensure the models are **learning meaningful patterns**.
>
> I’m mindful not to overinterpret these plots—they can be intuitive and visually appealing, but heavy reliance risks overfitting and misleading conclusions. This is a **calibration check**, not a signal to optimize prematurely.

In [None]:
# Load decision tree model and plot the tree

dt_model = joblib.load("../results/saved_models/decision_tree_(base).joblib")
estimator = dt_model.named_steps["model"]

# ensure feature_names matches columns used for training
# sklearn requires a list of strings, not a pandas index or series
feature_names = list(X_train.columns)

plt.figure(figsize=(20, 12))
plot_tree(
    estimator,
    feature_names=feature_names,
    class_names=["Stayed", "Left"],
    filled=True,
    rounded=True,
    max_depth=3,
)
plt.title("Baseline Decision Tree")
plt.tight_layout()

plt.savefig("../results/images/decision_tree_(base)_visualization.png")

plt.show()

In [None]:
# plot feature importance for each model

# list of model files and names
model_files = [
    ("decision_tree_(base).joblib", "Decision Tree", X_train.columns),
    ("random_forest_(base).joblib", "Random Forest", X_train.columns),
    ("xgboost_(base).joblib", "XGBoost", X_train.columns),
    ("logistic_regression_(base).joblib", "Logistic Regression", X_train_lr.columns),
]

# load each model and plot feature importance
for file_name, model_name, feature_names in model_files:
    load_and_plot_feature_importance(file_name, model_name, feature_names, top_n=10, save_png=True)

> All models consistently identify **low satisfaction** and **extreme workload** (either very high or very low) as the most important predictors of employee attrition. This finding aligns with the exploratory data analysis (EDA). **Tenure** also emerges as a significant factor, matching a pattern around the 4-5 year mark observed in the EDA. In contrast, **salary**, **department**, and **recent promotions** have minimal predictive value in this dataset. These key features are especially prominent in the ensemble models (Random Forest and XGBoost), which are likely the most robust. While all models highlight these variables, it is important to note that decision trees are prone to overfitting, and logistic regression underperforms due to its inability to capture non-linear relationships present in the data.

<a id="feature-engineering-round-one"></a>

## **Feature Engineering (Round One)**

[Back to top](#)

> Based on EDA and feature importance, focus on:
>
> - Satisfaction level (especially low values)
> - Extreme workload (very high or very low monthly hours, number of projects)
> - Tenure (especially the 4–5 year window)
>
> Feature engineering steps to experiment with:
>
> **Binning:**
>
> - Bin satisfaction_level (e.g., low/medium/high)
> - Bin average_monthly_hours (e.g., <160, 160–240, >240)
> - Bin number_project (e.g., ≤2, 3–5, ≥6)
> - Bin tenure (e.g., ≤3, 4–5, >5 years)
>
> **Interactions:**
>
> - satisfaction_level * number_project
>    - low: possibly disengaged or underperforming
>    - high: possibly engaged top performer or healthy productivity
>    - mid: potential burnout
> - satisfaction_level * average_monthly_hours
>    - satisfaction **given workload**
>    - low: burnout risk
>    - high: engaged
> - evaluation * satisfaction
>    - performace and morale
>    - both low: possibly disengaged firing risk
>    - both high: ideal employee
>    - high eval, low satisfaction: attrition risk
> - monthly_hours / number_project
>    - overwork / underwork index
>
> **Categorical Flags:**
>
> - burnout: (projects ≥ 6 or hours ≥ 240) & satisfaction ≤ 0.3
> - disengaged: (projects ≤ 2 and hours < 160 and satisfaction ≤ 0.5)
> - no_promo_4yr: (promotion_last_5years == 0) & (tenure >= 4)
>
> **Feature Selection:**
>
> - Drop weak predictors (e.g., department, salary, work_accident) for logistic regression, as they add noise and multicollinearity.
>
> ---
>
> **Note:**  
> I used simple hyperparameters for quick testing of new features and combinations. I used a wide set of hyperparameters and walked away from the computer to enjoy life while it crunched data. I eventually settled on a strategy of exhaustively grid searching quick models and randomly searching heavy tree models. Once the best feature set was identified, I did a final round of model training with a more extensive hyperparameter grid for optimal performance.

#### **Feature engineering functions**

In [None]:
# function to add new features to the X_train / X_train_lr dataframe


# add binning features
def add_binning_features(df):
    df = df.copy()
    df["satisfaction_bin"] = pd.cut(
        df["satisfaction_level"],
        bins=[-0.01, 0.4, 0.7, 1.0],
        labels=["low", "medium", "high"],
    )
    df["hours_bin"] = pd.cut(
        df["average_monthly_hours"],
        bins=[0, 160, 240, np.inf],
        labels=["low", "medium", "high"],
    )
    df["projects_bin"] = pd.cut(
        df["number_project"], bins=[0, 2, 5, np.inf], labels=["low", "medium", "high"]
    )
    df["tenure_bin"] = pd.cut(
        df["tenure"], bins=[0, 3, 5, np.inf], labels=["short", "mid", "long"]
    )
    # encode the binned features as dummies
    df = pd.get_dummies(
        df,
        columns=["satisfaction_bin", "hours_bin", "projects_bin", "tenure_bin"],
        drop_first=True,
    )
    return df


# add interaction features
def add_interaction_features(df):
    df = df.copy()
    df["satisfaction_x_projects"] = df["satisfaction_level"] * df["number_project"]
    df["satisfaction_x_hours"] = df["satisfaction_level"] * df["average_monthly_hours"]
    df["evaluation_x_satisfaction"] = df["last_evaluation"] * df["satisfaction_level"]
    df["hours_per_project"] = df["average_monthly_hours"] / df["number_project"]
    return df


# add flag features
def add_flag_features(df):
    df = df.copy()
    df["burnout"] = (
        (df["number_project"] >= 6) | (df["average_monthly_hours"] >= 240)
    ) & (df["satisfaction_level"] <= 0.3)
    df["disengaged"] = (
        (df["number_project"] <= 2)
        & (df["average_monthly_hours"] < 160)
        & (df["satisfaction_level"] <= 0.5)
    )
    df["no_promo_4yr"] = (df["promotion_last_5years"] == 0) & (df["tenure"] >= 4)
    return df

#### **Feature selection**

In [None]:
# feature selection for logistic regression
drop_cols = [col for col in X_train_lr.columns if col.startswith("department_")]
drop_cols += ["salary", "work_accident"]
X_train_lr_fs = X_train_lr.drop(columns=drop_cols)

# feature selection for tree-based models
drop_cols = [col for col in X_train.columns if col.startswith("department_")]
drop_cols += ["salary", "work_accident"]
X_train_fs = X_train.drop(columns=drop_cols)

#### **Define logistic regression models**

In [None]:
# logistic regression feature engineering parameters
lr_fe_params = {
    "model__C": [0.1, 1.0, 10.0],  # regularization strength (inverse)
    "model__penalty": ["l1", "l2"],  # regularization type (L1 = Lasso, L2 = Ridge)
    "model__solver": ["liblinear"],  # optimization algorithm (liblinear supports L1/L2)
    "model__class_weight": [None, "balanced"],  # None or balanced for class imbalance
}

In [None]:
# define feature engineered logistic regression models, their feature functions, and parameter grids
lr_fe_models = {
    "Logistic Regression with Binning": LogisticRegression(
        max_iter=1000, random_state=42
    ),
    "Logistic Regression with Interaction": LogisticRegression(
        max_iter=1000, random_state=42
    ),
    "Logistic Regression with Flags": LogisticRegression(
        max_iter=1000, random_state=42
    ),
}
lr_fe_feature_funcs = {
    "Logistic Regression with Binning": add_binning_features,
    "Logistic Regression with Interaction": add_interaction_features,
    "Logistic Regression with Flags": add_flag_features,
}
lr_fe_param_grids = {
    "Logistic Regression with Binning": lr_fe_params,
    "Logistic Regression with Interaction": lr_fe_params,
    "Logistic Regression with Flags": lr_fe_params,
}

# create models_config for logistic regression with feature engineering
lr_fe_configs = make_models_config(
    lr_fe_models,
    X_train_lr,
    y_train_lr,
    feature_func=lr_fe_feature_funcs,
    scaler=StandardScaler(),
    param_grids=lr_fe_param_grids,
)

# create models_config for logistic regression with feature engineering and feature selection
lr_fe_fs_configs = make_models_config(
    lr_fe_models,
    X_train_lr_fs,
    y_train_lr,
    feature_func=lr_fe_feature_funcs,
    scaler=StandardScaler(),
    param_grids=lr_fe_param_grids,
    name_suffix=" (feature selection)",
)

#### **Run logistic regression models**

In [None]:
# run feature engineered logistic regression model evaluation
results_lr_fe_df = run_model_evaluation(
    lr_fe_configs + lr_fe_fs_configs, scoring=scoring
)
# print feature engineered model results, order by recall
print("Feature Engineered Model Evaluation Results:")
results_lr_fe_df.sort_values(by="recall", ascending=False, inplace=True)
results_lr_fe_df.head()

> #### **Observations of Feature-Engineered Logistic Regression Results**
>
> - **Logistic Regression with Flags (feature selection)** had the highest recall and strong metrics, using only 6 features—making it highly interpretable and efficient.
>
> - **Feature selection** (removing department, salary, accident, etc.) simplified the model without hurting accuracy.
>
> - **Interaction and binning features** improved recall and F1 over the baseline, but not as much as the flag-based models.
>
> - **Interpretability:** These models are transparent and easy to explain—ideal for HR use.
>
> - **Summary:** With targeted feature engineering, logistic regression can approach the accuracy of complex models while staying simple and explainable.

In [None]:
# plot confuision matrices for feature engineered models
plot_confusion_grid_from_results(results_lr_fe_df)
# plot_confusion_from_results(results_lr_fe_df)

#### **Define tree-based feature engineering models**

In [None]:
# # tree-based feature engineering parameters
tree_fe_params = {
    "Random Forest": {
        "model__n_estimators": [100, 300],  # 300 was best, but 100 is faster for FE
        "model__max_depth": [
            3,
            4,
            5,
            8,
        ],  # 5 was best, trying for regularization, deeper trees can overfit, take longer to train
        "model__max_features": ["sqrt", 1.0],  # 1.0 was best, but sqrt is common
        "model__max_samples": [0.7, 1.0],  # 1.0 was best
        "model__min_samples_leaf": [1, 2],  # 1 or 2
        "model__min_samples_split": [2, 3],  # 2 or 3
        "model__class_weight": [
            None,
            "balanced",
        ],  # None or balanced for class imbalance
    },
    "XGBoost": {
        "model__n_estimators": [100, 300],  # 300 was best
        "model__max_depth": [
            3,
            4,
            5,
            8,
        ],  # 3 was best (moderate increase in training time)
        "model__learning_rate": [
            0.1,
            0.2,
        ],  # 0.1 is standard, 0.2 for speed, step size shrinkage
        "model__subsample": [
            0.6,
            0.8,
            1.0,
        ],  # 1.0 was best, row subsampling (adds randomness, helps generalization)
        "model__colsample_bytree": [
            0.6,
            0.8,
            1.0,
        ],  # 1.0 was best, column subsampling (adds randomness, helps generalization)
        "model__min_child_weight": [
            1,
            5,
        ],  # 1 is default, 5 for regularization, minimum sum of instance weight in a child
        "model__gamma": [
            0,
            0.1,
            0.2,
        ],  # 0.2 was best, try 0 for comparison, minimum loss reduction required to make a split
        "model__scale_pos_weight": [
            1,
            scale_pos_weight_value,
        ],  # 1 or calculated value for class imbalance
        "model__reg_alpha": [
            0,
            0.1,
            1,
        ],  # L1 regularization (helps control overfitting)
        "model__reg_lambda": [1, 2, 5],  # L2 regularization (helps control overfitting)
    },
    "Decision Tree": {
        "model__max_depth": [3, 4, 5, 6, 8],  # best was 8
        "model__min_samples_leaf": [1, 2, 3],  # 1 was best
        "model__min_samples_split": [2, 3, 4],  # 2 was best
        "model__class_weight": [
            None,
            "balanced",
        ],  # None or balanced for class imbalance
    },
}

In [None]:
# tree-based feature engineering configs with full model names
dt_fe_models = {
    "Decision Tree with Binning": DecisionTreeClassifier(random_state=42),
    "Decision Tree with Interaction": DecisionTreeClassifier(random_state=42),
    "Decision Tree with Flags": DecisionTreeClassifier(random_state=42),
}
rf_xgb_fe_models = {
    "Random Forest with Binning": RandomForestClassifier(random_state=42, n_jobs=-1),
    "Random Forest with Interaction": RandomForestClassifier(
        random_state=42, n_jobs=-1
    ),
    "Random Forest with Flags": RandomForestClassifier(random_state=42, n_jobs=-1),
    "XGBoost with Binning": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
    "XGBoost with Interaction": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
    "XGBoost with Flags": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
}

tree_fe_feature_funcs = {
    "Random Forest with Binning": add_binning_features,
    "Random Forest with Interaction": add_interaction_features,
    "Random Forest with Flags": add_flag_features,
    "XGBoost with Binning": add_binning_features,
    "XGBoost with Interaction": add_interaction_features,
    "XGBoost with Flags": add_flag_features,
    "Decision Tree with Binning": add_binning_features,
    "Decision Tree with Interaction": add_interaction_features,
    "Decision Tree with Flags": add_flag_features,
}

tree_fe_param_grids = {
    "Random Forest with Binning": tree_fe_params["Random Forest"],
    "Random Forest with Interaction": tree_fe_params["Random Forest"],
    "Random Forest with Flags": tree_fe_params["Random Forest"],
    "XGBoost with Binning": tree_fe_params["XGBoost"],
    "XGBoost with Interaction": tree_fe_params["XGBoost"],
    "XGBoost with Flags": tree_fe_params["XGBoost"],
    "Decision Tree with Binning": tree_fe_params["Decision Tree"],
    "Decision Tree with Interaction": tree_fe_params["Decision Tree"],
    "Decision Tree with Flags": tree_fe_params["Decision Tree"],
}

# with all feature engineering functions applied
dt_fe_configs = make_models_config(
    dt_fe_models,
    X_train,
    y_train,
    feature_func=tree_fe_feature_funcs,
    param_grids=tree_fe_param_grids,
)
rf_xgb_fe_configs = make_models_config(
    rf_xgb_fe_models,
    X_train,
    y_train,
    feature_func=tree_fe_feature_funcs,
    param_grids=tree_fe_param_grids,
)

# with feature engineering and feature selection (drop_cols)
dt_fe_fs_configs = make_models_config(
    dt_fe_models,
    X_train_fs,
    y_train,
    feature_func=tree_fe_feature_funcs,
    param_grids=tree_fe_param_grids,
    name_suffix=" (feature selection)",
)
rf_xgb_fe_fs_configs = make_models_config(
    rf_xgb_fe_models,
    X_train_fs,
    y_train,
    feature_func=tree_fe_feature_funcs,
    param_grids=tree_fe_param_grids,
    name_suffix=" (feature selection)",
)

#### **Run tree-based feature engineering models**

In [None]:
# run tree-based feature engineered model evaluation
results_tree_fe_df = run_model_evaluation(
    dt_fe_configs + dt_fe_fs_configs, scoring=scoring, search_type="grid"
)
results_tree_fe_df = run_model_evaluation(
    rf_xgb_fe_configs + rf_xgb_fe_fs_configs,
    results_df=results_tree_fe_df,
    scoring=scoring,
    search_type="random",
    n_iter=50,
)
# print feature engineered tree-based model results, order by recall
print("Feature Engineered Tree-Based Model Evaluation Results:")
results_tree_fe_df.sort_values(by="recall", ascending=False, inplace=True)
results_tree_fe_df.head()

> ### **Patterns in Results thus far**
>
> - **Recall is consistently high** across all models, especially for Logistic Regression and Decision Tree (base), indicating strong sensitivity to identifying leavers.
> - **F1 and Precision are much lower for Logistic Regression (base),** suggesting many false positives. Tree-based and XGBoost models have much better balance between recall and precision.
> - **ROC AUC is highest for XGBoost and Random Forest,** showing strong overall discrimination.
> - **Feature selection and engineering** (binning, interaction, flags) generally improves F1, precision, and accuracy, sometimes at a small cost to recall.
> - **Reducing features (feature selection)** often maintains or even improves performance, especially for XGBoost and Decision Tree, and greatly reduces model complexity and training time.
> - **Confusion matrices show that most errors are false positives** (predicting leave when they stay), which is expected with `class_weight='balanced'` and high recall focus.

<a id="feature-engineering-round-two"></a>

## **Feature Engineering (Round Two)**

[Back to top](#)

>#### **What? Why?**
>
> Really, this is a feature shrinking round. Some feature engineering paired with a lot of feature selection. Feature-rich models have barely improved or even reduced performance, and feature selection has performed well.
>
> Simpler models are easier to explain to stakeholders, and it'll hopefully reduce noise and potential multicollinearity.
>
> **Selected features + burnout flag:**
> This set isolates the core predictors of attrition (satisfaction, workload, tenure, promotion) and adds a “burnout” flag to capture the high-risk group of overworked, dissatisfied employees
>
> **Selected features + interactions:**
> This set focuses on the main drivers (satisfaction, workload, tenure) and adds interaction terms (satisfaction × projects, hours per project) to capture non-linear effects and workload intensity, which EDA showed are important for distinguishing between underworked, overworked, and healthy employees.
>
> **Selected features + interactions + burnout flag:**
> This feature set combines the core predictors of attrition (satisfaction, workload, tenure) with a “burnout” flag to capture high-risk, overworked employees. It also includes a key interaction term, "satisfaction × projects", to distinguish between groups identified in EDA
>
> `satisfaction_x_projects` separates healthy, burned-out, and underperforming employees:
> - Employees who are satisfied and productive (high satisfaction, moderate projects)
> - Those who are overworked and dissatisfied (low satisfaction, high projects)
> - Those who are disengaged (low satisfaction, low projects)
>
> `hours per project` captures nuanced patterns of overwork and underwork:
> - Employees with many projects but reasonable hours (healthy workload)
> - Employees with few projects but high hours (potentially inefficient or struggling)
> - Employees with many projects and high hours (burnout risk)

In [None]:
# selected features + burnout flag
def select_core_features_with_burnout(df):
    df = df.copy()
    # burnout flag: (projects >= 6 or hours >= 240) & satisfaction <= 0.3
    df["burnout"] = (
        (df["number_project"] >= 6) | (df["average_monthly_hours"] >= 240)
    ) & (df["satisfaction_level"] <= 0.3)
    return df[
        [
            "satisfaction_level",
            "last_evaluation",
            "number_project",
            "average_monthly_hours",
            "tenure",
            "promotion_last_5years",
            "burnout",
        ]
    ]


# selected features + interactions
def select_core_features_with_interactions(df):
    df = df.copy()
    # interactions
    df["satisfaction_x_projects"] = df["satisfaction_level"] * df["number_project"]
    df["hours_per_project"] = df["average_monthly_hours"] / df["number_project"]
    return df[
        [
            "satisfaction_level",
            "number_project",
            "average_monthly_hours",
            "tenure",
            "satisfaction_x_projects",
            "hours_per_project",
        ]
    ]


# selected features + interactions + burnout flag
def select_core_features_with_interactions_and_burnout(df):
    df = df.copy()
    # burnout flag: (projects >= 6 or hours >= 240) & satisfaction <= 0.3
    df["burnout"] = (
        (df["number_project"] >= 6) | (df["average_monthly_hours"] >= 240)
    ) & (df["satisfaction_level"] <= 0.3)
    # interaction
    df["satisfaction_x_projects"] = df["satisfaction_level"] * df["number_project"]
    return df[
        [
            "satisfaction_level",
            "number_project",
            "average_monthly_hours",
            "tenure",
            "burnout",
            "satisfaction_x_projects",
        ]
    ]

#### **Define feature engineering round 2 models**

In [None]:
# --- Feature engineering round 2 model dicts ---

# logistic regression FE2 models, feature funcs, param grids
lr_fe2_models = {
    "Logistic Regression (Core + Burnout)": LogisticRegression(
        max_iter=1000, random_state=42
    ),
    "Logistic Regression (Core + Interactions)": LogisticRegression(
        max_iter=1000, random_state=42
    ),
    "Logistic Regression (Core + Interactions + Burnout)": LogisticRegression(
        max_iter=1000, random_state=42
    ),
}
lr_fe2_feature_funcs = {
    "Logistic Regression (Core + Burnout)": select_core_features_with_burnout,
    "Logistic Regression (Core + Interactions)": select_core_features_with_interactions,
    "Logistic Regression (Core + Interactions + Burnout)": select_core_features_with_interactions_and_burnout,
}
lr_fe2_param_grids = {
    "Logistic Regression (Core + Burnout)": lr_fe_params,
    "Logistic Regression (Core + Interactions)": lr_fe_params,
    "Logistic Regression (Core + Interactions + Burnout)": lr_fe_params,
}

# tree-based FE2 models, feature funcs, param grids
dt_fe2_models = {
    "Decision Tree (Core + Burnout)": DecisionTreeClassifier(random_state=42),
    "Decision Tree (Core + Interactions)": DecisionTreeClassifier(random_state=42),
    "Decision Tree (Core + Interactions + Burnout)": DecisionTreeClassifier(
        random_state=42
    ),
}
rf_xgb_fe2_models = {
    "Random Forest (Core + Burnout)": RandomForestClassifier(
        random_state=42, n_jobs=-1
    ),
    "Random Forest (Core + Interactions)": RandomForestClassifier(
        random_state=42, n_jobs=-1
    ),
    "Random Forest (Core + Interactions + Burnout)": RandomForestClassifier(
        random_state=42, n_jobs=-1
    ),
    "XGBoost (Core + Burnout)": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
    "XGBoost (Core + Interactions)": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
    "XGBoost (Core + Interactions + Burnout)": XGBClassifier(
        eval_metric=get_xgb_eval_metric(scoring), random_state=42, n_jobs=-1
    ),
}
tree_fe2_feature_funcs = {
    "Decision Tree (Core + Burnout)": select_core_features_with_burnout,
    "Decision Tree (Core + Interactions)": select_core_features_with_interactions,
    "Decision Tree (Core + Interactions + Burnout)": select_core_features_with_interactions_and_burnout,
    "Random Forest (Core + Burnout)": select_core_features_with_burnout,
    "Random Forest (Core + Interactions)": select_core_features_with_interactions,
    "Random Forest (Core + Interactions + Burnout)": select_core_features_with_interactions_and_burnout,
    "XGBoost (Core + Burnout)": select_core_features_with_burnout,
    "XGBoost (Core + Interactions)": select_core_features_with_interactions,
    "XGBoost (Core + Interactions + Burnout)": select_core_features_with_interactions_and_burnout,
}
tree_fe2_param_grids = {
    "Decision Tree (Core + Burnout)": tree_fe_params["Decision Tree"],
    "Decision Tree (Core + Interactions)": tree_fe_params["Decision Tree"],
    "Decision Tree (Core + Interactions + Burnout)": tree_fe_params["Decision Tree"],
    "Random Forest (Core + Burnout)": tree_fe_params["Random Forest"],
    "Random Forest (Core + Interactions)": tree_fe_params["Random Forest"],
    "Random Forest (Core + Interactions + Burnout)": tree_fe_params["Random Forest"],
    "XGBoost (Core + Burnout)": tree_fe_params["XGBoost"],
    "XGBoost (Core + Interactions)": tree_fe_params["XGBoost"],
    "XGBoost (Core + Interactions + Burnout)": tree_fe_params["XGBoost"],
}

# create models_config for FE2 models
lr_fe2_configs = make_models_config(
    lr_fe2_models,
    X_train_lr,
    y_train_lr,
    feature_func=lr_fe2_feature_funcs,
    scaler=StandardScaler(),
    param_grids=lr_fe2_param_grids,
)
dt_fe2_configs = make_models_config(
    dt_fe2_models,
    X_train,
    y_train,
    feature_func=tree_fe2_feature_funcs,
    param_grids=tree_fe2_param_grids,
)
rf_xgb_fe2_configs = make_models_config(
    rf_xgb_fe2_models,
    X_train,
    y_train,
    feature_func=tree_fe2_feature_funcs,
    param_grids=tree_fe2_param_grids,
)

#### **Run feature engineering round 2 models**

In [None]:
# run feature engineered round 2 model evaluation
results_fe2_df = run_model_evaluation(
    lr_fe2_configs, scoring=scoring, search_type="grid"
)
results_fe2_df = run_model_evaluation(
    dt_fe2_configs, results_df=results_fe2_df, scoring=scoring, search_type="grid"
)
results_fe2_df = run_model_evaluation(
    rf_xgb_fe2_configs,
    results_df=results_fe2_df,
    scoring=scoring,
    search_type="random",
    n_iter=50,
)
# print feature engineered round 2 model results, order by recall
print("Feature Engineered Round 2 Model Evaluation Results:")
results_fe2_df.sort_values(by="recall", ascending=False, inplace=True)
results_fe2_df.head()

In [None]:
# plot confusion matrices for feature engineered round 2 models
plot_confusion_grid_from_results(results_fe2_df)
# plot_confusion_from_results(results_fe2_df)

<a id="model-evaluation-results"></a>

## **Model Evaluation Results**

[Back to top](#)

In [None]:
# merge all results dataframes into a single dataframe for comparison
all_results_df = pd.concat(
    [results_df, results_lr_fe_df, results_tree_fe_df, results_fe2_df],
    ignore_index=True,
)
all_results_df.sort_values(by="recall", ascending=False, inplace=True)
print("All Model Evaluation Results:")
all_results_df.head(10)

In [None]:
# save results to CSV
results_df.to_csv("../results/base_model_evaluation_results.csv", index=False)
results_lr_fe_df.to_csv(
    "../results/logistic_regression_feature_engineered_results.csv", index=False
)
results_tree_fe_df.to_csv(
    "../results/tree_based_feature_engineered_results.csv", index=False
)
results_fe2_df.to_csv("../results/feature_engineered_round_2_results.csv", index=False)
all_results_df.to_csv("../results/all_model_evaluation_results.csv", index=False)

In [None]:
# plot confusion matrices for all models in a grid
plot_confusion_grid_from_results(all_results_df)

In [None]:
plot_confusion_grid_from_results(
    all_results_df.iloc[:9], 
    png_title="Top Model Confusion Matrices"
)

### Model Evaluation Summary

#### 1. **Logistic Regression**
- **Best Recall:**  
  - *Logistic Regression (Core + Interactions)* achieves the highest recall (0.962), with only 6 features and a simple, interpretable model.
  - Other logistic regression variants with feature selection or binning also maintain high recall (0.94–0.96) with fewer features.
- **F1 & Precision:**  
  - F1 scores for logistic regression are generally lower (0.64–0.75), reflecting lower precision (0.51–0.63).
  - Feature selection and engineering (e.g., interactions, binning) slightly improve F1 and precision while keeping models simple.

#### 2. **Tree-Based Models (Decision Tree, Random Forest, XGBoost)**
- **Top F1 & Precision:**  
  - XGBoost and Random Forest models consistently achieve the highest F1 (up to 0.91) and precision (up to 0.89), with strong recall (0.93–0.94).
  - Decision Trees also perform well, especially with feature engineering (F1 up to 0.89, precision up to 0.87).
- **Feature Efficiency:**  
  - Tree-based models with feature selection or engineered features (e.g., "Core + Burnout", "feature selection") often match or outperform base models with fewer features.

#### 3. **Feature Selection & Engineering**
- **Effectiveness:**  
  - Models using feature selection or engineered features (interactions, binning, flags) often achieve similar or better performance with fewer features.
  - This reduces model complexity and improves interpretability without sacrificing accuracy, recall, or F1.

#### 4. **Interpretability vs. Performance**
- **Trade-off:**  
  - Logistic regression models are more interpretable and, with feature engineering, are now much more competitive in recall and accuracy.
  - Tree-based models remain top performers for F1 and precision, but at the cost of increased complexity.

---

**Conclusion:**  
- Feature selection and engineering are highly effective, enabling simpler models (especially logistic regression) to achieve strong recall and competitive accuracy.
- Tree-based models (especially XGBoost) remain the best for F1 and precision, but logistic regression is now a viable, interpretable alternative for high-recall use cases.

In [None]:
# print all_results_df, ordered by alternate metrics
metrics = ["f1", "accuracy", "roc_auc", "precision"]

for metric in metrics:
    print(f"\n--- Sorted by {metric} (descending) ---")
    display(all_results_df.sort_values(by=metric, ascending=False).head())

Logistic Regression
Top Pick:

Logistic Regression (Core + Interactions)
Recall: 0.962 (highest among all models)
F1: 0.667 (moderate)
Precision: 0.51 (lower, but expected with high recall)
Features: 6 (very simple, highly interpretable)
Why:
Achieves the highest recall, which is critical for identifying as many at-risk employees as possible.
Uses only 6 features, making it easy to explain to HR and stakeholders.
Slightly lower F1 and precision, but this is a common trade-off when maximizing recall.
Good for organizations prioritizing interpretability and proactive retention.
Alternative:

Logistic Regression with Interaction (feature selection)
Recall: 0.960 (very close to top)
F1: 0.671 (slightly higher)
Precision: 0.52 (slightly higher)
Features: 10 (still simple)
Why:
Slightly better F1 and precision, with a small drop in recall.
Still interpretable and efficient.

----

Decision Tree
Top Pick:

Decision Tree (Core + Burnout)
Recall: 0.944 (highest among DTs)
F1: 0.814
Precision: 0.72
Features: 7
Depth: 5 → interpretable
Why:
Best recall for DTs, with strong F1 and precision.
Simple model, easy to visualize and explain.
Relatively shallow depth helps interpretability.
Captures key non-linear relationships (e.g., burnout).

Alternative:

Decision Tree (base)
Recall: 0.942
F1: 0.803
Precision: 0.70
Features: 18
Why:
Slightly lower recall, more features, but still interpretable.
Deeper and less parsimonious.
Useful if you want to see the effect of all variables.

----

Random Forest
Top Pick:

Random Forest (Core + Burnout)
Recall: 0.941 (highest among RFs)
F1: 0.837 (best among all models except XGB)
Precision: 0.75
Features: 7
Max Depth: 5 (controlled complexity)
Why:
Best recall and F1 for RFs, with a compact feature set.
Limited feature set and shallow trees improve generalizability.
Balances predictive power and interpretability.
Efficient for deployment.

Alternative:

Random Forest (base)
Recall: 0.940
F1: 0.770
Precision: 0.65
Features: 18
Why:
Slightly lower recall and F1, but includes all features.
Useful for feature importance analysis.

----

XGBoost
Top Pick:

XGBoost (base)
Recall: 0.937 (highest among XGBs)
F1: 0.911 (highest overall)
Precision: 0.89 (highest overall)
ROC AUC: 0.986 (highest overall)
Features: 18
Why:
Best balance of recall, F1, precision, and ROC AUC.
Best overall performer in general metrics.
Excellent for minimizing both false negatives and false positives.
Slightly more complex, but worth it for performance.

Alternative:

XGBoost (Core + Burnout)
Recall: 0.937 (same as base)
F1: 0.906
Precision: 0.88
Features: 7
Why:
Nearly identical recall, slightly lower F1/precision, but much simpler.
Good if you want a more interpretable XGBoost model.

Alternates:
XGB with Binning (recall tied, more compact features)
XGB with Flags (feature selection) (best interpretability: only 9 features, F1=0.914, recall=0.935)

If interpretability or runtime matters more than a slight edge in F1, pick the XGB with Flags.

In [None]:
# print total execution time, for measuring performance
nb_end_time = time.time()
print(f"Total execution time: {nb_end_time - nb_start_time:.2f} seconds")
print(
    f"Total execution time: {time.strftime('%H:%M:%S', time.gmtime(nb_end_time - nb_start_time))}"
)

<a id="pace-execute-stage"></a>

# pacE: Execute Stage
[Back to top](#)
- Interpret model performance and results
- Share actionable steps with stakeholders



> I passed the point of diminishing returns long ago.
>
> But, I learned a lot of foundational stuff about the model construction process (Pipelines, cross-validation, random search vs. grid serach, checking misclassification errors, feature selection and engineering, etc), and i did get the logistic regression model a bit better, so I'll call it a win. The time may have been wasted now, but I'll be a lot quicker next time. Nothing like mastering the basics.

✏
## Recall evaluation metrics

- **AUC** is the area under the ROC curve; it's also considered the probability that the model ranks a random positive example more highly than a random negative example.
- **Precision** measures the proportion of data points predicted as True that are actually True, in other words, the proportion of positive predictions that are true positives.
- **Recall** measures the proportion of data points that are predicted as True, out of all the data points that are actually True. In other words, it measures the proportion of positives that are correctly classified.
- **Accuracy** measures the proportion of data points that are correctly classified.
- **F1-score** is an aggregation of precision and recall.






💭
### Reflect on these questions as you complete the executing stage.

- What key insights emerged from your model(s)?
- What business recommendations do you propose based on the models built?
- What potential recommendations would you make to your manager/company?
- Do you think your model could be improved? Why or why not? How?
- Given what you know about the data and the models you were using, what other questions could you address for the team?
- What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
- Do you have any ethical considerations in this stage?



> ### Execute Stage Reflection
> 
> #### What key insights emerged from your model(s)?
> - **Satisfaction level** and **workload** (number of projects, monthly hours) are the strongest predictors of attrition.
> - Two main at-risk groups: **overworked/burned-out employees** (many projects, long hours, low satisfaction) and **underworked/disengaged employees** (few projects, low satisfaction).
> - **Tenure** is important: attrition peaks at 4–5 years, then drops sharply.
> - **Salary, department, and recent promotions** have minimal predictive value.
> - **Tree-based models (Random Forest, XGBoost)** achieved the best balance of recall, precision, and F1. With feature engineering, **logistic regression** became competitive and highly interpretable.
> 
> #### What business recommendations do you propose based on the models built?
> - **Monitor satisfaction and workload:** Regularly survey employees and track workload to identify those at risk of burnout or disengagement.
> - **Targeted retention efforts:** Focus on employees with low satisfaction and extreme workloads, especially those at the 4–5 year tenure mark.
> - **Promotions and recognition:** Consider more frequent recognition or advancement opportunities.
> - **Work-life balance:** Encourage reasonable project loads and monthly hours to reduce burnout risk.
> 
> #### What potential recommendations would you make to your manager/company?
> - **Implement early warning systems** using the model to flag at-risk employees for supportive HR outreach.
> - **Review workload distribution** and ensure fair, manageable assignments.
> - **Conduct stay interviews** with employees approaching 4–5 years of tenure.
> - **Communicate transparently** about how predictive models are used, emphasizing support rather than punitive action.
> 
> #### Do you think your model could be improved? Why or why not? How?
> - **Feature engineering:** Further refine interaction terms or add time-based features if available.
> - **External data:** Incorporate additional data (e.g., engagement surveys, manager ratings, exit interview themes).
> - **Model calibration:** Regularly retrain and calibrate the model as new data becomes available.
> - **Bias audits:** Routinely check for bias across demographic groups.
> 
> #### Given what you know about the data and the models you were using, what other questions could you address for the team?
> - What are the specific reasons for attrition in different departments or roles?
> - Are there seasonal or cyclical patterns in attrition?
> - How do external factors (e.g., economic conditions, industry trends) affect turnover?
> - What interventions are most effective for retaining at-risk employees?
> 
> #### What resources do you find yourself using as you complete this stage? (Make sure to include the links.)
> - [pandas documentation](https://pandas.pydata.org/docs/)
> - [matplotlib documentation](https://matplotlib.org/stable/users/index.html)
> - [seaborn documentation](https://seaborn.pydata.org/)
> - [scikit-learn documentation](https://scikit-learn.org/stable/user_guide.html)
> - [XGBoost documentation](https://xgboost.readthedocs.io/en/stable/)
> - [Kaggle HR Analytics Dataset](https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv)
> 
> #### Do you have any ethical considerations in this stage?
> - **Data privacy:** Ensure employee data is kept confidential and secure.
> - **Fairness:** Avoid using the model to unfairly target or penalize specific groups.
> - **Transparency:** Clearly communicate how predictions are generated and used.
> - **Supportive use:** Use predictions to offer support and resources, not for punitive measures.
> - **Ongoing monitoring:** Regularly audit the model for bias and unintended consequences.

<a id="results-and-evaluation"></a>

## Results and Evaluation
[Back to top](#)
- Interpret model
- Evaluate model performance using metrics
- Prepare results, visualizations, and actionable steps to share with stakeholders




### Summary of model results

> Will do, after running X_test through model

### Conclusion, Recommendations, Next Steps

> ### Conclusion
> - **Satisfaction level** and **workload** (number of projects, monthly hours) are the strongest predictors of employee attrition.
> - Two main at-risk groups emerged: **overworked/burned-out employees** (many projects, long hours, low satisfaction) and **underworked/disengaged employees** (few projects, low satisfaction).
> - **Tenure** is important: attrition peaks at 4–5 years, then drops sharply.
> - **Salary, department, and recent promotions** have minimal predictive value.
> - **Tree-based models (Random Forest, XGBoost)** achieved the best balance of recall, precision, and F1. With feature engineering, **logistic regression** became competitive and highly interpretable.
> 
> ### Recommendations
> - **Monitor satisfaction and workload:** Regularly survey employees and track workload to identify those at risk of burnout or disengagement.
> - **Targeted retention efforts:** Focus on employees with low satisfaction and extreme workloads, especially those at the 4–5 year tenure mark.
> - **Promotions and recognition:** Consider more frequent recognition or advancement opportunities.
> - **Work-life balance:** Encourage reasonable project loads and monthly hours to reduce burnout risk.
> - **Implement early warning systems:** Use the model to flag at-risk employees for supportive HR outreach.
> - **Review workload distribution:** Ensure fair, manageable assignments.
> - **Conduct stay interviews:** Engage employees approaching 4–5 years of tenure.
> - **Communicate transparently:** Clearly explain how predictive models are used, emphasizing support rather than punitive action.
> 
> ### Next Steps
> - **Model deployment:** Integrate the predictive model into HR processes for early identification of at-risk employees.
> - **Continuous improvement:** Regularly retrain and calibrate the model as new data becomes available.
> - **Expand data sources:** Incorporate additional data (e.g., engagement surveys, manager ratings, exit interview themes) to improve model accuracy.
> - **Bias and fairness audits:** Routinely check for bias across demographic groups and monitor for unintended consequences.
> - **Ethical safeguards:** Ensure employee data privacy, fairness, and transparency in all predictive analytics initiatives.
> 
> ---
> **Resources Used:**
> - [pandas documentation](https://pandas.pydata.org/docs/)
> - [matplotlib documentation](https://matplotlib.org/stable/users/index.html)
> - [seaborn documentation](https://seaborn.pydata.org/)
> - [scikit-learn documentation](https://scikit-learn.org/stable/user_guide.html)
> - [XGBoost documentation](https://xgboost.readthedocs.io/en/stable/)
> - [Kaggle HR Analytics Dataset](https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv)

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.



<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js"></script>
<script src="../static/js/scripts.js"></script>

