In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn modules
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, 
                             roc_auc_score, mean_squared_error, mean_absolute_error, 
                             r2_score, classification_report, confusion_matrix, 
                             precision_recall_curve)

# Machine learning models
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.ensemble import (RandomForestClassifier, RandomForestRegressor, 
                              GradientBoostingClassifier, GradientBoostingRegressor)
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Other utilities
import joblib
import pickle
from scipy.stats import boxcox
import xgboost as xgb

# Load dataset
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df_transformed = df


In [None]:
df.head()

## Data Cleaning:

Since the data was collected in-house through a multiple-choice survey without open-ended questions, the cleaning process is relatively straightforward. However, there are still a few aspects to check:

- **Data types:** Ensure correct classification, primarily int64 and object.
- **Duplicates and missing values:** Identify and handle any null or duplicate entries.
- **Whitespace:** Extra spaces have been removed for consistency.

## Data Exploration

In [None]:
df.describe().T

## Checking for Outliers  

While some machine learning models are robust to outliers, understanding their presence is crucial, as many statistical techniques assume normally distributed errors. Skewed data can impact model performance, especially for algorithms sensitive to variance, such as linear regression and SVMs.  

Common methods for detecting outliers include:  
- **Z-score**: Identifies data points that deviate significantly from the mean.  
- **Interquartile Range (IQR) Method**: Flags values that fall beyond 1.5 times the IQR.  
- **Visualization Techniques**: Box plots, histograms, and density plots provide intuitive insights into data distribution.  

For this analysis, I used **box plots** to visualize the spread and detect potential outliers effectively.

In [None]:
# Set figure size
plt.figure(figsize=(12, 6))

# Create a boxplot for all numerical columns
sns.boxplot(data=df, orient="h")

# Show the plot
plt.title("Outlier Detection using Boxplots")
plt.show()

Some results are difficult to interpret due to scaling in the graph above. To improve clarity, I created a focused visualization of selected columns below.

In [None]:
columns_to_check = ['NumCompaniesWorked', 'PerformanceRating', 'StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']  # Replace with actual column names

# Plot only selected columns
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[columns_to_check], orient="h")
plt.show()

### Comparing the box plots for each feature by Attrition:

In [None]:
# Define the columns to compare with Attrition
columns_to_check = ['NumCompaniesWorked','StockOptionLevel','MonthlyIncome','MonthlyRate','DailyRate',
                    'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany',
                    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

# Loop through each column and create a boxplot
for col in columns_to_check:
    plt.figure(figsize=(12, 6))
    sns.boxplot(x=df['Attrition'], y=df[col], data=df)
    plt.xticks(rotation=45)
    plt.title(f"Distribution of {col} Across Attrition")
    plt.show()

Currently it appears that people who leave will do so around the 2.5 year mark and are earlier on in their careers.

### Comparing categorical features against Attrition:

In [None]:
categorical_columns = ['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus','Over18','OverTime']

for col in categorical_columns:
    plt.figure(figsize=(8, 5))  # New figure for each plot

    # Group and count occurrences
    category_counts = df.groupby(['Attrition', col]).size().unstack()

    # **Sort columns by total count (sum over rows) in ascending order**
    category_counts = category_counts[category_counts.sum(axis=0).sort_values(ascending=False).index]

    # Plot stacked bar chart
    category_counts.plot(kind='bar', stacked=True, ax=plt.gca())

    # Labels & Title
    plt.xlabel('Attrition')
    plt.ylabel("Count")
    plt.title(f"Stacked Bar Chart: {col} vs {'Attrition'}")
    plt.legend(title=col)

    # Show plot
    plt.show()

In [None]:
categorical_columns = ['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus','OverTime']

summary_tables = {}

# Loop through each categorical column
for cat_col in categorical_columns:
    # Cross-tabulation of Attrition and categorical variable
    crosstab = pd.crosstab(df[cat_col], df['Attrition'], margins=True)

    # Convert counts to percentages
    crosstab_percentage = crosstab.div(crosstab["All"], axis=0) * 100

    # Combine both count & percentage into a single DataFrame
    summary_table = crosstab.copy()
    for col in crosstab.columns[:-1]:  # Exclude the "All" column
        summary_table[col] = summary_table[col].astype(str) + " (" + crosstab_percentage[col].round(2).astype(str) + "%)"

    # Store the formatted summary table
    summary_tables[cat_col] = summary_table.drop(columns=["All"])  # Remove the total count column

    # Print the table
    print(f"\n📊 Summary Table for {cat_col} vs Attrition:\n")
    print(summary_tables[cat_col])


We can analyze categorical columns by grouping them by the **Attrition** column using bar charts and percentage tables. Given the presence of outliers and the imbalance in some categories, key observations include:  

1. **25%** of employees who left the company were frequent travelers.  
2. **20%** of employees in the **Sales** department left.  
3. **Technical** and **Marketing** employees had some of the highest attrition rates at **22-24%**.  
4. While **26%** of HR employees left, this group had a small sample size (only **27 employees** in the study).

## Dealing with Outliers

Since our data contains outliers, we will address them before analyzing correlations and other patterns. Instead of removing data, we will apply the Box-Cox transformation, as the context of these outliers is unknown.

The Box-Cox transformation is a powerful technique for handling outliers and improving normality, making the data more suitable for predictive modeling. It is particularly useful because it adapts to different distributions, ensuring a more stable and well-behaved dataset.

In [None]:
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

num_cols = df.select_dtypes(include=['float64', 'int64']).columns
cat_cols = df.select_dtypes(include=['object', 'category']).columns

# Function to apply Box-Cox transformation
def apply_boxcoxs(series):
    if (series <= 0).any():  # Shift data if necessary
        series += abs(series.min()) + 1  # Shift to positive
    return boxcox(series)[0]  # Apply transformation

# Apply Box-Cox to numerical columns, skipping constant columns
for col in num_cols:
    try:
        df_transformed[col] = apply_boxcoxs(df[col])
    except ValueError as e:
        print(f"Skipping {col}: {e}")

# Keep categorical columns unchanged
df_transformed[cat_cols] = df[cat_cols]
df_transformed['Attrition'] = df_transformed['Attrition'].map({'Yes': 1, 'No': 0})

## Correlation of Unskewed Data
Below are the top 20 features most correlated with Attrition. Key insights include:

- Employees who work overtime are more likely to leave.

- Individuals who have never been married show higher attrition rates.

- Sales Representatives are among the most likely to leave.

These correlations provide a glance into employee retention patterns.

In [None]:

# Convert categorical columns to numeric using one-hot encoding
df_encoded = pd.get_dummies(df_transformed, drop_first=True)  # Creates numeric columns for categories

# Compute correlation with Attrition
corr_matrix = df_encoded.corr()

# Select top 20 features most correlated with Attrition
top_corr_features = corr_matrix.nlargest(20, "Attrition")["Attrition"].index

# Plot heatmap
plt.figure(figsize=(15, 15))
sns.heatmap(df_encoded[top_corr_features].corr(), annot=True, cmap="RdYlGn", annot_kws={"size": 10})

plt.title("Top 20 Correlated Features")
plt.show()

In [None]:
# Convert categorical columns to numeric using one-hot encoding
data_encoded = pd.get_dummies(df_transformed, drop_first=True)  # Avoid dummy variable trap

# Compute correlation with Attrition
corr_values = data_encoded.drop(columns=['Attrition','StandardHours','EmployeeCount'], errors='ignore').corrwith(data_encoded['Attrition']).sort_values()

# Plot correlation values
plt.figure(figsize=(10, 15))
corr_values.plot(kind='barh')
plt.title("Correlation of All Features with Attrition")
plt.xlabel("Correlation")
plt.ylabel("Features")
plt.show()

### Handling Class Imbalance  

Since employee attrition is typically lower than retention, class imbalance can impact model performance. To address this:  

1. **Detect and quantify class imbalance** by calculating the distribution of attrition classes.  
2. **Implement class weighting** in models that support it to ensure balanced learning.  

This approach helps improve model accuracy, particularly for predicting the minority class.

### Attrition Rate

In [None]:
target_col = 'Attrition'

# Check attrition distribution (already binary 0=stayed, 1=left)
attrition_counts = df[target_col].value_counts()
print("Attrition distribution:")
print(attrition_counts)

# Calculate attrition rate
left_count = attrition_counts.get(1, 0)
total_count = len(df)
attrition_rate = left_count / total_count
print(f"Attrition rate: {attrition_rate:.2%}")

# Check for class imbalance
stayed_count = attrition_counts.get(0, 0)
if left_count > 0:
    imbalance_ratio = stayed_count / left_count
    print(f"Class imbalance ratio (stayed:left): {imbalance_ratio:.2f}:1")
    if imbalance_ratio > 3:
        print("Note: Dataset has significant class imbalance.")
else:
    print("Warning: No attrition cases found in the dataset.")
    imbalance_ratio = 999  # Set a high value to trigger class weight adjustment

In [None]:
target_col = 'Attrition'
# Split Data
X = df_transformed.drop(target_col, axis=1)
y = df_transformed[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Print Dataset Summary
print(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")
print(f"Training attrition rate: {np.mean(y_train):.2%}, Test attrition rate: {np.mean(y_test):.2%}")

# Preprocessing Pipeline (No Imputation Since No Nulls)
numeric_transformer = Pipeline([('scaler', StandardScaler())])

categorical_transformer = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numeric_cols = [col for col in numeric_cols if col != target_col]
categorical_cols = df.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
preprocessor = ColumnTransformer([
    
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Handle Class Imbalance
class_weight = {0: 1, 1: imbalance_ratio} if imbalance_ratio > 3 else None

# Define Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, class_weight=class_weight),
    'Random Forest': RandomForestClassifier(random_state=42, class_weight=class_weight),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42, scale_pos_weight=imbalance_ratio if imbalance_ratio > 3 else 1),
    'SVC': SVC(probability=True, random_state=42, class_weight=class_weight),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(random_state=42, class_weight=class_weight)
}

Since the attrition rates are very similar between training and test sets, it confirms that the split was done properly (stratified sampling). This prevents bias in the model.

**Class Imbalance:** Only about 16% of employees leave, so naturally the dataset is imbalanced. Models must handle this imbalance to avoid favoring the majority class (employees who stay).

**Real-World Representation:** If the training and test data had very different attrition rates, the model might struggle to generalize. But here, the similarity means the model is trained on a representative sample of the real workforce.

## Train and Evaluate Models

In [None]:
# Train and Evaluate Models
results = {}
for name, model in models.items():
    print(f"\nTraining {name}...")

    pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    # Compute Metrics
    metrics = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='weighted'),
        'Recall': recall_score(y_test, y_pred, average='weighted'),
        'F1 Score': f1_score(y_test, y_pred, average='weighted'),
        'ROC-AUC': roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None,
        'Attrition Precision': precision_score(y_test, y_pred, pos_label=1),
        'Attrition Recall': recall_score(y_test, y_pred, pos_label=1),
        'Attrition F1': f1_score(y_test, y_pred, pos_label=1),
        'Model': pipeline
    }
    
    results[name] = metrics

In [None]:
# Display Model Comparison
metrics_df = pd.DataFrame({k: {m: v[m] for m in v if m != "Model"} for k, v in results.items()}).T
print("\n=== Model Performance Comparison ===")
print(metrics_df)

# Identify Best Models
best_metric = 'Attrition Recall'
best_model_name = max(results, key=lambda k: results[k][best_metric] or 0)
best_model = results[best_model_name]['Model']
print(f"\nBest model based on {best_metric}: {best_model_name}")

# Feature Importance for Tree-Based Models
if hasattr(best_model.named_steps['model'], 'feature_importances_'):
    cat_features = preprocessor.transformers_[1][2]
    cat_feature_names = preprocessor.transformers_[1][1].steps[0][1].get_feature_names_out(cat_features)
    feature_names = np.concatenate([numeric_cols, cat_feature_names])
    
    importances = best_model.named_steps['model'].feature_importances_
    sorted_idx = np.argsort(importances)[::-1]


# Precision-Recall Curve and Optimal Threshold
if hasattr(best_model.named_steps['model'], 'predict_proba'):
    y_pred_proba = best_model.predict_proba(X_test)[:, 1]
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    optimal_threshold = thresholds[np.argmax(f1_scores)] if len(thresholds) > 0 else 0.5

    print(f"\nOptimal probability threshold for attrition: {optimal_threshold:.4f}")


## Conclusion

The model that best identifies employees likely to leave (highest Attrition Recall) is Logistic Regression because it correctly flags the most potential leavers, making it the best choice if catching as many at-risk employees as possible is the goal.

If we look at overall performance (Accuracy, Precision, Recall, F1 Score, and ROC-AUC), XGBoost and Gradient Boosting perform best.

XGBoost has the highest F1 Score (0.841) and Accuracy (86.39%), meaning it's a strong all-around model.

### Comparison:

- Logistic Regression is best at finding leavers (high recall: 61.7%) but not as precise (some false positives).

- Random Forest & KNN have poor recall, meaning they miss a lot of people who actually leave.

- XGBoost & Gradient Boosting offer a balance, but they still miss many at-risk employees.


**Optimal Probability Threshold (0.6073):**

Instead of using the default 50% probability threshold, setting it to 0.6073 improves the model’s ability to detect leavers without too many false alarms.

### Business Takeaway
If identifying at-risk employees is the priority (minimizing missed leavers), go with Logistic Regression.

If a balance between predicting leavers and overall accuracy is needed, XGBoost or Gradient Boosting is better.

Adjusting the probability threshold to 0.6073 makes the model more effective at detecting employees likely to leave.