# Debugging Assignment — Employee Attrition Prediction (BROKEN VERSION)

This notebook contains an **intentionally broken** machine learning pipeline for predicting employee attrition.

Your task (as a candidate) will be to:
- Identify the issues
- Fix them
- Retrain and properly evaluate the model

> ⚠️ Note: This version is deliberately incorrect. Do **not** use it as a template in production.


In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Load dataset
df = pd.read_csv("debug_dataset.csv")
df.head()

Unnamed: 0,age,gender,education,department,job_role,monthly_income,years_at_company,promotions,overtime,performance_rating,attrition
0,50,Female,Post-Graduate,IT,Lead,102565,9,0,No,1,0
1,36,Female,PhD,Sales,Lead,49402,2,0,Yes,2,0
2,29,Female,Graduate,HR,Executive,24263,7,1,No,1,0
3,42,Male,Graduate,HR,Executive,116523,7,3,Yes,4,0
4,40,Female,PhD,HR,Manager,66828,1,1,Yes,3,0


In [32]:
# Basic info
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nTarget Distribution:")
print(df['attrition'].value_counts())

Dataset Shape: (300, 11)

Data Types:
age                    int64
gender                object
education             object
department            object
job_role              object
monthly_income         int64
years_at_company       int64
promotions             int64
overtime              object
performance_rating     int64
attrition              int64
dtype: object

Missing Values:
age                   0
gender                0
education             0
department            0
job_role              0
monthly_income        0
years_at_company      0
promotions            0
overtime              0
performance_rating    0
attrition             0
dtype: int64

Target Distribution:
attrition
0    231
1     69
Name: count, dtype: int64


In [33]:

            # --- Feature Engineering (BROKEN ON PURPOSE) ---

# 1) Create a feature that directly uses the target (leakage disguised as a helper feature)

df["attrition_copy"] = df["attrition"]  # <-- target copied into features


# 2) Use 'target_leakage_feature' as-is assuming it's a good predictor
df["target_leakage_feature"] = df["attrition"].apply(lambda x: 1 if x == 1 else 0)


# 3) Fill all missing values with 0 without analysis

df = df.fillna(0)

df.head()

Unnamed: 0,age,gender,education,department,job_role,monthly_income,years_at_company,promotions,overtime,performance_rating,attrition,attrition_copy,target_leakage_feature
0,50,Female,Post-Graduate,IT,Lead,102565,9,0,No,1,0,0,0
1,36,Female,PhD,Sales,Lead,49402,2,0,Yes,2,0,0,0
2,29,Female,Graduate,HR,Executive,24263,7,1,No,1,0,0,0
3,42,Male,Graduate,HR,Executive,116523,7,3,Yes,4,0,0,0
4,40,Female,PhD,HR,Manager,66828,1,1,Yes,3,0,0,0


In [34]:

            # --- Create features/target (order BROKEN) ---

categorical_cols = ['gender', 'education', 'department', 'job_role', 'overtime']

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


X = df.drop("attrition", axis=1)

y = df["attrition"]


# Scale ALL data before splitting (data leakage)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Split AFTER scaling (this uses information from the full dataset)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [35]:

            # --- Train RandomForest with almost no control ---

model = RandomForestClassifier(n_estimators=500, random_state=42, max_depth=None)
model.fit(X_train, y_train)


# Evaluate using Accuracy only

y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print("Test Accuracy:", acc)

print("\nClassification Report:")

print(classification_report(y_test, y_pred))



Test Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        49
           1       1.00      1.00      1.00        11

    accuracy                           1.00        60
   macro avg       1.00      1.00      1.00        60
weighted avg       1.00      1.00      1.00        60



In [36]:

            # --- Misused cross-validation (BROKEN) ---

# Run cross-validation only on the TEST set (incorrect!)

cv_scores = cross_val_score(model, X_test, y_test, cv=5)

print("Cross-validation scores on test set:", cv_scores)

print("Mean CV score:", cv_scores.mean())



Cross-validation scores on test set: [1. 1. 1. 1. 1.]
Mean CV score: 1.0



## ✅ Final Conclusion (INCORRECT)

The model achieves **very high accuracy** and therefore is ready to be used in production for predicting employee attrition.

No further validation or checks are required.
