Week 3 – ML Foundations: Baseline Model
Predicting Employee Attrition using Logistic Regression & Random Forest (Baseline)

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

Load Dataset

In [10]:
df = pd.read_csv('../06_ml/HR-Employee-Attrition.csv')
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
print('Shape:', df.shape)
df.head()

Shape: (1470, 35)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Prepare Features and Target

In [11]:
features = [
    'Age', 'MonthlyIncome', 'DistanceFromHome', 'YearsAtCompany',
    'TotalWorkingYears', 'JobSatisfaction', 'EnvironmentSatisfaction',
    'JobInvolvement', 'PerformanceRating', 'WorkLifeBalance',
    'OverTime', 'Gender', 'MaritalStatus'
]

X = df[features]
y = df['Attrition']

X = pd.get_dummies(X, columns=['OverTime', 'Gender', 'MaritalStatus'], drop_first=True)
print('Feature shape:', X.shape)
print('Target distribution:\n', y.value_counts())

Feature shape: (1470, 14)
Target distribution:
 Attrition
0    1233
1     237
Name: count, dtype: int64


Split and Scale Data

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('Training set:', X_train.shape, 'Testing set:', X_test.shape)

Training set: (1176, 14) Testing set: (294, 14)


Baseline Model – Logistic Regression

In [13]:
model = LogisticRegression(max_iter=2000, class_weight='balanced', solver='lbfgs', random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print('Accuracy:', round(accuracy_score(y_test, y_pred), 3))
print('F1 Score:', round(f1_score(y_test, y_pred), 3))
print('\nClassification Report:\n', classification_report(y_test, y_pred))
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))

Accuracy: 0.759
F1 Score: 0.51

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.75      0.84       247
           1       0.38      0.79      0.51        47

    accuracy                           0.76       294
   macro avg       0.66      0.77      0.68       294
weighted avg       0.86      0.76      0.79       294


Confusion Matrix:
 [[186  61]
 [ 10  37]]


Feature Importance – Logistic Regression

In [14]:
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

print('\nTop 10 Positive Coefficients (Higher Attrition Risk):\n')
print(coef_df.head(10))
print('\nTop 10 Negative Coefficients (Higher Retention):\n')
print(coef_df.tail(10))


Top 10 Positive Coefficients (Higher Attrition Risk):

                  Feature  Coefficient
10           OverTime_Yes     0.642238
13   MaritalStatus_Single     0.489612
2        DistanceFromHome     0.240569
12  MaritalStatus_Married     0.145210
11            Gender_Male     0.138972
8       PerformanceRating    -0.088698
3          YearsAtCompany    -0.162494
1           MonthlyIncome    -0.192234
4       TotalWorkingYears    -0.224247
0                     Age    -0.230754

Top 10 Negative Coefficients (Higher Retention):

                    Feature  Coefficient
11              Gender_Male     0.138972
8         PerformanceRating    -0.088698
3            YearsAtCompany    -0.162494
1             MonthlyIncome    -0.192234
4         TotalWorkingYears    -0.224247
0                       Age    -0.230754
9           WorkLifeBalance    -0.233871
5           JobSatisfaction    -0.346166
7            JobInvolvement    -0.359475
6   EnvironmentSatisfaction    -0.378514
