# Proyek Pertama: Analisis Faktor Penyebab Tingginya Tingkat Attrition Karyawan Jaya Jaya Maju (Menyelesaikan Permasalahan Human Resources)

- Nama: Muhammad Akbar Hamid
- Email: muhakbarhamid21@gmail.com
- Id Dicoding: muhakbarhamid21

## Persiapan

In [52]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import optuna
import joblib

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve



In [2]:
sns.set(style="whitegrid")

In [3]:
df = pd.read_csv('data/employee_data.csv')

In [4]:
df.head()

Unnamed: 0,EmployeeId,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,38,,Travel_Frequently,1444,Human Resources,1,4,Other,1,...,2,80,1,7,2,3,6,2,1,2
1,2,37,1.0,Travel_Rarely,1141,Research & Development,11,2,Medical,1,...,1,80,0,15,2,1,1,0,0,0
2,3,51,1.0,Travel_Rarely,1323,Research & Development,4,4,Life Sciences,1,...,3,80,3,18,2,4,10,0,2,7
3,4,42,0.0,Travel_Frequently,555,Sales,26,3,Marketing,1,...,4,80,1,23,2,4,20,4,4,8
4,5,40,,Travel_Rarely,1194,Research & Development,2,4,Medical,1,...,2,80,3,20,2,3,5,3,0,2


## Data Understanding

### Informasi Umum Dataset

In [5]:
print(f"Jumlah baris: {df.shape[0]}, Jumlah kolom: {df.shape[1]}")

Jumlah baris: 1470, Jumlah kolom: 35


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   EmployeeId                1470 non-null   int64  
 1   Age                       1470 non-null   int64  
 2   Attrition                 1058 non-null   float64
 3   BusinessTravel            1470 non-null   object 
 4   DailyRate                 1470 non-null   int64  
 5   Department                1470 non-null   object 
 6   DistanceFromHome          1470 non-null   int64  
 7   Education                 1470 non-null   int64  
 8   EducationField            1470 non-null   object 
 9   EmployeeCount             1470 non-null   int64  
 10  EnvironmentSatisfaction   1470 non-null   int64  
 11  Gender                    1470 non-null   object 
 12  HourlyRate                1470 non-null   int64  
 13  JobInvolvement            1470 non-null   int64  
 14  JobLevel

### Cek Nilai Kosong & Duplikat

In [7]:
missing = df.isnull().sum()
print("Missing values:\n", missing[missing > 0])

Missing values:
 Attrition    412
dtype: int64


Kolom Attrition memiliki nilai kosong (missing) → perlu dibersihkan (drop rows / imputasi).

In [8]:
print("Jumlah data duplikat:", df.duplicated().sum())

Jumlah data duplikat: 0


Tidak ditemukan data duplikat.

### Distribusi Target "Attrition"

In [9]:
attr_counts = df['Attrition'].value_counts()
attr_percent = df['Attrition'].value_counts(normalize=True) * 100

attr_summary = pd.DataFrame({
    'Status Karyawan': ['0.0 (Bertahan)', '1.0 (Keluar)'],
    'Jumlah': attr_counts.values,
    'Proporsi (%)': attr_percent.values
})

print("Ringkasan Status Attrition Karyawan:\n")
print(attr_summary.to_string(index=False))


Ringkasan Status Attrition Karyawan:

Status Karyawan  Jumlah  Proporsi (%)
 0.0 (Bertahan)     879     83.081285
   1.0 (Keluar)     179     16.918715


In [10]:
fig = go.Figure(data=[go.Pie(labels=['0.0 (Bertahan)', '1.0 (Keluar)'], values=attr_counts, pull=[0.1, 0], textinfo='percent+label',)])
fig.update_layout(title='Distribusi Karyawan: Keluar vs Bertahan', height=400, width=600)
fig.show()

- 83% karyawan bertahan, hanya 17% yang keluar.

Artinya: data mengalami class imbalance yang perlu diperhatikan jika membuat model prediktif.

### Statistik Deskriptif

Untuk melihat ringkasan data numerik dan kategorikal secara statistik.

In [11]:
# Numerik
df.describe()

Unnamed: 0,EmployeeId,Age,Attrition,DailyRate,DistanceFromHome,Education,EmployeeCount,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1058.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,735.5,36.92381,0.169187,802.485714,9.192517,2.912925,1.0,2.721769,65.891156,2.729932,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,424.496761,9.135373,0.375094,403.5091,8.106864,1.024165,0.0,1.093082,20.329428,0.711561,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,1.0,18.0,0.0,102.0,1.0,1.0,1.0,1.0,30.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,368.25,30.0,0.0,465.0,2.0,2.0,1.0,2.0,48.0,2.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,735.5,36.0,0.0,802.0,7.0,3.0,1.0,3.0,66.0,3.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,1102.75,43.0,0.0,1157.0,14.0,4.0,1.0,4.0,83.75,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,1470.0,60.0,1.0,1499.0,29.0,5.0,1.0,4.0,100.0,4.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [12]:
# Kategorikal
df.describe(include='object')

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
count,1470,1470,1470,1470,1470,1470,1470,1470
unique,3,3,6,2,9,3,1,2
top,Travel_Rarely,Research & Development,Life Sciences,Male,Sales Executive,Married,Y,No
freq,1043,961,606,882,326,673,1470,1054


### Eksplorasi Fitur Numerik

In [13]:
fig = px.histogram(df, x='Age', color='Attrition', barmode='stack', title='Distribusi Usia berdasarkan Attrition', height=400, text_auto=True)
fig.update_layout(xaxis_title='Usia', yaxis_title='Jumlah', bargap=0.1)
fig.show()

Karyawan dengan usia muda (20–30 tahun) memiliki jumlah yang keluar (attrition = 1) lebih banyak dibandingkan yang lebih tua. Ini memperkuat korelasi negatif antara usia dan attrition.

In [14]:
fig = px.box(df, x='Attrition', y='MonthlyIncome', color='Attrition', title='Distribusi Monthly Income Berdasarkan Attrition', height=400, width=800)
fig.update_layout(yaxis_title='Monthly Income')
fig.show()

Karyawan yang keluar cenderung memiliki pendapatan lebih rendah daripada yang bertahan. Outlier menunjukkan beberapa karyawan bergaji tinggi tetap keluar, tapi jumlahnya kecil.

### Eksplorasi Fitur Kategorikal

In [15]:
fig = px.histogram(df, x="OverTime", color="Attrition", barmode="group", title="Distribusi OverTime Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Over Time', yaxis_title='Jumlah')
fig.show()

Karyawan yang lembur (OverTime = Yes) jauh lebih sering keluar. Ini menunjukkan lembur berlebihan bisa menjadi penyebab burnout dan turnover tinggi.

In [16]:
fig = px.histogram(df, x="Department", color="Attrition", barmode="group", title="Distribusi Department berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Department', yaxis_title='Jumlah')
fig.show()

Attrition tertinggi terjadi di departemen:
- Sales
- R&D
- HR memiliki jumlah keluar yang kecil, mungkin karena ukuran tim kecil.

In [17]:
fig = px.histogram(df, x="Gender", color="Attrition", barmode="group", title="Distribusi Gender Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Gender', yaxis_title='Jumlah')
fig.show()

Jumlah pria dan wanita yang keluar hampir sebanding secara proporsional, meskipun laki-laki sedikit lebih banyak keluar secara absolut. Namun ini bisa jadi karena jumlah karyawan laki-laki juga lebih besar.

In [18]:
fig = px.histogram(df, x="EducationField", color="Attrition", barmode="group", title="Distribusi Education Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Education', yaxis_title='Jumlah')
fig.show()

Tidak ada pola yang sangat dominan, namun attrition lebih tinggi terlihat pada bidang Life Sciences dan Medical, yang memang merupakan mayoritas karyawan.

In [19]:
fig = px.histogram(df, x="JobRole", color="Attrition", barmode="group", title="Distribusi Job Role Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Job Role', yaxis_title='Jumlah')
fig.show()

- Sales Representative dan Laboratory Technician memiliki proporsi keluar yang tinggi.
- Manager, Research Director, dan Healthcare Rep cenderung lebih loyal.

In [20]:
fig = px.histogram(df, x="MaritalStatus", color="Attrition", barmode="group", title="Distribusi Marital Status Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Marinal Status', yaxis_title='Jumlah')
fig.show()

Karyawan yang Single cenderung lebih banyak keluar dibanding yang Married atau Divorced. Ini bisa dikaitkan dengan komitmen dan stabilitas hidup.

In [21]:
fig = px.histogram(df, x="BusinessTravel", color="Attrition", barmode="group", title="Distribusi Business Travel Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Business Travel', yaxis_title='Jumlah')
fig.show()

Karyawan yang sering bepergian (Travel_Frequently) memiliki proporsi keluar lebih tinggi dibandingkan yang tidak melakukan perjalanan. Faktor kelelahan dan ketidakseimbangan hidup kerja bisa menjadi penyebab.

### Korelasi Antar Fitur Numerik

In [22]:
corr = df.corr(numeric_only=True).round(2)

fig = go.Figure(data=go.Heatmap(z=corr.values, x=corr.columns, y=corr.index, colorscale='RdBu', zmin=-1, zmax=1, colorbar=dict(title='Korelasi'), text=corr.values, texttemplate="%{text}", hovertemplate="Fitur X: %{x}<br>Fitur Y: %{y}<br>Korelasi: %{z:.2f}<extra></extra>"))
fig.update_layout(title="Matriks Korelasi", xaxis_title="Fitur", yaxis_title="Fitur", width=1400, height=1400)
fig.show()


In [23]:
top_corr = corr['Attrition'].drop('Attrition').sort_values(ascending=True)
chart_height = len(top_corr) * 30

fig = px.bar(x=top_corr.values, y=top_corr.index, orientation='h', title='Korelasi Fitur terhadap Attrition', labels={'x': 'Korelasi', 'y': 'Fitur'}, height=chart_height, width=800, text_auto=True)
fig.update_layout(
    xaxis_range=[-0.20, 0.20],
    yaxis=dict(
        showgrid=True,
        gridcolor='lightgrey',
        gridwidth=1,
        ticks="outside"
    ),
    plot_bgcolor='white'
)
fig.show()


Fitur dengan korelasi negatif terkuat:
1. TotalWorkingYears: -0.18
2. JobLevel: -0.17
3. Age: -0.17
4. MonthlyIncome: -0.16
5. YearsWithCurrManager: -0.16
6. YearsInCurrentRole: -0.16

Semakin senior dan berpengalaman seseorang, kemungkinan keluar dari perusahaan semakin kecil.

## Data Preparation / Preprocessing

In [24]:
df_prep = df.copy()

### Hapus Kolom Tidak Relevan

Beberapa kolom seperti `EmployeeID`, `EmployeeCount`, `StandardHours`, dan `Over18` tidak memiliki variasi atau tidak relevan untuk proses modeling.

In [25]:
drop_cols = ['EmployeeId', 'EmployeeCount', 'StandardHours', 'Over18']
df_prep.drop(columns=drop_cols, inplace=True)

### Tangani Missing Values

Dalam proyek prediksi Attrition, baris dengan nilai kosong pada kolom target tidak boleh diisi sembarangan, karena mereka tidak menyediakan label yang bisa dipelajari model. Mengisi label target seperti Attrition secara paksa akan menyebabkan model belajar dari data yang keliru, menurunkan performa, dan merusak validitas prediksi. Oleh karena itu, praktik terbaik adalah menghapus (drop) baris tersebut sebelum proses training dimulai.

In [26]:
df_prep = df_prep.dropna(subset=['Attrition'])

### Ubah Tipe Target menjadi Integer

In [27]:
df_prep['Attrition'] = df_prep['Attrition'].astype(int)

### Encoding Fitur Kategorikal

In [28]:
df_prep = pd.get_dummies(df_prep, drop_first=True)

### Normalisasi/Standarisasi Fitur Numerik

In [29]:
numerical_cols = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 
                  'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 
                  'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 
                  'YearsSinceLastPromotion', 'YearsWithCurrManager']

### Simpan Data Preprocessing

In [19]:
df_prep.to_csv("data/employee_data_preprocessed.csv", index=False)

## Modeling

In [None]:
X = df_prep.drop(columns='Attrition')
y = df_prep['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Logistic Regression

In [31]:
# Scaling fitur
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
def objective_lr(trial):
    penalty = trial.suggest_categorical("penalty", ["l1", "l2"])
    C = trial.suggest_float("C", 0.01, 10.0, log=True)
    solver = "liblinear" if penalty == "l1" else "lbfgs"
    model = LogisticRegression(class_weight='balanced', penalty=penalty, C=C, solver=solver, max_iter=1000)
    return cross_val_score(model, X_train_scaled, y_train, scoring="roc_auc", cv=3).mean()

In [33]:
study_lr = optuna.create_study(direction="maximize")
study_lr.optimize(objective_lr, n_trials=30)
best_params_lr = study_lr.best_params

[I 2025-05-02 23:53:51,475] A new study created in memory with name: no-name-89412db3-06ef-4ea8-a5e0-b91edf447981
[I 2025-05-02 23:53:51,512] Trial 0 finished with value: 0.8175476161150298 and parameters: {'penalty': 'l2', 'C': 0.18097081681549174}. Best is trial 0 with value: 0.8175476161150298.
[I 2025-05-02 23:53:51,547] Trial 1 finished with value: 0.8108089338831 and parameters: {'penalty': 'l1', 'C': 0.17808997458792789}. Best is trial 0 with value: 0.8175476161150298.
[I 2025-05-02 23:53:51,597] Trial 2 finished with value: 0.8137233204231845 and parameters: {'penalty': 'l1', 'C': 1.09073132682291}. Best is trial 0 with value: 0.8175476161150298.
[I 2025-05-02 23:53:51,649] Trial 3 finished with value: 0.8109814695515238 and parameters: {'penalty': 'l2', 'C': 3.8367336076066443}. Best is trial 0 with value: 0.8175476161150298.
[I 2025-05-02 23:53:51,723] Trial 4 finished with value: 0.747905065857665 and parameters: {'penalty': 'l1', 'C': 0.03207557376286463}. Best is trial 0 w

In [34]:
print("Logistic Regression :", best_params_lr)

Logistic Regression : {'penalty': 'l2', 'C': 0.3050381040881149}


In [35]:
# Menggunakan Parameter Terbaik (LR dengan scaling)

penalty = best_params_lr['penalty']
solver = "liblinear" if penalty == "l1" else "lbfgs"

lr_best = LogisticRegression(**best_params_lr, class_weight='balanced', max_iter=1000)
lr_best.fit(X_train_scaled, y_train)
y_pred_lr = lr_best.predict(X_test_scaled)
y_proba_lr = lr_best.predict_proba(X_test_scaled)[:, 1]

### Random Forest

In [36]:
def objective_rf(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 20),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None])
    }
    model = RandomForestClassifier(class_weight='balanced', **params, random_state=42)
    return cross_val_score(model, X_train, y_train, scoring="roc_auc", cv=3).mean()

In [37]:
study_rf = optuna.create_study(direction="maximize")
study_rf.optimize(objective_rf, n_trials=30)
best_params_rf = study_rf.best_params

[I 2025-05-02 23:53:52,646] A new study created in memory with name: no-name-4ebd95cd-06de-4222-80cf-0209fbae4528
[I 2025-05-02 23:53:53,481] Trial 0 finished with value: 0.7636943365374403 and parameters: {'n_estimators': 133, 'max_depth': 18, 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_features': 'log2'}. Best is trial 0 with value: 0.7636943365374403.
[I 2025-05-02 23:53:54,026] Trial 1 finished with value: 0.7530885854165619 and parameters: {'n_estimators': 126, 'max_depth': 3, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.7636943365374403.
[I 2025-05-02 23:53:57,478] Trial 2 finished with value: 0.762920251565189 and parameters: {'n_estimators': 243, 'max_depth': 6, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': None}. Best is trial 0 with value: 0.7636943365374403.
[I 2025-05-02 23:53:59,932] Trial 3 finished with value: 0.7490144404076048 and parameters: {'n_estimators': 183, 'max_depth': 19, 'min_sa

In [38]:
print("Random Forest :", best_params_rf)

Random Forest : {'n_estimators': 213, 'max_depth': 12, 'min_samples_split': 8, 'min_samples_leaf': 5, 'max_features': 'sqrt'}


In [39]:
# Menggunakan Parameter Terbaik
rf_best = RandomForestClassifier(**best_params_rf, class_weight='balanced', random_state=42)
rf_best.fit(X_train, y_train)
y_pred_rf = rf_best.predict(X_test)
y_proba_rf = rf_best.predict_proba(X_test)[:, 1]

### XGBoost

In [40]:
# Hitung scale_pos_weight
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

In [41]:
def objective_xgb(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 5),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 5),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "scale_pos_weight": scale_pos_weight
    }
    model = XGBClassifier(**params, eval_metric='logloss', random_state=42)
    return cross_val_score(model, X_train, y_train, scoring="roc_auc", cv=3).mean()


In [42]:
study_xgb = optuna.create_study(direction="maximize")
study_xgb.optimize(objective_xgb, n_trials=30)
best_params_xgb = study_xgb.best_params

[I 2025-05-02 23:54:29,012] A new study created in memory with name: no-name-2421c5bc-713d-47f6-bcdd-51f243e54a0e
[I 2025-05-02 23:54:29,301] Trial 0 finished with value: 0.7787200321614461 and parameters: {'n_estimators': 249, 'max_depth': 6, 'learning_rate': 0.1354473707689558, 'subsample': 0.8476862914134807, 'colsample_bytree': 0.6498959317681546, 'reg_alpha': 0.3098959430856024, 'reg_lambda': 3.5660095373033607, 'gamma': 4.431437589372273}. Best is trial 0 with value: 0.7787200321614461.
[I 2025-05-02 23:54:29,655] Trial 1 finished with value: 0.7518327934758823 and parameters: {'n_estimators': 250, 'max_depth': 9, 'learning_rate': 0.25279560697221837, 'subsample': 0.6620167824087686, 'colsample_bytree': 0.6515842048960157, 'reg_alpha': 0.02500357771867301, 'reg_lambda': 0.9493740588538563, 'gamma': 0.9019932646621953}. Best is trial 0 with value: 0.7787200321614461.
[I 2025-05-02 23:54:29,814] Trial 2 finished with value: 0.7533397943189327 and parameters: {'n_estimators': 65, 'm

In [43]:
print("XGBoost :", best_params_xgb)

XGBoost : {'n_estimators': 217, 'max_depth': 4, 'learning_rate': 0.15846773896247185, 'subsample': 0.590413014517955, 'colsample_bytree': 0.7409921279970334, 'reg_alpha': 4.970786844349558, 'reg_lambda': 0.9329272486387984, 'gamma': 2.390573478957289}


In [44]:
# Menggunakan Parameter Terbaik
xgb_best = XGBClassifier(**best_params_xgb, eval_metric='logloss', random_state=42)
xgb_best.fit(X_train, y_train)
y_proba_xgb = xgb_best.predict_proba(X_test)[:, 1]

# Threshold tuning: default = 0.5 → turunkan ke 0.35
y_pred_xgb = (y_proba_xgb >= 0.35).astype(int)

### Stacking Ensemble (meta-model)

Gabungkan output Logistic Regression + XGBoost + RF → lalu latih meta-classifier.

In [45]:
# Buat pipeline untuk meta-classifier
meta_learner = make_pipeline(
    StandardScaler(),
    LogisticRegression(class_weight='balanced', max_iter=2000)
)

In [46]:
# Stacking dengan Logistic Regression sebagai meta-learner
stack_model = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(**best_params_lr, class_weight='balanced', max_iter=1000)),
        ('rf', RandomForestClassifier(**best_params_rf, class_weight='balanced', random_state=42)),
        ('xgb', XGBClassifier(**best_params_xgb, eval_metric='logloss', random_state=42))
    ],
    final_estimator=meta_learner
)

In [47]:
# Latih model stack
stack_model.fit(X_train, y_train)

# Prediksi dan probabilitas
y_pred_stack = stack_model.predict(X_test)
y_proba_stack = stack_model.predict_proba(X_test)[:, 1]


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th

## Evaluation

In [48]:
def evaluate_model(name, y_true, y_pred, y_proba):
    print(f"=== {name} ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, zero_division=0))
    print("Recall   :", recall_score(y_true, y_pred, zero_division=0))
    print("F1 Score :", f1_score(y_true, y_pred, zero_division=0))
    print("ROC AUC  :", roc_auc_score(y_true, y_proba))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n")

evaluate_model("Logistic Regression (Tuned)", y_test, y_pred_lr, y_proba_lr)
evaluate_model("Random Forest (Tuned)", y_test, y_pred_rf, y_proba_rf)
evaluate_model("XGBoost (Tuned + Threshold 0.35)", y_test, y_pred_xgb, y_proba_xgb)
evaluate_model("Stacking Ensemble", y_test, y_pred_stack, y_proba_stack)


=== Logistic Regression (Tuned) ===
Accuracy : 0.7122641509433962
Precision: 0.33766233766233766
Recall   : 0.7222222222222222
F1 Score : 0.46017699115044247
ROC AUC  : 0.811395202020202
Confusion Matrix:
 [[125  51]
 [ 10  26]]


=== Random Forest (Tuned) ===
Accuracy : 0.8584905660377359
Precision: 0.6666666666666666
Recall   : 0.3333333333333333
F1 Score : 0.4444444444444444
ROC AUC  : 0.8077651515151515
Confusion Matrix:
 [[170   6]
 [ 24  12]]


=== XGBoost (Tuned + Threshold 0.35) ===
Accuracy : 0.8443396226415094
Precision: 0.56
Recall   : 0.3888888888888889
F1 Score : 0.45901639344262296
ROC AUC  : 0.8137626262626263
Confusion Matrix:
 [[165  11]
 [ 22  14]]


=== Stacking Ensemble ===
Accuracy : 0.7547169811320755
Precision: 0.3888888888888889
Recall   : 0.7777777777777778
F1 Score : 0.5185185185185185
ROC AUC  : 0.8151830808080808
Confusion Matrix:
 [[132  44]
 [  8  28]]




In [49]:
def get_metrics(name, y_true, y_pred, y_proba):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, zero_division=0),
        "Recall": recall_score(y_true, y_pred, zero_division=0),
        "F1 Score": f1_score(y_true, y_pred, zero_division=0),
        "ROC AUC": roc_auc_score(y_true, y_proba)
    }

results = []
results.append(get_metrics("Logistic Regression (Tuned)", y_test, y_pred_lr, y_proba_lr))
results.append(get_metrics("Random Forest (Tuned)", y_test, y_pred_rf, y_proba_rf))
results.append(get_metrics("XGBoost (Tuned + Treshold 0.35)", y_test, y_pred_xgb, y_proba_xgb))
results.append(get_metrics("Stacking Ensemble", y_test, y_pred_stack, y_proba_stack))

df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC AUC
0,Logistic Regression (Tuned),0.712264,0.337662,0.722222,0.460177,0.811395
1,Random Forest (Tuned),0.858491,0.666667,0.333333,0.444444,0.807765
2,XGBoost (Tuned + Treshold 0.35),0.84434,0.56,0.388889,0.459016,0.813763
3,Stacking Ensemble,0.754717,0.388889,0.777778,0.518519,0.815183


In [50]:
df_long = df_results.melt(id_vars='Model', var_name='Metric', value_name='Score')

fig = px.bar(df_long, x='Score', y='Model', color='Metric', barmode='group', orientation='h', title='Perbandingan Kinerja Model', text='Score')
fig.update_layout(xaxis_title='Nilai Metrik', yaxis_title='Model', legend_title='Metrik', template='plotly_white', height=500)
fig.update_traces(texttemplate='%{text:.3f}', textposition='outside')
fig.show()


In [51]:
fig = go.Figure()

model_preds = {
    "LogReg": y_proba_lr,
    "RandomForest": y_proba_rf,
    "XGBoost": y_proba_xgb,
    "Stacking": y_proba_stack
}

for name, y_prob in model_preds.items():
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc = roc_auc_score(y_test, y_prob)
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'{name} (AUC={auc:.3f})'))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(dash='dash'), name='Baseline'))
fig.update_layout(title='ROC Curve - Semua Model', xaxis_title='False Positive Rate', yaxis_title='True Positive Rate', template='plotly_white', height=500)
fig.show()


### Analisis Hasil Evaluasi

Ringkasan Hasil Evaluasi Model

| Model                         | Accuracy  | Precision | Recall    | F1 Score  | ROC AUC   |
| ----------------------------- | --------- | --------- | --------- | --------- | --------- |
| Logistic Regression (Tuned)   | 0.712     | 0.338     | 0.722     | 0.460     | 0.811     |
| Random Forest (Tuned)         | **0.858** | **0.667** | 0.333     | 0.444     | 0.808     |
| XGBoost (Tuned + Thresh 0.35) | 0.844     | 0.560     | 0.389     | 0.459     | 0.814     |
| Stacking Ensemble             | 0.755     | 0.389     | **0.778** | **0.519** | **0.815** |

Analisis Tiap Model
1. Logistic Regression (Tuned)
   - ✅ Recall sangat tinggi (0.72) → cocok untuk deteksi dini siapa yang mungkin keluar.
   - ❌ Precision rendah (0.34) → banyak false positive (karyawan diprediksi keluar, padahal tidak).
   - 📈 ROC AUC 0.811 → cukup bagus dalam memisahkan kelas keluar vs tidak.
   - ⚠️ Risiko: bisa menyebabkan over-alert di sistem HR (terlalu banyak orang yang “diperingatkan”).
2. Random Forest (Tuned)
   - ✅ Accuracy & Precision tertinggi.
   - ❌ Recall hanya 0.33 → artinya hanya mendeteksi 1 dari 3 karyawan yang keluar.
   - Cocok untuk strategi HR yang ingin memastikan prediksi positif benar, tapi tidak terlalu sensitif.
3. XGBoost (Tuned + Threshold 0.35)
   - 🧠 Threshold tuning berhasil: Precision & Recall lebih seimbang dibanding RF.
   - ✅ ROC AUC sangat tinggi (0.814) → prediktif kuat.
   - ⚖️ Solusi “tengah” antara RF (akurasi tinggi) dan LogReg (recall tinggi).
   - ⚠️ Bisa digunakan untuk ranking risiko keluar karyawan.
4. Stacking Ensemble
   - ✅ Recall tertinggi (0.778) → terbaik dalam mendeteksi siapa yang keluar.
   - ✅ F1 Score tertinggi → menandakan keseimbangan precision & recall terbaik.
   - ✅ ROC AUC tertinggi (0.815) → paling baik memisahkan kelas.
   - ⚠️ Accuracy memang tidak tertinggi (0.755), tapi itu trade-off wajar untuk meningkatkan recall.
   - 🧠 Ini adalah model paling stabil dan unggul secara keseluruhan.

**Rekomendasi Model Terbaik: Stacking Ensemble**

Berdasarkan hasil metrik dan objektif bisnis HR, model Stacking Ensemble adalah yang paling disarankan, karena:

Kenapa?
   - Memberikan recall tertinggi (penting untuk mendeteksi karyawan berisiko).
   - F1 Score tertinggi → prediksi seimbang.
   - ROC AUC tertinggi → performa klasifikasi sangat baik.

Cocok untuk:
   - Sistem early warning HR.
   - Mendukung pengambilan keputusan strategis dalam retensi SDM.
   - Menyediakan dasar untuk dashboard prediktif interaktif.

Rekomendasi Final
| Tujuan HR                       | Model yang Dipilih      | Alasan                                       |
| ------------------------------- | ----------------------- | -------------------------------------------- |
| Deteksi sebanyak mungkin keluar | ✅ **Stacking Ensemble** | Recall & F1 Score tertinggi                  |
| Ingin prediksi positif benar    | 🔸 Random Forest        | Precision tinggi, tapi lemah di recall       |
| Sistem notifikasi sensitif      | 🔸 Logistic Regression  | Recall bagus tapi trade-off dengan precision |
| Pemisahan risiko + ranking      | 🔸 XGBoost + Threshold  | AUC bagus, seimbang & bisa diatur threshold  |


## Simpan Model

In [53]:
joblib.dump(lr_best, 'model/logistic_model.pkl')
joblib.dump(rf_best, 'model/random_forest_model.pkl')
joblib.dump(xgb_best, 'model/xgboost_model.pkl')
joblib.dump(stack_model, 'model/stacking_model.pkl')

['model/stacking_model.pkl']

In [18]:
joblib.dump(X_train.columns.tolist(), "model/feature_columns.pkl")

['model/feature_columns.pkl']