# Proyek Akhir: Menyelesaikan Permasalahan Perusahaan Edutech

- Nama: Muhammad Akbar Hamid
- Email: muhakbarhamid21@gmail.com
- Id Dicoding: muhakbarhamid21

## Persiapan

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


In [2]:
sns.set(style="whitegrid")

In [3]:
df = pd.read_csv('data/employee_data.csv')

In [4]:
df.head()

Unnamed: 0,EmployeeId,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,38,,Travel_Frequently,1444,Human Resources,1,4,Other,1,...,2,80,1,7,2,3,6,2,1,2
1,2,37,1.0,Travel_Rarely,1141,Research & Development,11,2,Medical,1,...,1,80,0,15,2,1,1,0,0,0
2,3,51,1.0,Travel_Rarely,1323,Research & Development,4,4,Life Sciences,1,...,3,80,3,18,2,4,10,0,2,7
3,4,42,0.0,Travel_Frequently,555,Sales,26,3,Marketing,1,...,4,80,1,23,2,4,20,4,4,8
4,5,40,,Travel_Rarely,1194,Research & Development,2,4,Medical,1,...,2,80,3,20,2,3,5,3,0,2


## Data Understanding

### Informasi Umum Dataset

In [5]:
print(f"Jumlah baris: {df.shape[0]}, Jumlah kolom: {df.shape[1]}")

Jumlah baris: 1470, Jumlah kolom: 35


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   EmployeeId                1470 non-null   int64  
 1   Age                       1470 non-null   int64  
 2   Attrition                 1058 non-null   float64
 3   BusinessTravel            1470 non-null   object 
 4   DailyRate                 1470 non-null   int64  
 5   Department                1470 non-null   object 
 6   DistanceFromHome          1470 non-null   int64  
 7   Education                 1470 non-null   int64  
 8   EducationField            1470 non-null   object 
 9   EmployeeCount             1470 non-null   int64  
 10  EnvironmentSatisfaction   1470 non-null   int64  
 11  Gender                    1470 non-null   object 
 12  HourlyRate                1470 non-null   int64  
 13  JobInvolvement            1470 non-null   int64  
 14  JobLevel

### Cek Nilai Kosong & Duplikat

In [7]:
missing = df.isnull().sum()
print("Missing values:\n", missing[missing > 0])

Missing values:
 Attrition    412
dtype: int64


Kolom Attrition memiliki nilai kosong (missing) → perlu dibersihkan (drop rows / imputasi).

In [8]:
print("Jumlah data duplikat:", df.duplicated().sum())

Jumlah data duplikat: 0


Tidak ditemukan data duplikat.

### Distribusi Target "Attrition"

In [9]:
attr_counts = df['Attrition'].value_counts()
attr_percent = df['Attrition'].value_counts(normalize=True) * 100

attr_summary = pd.DataFrame({
    'Status Karyawan': ['0.0 (Bertahan)', '1.0 (Keluar)'],
    'Jumlah': attr_counts.values,
    'Proporsi (%)': attr_percent.values
})

print("Ringkasan Status Attrition Karyawan:\n")
print(attr_summary.to_string(index=False))


Ringkasan Status Attrition Karyawan:

Status Karyawan  Jumlah  Proporsi (%)
 0.0 (Bertahan)     879     83.081285
   1.0 (Keluar)     179     16.918715


In [10]:
fig = go.Figure(data=[go.Pie(labels=['0.0 (Bertahan)', '1.0 (Keluar)'], values=attr_counts, pull=[0.1, 0], textinfo='percent+label',)])
fig.update_layout(title='Distribusi Karyawan: Keluar vs Bertahan', height=400, width=600)
fig.show()

- 83% karyawan bertahan, hanya 17% yang keluar.

Artinya: data mengalami class imbalance yang perlu diperhatikan jika membuat model prediktif.

### Statistik Deskriptif

Untuk melihat ringkasan data numerik dan kategorikal secara statistik.

In [11]:
# Numerik
df.describe()

Unnamed: 0,EmployeeId,Age,Attrition,DailyRate,DistanceFromHome,Education,EmployeeCount,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1058.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,735.5,36.92381,0.169187,802.485714,9.192517,2.912925,1.0,2.721769,65.891156,2.729932,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,424.496761,9.135373,0.375094,403.5091,8.106864,1.024165,0.0,1.093082,20.329428,0.711561,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,1.0,18.0,0.0,102.0,1.0,1.0,1.0,1.0,30.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,368.25,30.0,0.0,465.0,2.0,2.0,1.0,2.0,48.0,2.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,735.5,36.0,0.0,802.0,7.0,3.0,1.0,3.0,66.0,3.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,1102.75,43.0,0.0,1157.0,14.0,4.0,1.0,4.0,83.75,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,1470.0,60.0,1.0,1499.0,29.0,5.0,1.0,4.0,100.0,4.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [12]:
# Kategorikal
df.describe(include='object')

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
count,1470,1470,1470,1470,1470,1470,1470,1470
unique,3,3,6,2,9,3,1,2
top,Travel_Rarely,Research & Development,Life Sciences,Male,Sales Executive,Married,Y,No
freq,1043,961,606,882,326,673,1470,1054


### Eksplorasi Fitur Numerik

In [13]:
fig = px.histogram(df, x='Age', color='Attrition', barmode='stack', title='Distribusi Usia berdasarkan Attrition', height=400, text_auto=True)
fig.update_layout(xaxis_title='Usia', yaxis_title='Jumlah', bargap=0.1)
fig.show()

Karyawan dengan usia muda (20–30 tahun) memiliki jumlah yang keluar (attrition = 1) lebih banyak dibandingkan yang lebih tua. Ini memperkuat korelasi negatif antara usia dan attrition.

In [14]:
fig = px.box(df, x='Attrition', y='MonthlyIncome', color='Attrition', title='Distribusi Monthly Income Berdasarkan Attrition', height=400, width=800)
fig.update_layout(yaxis_title='Monthly Income')
fig.show()

Karyawan yang keluar cenderung memiliki pendapatan lebih rendah daripada yang bertahan. Outlier menunjukkan beberapa karyawan bergaji tinggi tetap keluar, tapi jumlahnya kecil.

### Eksplorasi Fitur Kategorikal

In [15]:
fig = px.histogram(df, x="OverTime", color="Attrition", barmode="group", title="Distribusi OverTime Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Over Time', yaxis_title='Jumlah')
fig.show()

Karyawan yang lembur (OverTime = Yes) jauh lebih sering keluar. Ini menunjukkan lembur berlebihan bisa menjadi penyebab burnout dan turnover tinggi.

In [16]:
fig = px.histogram(df, x="Department", color="Attrition", barmode="group", title="Distribusi Department berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Department', yaxis_title='Jumlah')
fig.show()

Attrition tertinggi terjadi di departemen:
- Sales
- R&D
- HR memiliki jumlah keluar yang kecil, mungkin karena ukuran tim kecil.

In [17]:
fig = px.histogram(df, x="Gender", color="Attrition", barmode="group", title="Distribusi Gender Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Gender', yaxis_title='Jumlah')
fig.show()

Jumlah pria dan wanita yang keluar hampir sebanding secara proporsional, meskipun laki-laki sedikit lebih banyak keluar secara absolut. Namun ini bisa jadi karena jumlah karyawan laki-laki juga lebih besar.

In [18]:
fig = px.histogram(df, x="EducationField", color="Attrition", barmode="group", title="Distribusi Education Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Education', yaxis_title='Jumlah')
fig.show()

Tidak ada pola yang sangat dominan, namun attrition lebih tinggi terlihat pada bidang Life Sciences dan Medical, yang memang merupakan mayoritas karyawan.

In [None]:
fig = px.histogram(df, x="JobRole", color="Attrition", barmode="group", title="Distribusi Job Role Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Job Role', yaxis_title='Jumlah')
fig.show()

- Sales Representative dan Laboratory Technician memiliki proporsi keluar yang tinggi.
- Manager, Research Director, dan Healthcare Rep cenderung lebih loyal.

In [None]:
fig = px.histogram(df, x="MaritalStatus", color="Attrition", barmode="group", title="Distribusi Marital Status Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Marinal Status', yaxis_title='Jumlah')
fig.show()

Karyawan yang Single cenderung lebih banyak keluar dibanding yang Married atau Divorced. Ini bisa dikaitkan dengan komitmen dan stabilitas hidup.

In [None]:
fig = px.histogram(df, x="BusinessTravel", color="Attrition", barmode="group", title="Distribusi Business Travel Berdasarkan Attrition", height=400, width=800, text_auto=True)
fig.update_layout(xaxis_title='Business Travel', yaxis_title='Jumlah')
fig.show()

Karyawan yang sering bepergian (Travel_Frequently) memiliki proporsi keluar lebih tinggi dibandingkan yang tidak melakukan perjalanan. Faktor kelelahan dan ketidakseimbangan hidup kerja bisa menjadi penyebab.

### Korelasi Antar Fitur Numerik

In [22]:
corr = df.corr(numeric_only=True).round(2)

fig = go.Figure(data=go.Heatmap(z=corr.values, x=corr.columns, y=corr.index, colorscale='RdBu', zmin=-1, zmax=1, colorbar=dict(title='Korelasi'), text=corr.values, texttemplate="%{text}", hovertemplate="Fitur X: %{x}<br>Fitur Y: %{y}<br>Korelasi: %{z:.2f}<extra></extra>"))
fig.update_layout(title="Matriks Korelasi", xaxis_title="Fitur", yaxis_title="Fitur", width=1400, height=1400)
fig.show()


In [27]:
top_corr = corr['Attrition'].drop('Attrition').sort_values(ascending=True)
chart_height = len(top_corr) * 30

fig = px.bar(x=top_corr.values, y=top_corr.index, orientation='h', title='Korelasi Fitur terhadap Attrition', labels={'x': 'Korelasi', 'y': 'Fitur'}, height=chart_height, width=800, text_auto=True)
fig.update_layout(
    xaxis_range=[-0.20, 0.20],
    yaxis=dict(
        showgrid=True,
        gridcolor='lightgrey',
        gridwidth=1,
        ticks="outside"
    ),
    plot_bgcolor='white'
)
fig.show()


Fitur dengan korelasi negatif terkuat:
1. TotalWorkingYears: -0.18
2. JobLevel: -0.17
3. Age: -0.17
4. MonthlyIncome: -0.16
5. YearsWithCurrManager: -0.16
6. YearsInCurrentRole: -0.16

Semakin senior dan berpengalaman seseorang, kemungkinan keluar dari perusahaan semakin kecil.

## Data Preparation / Preprocessing

## Modeling

## Evaluation