# Proyek Akhir - Menyelesaikan Permasalahan Institusi Pendidikan: Prediksi Dropout Mahasiswa dan Visualisasi Performa Akademik di Jaya Jaya Institut

- Nama: Muhammad Akbar Hamid
- Email: muhakbarhamid21@gmail.com
- Id Dicoding: muhakbarhamid21

## Persiapan

### Menyiapkan library yang dibutuhkan

In [1]:
import pandas as pd
import math

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

from imblearn.over_sampling import SMOTE, ADASYN

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import joblib
import optuna

### Menyiapkan data yang akan diguankan

In [2]:
df = pd.read_csv("data/data_student.csv", sep=";")

## Data Understanding

### Preview Data

Melihat beberapa baris pertama untuk memastikan kolom dan format data sudah benar, serta mengamati sampel nilai tiap kolom.

In [3]:
df.head()

Unnamed: 0,Marital_status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_grade,Nacionality,Mothers_qualification,Fathers_qualification,...,Curricular_units_2nd_sem_credited,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Status
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


### Tipe Kolom

Menampilkan tipe data dan jumlah non-null setiap kolom—penting untuk mendeteksi missing values dan memastikan kolom numerik vs kategorikal.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Marital_status                                4424 non-null   int64  
 1   Application_mode                              4424 non-null   int64  
 2   Application_order                             4424 non-null   int64  
 3   Course                                        4424 non-null   int64  
 4   Daytime_evening_attendance                    4424 non-null   int64  
 5   Previous_qualification                        4424 non-null   int64  
 6   Previous_qualification_grade                  4424 non-null   float64
 7   Nacionality                                   4424 non-null   int64  
 8   Mothers_qualification                         4424 non-null   int64  
 9   Fathers_qualification                         4424 non-null   i

### Statistik Deskriptif Fitur Numerik

Ringkasan statistik (count, mean, std, min, quartiles, max) untuk semua kolom numerik—berguna mendeteksi outlier atau variabilitas tinggi.

In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Marital_status,4424.0,1.178571,0.605747,1.0,1.0,1.0,1.0,6.0
Application_mode,4424.0,18.669078,17.484682,1.0,1.0,17.0,39.0,57.0
Application_order,4424.0,1.727848,1.313793,0.0,1.0,1.0,2.0,9.0
Course,4424.0,8856.642631,2063.566416,33.0,9085.0,9238.0,9556.0,9991.0
Daytime_evening_attendance,4424.0,0.890823,0.311897,0.0,1.0,1.0,1.0,1.0
Previous_qualification,4424.0,4.577758,10.216592,1.0,1.0,1.0,1.0,43.0
Previous_qualification_grade,4424.0,132.613314,13.188332,95.0,125.0,133.1,140.0,190.0
Nacionality,4424.0,1.873192,6.914514,1.0,1.0,1.0,1.0,109.0
Mothers_qualification,4424.0,19.561935,15.603186,1.0,2.0,19.0,37.0,44.0
Fathers_qualification,4424.0,22.275316,15.343108,1.0,3.0,19.0,37.0,44.0


### Distribusi Target (`Status`)

Memeriksa proporsi masing-masing kelas (Dropout, Enrolled, Graduate) untuk mengetahui apakah perlu penanganan imbalance.

In [6]:
fig = px.histogram(
  df, 
  x='Status', 
  color='Status', 
  title='Distribusi Status Mahasiswa', 
  text_auto=True, 
  height=400, width=800,
  color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

### Distribusi Fitur Numerik

Berikut beberapa contoh fitur numerik penting. Warna menunjukkan distribusi per kelas Status.

#### Distribusi Fitur `Age_at_enrollment`

Melihat rentang dan kebanyakan usia mahasiswa saat daftar—usia ekstrem bisa berpengaruh ke risiko dropout.

In [7]:
fig = px.histogram(
  df, 
  x='Age_at_enrollment', 
  title='Distribusi Usia pada Saat Pendaftaran', 
  text_auto=True, 
  height=400, width=800, 
  color='Status'
)
fig.show()

#### Distribusi Fitur `Admission_grade`

Distribusi nilai masuk—nilai rendah atau tinggi mayoritas? Ini indikator awal kemampuan akademik.

In [8]:
fig = px.histogram(
    df,
    x='Admission_grade',
    nbins=30,
    title='Distribusi Admission Grade',
    labels={'Admission_grade':'Admission Grade','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly,
)
fig.show()

#### Distribusi Fitur `Previous_qualification_grade`

Nilai kualifikasi sebelum kuliah—menunjukkan latar belakang akademik dan potensi adaptasi.

In [9]:
fig = px.histogram(
    df,
    x='Previous_qualification_grade',
    nbins=50,
    title='Distribusi Nilai Kualifikasi Sebelumnya',
    labels={'Previous_qualification_grade':'Prev Qualification Grade','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Rata-rata Nilai Semester

Melihat kinerja di tiap semester untuk mendeteksi penurunan atau konsistensi nilai.

In [10]:
# Semester 1
fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_grade',
    nbins=30,
    title='Distribusi Nilai Rata-rata Sem 1',
    text_auto=True, height=400, width=800, color='Status'
)
fig.show()

# Semester 2
fig = px.histogram(
    df,
    x='Curricular_units_2nd_sem_grade',
    nbins=30,
    title='Distribusi Nilai Rata-rata Sem 2',
    text_auto=True, height=400, width=800, color='Status'
)
fig.show()


#### Distribusi Ekonomi Makro pada Fitur `Unemployment_rate`, `Inflation_rate`, dan `GDP`

Variabel eksternal yang bisa memengaruhi kemampuan mahasiswa membayar biaya kuliah dan kelanjutan studi.

In [11]:
fig = px.histogram(
    df,
    x='Unemployment_rate',
    title='Distribusi Unemployment Rate (%)',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()

fig = px.histogram(
    df,
    x='Inflation_rate',
    title='Distribusi Inflation Rate (%)',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()

fig = px.histogram(
    df,
    x='GDP',
    nbins=30,
    title='Distribusi GDP',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()


#### Distribusi Beban Akademik vs Keberhasilan (Semester 1)

Meninjau jumlah mata kuliah yang diambil vs yang diselesaikan—rasio rendah menandakan kesulitan akademik.

In [12]:
fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_enrolled',
    nbins=20,
    title='Units Enrolled Sem 1',
    text_auto=True, height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_approved',
    nbins=20,
    title='Units Approved Sem 1',
    text_auto=True, height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

### Distribusi Fitur Kategorikal

Fokus pada kategori yang berpotensi kuat memengaruhi outcome.

#### Distribusi Fitur `Course`

Jurusan dengan jumlah terbanyak bisa menjadi fokus intervensi jika memiliki angka dropout tinggi.

In [13]:
course_cnt = df['Course'].value_counts().reset_index()
course_cnt.columns = ['Course','Count']

fig = px.bar(
  course_cnt, 
  x='Course', y='Count', 
  title='Jumlah Mahasiswa per Course', 
  color_discrete_sequence=['#FF0000'], 
  text='Count', 
  height=400, width=800
)
fig.show()

#### Distribusi Fitur `Marital_status`

Dukungan keluarga/tingkat tanggung jawab bisa berbeda antar status pernikahan.

In [14]:
fig = px.histogram(
    df,
    x='Marital_status',
    category_orders={'Marital_status': sorted(df['Marital_status'].unique())},
    title='Distribusi Marital Status Mahasiswa',
    labels={'Marital_status':'Marital Status (1=Single, 2=Married, 3=Widower, 4=Divorced, 5=Facto Union, 6=Legally Separated)','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Application_mode`

Mode pendaftaran menunjukkan asal motivasi atau jalur masuk (reguler, transfer, beasiswa khusus).

In [15]:
fig = px.histogram(
    df,
    x='Application_mode',
    title='Distribusi Application Mode',
    labels={'Application_mode':'Mode Aplikasi','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Daytime_evening_attendance`

Jadwal kuliah pagi vs malam bisa berdampak pada performa (energi, pekerjaan sampingan).

In [16]:
fig = px.histogram(
    df,
    x='Daytime_evening_attendance',
    title='Distribusi Kehadiran (Daytime vs Evening)',
    labels={'Daytime_evening_attendance':'1=Day, 0=Evening','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Previous_qualification`

Tingkat pendidikan sebelumnya memengaruhi kesiapan akademik.

In [17]:
fig = px.histogram(
    df,
    x='Previous_qualification',
    title='Distribusi Tingkat Kualifikasi Sebelumnya',
    labels={'Previous_qualification':'Kualifikasi Awal','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Scholarship_holder`

Beasiswa dapat mengurangi beban finansial dan meningkatkan retensi.

In [18]:
fig = px.histogram(
    df,
    x='Scholarship_holder',
    title='Proporsi Mahasiswa Penerima Beasiswa',
    labels={'Scholarship_holder':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Displaced`

Mahasiswa displaced mungkin menghadapi tantangan adaptasi yang lebih tinggi.

In [19]:
fig = px.histogram(
    df,
    x='Displaced',
    title='Proporsi Mahasiswa Displaced',
    labels={'Displaced':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Debtor`

Tunggakan pembayaran bisa menimbulkan stres dan risiko dropout.

In [20]:
fig = px.histogram(
    df,
    x='Debtor',
    title='Proporsi Mahasiswa Debtor',
    labels={'Debtor':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.update_xaxes(tickmode='array', tickvals=[0,1], ticktext=['Tidak','Ya'])
fig.show()


### Korelasi Antar Fitur

Menampilkan matriks korelasi untuk semua kolom numerik—membantu deteksi multikolinearitas.

In [21]:
corr = df.corr(numeric_only=True).round(2)

fig = go.Figure(
    data=go.Heatmap(
        z=corr.values,
        x=corr.columns,
        y=corr.index,
        colorscale='RdBu',
        zmin=-1,
        zmax=1,
        colorbar=dict(title='Korelasi'),
        text=corr.values,
        texttemplate="%{text}",
        hovertemplate=(
            "Fitur X: %{x}<br>"
            "Fitur Y: %{y}<br>"
            "Korelasi: %{z:.2f}<extra></extra>"
        )
    )
)

fig.update_layout(title="Matriks Korelasi", xaxis_title="Fitur", yaxis_title="Fitur", width=1500, height=1500)

fig.show()


#### Korelasi Antar Fitur terhadap Setiap Kategori `Status` (Dropout, Graduate, Enrolled)

Melihat fitur mana yang paling berkorelasi positif/negatif dengan masing-masing kelas target.

In [22]:
status_dummies = pd.get_dummies(df['Status'], prefix='Status')

df_enc = pd.concat([df.drop(columns=['Status']), status_dummies], axis=1)

corr = df_enc.corr(numeric_only=True).round(2)

for cat in ['Dropout', 'Graduate', 'Enrolled']:
    col = f'Status_{cat}'
    top_corr = corr[col].drop(col).sort_values(ascending=True)
    chart_height = len(top_corr) * 30

    fig = go.Figure(
        data=go.Bar(
            x=top_corr.values,
            y=top_corr.index,
            orientation='h',
            text=top_corr.values,
            textposition='auto'
        )
    )
    fig.update_layout(
        title=f'Korelasi Fitur terhadap Status = {cat}',
        xaxis=dict(
            title='Korelasi',
            range=[top_corr.min() - 0.05, top_corr.max() + 0.05]
        ),
        yaxis=dict(
            title='Fitur',
            showgrid=True,
            gridcolor='lightgrey',
            gridwidth=1,
            ticks="outside"
        ),
        plot_bgcolor='white',
        width=800,
        height=chart_height
    )
    fig.show()


### Hubungan Fitur dengan Target (Kolom `Status`)

#### Hubungan `Admission_grade` vs `Status`

Box-plot membandingkan distribusi nilai masuk antar hasil akhir mahasiswa.

In [23]:
fig = px.box(
  df,
  x='Status', y='Admission_grade',
  title='Admission Grade per Kelas Status',
  color='Status',
  points='all',
  height=400, width=800,
  color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()


#### Hubungan `Previous_qualification_grade` vs `Status`

Melihat perbedaan rata-rata nilai kualifikasi sebelum kuliah pada tiap outcome.

In [24]:
fig = px.box(
    df,
    x='Status', y='Previous_qualification_grade',
    title='Nilai Kualifikasi Sebelumnya per Kelas Status',
    color='Status',
    points='all',
    labels={'Previous_qualification_grade':'Prev Qualification Grade','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan `Curricular_units_1st_sem_grade` vs `Status`

Menilai kinerja semester pertama sebagai indikator risiko dropout atau kelulusan.

In [25]:
fig = px.box(
    df,
    x='Status', y='Curricular_units_1st_sem_grade',
    title='Rata-rata Nilai Semester 1 per Kelas Status',
    color='Status',
    points='all',
    labels={'Curricular_units_1st_sem_grade':'Grade Sem 1','Status':'Status'},
    height=400, width=800,
)
fig.show()

#### Hubungan `Curricular_units_2nd_sem_grade` vs `Status`

Memeriksa perubahan kinerja semester kedua—apakah ada rebound atau penurunan.

In [26]:
fig = px.box(
    df,
    x='Status', y='Curricular_units_2nd_sem_grade',
    title='Rata-rata Nilai Semester 2 per Kelas Status',
    color='Status',
    points='all',
    labels={'Curricular_units_2nd_sem_grade':'Grade Sem 2','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan `Age_at_enrollment` vs `Status`

Mengetahui apakah kelompok usia tertentu lebih berisiko dropout atau lebih sering lulus.

In [27]:
fig = px.box(
    df,
    x='Status', y='Age_at_enrollment',
    title='Usia Saat Pendaftaran per Kelas Status',
    color='Status',
    points='all',
    labels={'Age_at_enrollment':'Usia','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan Rasio `approved/enrolled` Semester 1 vs `Status`

Menilai efisiensi penyelesaian beban studi tanpa menambah kolom baru permanen.

In [28]:
fig = px.box(
    df.assign(
        ratio_sem1 = df['Curricular_units_1st_sem_approved'] / df['Curricular_units_1st_sem_enrolled']
    ),
    x='Status', y='ratio_sem1',
    title='Rasio Approved/Enrolled Sem 1 per Kelas Status',
    labels={'ratio_sem1':'Approval Ratio Sem 1','Status':'Status'},
    points='all',
    height=400, width=800,
    color_discrete_sequence=px.colors.qualitative.Plotly,
    color='Status'
)
fig.show()


## Data Preparation

### Data Cleaning

In [29]:
df.isnull().sum()

Marital_status                                  0
Application_mode                                0
Application_order                               0
Course                                          0
Daytime_evening_attendance                      0
Previous_qualification                          0
Previous_qualification_grade                    0
Nacionality                                     0
Mothers_qualification                           0
Fathers_qualification                           0
Mothers_occupation                              0
Fathers_occupation                              0
Admission_grade                                 0
Displaced                                       0
Educational_special_needs                       0
Debtor                                          0
Tuition_fees_up_to_date                         0
Gender                                          0
Scholarship_holder                              0
Age_at_enrollment                               0


Tidak ada missing value.

In [30]:
df.duplicated().sum()

0

Tidak ada data duplikat.

### Konversi Tipe Data

Tandai kolom kategorikal vs numerik agar downstream steps tahu caranya memproses.

In [31]:
cat_cols = [
    # kode-kode kategori
    'Marital_status',             # 1=single,2=married,…  
    'Application_mode',           # 1=first phase,…  
    'Application_order',          # 0=first choice,…  
    'Course',                     # kode jurusan  
    'Daytime_evening_attendance', # 1=day,0=evening  
    'Previous_qualification',     # kode tingkat kualifikasi  
    'Nacionality',                # kode kebangsaan  
    'Mothers_qualification',      # kode jenjang ibu  
    'Fathers_qualification',      # kode jenjang ayah  
    'Mothers_occupation',         # kode profesi ibu  
    'Fathers_occupation',         # kode profesi ayah  
    'Displaced',                  # 1=ya,0=tidak  
    'Educational_special_needs',  # 1=ya,0=tidak  
    'Debtor',                     # 1=ya,0=tidak  
    'Tuition_fees_up_to_date',    # 1=ya,0=tidak  
    'Gender',                     # 1=male,0=female  
    'Scholarship_holder',         # 1=ya,0=tidak  
    'International',              # 1=ya,0=tidak  
    'Status'                      # target: Dropout/Enrolled/Graduate  
]

df[cat_cols] = df[cat_cols].astype('category')

In [32]:
num_cols = [
    'Previous_qualification_grade',
    'Admission_grade',
    'Age_at_enrollment',
    # seluruh Curricular_units_… (credited, enrolled, evaluations, approved, grade, without_evaluations)
    'Curricular_units_1st_sem_credited',
    'Curricular_units_1st_sem_enrolled',
    'Curricular_units_1st_sem_evaluations',
    'Curricular_units_1st_sem_approved',
    'Curricular_units_1st_sem_grade',
    'Curricular_units_1st_sem_without_evaluations',
    'Curricular_units_2nd_sem_credited',
    'Curricular_units_2nd_sem_enrolled',
    'Curricular_units_2nd_sem_evaluations',
    'Curricular_units_2nd_sem_approved',
    'Curricular_units_2nd_sem_grade',
    'Curricular_units_2nd_sem_without_evaluations',
    'Unemployment_rate',
    'Inflation_rate',
    'GDP'
]

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                        Non-Null Count  Dtype   
---  ------                                        --------------  -----   
 0   Marital_status                                4424 non-null   category
 1   Application_mode                              4424 non-null   category
 2   Application_order                             4424 non-null   category
 3   Course                                        4424 non-null   category
 4   Daytime_evening_attendance                    4424 non-null   category
 5   Previous_qualification                        4424 non-null   category
 6   Previous_qualification_grade                  4424 non-null   float64 
 7   Nacionality                                   4424 non-null   category
 8   Mothers_qualification                         4424 non-null   category
 9   Fathers_qualification                         4424 n

### Encoding Kategorikal

#### Label‐Encode Target (`Status`)

In [34]:
le = LabelEncoder()
df['Status_enc'] = le.fit_transform(df['Status'])

#### Ordinal‐Encode Fitur Kategorikal (Tanpa Target atau `Status`)

In [35]:
cat_nominal = [c for c in cat_cols if c != 'Status']

for col in cat_nominal:
    df[col] = df[col].cat.codes

In [36]:
df.shape[1]

38

### Train–Test Split

Pisahkan data latih/uji secara stratified untuk menghindari data leakage.

In [37]:
X = df.drop(columns=['Status','Status_enc'])
y = df['Status_enc']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [39]:
print("Shape X_train:", X_train.shape, "X_test:", X_test.shape)
print("Distribusi y_train:\n", y_train.value_counts(normalize=True))

Shape X_train: (3539, 36) X_test: (885, 36)
Distribusi y_train:
 Status_enc
2    0.499294
0    0.321277
1    0.179429
Name: proportion, dtype: float64


### Feature Scaling (MinMax untuk χ²)

Menormalkan rentang nilai fitur numerik agar cocok untuk uji Chi-Squared (non-negatif) dan model lain yang sensitif skala. Di sini dipakai MinMaxScaler → rentang [0,1].

In [40]:
mm = MinMaxScaler()
X_train_mm = pd.DataFrame(
    mm.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)
X_test_mm = pd.DataFrame(
    mm.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

In [41]:
print("Min nilai X_train_mm:", X_train_mm.min().min())
print("Max nilai X_train_mm:", X_train_mm.max().max())

Min nilai X_train_mm: 0.0
Max nilai X_train_mm: 1.0


### Feature Selction

In [42]:
def objective(trial):
    k = trial.suggest_int("k", 1, X_train_mm.shape[1])
    selector = SelectKBest(score_func=chi2, k=k)
    X_sel = selector.fit_transform(X_train_mm, y_train)
    clf = LogisticRegression(max_iter=1000)
    score = cross_val_score(clf, X_sel, y_train, cv=5, scoring="accuracy").mean()
    penalty = 0.0025 * k # Adjust this value to control the penalty
    return score - penalty

In [43]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50, show_progress_bar=True)

[I 2025-05-18 21:47:53,186] A new study created in memory with name: no-name-99afc18f-937a-4d98-869b-a262508bc386


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:47:53,430] Trial 0 finished with value: 0.7014733616218766 and parameters: {'k': 10}. Best is trial 0 with value: 0.7014733616218766.
[I 2025-05-18 21:47:53,481] Trial 1 finished with value: 0.5454359152622283 and parameters: {'k': 2}. Best is trial 0 with value: 0.7014733616218766.
[I 2025-05-18 21:47:53,652] Trial 2 finished with value: 0.6943821670302623 and parameters: {'k': 16}. Best is trial 0 with value: 0.7014733616218766.
[I 2025-05-18 21:47:53,776] Trial 3 finished with value: 0.7042154723946971 and parameters: {'k': 8}. Best is trial 3 with value: 0.7042154723946971.
[I 2025-05-18 21:47:53,930] Trial 4 finished with value: 0.6891444513700765 and parameters: {'k': 19}. Best is trial 3 with value: 0.7042154723946971.
[I 2025-05-18 21:47:54,117] Trial 5 finished with value: 0.6798590567289174 and parameters: {'k': 33}. Best is trial 3 with value: 0.7042154723946971.
[I 2025-05-18 21:47:54,269] Trial 6 finished with value: 0.6891444513700765 and parameters: {'k'

In [44]:
best_k = study.best_params["k"]

print(f"Best k (jumlah fitur): {best_k}\n")

Best k (jumlah fitur): 7



In [45]:
final_selector = SelectKBest(score_func=chi2, k=best_k).fit(X_train_mm, y_train)
selected_features = X_train_mm.columns[final_selector.get_support()].tolist()

print("Fitur terpilih:", selected_features, "\n")

Fitur terpilih: ['Debtor', 'Tuition_fees_up_to_date', 'Gender', 'Scholarship_holder', 'Curricular_units_1st_sem_grade', 'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade'] 



In [46]:
X_train_sel = X_train_mm[selected_features].copy()
X_test_sel  = X_test_mm[selected_features].copy()

print("Dimensi X_train_sel:", X_train_sel.shape)
print("Dimensi X_test_sel :", X_test_sel.shape)


Dimensi X_train_sel: (3539, 7)
Dimensi X_test_sel : (885, 7)


#### Skor χ²

In [47]:
chi2_sel = SelectKBest(score_func=chi2, k='all').fit(X_train_mm, y_train)

scores = pd.Series(chi2_sel.scores_, index=X_train_mm.columns).sort_values(ascending=False)

In [48]:
print("Top fitur menurut skor χ²:")
print(scores)

Top fitur menurut skor χ²:
Scholarship_holder                              240.209317
Debtor                                          201.265523
Curricular_units_2nd_sem_grade                  174.139010
Curricular_units_2nd_sem_approved               139.757866
Gender                                          119.584389
Curricular_units_1st_sem_grade                  102.489923
Tuition_fees_up_to_date                          78.259262
Curricular_units_1st_sem_approved                75.409231
Application_mode                                 52.031398
Age_at_enrollment                                40.747134
Displaced                                        21.985026
Previous_qualification                           18.635949
Marital_status                                   13.094620
Curricular_units_2nd_sem_without_evaluations      7.971461
Curricular_units_2nd_sem_evaluations              7.864524
Application_order                                 7.195725
Curricular_units_1st_sem_with

In [49]:
df_scores = scores.reset_index()
df_scores.columns = ['feature', 'chi2_score']

In [50]:
df_scores.to_csv('data/chi2_feature_scores.csv', index=False)

#### Feature Selection Considerations

In [51]:
manual_features = [
    'Application_mode',
    'Age_at_enrollment',
    'Curricular_units_1st_sem_approved',
    'Curricular_units_1st_sem_grade',
    'Curricular_units_2nd_sem_approved',
    'Curricular_units_2nd_sem_grade',
    'Debtor',
    'Displaced',
    'Gender',
    'Marital_status',
    'Previous_qualification',
    'Scholarship_holder',
    'Tuition_fees_up_to_date'
]

In [52]:
missing_train = set(manual_features) - set(X_train_mm.columns)
missing_test  = set(manual_features) - set(X_test_mm.columns)

if missing_train or missing_test:
    raise ValueError(f"Fitur hilang: train {missing_train}, test {missing_test}")

X_train_sel = X_train_mm[manual_features].copy()
X_test_sel  = X_test_mm[manual_features].copy()

print("Dimensi X_train_sel:", X_train_sel.shape)
print("Dimensi X_test_sel :", X_test_sel.shape)

Dimensi X_train_sel: (3539, 13)
Dimensi X_test_sel : (885, 13)


In [53]:
joblib.dump(manual_features, 'model/selected_features.pkl')

['model/selected_features.pkl']

## Modeling

### Logistic Regression

#### Logistic Regression (Baseline)

In [54]:
def objective_lr_base(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    penalty = trial.suggest_categorical('penalty', ['l1','l2'])
    solver = 'liblinear' if penalty=='l1' else 'lbfgs'
    model = LogisticRegression(C=C, penalty=penalty, solver=solver, max_iter=1000, random_state=42)
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lr_base = optuna.create_study(direction='maximize')
study_lr_base.optimize(objective_lr_base, n_trials=50, show_progress_bar=True)
best_lr_params = study_lr_base.best_params

print("Best params LR Baseline:", best_lr_params)
print("Best score LR Baseline:", study_lr_base.best_value)

lr_base = LogisticRegression(**best_lr_params,
                                solver=('liblinear' if best_lr_params['penalty']=='l1' else 'lbfgs'),
                                max_iter=1000, random_state=42)
lr_base.fit(X_train_sel, y_train)
joblib.dump(lr_base, 'model/lr_base_optuna.pkl')

[I 2025-05-18 21:48:00,090] A new study created in memory with name: no-name-aee8a88c-2d8a-42c4-8ce6-109b0f5fd854


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:48:06,486] Trial 0 finished with value: 0.8323711296844853 and parameters: {'C': 54.7591407269952, 'penalty': 'l2'}. Best is trial 0 with value: 0.8323711296844853.
[I 2025-05-18 21:48:08,120] Trial 1 finished with value: 0.8324831275610975 and parameters: {'C': 17.693571415256216, 'penalty': 'l2'}. Best is trial 1 with value: 0.8324831275610975.
[I 2025-05-18 21:48:09,725] Trial 2 finished with value: 0.824232498553183 and parameters: {'C': 41.84487959111901, 'penalty': 'l1'}. Best is trial 1 with value: 0.8324831275610975.
[I 2025-05-18 21:48:09,825] Trial 3 finished with value: 0.8242288513098138 and parameters: {'C': 20.417294867131023, 'penalty': 'l1'}. Best is trial 1 with value: 0.8324831275610975.
[I 2025-05-18 21:48:09,882] Trial 4 finished with value: 0.8328265457253984 and parameters: {'C': 0.4055459471983423, 'penalty': 'l2'}. Best is trial 4 with value: 0.8328265457253984.
[I 2025-05-18 21:48:09,921] Trial 5 finished with value: 0.78997479628143 and parame

['model/lr_base_optuna.pkl']

#### Logistic Regression + SMOTE

In [55]:
def objective_lr_smote(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    penalty = trial.suggest_categorical('penalty', ['l1','l2'])
    solver = 'liblinear' if penalty=='l1' else 'lbfgs'
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = LogisticRegression(C=C, penalty=penalty, solver=solver, max_iter=1000, random_state=42)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lr_smote = optuna.create_study(direction='maximize')
study_lr_smote.optimize(objective_lr_smote, n_trials=50, show_progress_bar=True)
best_lr_smote_params = study_lr_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params LR + SMOTE:", best_lr_smote_params)
print("Best score LR + SMOTE:", study_lr_smote.best_value)

lr_smote = LogisticRegression(**best_lr_smote_params,
                                solver=('liblinear' if best_lr_smote_params['penalty']=='l1' else 'lbfgs'),
                                max_iter=1000, random_state=42)
lr_smote.fit(X_smote, y_smote)
joblib.dump(lr_smote, 'model/lr_smote_optuna.pkl')

[I 2025-05-18 21:48:13,043] A new study created in memory with name: no-name-ac9f77f8-de8f-4095-98c1-9d5d43661b77


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:48:13,174] Trial 0 finished with value: 0.8290016644079131 and parameters: {'C': 0.170017618416068, 'penalty': 'l2'}. Best is trial 0 with value: 0.8290016644079131.
[I 2025-05-18 21:48:13,331] Trial 1 finished with value: 0.8233078037241475 and parameters: {'C': 0.4053317354595332, 'penalty': 'l1'}. Best is trial 0 with value: 0.8290016644079131.
[I 2025-05-18 21:48:13,619] Trial 2 finished with value: 0.8233349169170531 and parameters: {'C': 6.911403306607692, 'penalty': 'l1'}. Best is trial 0 with value: 0.8290016644079131.
[I 2025-05-18 21:48:13,780] Trial 3 finished with value: 0.8233872832371688 and parameters: {'C': 0.4430232084706799, 'penalty': 'l1'}. Best is trial 0 with value: 0.8290016644079131.
[I 2025-05-18 21:48:13,878] Trial 4 finished with value: 0.8115646790759232 and parameters: {'C': 0.024745297245757705, 'penalty': 'l1'}. Best is trial 0 with value: 0.8290016644079131.
[I 2025-05-18 21:48:14,020] Trial 5 finished with value: 0.8231711820874452 and 

['model/lr_smote_optuna.pkl']

#### Logistic Regression + ADASYN

In [56]:
def objective_lr_adasyn(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    penalty = trial.suggest_categorical('penalty', ['l1','l2'])
    solver = 'liblinear' if penalty=='l1' else 'lbfgs'
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = LogisticRegression(C=C, penalty=penalty, solver=solver, max_iter=1000, random_state=42)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lr_adasyn = optuna.create_study(direction='maximize')
study_lr_adasyn.optimize(objective_lr_adasyn, n_trials=50, show_progress_bar=True)
best_lr_adasyn_params = study_lr_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params LR + ADASYN:", best_lr_adasyn_params)
print("Best score LR + ADASYN:", study_lr_adasyn.best_value)

lr_adasyn = LogisticRegression(**best_lr_adasyn_params,
                                solver=('liblinear' if best_lr_adasyn_params['penalty']=='l1' else 'lbfgs'),
                                max_iter=1000, random_state=42)
lr_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(lr_adasyn, 'model/lr_adasyn_optuna.pkl')

[I 2025-05-18 21:48:19,611] A new study created in memory with name: no-name-a84091fd-aeb2-44fe-8102-be87a5564150


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:48:19,780] Trial 0 finished with value: 0.746181699052286 and parameters: {'C': 0.01834137708950083, 'penalty': 'l1'}. Best is trial 0 with value: 0.746181699052286.
[I 2025-05-18 21:48:19,956] Trial 1 finished with value: 0.7787011421462464 and parameters: {'C': 3.6677531902750466, 'penalty': 'l2'}. Best is trial 1 with value: 0.7787011421462464.
[I 2025-05-18 21:48:20,142] Trial 2 finished with value: 0.6004895542724754 and parameters: {'C': 0.0027712281963345247, 'penalty': 'l1'}. Best is trial 1 with value: 0.7787011421462464.
[I 2025-05-18 21:48:20,270] Trial 3 finished with value: 0.762481963533844 and parameters: {'C': 0.023073018168539194, 'penalty': 'l2'}. Best is trial 1 with value: 0.7787011421462464.
[I 2025-05-18 21:48:20,447] Trial 4 finished with value: 0.778196568674674 and parameters: {'C': 41.65436916025157, 'penalty': 'l2'}. Best is trial 1 with value: 0.7787011421462464.
[I 2025-05-18 21:48:20,683] Trial 5 finished with value: 0.7718674351869026 and

['model/lr_adasyn_optuna.pkl']

### Random Forest

#### Random Forest (Baseline)

In [57]:
def objective_rf_base(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 2, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42,
        n_jobs=-1
    )
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_rf_base = optuna.create_study(direction='maximize')
study_rf_base.optimize(objective_rf_base, n_trials=50, show_progress_bar=True)
best_rf_params = study_rf_base.best_params

print("Best params RF Baseline:", best_rf_params)
print("Best score RF Baseline:", study_rf_base.best_value)

rf_base = RandomForestClassifier(**best_rf_params, random_state=42, n_jobs=-1)
rf_base.fit(X_train_sel, y_train)
joblib.dump(rf_base, 'model/rf_base_optuna.pkl')

[I 2025-05-18 21:48:28,641] A new study created in memory with name: no-name-a5eeee48-d0b0-4dc2-b59a-2a912b26ae6d


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:48:29,343] Trial 0 finished with value: 0.8590942783804509 and parameters: {'n_estimators': 61, 'max_depth': 19, 'min_samples_split': 8}. Best is trial 0 with value: 0.8590942783804509.
[I 2025-05-18 21:48:29,991] Trial 1 finished with value: 0.8538314559957307 and parameters: {'n_estimators': 85, 'max_depth': 16, 'min_samples_split': 3}. Best is trial 0 with value: 0.8590942783804509.
[I 2025-05-18 21:48:30,401] Trial 2 finished with value: 0.8624146253859024 and parameters: {'n_estimators': 56, 'max_depth': 11, 'min_samples_split': 7}. Best is trial 2 with value: 0.8624146253859024.
[I 2025-05-18 21:48:31,354] Trial 3 finished with value: 0.8645215677925581 and parameters: {'n_estimators': 265, 'max_depth': 8, 'min_samples_split': 10}. Best is trial 3 with value: 0.8645215677925581.
[I 2025-05-18 21:48:31,915] Trial 4 finished with value: 0.8633758963978174 and parameters: {'n_estimators': 124, 'max_depth': 10, 'min_samples_split': 3}. Best is trial 3 with value: 0.8

['model/rf_base_optuna.pkl']

#### Random Forest + SMOTE

In [58]:
def objective_rf_smote(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 2, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42,
        n_jobs=-1
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_rf_smote = optuna.create_study(direction='maximize')
study_rf_smote.optimize(objective_rf_smote, n_trials=50, show_progress_bar=True)
best_rf_smote_params = study_rf_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params RF + SMOTE:", best_rf_smote_params)
print("Best score RF + SMOTE:", study_rf_smote.best_value)

rf_smote = RandomForestClassifier(**best_rf_smote_params, random_state=42, n_jobs=-1)
rf_smote.fit(X_smote, y_smote)
joblib.dump(rf_smote, 'model/rf_smote_optuna.pkl')

[I 2025-05-18 21:49:15,573] A new study created in memory with name: no-name-4df42c9b-40fa-4be6-b8dc-8e2baab33d46


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:49:16,458] Trial 0 finished with value: 0.8620793615176977 and parameters: {'n_estimators': 193, 'max_depth': 4, 'min_samples_split': 7}. Best is trial 0 with value: 0.8620793615176977.
[I 2025-05-18 21:49:17,168] Trial 1 finished with value: 0.9178242966625645 and parameters: {'n_estimators': 105, 'max_depth': 14, 'min_samples_split': 8}. Best is trial 1 with value: 0.9178242966625645.
[I 2025-05-18 21:49:17,925] Trial 2 finished with value: 0.8460369785512392 and parameters: {'n_estimators': 189, 'max_depth': 2, 'min_samples_split': 9}. Best is trial 1 with value: 0.9178242966625645.
[I 2025-05-18 21:49:18,885] Trial 3 finished with value: 0.8868042938330009 and parameters: {'n_estimators': 206, 'max_depth': 7, 'min_samples_split': 10}. Best is trial 1 with value: 0.9178242966625645.
[I 2025-05-18 21:49:20,835] Trial 4 finished with value: 0.9267253046906457 and parameters: {'n_estimators': 255, 'max_depth': 19, 'min_samples_split': 5}. Best is trial 4 with value: 0.

['model/rf_smote_optuna.pkl']

#### Random Forest + ADASYN

In [59]:
def objective_rf_adasyn(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 2, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42,
        n_jobs=-1
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_rf_adasyn = optuna.create_study(direction='maximize')
study_rf_adasyn.optimize(objective_rf_adasyn, n_trials=50, show_progress_bar=True)
best_rf_adasyn_params = study_rf_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params RF + ADASYN:", best_rf_adasyn_params)
print("Best score RF + ADASYN:", study_rf_adasyn.best_value)

rf_adasyn = RandomForestClassifier(**best_rf_adasyn_params, random_state=42, n_jobs=-1)
rf_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(rf_adasyn, 'model/rf_adasyn_optuna.pkl')

[I 2025-05-18 21:50:37,269] A new study created in memory with name: no-name-1897651d-8f95-4b3b-83e9-61724b3f4247


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:50:38,938] Trial 0 finished with value: 0.9036322699133044 and parameters: {'n_estimators': 235, 'max_depth': 12, 'min_samples_split': 2}. Best is trial 0 with value: 0.9036322699133044.
[I 2025-05-18 21:50:40,100] Trial 1 finished with value: 0.9140734836986472 and parameters: {'n_estimators': 151, 'max_depth': 16, 'min_samples_split': 4}. Best is trial 1 with value: 0.9140734836986472.
[I 2025-05-18 21:50:40,838] Trial 2 finished with value: 0.8649916907632654 and parameters: {'n_estimators': 112, 'max_depth': 8, 'min_samples_split': 8}. Best is trial 1 with value: 0.9140734836986472.
[I 2025-05-18 21:50:41,833] Trial 3 finished with value: 0.832186874083226 and parameters: {'n_estimators': 214, 'max_depth': 5, 'min_samples_split': 10}. Best is trial 1 with value: 0.9140734836986472.
[I 2025-05-18 21:50:42,477] Trial 4 finished with value: 0.8321790684161599 and parameters: {'n_estimators': 124, 'max_depth': 5, 'min_samples_split': 3}. Best is trial 1 with value: 0.9

['model/rf_adasyn_optuna.pkl']

### Support Vector Machine

#### Support Vector Machine (Baseline)

In [60]:
def objective_svc_base(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    kernel = trial.suggest_categorical('kernel', ['linear','rbf','poly'])
    gamma = 'scale' if kernel=='linear' else trial.suggest_float('gamma', 1e-4, 1e-1, log=True)
    model = SVC(C=C, kernel=kernel, gamma=gamma, probability=True, random_state=42)
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_svc_base = optuna.create_study(direction='maximize')
study_svc_base.optimize(objective_svc_base, n_trials=50, show_progress_bar=True)
best_svc_params = study_svc_base.best_params

print("Best params SVC Baseline:", best_svc_params)
print("Best score SVC Baseline:", study_svc_base.best_value)

svc_base = SVC(**best_svc_params, probability=True, random_state=42)
svc_base.fit(X_train_sel, y_train)
joblib.dump(svc_base, 'model/svc_base_optuna.pkl')

[I 2025-05-18 21:51:59,413] A new study created in memory with name: no-name-469349c2-0cb6-408a-875c-3564c1ecfcb5


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:52:01,685] Trial 0 finished with value: 0.8382007342514589 and parameters: {'C': 2.1813108553891025, 'kernel': 'rbf', 'gamma': 0.0052827245367124236}. Best is trial 0 with value: 0.8382007342514589.
[I 2025-05-18 21:52:03,385] Trial 1 finished with value: 0.7338205400397241 and parameters: {'C': 49.178343019000835, 'kernel': 'poly', 'gamma': 0.00012737450677070329}. Best is trial 0 with value: 0.8382007342514589.
[I 2025-05-18 21:52:04,543] Trial 2 finished with value: 0.8393713115648737 and parameters: {'C': 0.07728199118839725, 'kernel': 'linear'}. Best is trial 2 with value: 0.8393713115648737.
[I 2025-05-18 21:52:06,337] Trial 3 finished with value: 0.7339727497902551 and parameters: {'C': 82.25640892359152, 'kernel': 'poly', 'gamma': 0.00010851819976976881}. Best is trial 2 with value: 0.8393713115648737.
[I 2025-05-18 21:52:07,424] Trial 4 finished with value: 0.8400177246037929 and parameters: {'C': 0.27561231072970216, 'kernel': 'linear'}. Best is trial 4 with 

['model/svc_base_optuna.pkl']

#### Support Vector Machine + SMOTE

In [61]:
def objective_svc_smote(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    kernel = trial.suggest_categorical('kernel', ['linear','rbf','poly'])
    gamma = 'scale' if kernel=='linear' else trial.suggest_float('gamma', 1e-4, 1e-1, log=True)
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = SVC(C=C, kernel=kernel, gamma=gamma, probability=True, random_state=42)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_svc_smote = optuna.create_study(direction='maximize')
study_svc_smote.optimize(objective_svc_smote, n_trials=50, show_progress_bar=True)
best_svc_smote_params = study_svc_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params SVC + SMOTE:", best_svc_smote_params)
print("Best score SVC + SMOTE:", study_svc_smote.best_value)

svc_smote = SVC(**best_svc_smote_params, probability=True, random_state=42)
svc_smote.fit(X_smote, y_smote)
joblib.dump(svc_smote, 'model/svc_smote_optuna.pkl')

[I 2025-05-18 21:53:33,373] A new study created in memory with name: no-name-8e633e57-cf0f-40cc-874b-134af6b3b534


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:53:41,976] Trial 0 finished with value: 0.7895460870741198 and parameters: {'C': 9.432068054626567, 'kernel': 'rbf', 'gamma': 0.00019110087182024838}. Best is trial 0 with value: 0.7895460870741198.
[I 2025-05-18 21:53:49,096] Trial 1 finished with value: 0.24458217382006806 and parameters: {'C': 0.017477339614011322, 'kernel': 'poly', 'gamma': 0.02771502198720716}. Best is trial 0 with value: 0.7895460870741198.
[I 2025-05-18 21:53:54,238] Trial 2 finished with value: 0.8409978204331214 and parameters: {'C': 21.26435573079378, 'kernel': 'rbf', 'gamma': 0.01898647037216124}. Best is trial 2 with value: 0.8409978204331214.
[I 2025-05-18 21:54:03,000] Trial 3 finished with value: 0.25046529222677194 and parameters: {'C': 0.00806379411842456, 'kernel': 'rbf', 'gamma': 0.00047513734172474186}. Best is trial 2 with value: 0.8409978204331214.
[I 2025-05-18 21:54:06,012] Trial 4 finished with value: 0.8329094707685998 and parameters: {'C': 0.14823935323741422, 'kernel': 'line

['model/svc_smote_optuna.pkl']

#### Support Vector Machine + ADASYN

In [62]:
def objective_svc_adasyn(trial):
    C = trial.suggest_float('C', 1e-3, 1e2, log=True)
    kernel = trial.suggest_categorical('kernel', ['linear','rbf','poly'])
    gamma = 'scale' if kernel=='linear' else trial.suggest_float('gamma', 1e-4, 1e-1, log=True)
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = SVC(C=C, kernel=kernel, gamma=gamma, probability=True, random_state=42)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_svc_adasyn = optuna.create_study(direction='maximize')
study_svc_adasyn.optimize(objective_svc_adasyn, n_trials=50, show_progress_bar=True)
best_svc_adasyn_params = study_svc_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params SVC + ADASYN:", best_svc_adasyn_params)
print("Best score SVC + ADASYN:", study_svc_adasyn.best_value)

svc_adasyn = SVC(**best_svc_adasyn_params, probability=True, random_state=42)
svc_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(svc_adasyn, 'model/svc_adasyn_optuna.pkl')

[I 2025-05-18 21:58:21,550] A new study created in memory with name: no-name-84afc52d-c7a7-4746-8c36-7205303a2690


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 21:58:26,893] Trial 0 finished with value: 0.7030557692564734 and parameters: {'C': 0.07630998798104283, 'kernel': 'poly', 'gamma': 0.0014262049312597522}. Best is trial 0 with value: 0.7030557692564734.
[I 2025-05-18 21:58:36,008] Trial 1 finished with value: 0.7172169216603084 and parameters: {'C': 0.022855173749562023, 'kernel': 'rbf', 'gamma': 0.009585156269033328}. Best is trial 1 with value: 0.7172169216603084.
[I 2025-05-18 21:58:41,737] Trial 2 finished with value: 0.7030950804069118 and parameters: {'C': 0.0071160381148837925, 'kernel': 'poly', 'gamma': 0.0026569102362665824}. Best is trial 1 with value: 0.7172169216603084.
[I 2025-05-18 21:58:47,389] Trial 3 finished with value: 0.49496192711151316 and parameters: {'C': 0.028104254495584135, 'kernel': 'poly', 'gamma': 0.00041397823657014467}. Best is trial 1 with value: 0.7172169216603084.
[I 2025-05-18 21:58:52,784] Trial 4 finished with value: 0.7819214218570909 and parameters: {'C': 15.987564000619601, 'kerne

['model/svc_adasyn_optuna.pkl']

### XGBoost

#### XGBoost (Baseline)

In [63]:
def objective_xgb_base(trial):
    param = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'eval_metric': 'logloss'
    }
    model = XGBClassifier(**param, random_state=42, n_jobs=-1)
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_xgb_base = optuna.create_study(direction='maximize')
study_xgb_base.optimize(objective_xgb_base, n_trials=50, show_progress_bar=True)
best_xgb_params = study_xgb_base.best_params

print("\nBest params XGBoost Baseline:", best_xgb_params)
print("Best score XGBoost Baseline:", study_xgb_base.best_value)

xgb_base = XGBClassifier(**best_xgb_params, eval_metric='logloss', random_state=42, n_jobs=-1)
xgb_base.fit(X_train_sel, y_train)
joblib.dump(xgb_base, 'model/xgb_base_optuna.pkl')

[I 2025-05-18 22:03:03,062] A new study created in memory with name: no-name-ded81f9e-00f3-48da-9d78-5bbdfd3b5fe1


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 22:03:03,554] Trial 0 finished with value: 0.8598544550726247 and parameters: {'max_depth': 4, 'learning_rate': 0.05922599533994059, 'n_estimators': 243, 'subsample': 0.8250445146814592, 'colsample_bytree': 0.8720889775422302}. Best is trial 0 with value: 0.8598544550726247.
[I 2025-05-18 22:03:04,212] Trial 1 finished with value: 0.8646511563256146 and parameters: {'max_depth': 6, 'learning_rate': 0.01565153178047934, 'n_estimators': 284, 'subsample': 0.5102828313502127, 'colsample_bytree': 0.6674222824020872}. Best is trial 1 with value: 0.8646511563256146.
[I 2025-05-18 22:03:05,243] Trial 2 finished with value: 0.8616111110860125 and parameters: {'max_depth': 10, 'learning_rate': 0.001199693680917707, 'n_estimators': 237, 'subsample': 0.5408099968344287, 'colsample_bytree': 0.6424631447643987}. Best is trial 1 with value: 0.8646511563256146.
[I 2025-05-18 22:03:06,199] Trial 3 finished with value: 0.8640036103154044 and parameters: {'max_depth': 8, 'learning_rate': 0.

['model/xgb_base_optuna.pkl']

#### XGBoost + SMOTE

In [64]:
def objective_xgb_smote(trial):
    param = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'eval_metric': 'logloss'
    }
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = XGBClassifier(**param, random_state=42, n_jobs=-1)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_xgb_smote = optuna.create_study(direction='maximize')
study_xgb_smote.optimize(objective_xgb_smote, n_trials=50, show_progress_bar=True)
best_xgb_smote_params = study_xgb_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params XGBoost + SMOTE:", best_xgb_smote_params)
print("Best score XGBoost + SMOTE:", study_xgb_smote.best_value)

xgb_smote = XGBClassifier(**best_xgb_smote_params, eval_metric='logloss', random_state=42, n_jobs=-1)
xgb_smote.fit(X_smote, y_smote)
joblib.dump(xgb_smote, 'model/xgb_smote_optuna.pkl')

[I 2025-05-18 22:03:24,130] A new study created in memory with name: no-name-c8263c08-749a-42ba-8d4f-608120d60705


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 22:03:25,141] Trial 0 finished with value: 0.9148886589198743 and parameters: {'max_depth': 6, 'learning_rate': 0.029874158371912523, 'n_estimators': 262, 'subsample': 0.943690387824428, 'colsample_bytree': 0.5071636219760847}. Best is trial 0 with value: 0.9148886589198743.
[I 2025-05-18 22:03:26,031] Trial 1 finished with value: 0.9166485281964191 and parameters: {'max_depth': 5, 'learning_rate': 0.07062978562053635, 'n_estimators': 264, 'subsample': 0.8316099485342895, 'colsample_bytree': 0.6062502263173833}. Best is trial 1 with value: 0.9166485281964191.
[I 2025-05-18 22:03:26,550] Trial 2 finished with value: 0.8967860041805373 and parameters: {'max_depth': 4, 'learning_rate': 0.025736794797280256, 'n_estimators': 203, 'subsample': 0.9535386873761811, 'colsample_bytree': 0.6281540700497084}. Best is trial 1 with value: 0.9166485281964191.
[I 2025-05-18 22:03:27,297] Trial 3 finished with value: 0.8874938166020951 and parameters: {'max_depth': 5, 'learning_rate': 0.0

['model/xgb_smote_optuna.pkl']

#### XGBoost + ADASYN

In [65]:
def objective_xgb_adasyn(trial):
    param = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'eval_metric': 'logloss'
    }
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = XGBClassifier(**param, random_state=42, n_jobs=-1)
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_xgb_adasyn = optuna.create_study(direction='maximize')
study_xgb_adasyn.optimize(objective_xgb_adasyn, n_trials=50, show_progress_bar=True)
best_xgb_adasyn_params = study_xgb_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params XGBoost + ADASYN:", best_xgb_adasyn_params)
print("Best score XGBoost + ADASYN:", study_xgb_adasyn.best_value)

xgb_adasyn = XGBClassifier(**best_xgb_adasyn_params, eval_metric='logloss', random_state=42, n_jobs=-1)
xgb_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(xgb_adasyn, 'model/xgb_adasyn_optuna.pkl')

[I 2025-05-18 22:04:44,612] A new study created in memory with name: no-name-f6e5a6d9-f35f-4f07-8fcb-58f5464db443


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 22:04:46,742] Trial 0 finished with value: 0.9192711459309679 and parameters: {'max_depth': 10, 'learning_rate': 0.060249192650645766, 'n_estimators': 207, 'subsample': 0.6589541510563582, 'colsample_bytree': 0.6172162534878165}. Best is trial 0 with value: 0.9192711459309679.
[I 2025-05-18 22:04:47,283] Trial 1 finished with value: 0.8280267632972272 and parameters: {'max_depth': 3, 'learning_rate': 0.005778681898291797, 'n_estimators': 192, 'subsample': 0.7612547213428928, 'colsample_bytree': 0.9114626041338411}. Best is trial 0 with value: 0.9192711459309679.
[I 2025-05-18 22:04:50,106] Trial 2 finished with value: 0.9203328407392842 and parameters: {'max_depth': 10, 'learning_rate': 0.04322337613676716, 'n_estimators': 239, 'subsample': 0.9032197700602677, 'colsample_bytree': 0.8535625275386913}. Best is trial 2 with value: 0.9203328407392842.
[I 2025-05-18 22:04:50,819] Trial 3 finished with value: 0.8674158882255234 and parameters: {'max_depth': 5, 'learning_rate': 

['model/xgb_adasyn_optuna.pkl']

### LightGBM

#### LightGBM (Baseline)

In [66]:
def objective_lgb_base(trial):
    num_leaves = trial.suggest_int('num_leaves', 20, 100)
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    model = LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        random_state=42,
        n_jobs=-1,
        verbose=-1,
        force_row_wise=True,
    )
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lgb_base = optuna.create_study(direction='maximize')
study_lgb_base.optimize(objective_lgb_base, n_trials=50, show_progress_bar=True)
best_lgb_params = study_lgb_base.best_params

print("Best params LGBM Baseline:", best_lgb_params)
print("Best score LGBM Baseline:", study_lgb_base.best_value)

lgb_base = LGBMClassifier(**best_lgb_params, random_state=42, n_jobs=-1, verbose=-1, force_row_wise=True)
lgb_base.fit(X_train_sel, y_train)
joblib.dump(lgb_base, 'model/lgb_base_optuna.pkl')

[I 2025-05-18 22:06:35,899] A new study created in memory with name: no-name-72fe40bb-3e08-47cd-abdd-c2a387f583a2


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 22:06:48,680] Trial 0 finished with value: 0.855854396736316 and parameters: {'num_leaves': 33, 'n_estimators': 130, 'learning_rate': 0.04286207054335831}. Best is trial 0 with value: 0.855854396736316.
[I 2025-05-18 22:07:30,914] Trial 1 finished with value: 0.8548466040733593 and parameters: {'num_leaves': 76, 'n_estimators': 214, 'learning_rate': 0.0049428053968494845}. Best is trial 0 with value: 0.855854396736316.
[I 2025-05-18 22:08:40,059] Trial 2 finished with value: 0.849094227570078 and parameters: {'num_leaves': 85, 'n_estimators': 279, 'learning_rate': 0.02410843865433377}. Best is trial 0 with value: 0.855854396736316.
[I 2025-05-18 22:09:11,128] Trial 3 finished with value: 0.8582423004414143 and parameters: {'num_leaves': 43, 'n_estimators': 248, 'learning_rate': 0.007428437639017094}. Best is trial 3 with value: 0.8582423004414143.
[I 2025-05-18 22:09:24,392] Trial 4 finished with value: 0.8614079797165883 and parameters: {'num_leaves': 22, 'n_estimators':

['model/lgb_base_optuna.pkl']

#### LightGBM + SMOTE

In [67]:
def objective_lgb_smote(trial):
    num_leaves = trial.suggest_int('num_leaves', 20, 100)
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        random_state=42,
        n_jobs=-1,
        verbose=-1,
        force_row_wise=True,
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lgb_smote = optuna.create_study(direction='maximize')
study_lgb_smote.optimize(objective_lgb_smote, n_trials=50, show_progress_bar=True)
best_lgb_smote_params = study_lgb_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params LGBM + SMOTE:", best_lgb_smote_params)
print("Best score LGBM + SMOTE:", study_lgb_smote.best_value)

lgb_smote = LGBMClassifier(**best_lgb_smote_params, random_state=42, n_jobs=-1, verbose=-1, force_row_wise=True)
lgb_smote.fit(X_smote, y_smote)
joblib.dump(lgb_smote, 'model/lgb_smote_optuna.pkl')

[I 2025-05-18 22:25:24,788] A new study created in memory with name: no-name-f137c5e8-f703-4f4e-b1e9-0ac723558a26


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 22:25:50,219] Trial 0 finished with value: 0.8930018974707492 and parameters: {'num_leaves': 36, 'n_estimators': 226, 'learning_rate': 0.004103348125141256}. Best is trial 0 with value: 0.8930018974707492.
[I 2025-05-18 22:26:05,499] Trial 1 finished with value: 0.9091005868518589 and parameters: {'num_leaves': 86, 'n_estimators': 51, 'learning_rate': 0.03726280810683616}. Best is trial 1 with value: 0.9091005868518589.
[I 2025-05-18 22:26:20,797] Trial 2 finished with value: 0.9080375632683614 and parameters: {'num_leaves': 76, 'n_estimators': 68, 'learning_rate': 0.027145176806397147}. Best is trial 1 with value: 0.9091005868518589.
[I 2025-05-18 22:26:40,461] Trial 3 finished with value: 0.8925731138437701 and parameters: {'num_leaves': 53, 'n_estimators': 121, 'learning_rate': 0.0054426932137046375}. Best is trial 1 with value: 0.9091005868518589.
[I 2025-05-18 22:27:39,083] Trial 4 finished with value: 0.9151689138486313 and parameters: {'num_leaves': 92, 'n_estimato

['model/lgb_smote_optuna.pkl']

#### LightGBM + ADASYN

In [68]:
def objective_lgb_adasyn(trial):
    num_leaves = trial.suggest_int('num_leaves', 20, 100)
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = LGBMClassifier(
        num_leaves=num_leaves,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        random_state=42,
        n_jobs=-1,
        verbose=-1,
        force_row_wise=True,
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_lgb_adasyn = optuna.create_study(direction='maximize')
study_lgb_adasyn.optimize(objective_lgb_adasyn, n_trials=50, show_progress_bar=True)
best_lgb_adasyn_params = study_lgb_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params LGBM + ADASYN:", best_lgb_adasyn_params)
print("Best score LGBM + ADASYN:", study_lgb_adasyn.best_value)

lgb_adasyn = LGBMClassifier(**best_lgb_adasyn_params, random_state=42, n_jobs=-1, verbose=-1, force_row_wise=True)
lgb_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(lgb_adasyn, 'model/lgb_adasyn_optuna.pkl')

[I 2025-05-18 23:05:31,366] A new study created in memory with name: no-name-9dfb7a0c-d4eb-4c88-86f2-ee21df234985


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 23:05:48,980] Trial 0 finished with value: 0.8636088281305809 and parameters: {'num_leaves': 38, 'n_estimators': 146, 'learning_rate': 0.007776616407607062}. Best is trial 0 with value: 0.8636088281305809.
[I 2025-05-18 23:06:11,886] Trial 1 finished with value: 0.8615122463697438 and parameters: {'num_leaves': 44, 'n_estimators': 167, 'learning_rate': 0.004988108580915682}. Best is trial 0 with value: 0.8636088281305809.
[I 2025-05-18 23:06:36,351] Trial 2 finished with value: 0.8837521656217888 and parameters: {'num_leaves': 90, 'n_estimators': 97, 'learning_rate': 0.017480966974908092}. Best is trial 2 with value: 0.8837521656217888.
[I 2025-05-18 23:07:13,642] Trial 3 finished with value: 0.9145070554025156 and parameters: {'num_leaves': 62, 'n_estimators': 209, 'learning_rate': 0.055846228236656116}. Best is trial 3 with value: 0.9145070554025156.
[I 2025-05-18 23:07:33,600] Trial 4 finished with value: 0.8945042516650734 and parameters: {'num_leaves': 47, 'n_estimat

['model/lgb_adasyn_optuna.pkl']

### CatBoost

#### CatBoost (Baseline)

In [69]:
def objective_cat_base(trial):
    iterations = trial.suggest_int('iterations', 100, 500)
    depth = trial.suggest_int('depth', 3, 10)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    model = CatBoostClassifier(
        iterations=iterations,
        depth=depth,
        learning_rate=learning_rate,
        verbose=0,
        random_state=42,
        thread_count=-1,
    )
    return cross_val_score(model, X_train_sel, y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_cat_base = optuna.create_study(direction='maximize')
study_cat_base.optimize(objective_cat_base, n_trials=50, show_progress_bar=True)
best_cat_params = study_cat_base.best_params

print("Best params CatBoost Baseline:", best_cat_params)
print("Best score CatBoost Baseline:", study_cat_base.best_value)

cat_base = CatBoostClassifier(**best_cat_params, verbose=0, random_state=42, thread_count=-1)
cat_base.fit(X_train_sel, y_train)
joblib.dump(cat_base, 'model/cat_base_optuna.pkl')

[I 2025-05-18 23:47:13,956] A new study created in memory with name: no-name-445619a9-a461-4a55-b496-4eef6810c54f


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 23:47:15,531] Trial 0 finished with value: 0.8643892009790413 and parameters: {'iterations': 187, 'depth': 6, 'learning_rate': 0.07734485487114112}. Best is trial 0 with value: 0.8643892009790413.
[I 2025-05-18 23:47:22,960] Trial 1 finished with value: 0.8563593902145961 and parameters: {'iterations': 129, 'depth': 10, 'learning_rate': 0.013041414289894195}. Best is trial 0 with value: 0.8643892009790413.
[I 2025-05-18 23:47:28,027] Trial 2 finished with value: 0.855711705424049 and parameters: {'iterations': 393, 'depth': 7, 'learning_rate': 0.0024557487948737755}. Best is trial 0 with value: 0.8643892009790413.
[I 2025-05-18 23:47:30,803] Trial 3 finished with value: 0.8621567628539484 and parameters: {'iterations': 256, 'depth': 7, 'learning_rate': 0.013774178698393902}. Best is trial 0 with value: 0.8643892009790413.
[I 2025-05-18 23:47:32,533] Trial 4 finished with value: 0.852857969711786 and parameters: {'iterations': 479, 'depth': 6, 'learning_rate': 0.0010371981

['model/cat_base_optuna.pkl']

#### CatBoost + SMOTE

In [70]:
def objective_cat_smote(trial):
    iterations = trial.suggest_int('iterations', 100, 500)
    depth = trial.suggest_int('depth', 3, 10)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_train_sel, y_train)
    model = CatBoostClassifier(
        iterations=iterations,
        depth=depth,
        learning_rate=learning_rate,
        verbose=0,
        random_state=42,
        thread_count=-1,
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_cat_smote = optuna.create_study(direction='maximize')
study_cat_smote.optimize(objective_cat_smote, n_trials=50, show_progress_bar=True)
best_cat_smote_params = study_cat_smote.best_params

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train_sel, y_train)

print("Best params CatBoost + SMOTE:", best_cat_smote_params)
print("Best score CatBoost + SMOTE:", study_cat_smote.best_value)

cat_smote = CatBoostClassifier(**best_cat_smote_params, verbose=0, random_state=42, thread_count=-1)
cat_smote.fit(X_smote, y_smote)
joblib.dump(cat_smote, 'model/cat_smote_optuna.pkl')

[I 2025-05-18 23:48:53,838] A new study created in memory with name: no-name-dd0691ee-4e20-458b-a26d-0206d4418de8


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-18 23:48:55,996] Trial 0 finished with value: 0.8970994612819629 and parameters: {'iterations': 256, 'depth': 6, 'learning_rate': 0.03195002932657537}. Best is trial 0 with value: 0.8970994612819629.
[I 2025-05-18 23:48:56,604] Trial 1 finished with value: 0.8485781108857374 and parameters: {'iterations': 173, 'depth': 3, 'learning_rate': 0.004249544632331223}. Best is trial 0 with value: 0.8970994612819629.
[I 2025-05-18 23:49:24,000] Trial 2 finished with value: 0.9329637724248181 and parameters: {'iterations': 337, 'depth': 10, 'learning_rate': 0.08938871533763507}. Best is trial 2 with value: 0.9329637724248181.
[I 2025-05-18 23:49:25,695] Trial 3 finished with value: 0.8827748044646496 and parameters: {'iterations': 296, 'depth': 5, 'learning_rate': 0.01885613827321624}. Best is trial 2 with value: 0.9329637724248181.
[I 2025-05-18 23:49:41,447] Trial 4 finished with value: 0.8698742075299519 and parameters: {'iterations': 387, 'depth': 9, 'learning_rate': 0.00137804607

['model/cat_smote_optuna.pkl']

#### CatBoost + ADASYN

In [71]:
def objective_cat_adasyn(trial):
    iterations = trial.suggest_int('iterations', 100, 500)
    depth = trial.suggest_int('depth', 3, 10)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    ada = ADASYN(random_state=42)
    X_res, y_res = ada.fit_resample(X_train_sel, y_train)
    model = CatBoostClassifier(
        iterations=iterations,
        depth=depth,
        learning_rate=learning_rate,
        verbose=0,
        random_state=42,
        thread_count=-1,
    )
    return cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc_ovr', n_jobs=-1).mean()

study_cat_adasyn = optuna.create_study(direction='maximize')
study_cat_adasyn.optimize(objective_cat_adasyn, n_trials=50, show_progress_bar=True)
best_cat_adasyn_params = study_cat_adasyn.best_params

ada = ADASYN(random_state=42)
X_adasyn, y_adasyn = ada.fit_resample(X_train_sel, y_train)

print("Best params CatBoost + ADASYN:", best_cat_adasyn_params)
print("Best score CatBoost + ADASYN:", study_cat_adasyn.best_value)

cat_adasyn = CatBoostClassifier(**best_cat_adasyn_params, verbose=0, random_state=42, thread_count=-1)
cat_adasyn.fit(X_adasyn, y_adasyn)
joblib.dump(cat_adasyn, 'model/cat_adasyn_optuna.pkl')

[I 2025-05-19 00:05:09,171] A new study created in memory with name: no-name-4e3de3df-8f45-43f0-bb52-01541d6f281a


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-05-19 00:05:11,570] Trial 0 finished with value: 0.8806525493106643 and parameters: {'iterations': 422, 'depth': 5, 'learning_rate': 0.040439160186669224}. Best is trial 0 with value: 0.8806525493106643.
[I 2025-05-19 00:05:12,200] Trial 1 finished with value: 0.8500731194158548 and parameters: {'iterations': 164, 'depth': 3, 'learning_rate': 0.07140271897051291}. Best is trial 0 with value: 0.8806525493106643.
[I 2025-05-19 00:05:21,998] Trial 2 finished with value: 0.8888402349674479 and parameters: {'iterations': 427, 'depth': 8, 'learning_rate': 0.02440502147168267}. Best is trial 2 with value: 0.8888402349674479.
[I 2025-05-19 00:05:23,686] Trial 3 finished with value: 0.8521649739890609 and parameters: {'iterations': 296, 'depth': 5, 'learning_rate': 0.02450901943060313}. Best is trial 2 with value: 0.8888402349674479.
[I 2025-05-19 00:05:39,288] Trial 4 finished with value: 0.8424103250996353 and parameters: {'iterations': 189, 'depth': 10, 'learning_rate': 0.00723355565

['model/cat_adasyn_optuna.pkl']

## Evaluation

In [72]:
models = {
    'LR Base': 'model/lr_base_optuna.pkl',
    'LR SMOTE': 'model/lr_smote_optuna.pkl',
    'LR ADASYN': 'model/lr_adasyn_optuna.pkl',
    'RF Base': 'model/rf_base_optuna.pkl',
    'RF SMOTE': 'model/rf_smote_optuna.pkl',
    'RF ADASYN': 'model/rf_adasyn_optuna.pkl',
    'SVC Base': 'model/svc_base_optuna.pkl',
    'SVC SMOTE': 'model/svc_smote_optuna.pkl',
    'SVC ADASYN': 'model/svc_adasyn_optuna.pkl',
    'XGB Base': 'model/xgb_base_optuna.pkl',
    'XGB SMOTE': 'model/xgb_smote_optuna.pkl',
    'XGB ADASYN': 'model/xgb_adasyn_optuna.pkl',
    'LGB Base': 'model/lgb_base_optuna.pkl',
    'LGB SMOTE': 'model/lgb_smote_optuna.pkl',
    'LGB ADASYN': 'model/lgb_adasyn_optuna.pkl',
    'Cat Base': 'model/cat_base_optuna.pkl',
    'Cat SMOTE': 'model/cat_smote_optuna.pkl',
    'Cat ADASYN': 'model/cat_adasyn_optuna.pkl'
}

#### Akurasi

In [73]:
acc_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_pred = m.predict(X_test_sel)
    acc_list.append({'Model': name, 'Accuracy': accuracy_score(y_test, y_pred)})

In [74]:
df_acc = pd.DataFrame(acc_list).sort_values(by='Accuracy', ascending=False)
print(df_acc)

         Model  Accuracy
15    Cat Base  0.763842
9     XGB Base  0.762712
12    LGB Base  0.760452
3      RF Base  0.754802
16   Cat SMOTE  0.740113
0      LR Base  0.738983
6     SVC Base  0.737853
5    RF ADASYN  0.733333
4     RF SMOTE  0.732203
11  XGB ADASYN  0.731073
10   XGB SMOTE  0.727684
17  Cat ADASYN  0.725424
14  LGB ADASYN  0.719774
13   LGB SMOTE  0.716384
7    SVC SMOTE  0.674576
1     LR SMOTE  0.658757
2    LR ADASYN  0.650847
8   SVC ADASYN  0.638418


In [92]:
fig = px.bar(df_acc, x='Model', y='Accuracy',
             title='Accuracy', text='Accuracy',
             height=400, width=800, range_y=[0.6, 0.8], color='Accuracy')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

#### Precision

In [76]:
prec_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_pred = m.predict(X_test_sel)
    prec_list.append({'Model': name,
                      'Precision': precision_score(y_test, y_pred, average='weighted')})

In [77]:
df_prec = pd.DataFrame(prec_list).sort_values(by='Precision', ascending=False)
print(df_prec)

         Model  Precision
15    Cat Base   0.748566
9     XGB Base   0.744564
7    SVC SMOTE   0.744438
12    LGB Base   0.741107
16   Cat SMOTE   0.737437
8   SVC ADASYN   0.732873
3      RF Base   0.732316
4     RF SMOTE   0.729821
6     SVC Base   0.729629
5    RF ADASYN   0.729300
1     LR SMOTE   0.722470
17  Cat ADASYN   0.721469
2    LR ADASYN   0.720841
10   XGB SMOTE   0.719240
11  XGB ADASYN   0.717912
0      LR Base   0.713308
14  LGB ADASYN   0.707786
13   LGB SMOTE   0.707412


In [78]:
fig = px.bar(df_prec, x='Model', y='Precision',
             title='Precision', text='Precision',
             height=400, width=800, range_y=[0.6, 0.8], color='Precision')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

#### Recall

In [79]:
rec_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_pred = m.predict(X_test_sel)
    rec_list.append({'Model': name,
                     'Recall': recall_score(y_test, y_pred, average='weighted')})

In [80]:
df_rec = pd.DataFrame(rec_list).sort_values(by='Recall', ascending=False)
print(df_rec)

         Model    Recall
15    Cat Base  0.763842
9     XGB Base  0.762712
12    LGB Base  0.760452
3      RF Base  0.754802
16   Cat SMOTE  0.740113
0      LR Base  0.738983
6     SVC Base  0.737853
5    RF ADASYN  0.733333
4     RF SMOTE  0.732203
11  XGB ADASYN  0.731073
10   XGB SMOTE  0.727684
17  Cat ADASYN  0.725424
14  LGB ADASYN  0.719774
13   LGB SMOTE  0.716384
7    SVC SMOTE  0.674576
1     LR SMOTE  0.658757
2    LR ADASYN  0.650847
8   SVC ADASYN  0.638418


In [81]:
fig = px.bar(df_rec, x='Model', y='Recall',
             title='Recall', text='Recall',
             height=400, width=800, range_y=[0.6, 0.8], color='Recall')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

#### F1-Score

In [82]:
f1_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_pred = m.predict(X_test_sel)
    f1_list.append({'Model': name,
                    'F1-Score': f1_score(y_test, y_pred, average='weighted')})

In [83]:
df_f1 = pd.DataFrame(f1_list).sort_values(by='F1-Score', ascending=False)
print(df_f1)

         Model  F1-Score
15    Cat Base  0.752590
9     XGB Base  0.746529
12    LGB Base  0.739803
16   Cat SMOTE  0.738591
3      RF Base  0.734295
5    RF ADASYN  0.730724
4     RF SMOTE  0.730074
6     SVC Base  0.726147
17  Cat ADASYN  0.723244
11  XGB ADASYN  0.723209
10   XGB SMOTE  0.722855
14  LGB ADASYN  0.712643
13   LGB SMOTE  0.711406
0      LR Base  0.705181
7    SVC SMOTE  0.695551
1     LR SMOTE  0.679746
2    LR ADASYN  0.673550
8   SVC ADASYN  0.666550


In [84]:
fig = px.bar(df_f1, x='Model', y='F1-Score',
             title='F1-Score', text='F1-Score',
             height=400, width=800, range_y=[0.6, 0.8], color='F1-Score')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

#### ROC AUC

In [85]:
roc_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_proba = m.predict_proba(X_test_sel)
    roc_list.append({'Model': name,
                    'ROC AUC': roc_auc_score(y_test, y_proba, multi_class='ovr')})

In [86]:
df_roc = pd.DataFrame(roc_list).sort_values(by='ROC AUC', ascending=False)
print(df_roc)

         Model   ROC AUC
15    Cat Base  0.870088
9     XGB Base  0.869443
3      RF Base  0.866917
12    LGB Base  0.866578
4     RF SMOTE  0.855229
16   Cat SMOTE  0.855114
17  Cat ADASYN  0.854706
5    RF ADASYN  0.854189
6     SVC Base  0.853291
7    SVC SMOTE  0.850145
11  XGB ADASYN  0.849143
10   XGB SMOTE  0.848910
13   LGB SMOTE  0.846292
8   SVC ADASYN  0.842304
14  LGB ADASYN  0.842106
0      LR Base  0.834966
1     LR SMOTE  0.831354
2    LR ADASYN  0.831248


In [87]:
fig = px.bar(df_roc, x='Model', y='ROC AUC',
             title='ROC AUC', text='ROC AUC',
             height=400, width=800, range_y=[0.8, 0.9], color='ROC AUC')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

#### Confusion Matrix

In [88]:
cm_list = []
for name, path in models.items():
    m = joblib.load(path)
    y_pred = m.predict(X_test_sel)
    cm = confusion_matrix(y_test, y_pred)
    cm_list.append({'Model': name, 'ConfusionMatrix': cm})

In [89]:
cm_df = pd.DataFrame(cm_list)
print(cm_df)

         Model                                ConfusionMatrix
0      LR Base    [[207, 24, 53], [42, 29, 88], [22, 2, 418]]
1     LR SMOTE  [[180, 77, 27], [26, 91, 42], [21, 109, 312]]
2    LR ADASYN  [[194, 68, 22], [31, 91, 37], [26, 125, 291]]
3      RF Base   [[215, 23, 46], [48, 44, 67], [14, 19, 409]]
4     RF SMOTE   [[202, 42, 40], [40, 67, 52], [17, 46, 379]]
5    RF ADASYN   [[206, 39, 39], [42, 66, 51], [19, 46, 377]]
6     SVC Base   [[183, 53, 48], [25, 60, 74], [18, 14, 410]]
7    SVC SMOTE  [[172, 87, 25], [23, 99, 37], [15, 101, 326]]
8   SVC ADASYN  [[172, 92, 20], [23, 99, 37], [18, 130, 294]]
9     XGB Base   [[216, 25, 43], [45, 52, 62], [16, 19, 407]]
10   XGB SMOTE   [[207, 41, 36], [47, 59, 53], [23, 41, 378]]
11  XGB ADASYN   [[212, 37, 35], [46, 54, 59], [23, 38, 381]]
12    LGB Base   [[216, 23, 45], [44, 46, 69], [18, 13, 411]]
13   LGB SMOTE   [[207, 41, 36], [49, 55, 55], [26, 44, 372]]
14  LGB ADASYN   [[207, 39, 38], [44, 54, 61], [27, 39, 376]]
15    Ca

In [90]:
labels     = [str(l) for l in sorted(set(y_test))]
items      = list(models.items())
n_models   = len(items)
cols       = 3
rows       = math.ceil(n_models/cols)

In [93]:
fig = make_subplots(rows=rows, cols=cols,
                    subplot_titles=[n for n,_ in items])

for idx,(name,path) in enumerate(items):
    m        = joblib.load(path)
    y_pred   = m.predict(X_test_sel)
    cm       = confusion_matrix(y_test, y_pred)
    r, c     = divmod(idx, cols)
    row, col = r+1, c+1

    fig.add_trace(
        go.Heatmap(
            z=cm, x=labels, y=labels,
            colorscale='Blues', showscale=(idx==n_models-1)
        ), row=row, col=col
    )
    for i in range(len(labels)):
        for j in range(len(labels)):
            fig.add_annotation(
                x=labels[j], y=labels[i],
                text=str(cm[i][j]), showarrow=False,
                row=row, col=col
            )

fig.update_layout(
    title='Confusion Matrix',
    height=400*rows, width=400*cols, showlegend=False
)
fig.show()