# Proyek Akhir - Menyelesaikan Permasalahan Institusi Pendidikan: Prediksi Dropout Mahasiswa dan Visualisasi Performa Akademik di Jaya Jaya Institut

- Nama: Muhammad Akbar Hamid
- Email: muhakbarhamid21@gmail.com
- Id Dicoding: muhakbarhamid21

## Persiapan

### Menyiapkan library yang dibutuhkan

In [37]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score, f1_score

import joblib

import warnings
import plotly.express as px
import plotly.graph_objects as go


warnings.filterwarnings("ignore")
sns.set(style="whitegrid")


### Menyiapkan data yang akan diguankan

In [5]:
df = pd.read_csv("data/data_student.csv", sep=";")

## Data Understanding

### Preview Data

Melihat beberapa baris pertama untuk memastikan kolom dan format data sudah benar, serta mengamati sampel nilai tiap kolom.

In [93]:
df.head()

Unnamed: 0,Marital_status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_grade,Nacionality,Mothers_qualification,Fathers_qualification,...,Curricular_units_2nd_sem_credited,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Status
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


### Tipe Kolom

Menampilkan tipe data dan jumlah non-null setiap kolom—penting untuk mendeteksi missing values dan memastikan kolom numerik vs kategorikal.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Marital_status                                4424 non-null   int64  
 1   Application_mode                              4424 non-null   int64  
 2   Application_order                             4424 non-null   int64  
 3   Course                                        4424 non-null   int64  
 4   Daytime_evening_attendance                    4424 non-null   int64  
 5   Previous_qualification                        4424 non-null   int64  
 6   Previous_qualification_grade                  4424 non-null   float64
 7   Nacionality                                   4424 non-null   int64  
 8   Mothers_qualification                         4424 non-null   int64  
 9   Fathers_qualification                         4424 non-null   i

### Statistik Deskriptif Fitur Numerik

Ringkasan statistik (count, mean, std, min, quartiles, max) untuk semua kolom numerik—berguna mendeteksi outlier atau variabilitas tinggi.

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Marital_status,4424.0,1.178571,0.605747,1.0,1.0,1.0,1.0,6.0
Application_mode,4424.0,18.669078,17.484682,1.0,1.0,17.0,39.0,57.0
Application_order,4424.0,1.727848,1.313793,0.0,1.0,1.0,2.0,9.0
Course,4424.0,8856.642631,2063.566416,33.0,9085.0,9238.0,9556.0,9991.0
Daytime_evening_attendance,4424.0,0.890823,0.311897,0.0,1.0,1.0,1.0,1.0
Previous_qualification,4424.0,4.577758,10.216592,1.0,1.0,1.0,1.0,43.0
Previous_qualification_grade,4424.0,132.613314,13.188332,95.0,125.0,133.1,140.0,190.0
Nacionality,4424.0,1.873192,6.914514,1.0,1.0,1.0,1.0,109.0
Mothers_qualification,4424.0,19.561935,15.603186,1.0,2.0,19.0,37.0,44.0
Fathers_qualification,4424.0,22.275316,15.343108,1.0,3.0,19.0,37.0,44.0


### Distribusi Target (`Status`)

Memeriksa proporsi masing-masing kelas (Dropout, Enrolled, Graduate) untuk mengetahui apakah perlu penanganan imbalance.

In [None]:
fig = px.histogram(
  df, 
  x='Status', 
  color='Status', 
  title='Distribusi Status Mahasiswa', 
  text_auto=True, 
  height=400, width=800,
  color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

### Distribusi Fitur Numerik

Berikut beberapa contoh fitur numerik penting. Warna menunjukkan distribusi per kelas Status.

#### Distribusi Fitur `Age_at_enrollment`

Melihat rentang dan kebanyakan usia mahasiswa saat daftar—usia ekstrem bisa berpengaruh ke risiko dropout.

In [None]:
fig = px.histogram(
  df, 
  x='Age_at_enrollment', 
  title='Distribusi Usia pada Saat Pendaftaran', 
  text_auto=True, 
  height=400, width=800, 
  color='Status'
)
fig.show()

#### Distribusi Fitur `Admission_grade`

Distribusi nilai masuk—nilai rendah atau tinggi mayoritas? Ini indikator awal kemampuan akademik.

In [None]:
fig = px.histogram(
    df,
    x='Admission_grade',
    nbins=30,
    title='Distribusi Admission Grade',
    labels={'Admission_grade':'Admission Grade','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly,
)
fig.show()

#### Distribusi Fitur `Previous_qualification_grade`

Nilai kualifikasi sebelum kuliah—menunjukkan latar belakang akademik dan potensi adaptasi.

In [109]:
fig = px.histogram(
    df,
    x='Previous_qualification_grade',
    nbins=50,
    title='Distribusi Nilai Kualifikasi Sebelumnya',
    labels={'Previous_qualification_grade':'Prev Qualification Grade','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Rata-rata Nilai Semester

Melihat kinerja di tiap semester untuk mendeteksi penurunan atau konsistensi nilai.

In [None]:
# Semester 1
fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_grade',
    nbins=30,
    title='Distribusi Nilai Rata-rata Sem 1',
    text_auto=True, height=400, width=800, color='Status'
)
fig.show()

# Semester 2
fig = px.histogram(
    df,
    x='Curricular_units_2nd_sem_grade',
    nbins=30,
    title='Distribusi Nilai Rata-rata Sem 2',
    text_auto=True, height=400, width=800, color='Status'
)
fig.show()


#### Distribusi Ekonomi Makro pada Fitur `Unemployment_rate`, `Inflation_rate`, dan `GDP`

Variabel eksternal yang bisa memengaruhi kemampuan mahasiswa membayar biaya kuliah dan kelanjutan studi.

In [None]:
# Unemployment rate
fig = px.histogram(
    df,
    x='Unemployment_rate',
    title='Distribusi Unemployment Rate (%)',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()

# Inflation rate
fig = px.histogram(
    df,
    x='Inflation_rate',
    title='Distribusi Inflation Rate (%)',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()

# GDP
fig = px.histogram(
    df,
    x='GDP',
    nbins=30,
    title='Distribusi GDP',
    text_auto=True, height=400, width=800,
    color='Status',
)
fig.show()


#### Distribusi Beban Akademik vs Keberhasilan (Semester 1)

Meninjau jumlah mata kuliah yang diambil vs yang diselesaikan—rasio rendah menandakan kesulitan akademik.

In [None]:
# Enrolled vs Approved Sem 1
fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_enrolled',
    nbins=20,
    title='Units Enrolled Sem 1',
    text_auto=True, height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

fig = px.histogram(
    df,
    x='Curricular_units_1st_sem_approved',
    nbins=20,
    title='Units Approved Sem 1',
    text_auto=True, height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

### Distribusi Fitur Kategorikal

Fokus pada kategori yang berpotensi kuat memengaruhi outcome.

#### Distribusi Fitur `Course`

Jurusan dengan jumlah terbanyak bisa menjadi fokus intervensi jika memiliki angka dropout tinggi.

In [None]:
course_cnt = df['Course'].value_counts().reset_index()
course_cnt.columns = ['Course','Count']

fig = px.bar(
  course_cnt, 
  x='Course', y='Count', 
  title='Jumlah Mahasiswa per Course', 
  color_discrete_sequence=['#FF0000'], 
  text='Count', 
  height=400, width=800
)
fig.show()

#### Distribusi Fitur `Marital_status`

Dukungan keluarga/tingkat tanggung jawab bisa berbeda antar status pernikahan.

In [112]:
fig = px.histogram(
    df,
    x='Marital_status',
    category_orders={'Marital_status': sorted(df['Marital_status'].unique())},
    title='Distribusi Marital Status Mahasiswa',
    labels={'Marital_status':'Marital Status (1=Single, 2=Married, 3=Widower, 4=Divorced, 5=Facto Union, 6=Legally Separated)','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Application_mode`

Mode pendaftaran menunjukkan asal motivasi atau jalur masuk (reguler, transfer, beasiswa khusus).

In [65]:
fig = px.histogram(
    df,
    x='Application_mode',
    title='Distribusi Application Mode',
    labels={'Application_mode':'Mode Aplikasi','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Daytime_evening_attendance`

Jadwal kuliah pagi vs malam bisa berdampak pada performa (energi, pekerjaan sampingan).

In [66]:
fig = px.histogram(
    df,
    x='Daytime_evening_attendance',
    title='Distribusi Kehadiran (Daytime vs Evening)',
    labels={'Daytime_evening_attendance':'1=Day, 0=Evening','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Previous_qualification`

Tingkat pendidikan sebelumnya memengaruhi kesiapan akademik.

In [68]:
fig = px.histogram(
    df,
    x='Previous_qualification',
    title='Distribusi Tingkat Kualifikasi Sebelumnya',
    labels={'Previous_qualification':'Kualifikasi Awal','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Scholarship_holder`

Beasiswa dapat mengurangi beban finansial dan meningkatkan retensi.

In [69]:
fig = px.histogram(
    df,
    x='Scholarship_holder',
    title='Proporsi Mahasiswa Penerima Beasiswa',
    labels={'Scholarship_holder':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Displaced`

Mahasiswa displaced mungkin menghadapi tantangan adaptasi yang lebih tinggi.

In [70]:
fig = px.histogram(
    df,
    x='Displaced',
    title='Proporsi Mahasiswa Displaced',
    labels={'Displaced':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()

#### Distribusi Fitur `Debtor`

Tunggakan pembayaran bisa menimbulkan stres dan risiko dropout.

In [75]:
fig = px.histogram(
    df,
    x='Debtor',
    title='Proporsi Mahasiswa Debtor',
    labels={'Debtor':'1=Ya, 0=Tidak','count':'Jumlah'},
    text_auto=True,
    height=400, width=800,
    color='Status',
    color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.update_xaxes(tickmode='array', tickvals=[0,1], ticktext=['Tidak','Ya'])
fig.show()


### Korelasi Antar Fitur

Menampilkan matriks korelasi untuk semua kolom numerik—membantu deteksi multikolinearitas.

In [73]:
corr = df.corr(numeric_only=True).round(2)

fig = go.Figure(
    data=go.Heatmap(
        z=corr.values,
        x=corr.columns,
        y=corr.index,
        colorscale='RdBu',
        zmin=-1,
        zmax=1,
        colorbar=dict(title='Korelasi'),
        text=corr.values,
        texttemplate="%{text}",
        hovertemplate=(
            "Fitur X: %{x}<br>"
            "Fitur Y: %{y}<br>"
            "Korelasi: %{z:.2f}<extra></extra>"
        )
    )
)

fig.update_layout(title="Matriks Korelasi", xaxis_title="Fitur", yaxis_title="Fitur", width=1500, height=1500)

fig.show()


#### Korelasi Antar Fitur terhadap Setiap Kategori `Status` (Dropout, Graduate, Enrolled)

Melihat fitur mana yang paling berkorelasi positif/negatif dengan masing-masing kelas target.

In [None]:
# One-hot encode kolom Status
status_dummies = pd.get_dummies(df['Status'], prefix='Status')

# Buat df baru yang sudah include dummies dan tanpa kolom Status asli
df_enc = pd.concat([df.drop(columns=['Status']), status_dummies], axis=1)

# Hitung korelasi seluruh kolom numerik
corr = df_enc.corr(numeric_only=True).round(2)

# Untuk tiap kategori Status, plot bar chart horizontal
for cat in ['Dropout', 'Graduate', 'Enrolled']:
    col = f'Status_{cat}'
    # Ambil korelasi terhadap dummy, buang self, urutkan
    top_corr = corr[col].drop(col).sort_values(ascending=True)
    chart_height = len(top_corr) * 30

    fig = go.Figure(
        data=go.Bar(
            x=top_corr.values,
            y=top_corr.index,
            orientation='h',
            text=top_corr.values,
            textposition='auto'
        )
    )
    fig.update_layout(
        title=f'Korelasi Fitur terhadap Status = {cat}',
        xaxis=dict(
            title='Korelasi',
            # Atur range +- sesuai range data
            range=[top_corr.min() - 0.05, top_corr.max() + 0.05]
        ),
        yaxis=dict(
            title='Fitur',
            showgrid=True,
            gridcolor='lightgrey',
            gridwidth=1,
            ticks="outside"
        ),
        plot_bgcolor='white',
        width=800,
        height=chart_height
    )
    fig.show()


### Hubungan Fitur dengan Target (Kolom `Status`)

#### Hubungan `Admission_grade` vs `Status`

Box-plot membandingkan distribusi nilai masuk antar hasil akhir mahasiswa.

In [None]:
fig = px.box(
  df,
  x='Status', y='Admission_grade',
  title='Admission Grade per Kelas Status',
  color='Status',
  points='all',
  height=400, width=800,
  color_discrete_sequence=px.colors.qualitative.Plotly
)
fig.show()


#### Hubungan `Previous_qualification_grade` vs `Status`

Melihat perbedaan rata-rata nilai kualifikasi sebelum kuliah pada tiap outcome.

In [None]:
fig = px.box(
    df,
    x='Status', y='Previous_qualification_grade',
    title='Nilai Kualifikasi Sebelumnya per Kelas Status',
    color='Status',
    points='all',
    labels={'Previous_qualification_grade':'Prev Qualification Grade','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan `Curricular_units_1st_sem_grade` vs `Status`

Menilai kinerja semester pertama sebagai indikator risiko dropout atau kelulusan.

In [None]:
fig = px.box(
    df,
    x='Status', y='Curricular_units_1st_sem_grade',
    title='Rata-rata Nilai Semester 1 per Kelas Status',
    color='Status',
    points='all',
    labels={'Curricular_units_1st_sem_grade':'Grade Sem 1','Status':'Status'},
    height=400, width=800,
)
fig.show()

#### Hubungan `Curricular_units_2nd_sem_grade` vs `Status`

Memeriksa perubahan kinerja semester kedua—apakah ada rebound atau penurunan.

In [None]:
fig = px.box(
    df,
    x='Status', y='Curricular_units_2nd_sem_grade',
    title='Rata-rata Nilai Semester 2 per Kelas Status',
    color='Status',
    points='all',
    labels={'Curricular_units_2nd_sem_grade':'Grade Sem 2','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan `Age_at_enrollment` vs `Status`

Mengetahui apakah kelompok usia tertentu lebih berisiko dropout atau lebih sering lulus.

In [None]:
fig = px.box(
    df,
    x='Status', y='Age_at_enrollment',
    title='Usia Saat Pendaftaran per Kelas Status',
    color='Status',
    points='all',
    labels={'Age_at_enrollment':'Usia','Status':'Status'},
    height=400, width=800
)
fig.show()

#### Hubungan Rasio `approved/enrolled` Semester 1 vs `Status`

Menilai efisiensi penyelesaian beban studi tanpa menambah kolom baru permanen.

In [None]:
fig = px.box(
    df.assign(
        ratio_sem1 = df['Curricular_units_1st_sem_approved'] / df['Curricular_units_1st_sem_enrolled']
    ),
    x='Status', y='ratio_sem1',
    title='Rasio Approved/Enrolled Sem 1 per Kelas Status',
    labels={'ratio_sem1':'Approval Ratio Sem 1','Status':'Status'},
    points='all',
    height=400, width=800,
    color_discrete_sequence=px.colors.qualitative.Plotly,
    color='Status'
)
fig.show()


## Data Preparation / Preprocessing

## Data Transformation

## Modeling

## Evaluation