# **Latar Belakang**

Prestasi belajar siswa merupakan indikator penting dalam menilai keberhasilan proses pendidikan. Namun, hasil akademik siswa tidak hanya dipengaruhi oleh kemampuan intelektual, tetapi juga oleh berbagai faktor sosial, ekonomi, dan lingkungan. Oleh karena itu, diperlukan analisis berbasis data untuk memahami faktor-faktor yang memengaruhi performa akademik siswa secara lebih komprehensif.

Dataset Student Performance dari UCI Machine Learning Repository menyediakan data nilai akademik siswa serta berbagai atribut pendukung seperti latar belakang keluarga, kebiasaan belajar, dan kondisi sosial. Dengan memanfaatkan dataset ini, proyek ini bertujuan untuk menganalisis hubungan antara faktor-faktor tersebut terhadap prestasi siswa serta membangun model prediksi nilai akhir. Hasil analisis diharapkan dapat membantu pendidik dalam mengambil keputusan yang lebih tepat guna meningkatkan kualitas pembelajaran.

source : https://archive.ics.uci.edu/dataset/320/student+performance

# **Import Library yang Digunakan**

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import library yang digunakan
import pandas as pd

# Eksekusi sintaks berikut untuk meng-custom tampilan
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# **Ekstraksi Data**

In [3]:
# Google Drive file ID
file_id = '1MvZttuTu1xMGe8RQenmv8J2giGpvDmdA'

# Construct the direct download URL
download_url = f'https://drive.google.com/uc?id={file_id}'

# Read the CSV directly into a pandas DataFrame
student_wellbeing = pd.read_csv(download_url)

# Display the first 3 rows of the DataFrame
display(student_wellbeing.head(3))

Unnamed: 0,Name,Gender,Age,Education Level,Screen Time (hrs/day),Sleep Duration (hrs),Physical Activity (hrs/week),Stress Level,Anxious Before Exams,Academic Performance Change
0,Aarav,Male,15.0,Class 8,7.1,8.9,9.3,Medium,No,Same
1,Meera,Female,25.0,MSc,3.3,-5.0,0.2,Medium,No,Same
2,Ishaan,Male,20.0,BTech,9.5,150.0,6.2,Medium,No,Same


**Data Dictionary**

| Nama Kolom              | Deskripsi                                                                 |
|:------------------------|:-------------------------------------------------------------------------|
| Name             | Nama responden                                   |
| Gender              | Jenis kelamin responden                         |
| Age                  | Umur responden                                                   |
| Education Level                     | Tingkat pendidikan responden saat ini                                             |
| Screen time (hrs/day)                 | Rata-rata waktu yang dihabiskan responden untuk menatap layar dalam satu hari, dinyatakan dalam jam.                      |
| Sleep duration    | Rata-rata durasi tidur responden dalam satu hari, dinyatakan dalam jam.                                 |
| Physical Activity (hrs/week)      | Total waktu aktivitas fisik responden dalam satu minggu, dinyatakan dalam jam.               |
| Stress Level         | Tingkat stres yang dirasakan responden dalam keseharian                  |
| Anxious Before Exams                    | Kondisi kecemasan yang dirasakan responden menjelang ujian      |
| Academic Performance Change     | Perubahan kinerja akademik responden dalam periode tertentu                                  |
 |


# **Informasi Umum pada Data**

In [4]:
# Periksa informasi umum pada data
student_wellbeing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Name                          1240 non-null   object 
 1   Gender                        1239 non-null   object 
 2   Age                           1240 non-null   object 
 3   Education Level               1240 non-null   object 
 4   Screen Time (hrs/day)         1236 non-null   object 
 5   Sleep Duration (hrs)          1232 non-null   float64
 6   Physical Activity (hrs/week)  1243 non-null   float64
 7   Stress Level                  1238 non-null   object 
 8   Anxious Before Exams          1233 non-null   object 
 9   Academic Performance Change   1235 non-null   object 
dtypes: float64(2), object(8)
memory usage: 97.8+ KB


In [5]:
pip install ydata-profiling



In [6]:
import ydata_profiling as yp

profile = yp.ProfileReport(
    student_wellbeing,
    title = 'Laporan Kesejahteraan Mental Murid'
)

profile.to_file('profiling_mental_murid.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:00<00:01,  8.63it/s][A
 30%|███       | 3/10 [00:00<00:00, 10.31it/s][A
 50%|█████     | 5/10 [00:00<00:00, 11.25it/s][A
 80%|████████  | 8/10 [00:00<00:00, 13.76it/s][A
100%|██████████| 10/10 [00:00<00:00, 13.29it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# **Analisa Univariate**

## ***Kolom Kategorik***

In [7]:
# Seleksi kolom dengan tipe object
kolom_kategorik = student_wellbeing.select_dtypes(exclude = ['number']).columns


# Tampilkan frekuensinya
for col in kolom_kategorik:
    print(f'> Frekuensi \033[91m{col}\033[0m')
    print(f'  Terdapat {student_wellbeing[col].nunique(dropna = False)} data unik\n')
    display(student_wellbeing[col].value_counts(dropna = False).reset_index())
    print('\n')

> Frekuensi [91mName[0m
  Terdapat 31 data unik



Unnamed: 0,Name,count
0,Shaurya,77
1,Kavya,68
2,Meera,66
3,Aadhya,66
4,Diya,64
5,Arjun,64
6,Anika,61
7,Krishna,61
8,Myra,60
9,Reyansh,60




> Frekuensi [91mGender[0m
  Terdapat 4 data unik



Unnamed: 0,Gender,count
0,Female,592
1,Male,578
2,Other,69
3,,11




> Frekuensi [91mAge[0m
  Terdapat 17 data unik



Unnamed: 0,Age,count
0,17.0,121
1,21.0,120
2,23.0,115
3,15.0,108
4,20.0,104
5,16.0,104
6,25.0,96
7,19.0,96
8,26.0,90
9,22.0,88




> Frekuensi [91mEducation Level[0m
  Terdapat 12 data unik



Unnamed: 0,Education Level,count
0,MSc,172
1,MTech,172
2,MA,164
3,Class 10,116
4,Class 11,115
5,BSc,105
6,BTech,104
7,Class 9,100
8,BA,78
9,Class 8,58




> Frekuensi [91mScreen Time (hrs/day)[0m
  Terdapat 106 data unik



Unnamed: 0,Screen Time (hrs/day),count
0,6.9,22
1,4.8,21
2,4.5,21
3,6.3,20
4,9.9,19
...,...,...
101,12.0,6
102,6.5,6
103,9.2,4
104,8.5,4




> Frekuensi [91mStress Level[0m
  Terdapat 4 data unik



Unnamed: 0,Stress Level,count
0,Medium,625
1,Low,398
2,High,215
3,,12




> Frekuensi [91mAnxious Before Exams[0m
  Terdapat 3 data unik



Unnamed: 0,Anxious Before Exams,count
0,Yes,634
1,No,599
2,,17




> Frekuensi [91mAcademic Performance Change[0m
  Terdapat 4 data unik



Unnamed: 0,Academic Performance Change,count
0,Same,486
1,Improved,377
2,Declined,372
3,,15






In [8]:
#menyesuaikan nama kolom

student_wellbeing.columns = (
    student_wellbeing.columns
    .str.lower()
    .str.replace(' ', '_')
    .str.replace(r'[()/]', '', regex=True)
)

Ditemukan karakter pada kolom numerik .... sehingga perlu dilakukan ...


In [9]:
# Mengubah string 'twenty' menjadi numerik pada kolom Age
student_wellbeing['age'] = student_wellbeing['age'].replace('twenty', 20)

# Konversi tipe datanya
student_wellbeing['age'] = student_wellbeing['age'] = pd.to_numeric(
    student_wellbeing['age'],
    errors='coerce'
)


In [10]:
# Mengubah string 'unknown' pada kolom Screen Time (hrs/day) menjadi None (kosong)
student_wellbeing['screen_time_hrsday'] = (
    student_wellbeing['screen_time_hrsday']
    .replace('unknown', None)
)

# Konversi tipe datanya
student_wellbeing['screen_time_hrsday'] = pd.to_numeric(
    student_wellbeing['screen_time_hrsday'],
    errors='coerce'
)

## ***Kolom Numerik***

In [11]:
# Menyeleksi kolom dengan tipe numerik
kolom_numerik = student_wellbeing.select_dtypes(include='number')

# Statistik Deskriptif Numerik
display(kolom_numerik.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1240.0,21.53,15.43,1.0,17.0,20.0,23.0,200.0
screen_time_hrsday,1235.0,8.17,13.47,-5.0,4.4,6.9,9.5,150.0
sleep_duration_hrs,1232.0,8.2,15.18,-5.0,5.1,6.5,7.8,150.0
physical_activity_hrsweek,1243.0,7.03,16.19,-5.0,2.7,5.1,7.7,150.0


**Data Insight**

Terdapat nilai ekstrem pada semua kolom numerik yang mengindikasi pada nilai tidak valid. Data perlu dibersihkan agar tidak menganggu pemodelan.


## **Missing Value**

### **Hitung Banyaknya *Missing Value***

In [12]:
# Menampilkan missing value pada data
display(student_wellbeing.isna().sum())

Unnamed: 0,0
name,10
gender,11
age,10
education_level,10
screen_time_hrsday,15
sleep_duration_hrs,18
physical_activity_hrsweek,7
stress_level,12
anxious_before_exams,17
academic_performance_change,15


### **Handling Missing Value - Tipe Kategorik**

In [13]:
# Menampilkan ukuran awal
student_wellbeing.shape

(1250, 10)

In [14]:
# Menghapus baris dari kolom Academic Performance Change kosong
student_wellbeing = student_wellbeing.dropna(
    subset=['academic_performance_change']
)

In [15]:
# Mengubah gender yang kosong dengan 'Other'
student_wellbeing['gender'] = (
    student_wellbeing['gender']
    .fillna('Other')
)

# Mengubah Education Level, Stress Level dan Anxious Before Exams yang kosong dengan 'Unknown'
cols_fill_unknown = [
    'education_level',
    'stress_level',
    'anxious_before_exams'
]
student_wellbeing[cols_fill_unknown] = (
    student_wellbeing[cols_fill_unknown]
    .fillna('Unknown')
)


### **Handling Missing Value - Tipe Numerik**

In [16]:
# Mengisi setiap missing value pada kolom numerik dengan median datanya
for col in kolom_numerik:
    median_data = student_wellbeing[col].median()
    student_wellbeing[col] = student_wellbeing[col].fillna(median_data)

### **Periksa Kembali**

In [17]:
# Memeriksa missing value
display(student_wellbeing.isna().sum())

Unnamed: 0,0
name,10
gender,0
age,0
education_level,0
screen_time_hrsday,0
sleep_duration_hrs,0
physical_activity_hrsweek,0
stress_level,0
anxious_before_exams,0
academic_performance_change,0


## **Duplicated Data**

### **Hitung Banyaknya *Duplikasi Data***

In [18]:
print(f'Jumlah data saat ini : {student_wellbeing.shape}')
print(f'Jumlah data duplikat : {student_wellbeing.duplicated(keep = False).sum()}')

Jumlah data saat ini : (1235, 10)
Jumlah data duplikat : 342


### **Hapus Duplikasi Data**

In [19]:
# Menghapus duplikasi data
student_wellbeing = student_wellbeing.drop_duplicates()
print(f'Jumlah data saat ini : {student_wellbeing.shape}')

Jumlah data saat ini : (1056, 10)


## **Outlier**

### **Gunakan Boxplot untuk mendeteksi *Outlier***

In [20]:
import plotly.express as px

def box_plot(series, column_name, color):
    # Membuat horizontal box plot
    fig = px.box(
        series,
        orientation = 'h',
        color_discrete_sequence  = [color]
    )

    # Memperbarui layout and display the plot
    fig.update_layout(
        title = f'<b>Distribusi {column_name}</b>',
        yaxis = dict(
            title = '',
            showgrid = False,
            showline = False,
            showticklabels = False,
            zeroline = False,
        ),
        xaxis = dict(
            title = column_name,
            showgrid = False,
            showline = True,
            showticklabels = True,
            zeroline = False,
        )
    )

    fig.show()

In [21]:
for col in kolom_numerik:
    box_plot(student_wellbeing[col], col, '#F28E2B')

### **Menggunakan Teknik Winsorizing untuk mengatasi Outlier**

In [22]:
# Fungsi untuk teknik winsorizing
def teknik_winsorizing(series):
    # Hitung Q1, Q3, dan IQR
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    # Hitung lower dan upper bound
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Jika lower bound negatif maka ubah menjadi 0
    if(lower_bound < 0):
      lower_bound = 0

    # Winsorizing: clip nilai ke batas bawah & atas
    series = series.astype(pd.Float64Dtype())
    winsorized_series = series.clip(lower=lower_bound, upper=upper_bound)

    return (winsorized_series)

In [23]:
for col in kolom_numerik:
    student_wellbeing[col] = teknik_winsorizing(student_wellbeing[col])
    box_plot(student_wellbeing[col], col, '#B07AA1')

## **Distribusi**

In [24]:
# Your Code
import plotly.express as px

for col in kolom_numerik:
    fig = px.histogram(
        student_wellbeing,
        x = col,
        nbins = 25,
        color_discrete_sequence = ['#B07AA1'],
        marginal = "box",
        hover_data = student_wellbeing.columns
    )

    fig.update_yaxes(
        showgrid = False,
        showticklabels=False,
        title =''
    )

    fig.update_layout(
        title={
            'text' : f'Distribusi <b><span style="color:#B07AA1"></span> {col}</b>',
            'y':0.92,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        plot_bgcolor = 'rgba(0,0,0,0)',
        bargap = 0.01,
        title_font = dict(size = 25)
    )

    fig.show()

# **Analisa Multivariate**

## **Korelasi**

In [25]:
# Menghitung korelasi
data_corr = student_wellbeing.select_dtypes(
    include='number'
).corr()


# Menampilkan hasilnya
display(data_corr)

Unnamed: 0,age,screen_time_hrsday,sleep_duration_hrs,physical_activity_hrsweek
age,1.0,0.01,0.02,-0.0
screen_time_hrsday,0.01,1.0,-0.0,0.02
sleep_duration_hrs,0.02,-0.0,1.0,-0.02
physical_activity_hrsweek,-0.0,0.02,-0.02,1.0


In [26]:
import plotly.express as px

# Membuat Grafiknya
fig = px.imshow(
    data_corr,
    color_continuous_scale='blues',
    title = '<b>Korelasi Kolom Numerik Data Student</b><br>',
    text_auto = True
)

#Menyembunyikan skala/rentang korelasi
fig.update_coloraxes(showscale=False)

#Mengatur judul heatmap
fig.update_layout(
    title = dict(
        x=0.5,
        y=0.9,
        xanchor='center',
        yanchor='top'
    ),
    width = 1000,
    height = 800
)

#Menampilkan heatmap
fig.show()

# **Statistik tiap Kategori Academic Performance Change**

In [27]:
student_wellbeing.groupby(['academic_performance_change']).agg(
    total_data = ('name', 'count'),
    min_age = ('age', 'min'),
    median_age = ('age', 'median'),
    max_age = ('age', 'max'),
    min_screen_time = ('screen_time_hrsday', 'min'),
    median_screen_time = ('screen_time_hrsday', 'median'),
    max_screen_time = ('screen_time_hrsday', 'max'),
    min_sleep_duration = ('sleep_duration_hrs', 'min'),
    median_sleep_duration = ('sleep_duration_hrs', 'median'),
    max_sleep_duration = ('sleep_duration_hrs', 'max')
)

Unnamed: 0_level_0,total_data,min_age,median_age,max_age,min_screen_time,median_screen_time,max_screen_time,min_sleep_duration,median_sleep_duration,max_sleep_duration
academic_performance_change,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Declined,310,8.0,20.0,32.0,0.0,6.9,17.0,1.3,6.35,11.7
Improved,317,8.0,21.0,32.0,0.0,7.0,17.0,1.3,6.5,11.7
Same,419,8.0,20.0,32.0,0.0,6.9,17.0,1.3,6.6,11.7


# **Data Preprocessing**

## **Feature Selection**

In [28]:
# Copy DataFrame
data_preprocessing = student_wellbeing.copy()

# Hapus kolom name
data_preprocessing = data_preprocessing.drop(columns=['name'])

# Tampilkan hasilnya
data_preprocessing.head()

Unnamed: 0,gender,age,education_level,screen_time_hrsday,sleep_duration_hrs,physical_activity_hrsweek,stress_level,anxious_before_exams,academic_performance_change
0,Male,15.0,Class 8,7.1,8.9,9.3,Medium,No,Same
1,Female,25.0,MSc,3.3,1.3,0.2,Medium,No,Same
2,Male,20.0,BTech,9.5,11.7,6.2,Medium,No,Same
3,Male,20.0,BA,10.8,5.6,5.5,High,Yes,Same
4,Female,17.0,Class 11,2.8,5.4,3.1,Medium,Yes,Same


## **Encoding**

In [29]:
# Definisikan ulang kolom kategorik dan kolom numerik
kolom_kategorik = data_preprocessing.select_dtypes(include=['object']).columns
kolom_numerik = data_preprocessing.select_dtypes(include=['number']).columns

In [30]:
from sklearn.preprocessing import OrdinalEncoder

# Definisikan ulang kolom kategorik (kecuali target)
kolom_kategorik_ord = [col for col in kolom_kategorik if col != 'academic_performance_change']

# Buat objek OrdinalEncoder
ord_enc = OrdinalEncoder()

# Fit encoder ke kolom kategorik
ord_enc.fit(data_preprocessing[kolom_kategorik_ord])

# Tampilkan mapping tiap kolom
for col, cats in zip(kolom_kategorik_ord, ord_enc.categories_):
    print(f"Mapping untuk {col}:")
    for i, cat in enumerate(cats):
        print(f"  {cat} → {i}")
    print()

# Transform kolom kategorik setelah mapping ditampilkan
data_preprocessing[kolom_kategorik_ord] = ord_enc.transform(data_preprocessing[kolom_kategorik_ord])


Mapping untuk gender:
  Female → 0
  Male → 1
  Other → 2

Mapping untuk education_level:
  BA → 0
  BSc → 1
  BTech → 2
  Class 10 → 3
  Class 11 → 4
  Class 12 → 5
  Class 8 → 6
  Class 9 → 7
  MA → 8
  MSc → 9
  MTech → 10
  Unknown → 11

Mapping untuk stress_level:
  High → 0
  Low → 1
  Medium → 2
  Unknown → 3

Mapping untuk anxious_before_exams:
  No → 0
  Unknown → 1
  Yes → 2



In [31]:
for col, categories in zip(kolom_kategorik, ord_enc.categories_):
    print(f"Mapping untuk \033[91m{col}\033[0m:\n")
    for i, cat in enumerate(categories):
        print(f"  {cat} → {i}")
    print()

Mapping untuk [91mgender[0m:

  Female → 0
  Male → 1
  Other → 2

Mapping untuk [91meducation_level[0m:

  BA → 0
  BSc → 1
  BTech → 2
  Class 10 → 3
  Class 11 → 4
  Class 12 → 5
  Class 8 → 6
  Class 9 → 7
  MA → 8
  MSc → 9
  MTech → 10
  Unknown → 11

Mapping untuk [91mstress_level[0m:

  High → 0
  Low → 1
  Medium → 2
  Unknown → 3

Mapping untuk [91manxious_before_exams[0m:

  No → 0
  Unknown → 1
  Yes → 2



## **Splitting Data**

In [32]:
# Import library yang digunakan
from sklearn.model_selection import train_test_split

# Variabel X untuk fitur dan variabel y untuk target
X = data_preprocessing.drop(columns=['academic_performance_change'])
y = data_preprocessing['academic_performance_change']

# Proses splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [33]:
# Tampilkan frekuensi kemunculan target
display(y.value_counts())

Unnamed: 0_level_0,count
academic_performance_change,Unnamed: 1_level_1
Same,424
Improved,320
Declined,312


* Declined → 0
* Improved → 1
* Same → 2

In [34]:
student_wellbeing['academic_performance_change'].value_counts(dropna=False)


Unnamed: 0_level_0,count
academic_performance_change,Unnamed: 1_level_1
Same,424
Improved,320
Declined,312


In [35]:
data_preprocessing['academic_performance_change'] = (
    student_wellbeing['academic_performance_change']
)


In [36]:
data_preprocessing['academic_performance_change'].value_counts()


Unnamed: 0_level_0,count
academic_performance_change,Unnamed: 1_level_1
Same,424
Improved,320
Declined,312


In [37]:
print(kolom_kategorik)


Index(['gender', 'education_level', 'stress_level', 'anxious_before_exams',
       'academic_performance_change'],
      dtype='object')


In [38]:
kolom_kategorik = kolom_kategorik.difference(
    ['academic_performance_change']
)


In [39]:
mapping = {
    'Declined': 0,
    'Improved': 1,
    'Same': 2
}

data_preprocessing['academic_performance_change'] = (
    data_preprocessing['academic_performance_change']
    .map(mapping)
)


In [40]:
data_preprocessing['academic_performance_change'].value_counts()


Unnamed: 0_level_0,count
academic_performance_change,Unnamed: 1_level_1
2,424
1,320
0,312


## **Imbalance Target**

In [41]:
# Import library imblearn
from imblearn.over_sampling import SMOTE

# Buat object SMOTE
smote = SMOTE(random_state=42)

# Terapkan pada data training saja
X_train_resample, y_train_resample = smote.fit_resample(
    X_train, y_train
)

In [42]:
# Tampilkan frekuensi kemunculan target
display(pd.Series(y_train_resample).value_counts())

Unnamed: 0_level_0,count
academic_performance_change,Unnamed: 1_level_1
Improved,339
Same,339
Declined,339


# **Modelling**

In [43]:
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(steps=[
    ('scaler', RobustScaler()),
    ('smote', SMOTE()),
    ('model',DecisionTreeClassifier())
])

param_grid = {
    'smote__k_neighbors': [3, 5, 7],
    'model__criterion': ['gini', 'entropy', 'log_loss'],
    'model__max_depth': [None, 5, 10, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 5],
    'model__class_weight': [None, 'balanced']
}

grid = GridSearchCV(
    estimator = pipeline,
    param_grid = param_grid,
    cv = 5,
    scoring = 'f1_macro',
    n_jobs = -1
)

grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best CV Score:", grid.best_score_)

Best Params: {'model__class_weight': 'balanced', 'model__criterion': 'entropy', 'model__max_depth': 20, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'smote__k_neighbors': 5}
Best CV Score: 0.38616886909848114


In [44]:
# Import library yang dibutuhkan
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Masukan parameter yang telah di-tuning
model_dtc = DecisionTreeClassifier(
    criterion = 'entropy',
    max_depth = 30,
    min_samples_leaf = 1,
    min_samples_split = 5,
    splitter = 'best'
)

# Definsikan pipeline baru
new_pipe = Pipeline([
    ('scaler', RobustScaler()),
    ('smote', SMOTE()),
    ('model', model_dtc)
])

# Terapkan pada CV
cv = StratifiedKFold(
    n_splits = 5,
    shuffle = True,
    random_state = 42
)

scores = cross_val_score(new_pipe, X, y, cv=cv, scoring = 'accuracy')

print(f'Scores : {scores}')
print(f'Mean   : {scores.mean()}')
print(f'Std    : {scores.std()}')

Scores : [0.39622642 0.30805687 0.35545024 0.31753555 0.32701422]
Mean   : 0.3408566574264509
Std    : 0.03190641303014306


In [45]:
model_dtc.fit(X_train_resample, y_train_resample)

# Predict test set labels
y_pred = model_dtc.predict(X_test)

In [46]:
from sklearn import metrics

# Menghitung dan mencetak laporan klasifikasi
classification_report = metrics.classification_report(y_pred, y_test)

# Tampilkan hasilnya
print(classification_report)

              precision    recall  f1-score   support

    Declined       0.30      0.32      0.31        59
    Improved       0.30      0.29      0.29        66
        Same       0.38      0.37      0.37        87

    accuracy                           0.33       212
   macro avg       0.32      0.33      0.33       212
weighted avg       0.33      0.33      0.33       212



In [47]:
import plotly.express as px
from sklearn.metrics import confusion_matrix

# Plot dengan Plotly
fig = px.imshow(
    confusion_matrix(y_test, y_pred),
    text_auto = True,
    color_continuous_scale = 'Blues',
    title = "<b>Confusion Matrix Model Decision Tree</b>",
)

# Ubah tampilan xticks dan yticks
fig.update_xaxes(
    tickmode = "array",
    tickvals = [0, 1, 2],
    ticktext = ['Declined', 'Improved', 'Same']
)

fig.update_yaxes(
    tickmode = "array",
    tickvals = [0, 1, 2],
    ticktext = ['Declined', 'Improved', 'Same'],
    tickangle = -90
)

# Hapus legend / colorbar
fig.update_layout(coloraxis_showscale = False)

# Judul dan label + ukuran figure
fig.update_layout(
    title = dict(
        x = 0.5,
        y = 0.9,
        xanchor = 'center',
        yanchor = 'top'
    ),
    xaxis_title = "<b>Prediksi</b>",
    yaxis_title = "<b>Nilai Sebenarnya</b>",
    width = 700,
    height = 700
)

# Tampilkan grafik
fig.show()

**Learning**

Berdasarkan confusion matrix pada data uji, model machine learning paling sering berhasil memprediksi kategori "same" dibandingkan kategori lainnya. Hal ini menunjukkan bahwa model lebih mudah mengenali kondisi performa akademik yang tidak mengalami perubahan.

Sementara itu, pada kategori "improved" dan "declined", model masih sering melakukan kesalahan prediksi. Beberapa data yang sebenarnya mengalami peningkatan justru diprediksi sebagai penurunan, dan sebaliknya. Hal ini menunjukkan bahwa model belum sepenuhnya mampu membedakan arah perubahan performa akademik siswa.

Secara keseluruhan, model sudah cukup baik untuk melihat kondisi umum performa akademik, namun masih perlu dikembangkan lebih lanjut agar dapat memprediksi perubahan performa akademik dengan lebih akurat.