> # **IMPORT LIBRARY**
Digunakan untuk mengimpor pustaka (library) yang dibutuhkan program.

- *`pandas`* = untuk membaca dan mengolah data tabel (seperti Excel)
- *`numpy`* = untuk perhitungan angka
- *`train_test_split`* = untuk membagi data latihan dan data uji
- *`LabelEncoder`* = mengubah label teks menjadi angka
- *`RandomForestClassifier`* = algoritma AI yang digunakan
- *`accuracy_score, classification_report`* = untuk mengevaluasi hasil AI
___

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

> # **LOAD DATASET**

Membaca seluruh Dataset yang sudah di tentukan, lalu tampilkan data.
___

In [4]:
df = pd.read_csv("Dataset/student_performance_updated_1000.csv")

df

Unnamed: 0,StudentID,Name,Gender,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Study Hours,Attendance (%),Online Classes Taken
0,1.0,John,Male,85.0,15.0,78.0,1.0,High,80.0,4.8,59.0,False
1,2.0,Sarah,Female,90.0,20.0,85.0,2.0,Medium,87.0,2.2,70.0,True
2,3.0,Alex,Male,78.0,10.0,65.0,0.0,Low,68.0,4.6,92.0,False
3,4.0,Michael,Male,92.0,25.0,90.0,3.0,High,92.0,2.9,96.0,False
4,5.0,Emma,Female,,18.0,82.0,2.0,Medium,85.0,4.1,97.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
995,,Kenneth Murray,Male,85.0,20.0,,1.0,High,72.0,0.8,80.0,True
996,4497.0,Amy Stout,Female,91.0,,86.0,0.0,High,90.0,3.9,80.0,True
997,1886.0,,Male,85.0,8.0,82.0,2.0,Low,68.0,0.4,54.0,False
998,7636.0,Joseph Sherman,Male,88.0,17.0,60.0,2.0,High,85.0,0.9,53.0,True


> # **DATA CLEANING**

Menghapus baris data yang memiliki nilai kosong (NaN), supaya tidak menyesatkan AI dan tidak membuat label salah, lalu membuat salinan DataFrame agar aman diproses. 
___

In [5]:
required_features = [
    "FinalGrade",
    "Attendance (%)",
    "StudyHoursPerWeek",
    "PreviousGrade"
]

df = df.dropna(subset=required_features).copy()

> # **FEATURE SELECTION**

Memilih kolom mana yang akan dilihat oleh AI sebagai input.

AI hanya belajar dari :
- `FinalGrade`
- `Attendance (%)`
- `StudyHoursPerWeek`
- `PreviousGrade`
___

In [6]:
X = df[required_features]

df[required_features]

Unnamed: 0,FinalGrade,Attendance (%),StudyHoursPerWeek,PreviousGrade
0,80.0,59.0,15.0,78.0
1,87.0,70.0,20.0,85.0
2,68.0,92.0,10.0,65.0
3,92.0,96.0,25.0,90.0
4,85.0,97.0,18.0,82.0
...,...,...,...,...
992,68.0,68.0,15.0,90.0
993,87.0,79.0,25.0,60.0
994,62.0,70.0,20.0,60.0
997,68.0,54.0,8.0,82.0


> # **PEMBUATAN LABEL**
___

**Membuat Performance Score**
---
```python
df["performance_score"] = (
    0.35 * df["FinalGrade"] +
    0.25 * df["Attendance (%)"] +
    0.25 * df["StudyHoursPerWeek"] +
    0.15 * df["PreviousGrade"]
)
```
Menggabungkan beberapa indikator akademik menjadi satu skor performa. Skor ini digunakan sebagai dasar pembuatan label.
___

**Menentukan Batas Kategori (Quantile)**
---
```python
low = df["performance_score"].quantile(0.33)
high = df["performance_score"].quantile(0.66)
```
Menentukan batas nilai untuk kategori :
- `Low/Medium/High`

Data dibagi berdasarkan posisi relatif, bukan angka mutlak. Ini membuat pembagian kelas lebih adil dan seimbang.
___

**Membuat Label Performa**
---
```python
def label_performance(score):
    if score <= low:
        return "Low"
    elif score <= high:
        return "Medium"
    else:
        return "High"

df["performance_label"] = df["performance_score"].apply(label_performance)
```
Mengubah skor numerik menjadi label teks. Setiap mahasiswa sekarang diberi label Low/Medium/High. Dataset sekarang sudah berlabel, sehingga bisa dipakai untuk supervised learning.
___

In [7]:
df["performance_score"] = (
    0.35 * df["FinalGrade"] +
    0.25 * df["Attendance (%)"] +
    0.25 * df["StudyHoursPerWeek"] +
    0.15 * df["PreviousGrade"]
)

low = df["performance_score"].quantile(0.33)
high = df["performance_score"].quantile(0.66)

def label_performance(score):
    if score <= low:
        return "Low"
    elif score <= high:
        return "Medium"
    else:
        return "High"

df["performance_label"] = df["performance_score"].apply(label_performance)

> # **ENCODING LABEL**

Mengubah label teks menjadi angka.

AI tidak bisa membaca teks, jadi :
- Low → `0`
- Medium → `1`
- High → `2`

Ini hanya perubahan bentuk, bukan perubahan makna.

In [8]:
le = LabelEncoder()
y = le.fit_transform(df["performance_label"])

> # **LOAD DATASET**

Melihat bahwa sudah ada penambahan kolom baru yaitu performance_score dan performance_label. 

In [9]:
df

Unnamed: 0,StudentID,Name,Gender,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Study Hours,Attendance (%),Online Classes Taken,performance_score,performance_label
0,1.0,John,Male,85.0,15.0,78.0,1.0,High,80.0,4.8,59.0,False,58.20,Low
1,2.0,Sarah,Female,90.0,20.0,85.0,2.0,Medium,87.0,2.2,70.0,True,65.70,Medium
2,3.0,Alex,Male,78.0,10.0,65.0,0.0,Low,68.0,4.6,92.0,False,59.05,Low
3,4.0,Michael,Male,92.0,25.0,90.0,3.0,High,92.0,2.9,96.0,False,75.95,High
4,5.0,Emma,Female,,18.0,82.0,2.0,Medium,85.0,4.1,97.0,True,70.80,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,,James Garcia,Male,91.0,15.0,90.0,2.0,High,68.0,4.3,68.0,False,58.05,Low
993,3592.0,Monica Johnson,Female,90.0,25.0,60.0,1.0,Low,87.0,1.7,79.0,False,65.45,Medium
994,2787.0,Shannon Porter,Male,78.0,20.0,60.0,0.0,High,62.0,1.6,70.0,False,53.20,Low
997,1886.0,,Male,85.0,8.0,82.0,2.0,Low,68.0,0.4,54.0,False,51.60,Low


> # **TRAIN–TEST SPLIT**

Membagi data menjadi :
- Data latihan `(80%)`
- Data pengujian `(20%)`

AI belajar dari sebagian data, lalu diuji dengan data yang belum pernah dilihat. Ini untuk memastikan AI tidak menghafal, tapi benar-benar belajar.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

> # **TRAIN MODEL**

Melatih model AI menggunakan data latihan.

Di sinilah AI :
- Melihat banyak contoh data
- Mempelajari pola hubungan antara data akademik dan label
- Menyimpan pola tersebut di dalam model

`INI ADALAH INTI KECERDASAN AI`

In [11]:
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42
)

model.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",200
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",10
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


> # **EVALUASI MODEL**

Mengukur seberapa baik AI bekerja.

Hasil prediksi AI dibandingkan dengan jawaban yang benar untuk melihat :
- Ketepatan
- Kualitas prediksi tiap kelas

In [12]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

Accuracy: 0.9176470588235294

Classification Report:
              precision    recall  f1-score   support

        High       0.93      0.91      0.92        58
         Low       0.96      0.95      0.95        56
      Medium       0.86      0.89      0.88        56

    accuracy                           0.92       170
   macro avg       0.92      0.92      0.92       170
weighted avg       0.92      0.92      0.92       170



> # **PREDIKSI DATA BARU**

Memungkinkan pengguna memasukkan data baru dan mendapatkan prediksi. AI digunakan untuk memprediksi performa mahasiswa baru berdasarkan pola yang sudah dipelajari.

In [13]:
def predict_from_user_input():
    final_grade = float(input("Final Grade (0–100): "))
    attendance = float(input("Attendance (%) (0–100): "))
    study_hours = float(input("Study Hours Per Week: "))
    previous_grade = float(input("Previous Grade (0–100): "))

    input_df = pd.DataFrame([{
        "FinalGrade": final_grade,
        "Attendance (%)": attendance,
        "StudyHoursPerWeek": study_hours,
        "PreviousGrade": previous_grade
    }])

    prediction = model.predict(input_df)[0]
    label = le.inverse_transform([prediction])[0]

    print("Predict Academic Performance:", label)

In [14]:
predict_from_user_input()

Predict Academic Performance: High
