 # Tahap Tokenisasi dengan IndoBERT Tokenizer

Tahap ini mengubah teks menjadi format yang bisa dibaca model (token IDs, attention mask, dll).

**Yang dilakukan:**
- Load tokenizer dari model IndoBERT base-p2
- Tokenisasi kolom 'text' pada train, validation, dan test set
- Gunakan max_length=512 (standar untuk base model)
- Simpan hasil tokenisasi supaya siap untuk training

**IMPORT LIBRARY**

In [1]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    get_linear_schedule_with_warmup
)
from datasets import load_dataset, DatasetDict
from accelerate import Accelerator

from sklearn.metrics import mean_squared_error

from torch.optim import AdamW

print("Semua library berhasil diimport")

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'datasets'

In [None]:
# Load file split
from datasets import Dataset, DatasetDict
train_df = pd.read_csv("../dataset/aes_2/train.csv")
valid_df = pd.read_csv("../dataset/aes_2/valid.csv")
test_df  = pd.read_csv("../dataset/aes_2/test.csv")

# Ubah ke HuggingFace dataset
final_dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(valid_df),
    "test": Dataset.from_pandas(test_df)
})

print("Dataset berhasil dimuat:")
print(final_dataset)

**SET SEED DAN KONSTANTA**

In [None]:
# Fungsi set seed
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # kalau nanti pakai GPU

# Jalankan sekali di awal
set_seed(42)

# Konstanta utama
MODEL_NAME = "indobenchmark/indobert-base-p1"
PROCESSED_FILE = "../data/processed/aes_preprocessed.csv"
OUTPUT_DIR = "./results_aes"

MAX_LENGTH = 256
TRAIN_BATCH_SIZE = 2
EVAL_BATCH_SIZE = 2

# Epoch dinamis
MAX_EPOCHS = 5          # batas maksimum
EARLY_STOPPING_PATIENCE = 2  # berhenti jika 2 epoch tidak membaik

LEARNING_RATE = 2e-5
WARMUP_STEPS = 200

print("Seed sudah diset & konstanta sudah didefinisikan")
print(f"Model yang dipakai: {MODEL_NAME}")
print(f"File data: {PROCESSED_FILE}")
print(f"Maksimum epoch: {MAX_EPOCHS} (dengan early stopping)")

Tahap Tokenisasi dengan IndoBERT Tokenizer

Setelah data sudah di-split, kita ubah teks menjadi format token yang bisa dipahami model.

**Yang dilakukan:**
- Load tokenizer dari IndoBERT base-p2
- Tokenisasi kolom 'text' pada train & validation (untuk training)
- Tokenisasi test set terpisah (untuk evaluasi akhir)
- Gunakan max_length=512 (standar base model)

**LOAD TOKENIZER**

In [None]:
# Nama model yang kita pakai
MODEL_NAME = "indobenchmark/indobert-base-p2"

# Load tokenizer
print(f"Memuat tokenizer {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

print("Tokenizer berhasil dimuat")
print("Vocab size:", tokenizer.vocab_size)
print("Max length default:", tokenizer.model_max_length)

**FUNGSI TOKENISASI & JALANKAN**

In [None]:
# Fungsi tokenisasi (Soal + Jawaban)
def tokenize_function(examples):
    return tokenizer(
        examples["Soal"],
        examples["Jawaban"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

# Tokenisasi train & validation
print("Tokenisasi train & validation...")

tokenized_train = final_dataset["train"].map(
    tokenize_function,
    batched=True,
    remove_columns=["Soal", "Jawaban"]
)

tokenized_valid = final_dataset["validation"].map(
    tokenize_function,
    batched=True,
    remove_columns=["Soal", "Jawaban"]
)

# Tokenisasi test
print("Tokenisasi test...")
tokenized_test = final_dataset["test"].map(
    tokenize_function,
    batched=True,
    remove_columns=["Soal", "Jawaban"]
)

print("Tokenisasi selesai!")
print("Contoh satu sampel train setelah tokenisasi:")
print(tokenized_train[0])

# ==============================
# TABEL CONTOH TOKEN DAN ID
# ==============================
import pandas as pd

sample_ids = tokenized_train[0]["input_ids"]
sample_tokens = tokenizer.convert_ids_to_tokens(sample_ids)

# Ambil 15 token pertama agar tabel tidak terlalu panjang
df_tokens = pd.DataFrame({
    "Token": sample_tokens[:15],
    "ID": sample_ids[:15]
})

print("\nContoh tabel tokenisasi:")
display(df_tokens)

In [None]:
print(tokenized_train.column_names)

In [None]:
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_valid = tokenized_valid.rename_column("label", "labels")
tokenized_test  = tokenized_test.rename_column("label", "labels")

print("Kolom setelah rename:")
print(tokenized_train.column_names)

**SET FORMAT PYTORCH**

In [None]:
# Set format supaya jadi tensor PyTorch
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_valid.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print("Format PyTorch sudah diset")
print("Kolom yang tersedia sekarang:", tokenized_train.column_names)

# Tahap Load Model & Setup Training

Sekarang data sudah ditokenisasi, kita:
1. Load model IndoBERT base-p2 untuk tugas regression (num_labels=1)
2. Setup TrainingArguments (hyperparameter CPU-friendly + early stopping)
3. Setup Trainer dengan compute_metrics RMSE (sesuai permintaan dosen)
4. Jalankan training

**Catatan:**  
- Epoch dibuat dinamis via early stopping  
- Batch size aman untuk CPU + RAM 8 GB

**Load Model IndoBERT**

In [None]:
import torch

# Tentukan device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model IndoBERT untuk regression
print(f"Memuat model {MODEL_NAME}...")
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=1,                   # regression → output 1 nilai
    problem_type="regression"       # pakai loss MSE otomatis
)

model.to(device)

print("Model berhasil dimuat")
print("Device:", device)
print("Jumlah parameter:", model.num_parameters())

# =================================
# CONTOH PREDIKSI AWAL MODEL
# =================================
# Ambil satu sampel dari data train
sample = tokenized_train[0]

# Siapkan input dasar
inputs = {
    "input_ids": sample["input_ids"].unsqueeze(0).to(device),
    "attention_mask": sample["attention_mask"].unsqueeze(0).to(device),
}

# Tambahkan token_type_ids hanya jika tersedia
if "token_type_ids" in sample:
    inputs["token_type_ids"] = sample["token_type_ids"].unsqueeze(0).to(device)

# Prediksi sebelum training
model.eval()
with torch.no_grad():
    output = model(**inputs)

prediksi = output.logits.item()
label_asli = sample["labels"].item()

print("\nContoh prediksi awal model:")
print("Prediksi (0–1) :", prediksi)
print("Label asli (0–1):", label_asli)

# Versi skala asli 1–10
print("\nDalam skala 1–10:")
print("Prediksi :", prediksi * 10)
print("Label    :", label_asli * 10)

**TrainingArguments**

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=MAX_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,

    gradient_accumulation_steps=4,   # tambahkan ini

    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_steps=WARMUP_STEPS,

    dataloader_num_workers=0, 
    logging_steps=50,
    report_to="none",
)

**DEFINE COMPUTE METRICS**

In [None]:
# Compute metrics hanya RMSE (sesuai dosen)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    return {"eval_rmse": rmse}

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    compute_metrics=compute_metrics,        # hanya RMSE
)

print("Trainer siap!")
print("Mulai training... (bisa memakan waktu beberapa jam di CPU)")

# Jalankan training
trainer.train()

In [None]:
# Evaluasi di test set
print("Evaluasi akhir di test set:")
test_results = trainer.evaluate(tokenized_test)
print(test_results)

# Tampilkan RMSE akhir
print(f"\nRMSE di test set: {test_results['eval_rmse']:.4f}")

In [None]:
import matplotlib.pyplot as plt

history = trainer.state.log_history

history_train_loss = []
history_valid_rmse = []

for log in history:
    if "loss" in log and "epoch" in log:
        history_train_loss.append(log["loss"])
    if "eval_rmse" in log:
        history_valid_rmse.append(log["eval_rmse"])

print("Train loss:", history_train_loss)
print("Valid RMSE:", history_valid_rmse)

# Cek jika kosong
if len(history_train_loss) == 0 or len(history_valid_rmse) == 0:
    print("Data history kosong. Jalankan trainer.train() dulu.")
else:
    min_len = min(len(history_train_loss), len(history_valid_rmse))
    history_train_loss = history_train_loss[:min_len]
    history_valid_rmse = history_valid_rmse[:min_len]
    history_train_rmse = [loss ** 0.5 for loss in history_train_loss]

    epochs_range = range(1, min_len + 1)

    plt.figure(figsize=(10,5))
    plt.plot(epochs_range, history_train_rmse, marker="o", label="Train RMSE")
    plt.plot(epochs_range, history_valid_rmse, marker="o", label="Valid RMSE")
    plt.xlabel("Epoch")
    plt.ylabel("RMSE")
    plt.title("Train vs Validation RMSE")
    plt.legend()
    plt.grid(True)
    plt.show()