# Finetuning for Classification Demo Using BERT

Pada Demo ini kami melakukan klasifikasi teks biner (binary text) untuk mengklasifikasikan judul berita menjadi clickbait atau non-clickbait dengan menggunakan BERT. Selain itu kami menggunakan framework/library Ktrain untuk finetuning.



**Step 1** : Install Ktrain dan impor modul ktrain yang diperlukan

In [None]:
# install ktrain
!pip3 install ktrain



In [None]:
# import ktrain
import ktrain
from ktrain import text
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Cek versi ktrain yang terinstall
ktrain.__version__

'0.39.0'

In [None]:
#cek apakah file telah dialokasikan ke GPU atau tidak
!nvidia-smi

Wed Nov 22 06:56:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Step 2** : Import file csv from drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df_train = pd.read_csv("drive/MyDrive/clickbait_dataset/clickbait_train.csv")
df_test = pd.read_csv("drive/MyDrive/clickbait_dataset/clickbait_test.csv")
df_val = pd.read_csv("drive/MyDrive/clickbait_dataset/clickbait_val.csv")

In [None]:
#menampilkan 5 baris pertama pada dataset
df_train.head(5)

Unnamed: 0,title,label,label_score
0,"Skenario Global Positif, IHSG Bakal Bertengger...",non-clickbait,0
1,"Acha Septriasa Ungkap Terima Kasih Pada Suami,...",clickbait,1
2,35 Orang Tewas di Pesta Pernikahan Akibat Sera...,non-clickbait,0
3,"Victon Adakan Fan Meeting, Perbedaan Jumlah Pe...",clickbait,1
4,"Seimbangkan Keuanganmu, Simak Pesan Malaikat H...",clickbait,1


**Step 3** :  Load dan Preprocess Dataset




In [None]:
df_train.shape

(12000, 3)

In [None]:
#menghitung jumlah label yang ada pada dataset
df_train["label"].value_counts()

non-clickbait    6968
clickbait        5032
Name: label, dtype: int64

In [None]:
#Mengambil sample data dataframe
df_sample = df_train.sample(frac=0.5, replace=False, random_state=1)

In [None]:
df_sample.shape

(6000, 3)

In [None]:
#menghitung jumlah sentiment yang ada pada datasample
df_sample["label"].value_counts()

non-clickbait    3508
clickbait        2492
Name: label, dtype: int64

In [None]:
X_train = df_train.reset_index().title.to_numpy()
Y_train = df_train.reset_index().label.to_numpy()

X_test = df_test.reset_index().title.to_numpy()
Y_test = df_test.reset_index().label.to_numpy()

X_val = df_val.reset_index().title.to_numpy()
Y_val = df_val.reset_index().label.to_numpy()

type(X_train)
X_train

array(['Skenario Global Positif, IHSG Bakal Bertengger Hijau',
       'Acha Septriasa Ungkap Terima Kasih Pada Suami, Singgung Kepercayaan Penuh Didik Anak Sendiri',
       '35 Orang Tewas di Pesta Pernikahan Akibat Serangan Pasukan Afghanistan',
       ...,
       'Bintangi Film Hustler, Jennifer Lopez Berpeluang Masuk Nominasi Oscar',
       'Salah Kaprah Penggunaan Istilah Makam dan Kuburan, Ini Penjelasannya',
       'Manchester United Vs Astana 1-0, Mason Greenwood Torehkan Rekor'],
      dtype=object)

In [None]:
#Load dan preprocess data
(X_train, y_train), (X_val, y_val), preproc = text.texts_from_array(x_train = X_train,
                                                                      y_train = Y_train,
                                                                      x_test = X_val,
                                                                      y_test = Y_val,
                                                                      maxlen = 500,
                                                                      ngram_range = 1,
                                                                      preprocess_mode = "bert",
                                                                      class_names = ["clickbait", "non-clickbait"])

preprocessing train...
language: id


Is Multi-Label? False
preprocessing test...
language: id




task: text classification


**Step 4** :  Muat model BERT yang telah dilatih sebelumnya (Pretrained BERT model) dan bungkus dalam objek `ktrain.Learner`

In [None]:
model = text.text_classifier('bert', train_data = (X_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model,train_data = (X_train, y_train), val_data=(X_val, y_val), batch_size=6)

Is Multi-Label? False
maxlen is 500




done.


In [None]:
X_test

[array([[   101,    125,  22206, ...,      0,      0,      0],
        [   101,  49114,  10116, ...,      0,      0,      0],
        [   101,  45500, 109403, ...,      0,      0,      0],
        ...,
        [   101,  18561,    123, ...,      0,      0,      0],
        [   101,  11471,  10390, ...,      0,      0,      0],
        [   101,  72605,  15926, ...,      0,      0,      0]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])]

**Step 5** :  Train dan Fine-Tune Model pada dataset


Dapat dilihat akurasi validasinya (validation accuracy) mencapai **81.73% validation accuracy** dalam satu (1) epoch.

In [None]:
learner.fit_onecycle(1e-5, 1)



begin training using onecycle policy with max lr of 1e-05...


<keras.src.callbacks.History at 0x7d840184ca00>

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
#Coba untuk melakukan pengujian dari model yang telah dibuat sebelumnya
data = [ 'Heboh! Ada UFO mendarat di Depok, lihat selengkapnya', 'SBY Meninggal di Jakarta', '4 Member Daftar Wamil Bareng, BTS Diperkirakan Comeback 2025']

In [None]:
#Prediksi terhadap data yang telah didefinisikan sebelumnya
predictor.predict(data)

['clickbait', 'non-clickbait', 'clickbait']