# Finetuning for Classification Demo: Sentiment Classification with IMDb Movie Reviews Dataset Using BERT

Pada Demo ini kami melakukan klasifikasi teks biner (binary text) untuk mengklasifikasikan ulasan film (movie reviews)  menjadi positif atau negatif dengan menggunakan BERT. Selain itu kami menggunakan framework/library Ktrain untuk finetuning.



**Step 1** : Install Ktrain dan impor modul ktrain yang diperlukan

In [None]:
# install ktrain
!pip3 install ktrain



In [None]:
# import ktrain
import ktrain
from ktrain import text
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Cek versi ktrain yang terinstall
ktrain.__version__

'0.38.0'

In [None]:
#cek apakah file telah dialokasikan ke GPU atau tidak
!nvidia-smi

Mon Nov 13 06:03:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Step 2** : Import file csv from drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv("drive/MyDrive/ColabNotebooks/Review_IMDB.csv")

In [None]:
#menampilkan 5 baris pertama pada dataset
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Step 3** :  Load dan Preprocess Dataset




In [None]:
df.shape

(50000, 2)

In [None]:
#menghitung jumlah sentiment yang ada pada dataset
df["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [None]:
#Mengambil sample data dataframe
df_sample = df.sample(frac=0.5, replace=False, random_state=1)

In [None]:
df_sample.shape

(25000, 2)

In [None]:
#menghitung jumlah sentiment yang ada pada datasample
df_sample["sentiment"].value_counts()

negative    12592
positive    12408
Name: sentiment, dtype: int64

In [None]:
#membuat variabel baru, x untuk reviews
X = df_sample["review"].tolist()

In [None]:
#menampilkan data x (reviews)
X[:5]

["With No Dead Heroes you get stupid lines like that as this woefully abysmal action flick needs to be seen to be believed. William Sanders is saved by his buddy Harry Cotter during an extraction in Vietnam but gets himself captured by the enemy. Fast forward ten years and Harry is now a brainwashed Russian operative with a mind control microchip implanted in his brain. His new Russian superior is Ivan played to the obscene hilt by Nick Nicholson who might I add not only doesn't attempt once to speak with a Russian accent but resembles more a gas station attendant in Kentucky with his stained teeth. What is even more absurd is the fact that he was also the dialog coach for this film. Soon William is re-recruited by the CIA to hunt Harry down. He teams up with Barbara, a freedom fighter who has infiltrated Ivan's El Salvador camp and soon the both of them are blowing up half of South America. Some scenes are so jaw droppingly awful that it's a wonder why this film doesn't have more of a

In [None]:
#membuat variabel baru, y untuk sentiment
y = df_sample["sentiment"].tolist()

In [None]:
#menampilkan data y (sentiment)
y[:5]

['negative', 'negative', 'negative', 'negative', 'positive']

In [None]:
#membagi kumpulan data menjadi data train dan data test
X_train, X_val_and_test, y_train, y_val_and_test = train_test_split(X,
                                                                    y,
                                                                    test_size = 0.3)

In [None]:
# Membagi val dan test menjadi val dan test yang sebenenarnya
X_val, X_test, y_val, y_test = train_test_split(X_val_and_test,
                                                y_val_and_test,
                                                test_size = 0.5)

In [None]:
#Load dan preprocess data
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_array(x_train = X_train,
                                                                      y_train = y_train,
                                                                      x_test = X_test,
                                                                      y_test = y_test,
                                                                      maxlen = 500,
                                                                      ngram_range = 1,
                                                                      preprocess_mode = "bert",
                                                                      class_names = ["positive", "negative"])

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en




task: text classification


**Step 4** :  Muat model BERT yang telah dilatih sebelumnya (Pretrained BERT model) dan bungkus dalam objek `ktrain.Learner`

In [None]:
model = text.text_classifier('bert', train_data = (X_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model,train_data = (X_train, y_train), val_data=(X_test, y_test), batch_size=6)

Is Multi-Label? False
maxlen is 500




done.


**Step 5** :  Train dan Fine-Tune Model pada dataset IMDb


Dapat dilihat akurasi validasinya (validation accuracy) mencapai **93.31% validation accuracy** dalam satu (1) epoch.

In [None]:
learner.fit_onecycle(1e-5, 1)



begin training using onecycle policy with max lr of 1e-05...


<keras.src.callbacks.History at 0x7e2cf263c5b0>

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
#Coba untuk melakukan pengujian dari model yang telah dibuat sebelumnya
data = [ 'This movie was horrible! The plot was boring. Acting was okay, though.',
         'The film really sucked. I want my money back.',
        'The plot had too many holes.',
        'What a beautiful romantic comedy. 10/10 would see again!',
         'I dont know what to say, I really love this movie!'
         ]

In [None]:
#Prediksi terhadap data yang telah didefinisikan sebelumnya
predictor.predict(data)

['negative', 'negative', 'negative', 'positive', 'positive']