<a href="https://colab.research.google.com/github/nadyadtm/BERT-Implementation-in-News-Categorization/blob/main/News_Categorization_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Categorization using BERT base uncased

Berikut ini adalah implementasi news categorization (Pengkategorian Berita) dengan menggunakan BERT base uncased

## Import Package
Sebelum memulai implementasi, diperlukan import beberapa package terlebih dahulu dan menginstall package transformer


In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

#install package transformes dan lakukan import
!pip install transformers==3
import transformers
from transformers import BertModel, BertTokenizer, DistilBertModel, DistilBertTokenizer, AdamW, get_linear_schedule_with_warmup
from transformers import BertTokenizer

#import package pytorch
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

#import package sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

#import numpy, pandas
import numpy as np
import pandas as pd

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from collections import defaultdict
from textwrap import wrap

#import os
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

Mounted at /content/drive
Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 16.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 58.7MB/s 
[?25hCollecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/bd/e5abec46af977c8a1375c1dca7cb1e5b3ec392ef279067af7f6bc50491a0/tokenizers-0.8.0rc4-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 59.6MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/

<torch._C.Generator at 0x7f65f9672480>

##Set GPU
Sebelum memulai pelatihan, diperlukan untuk set GPU untuk menjalankan Pytorch

In [None]:
#mengecek apakah terdapat GPU pada komputer
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
#jika tidak ada maka gunakan CPU untuk menjalankan program
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


## Load dan Split Train dan Test
Berikut ini adalah langkah untuk load data train

In [None]:
#Load data train
df_train = pd.read_csv('drive/My Drive/SMT 2/NLP/Tugas 2/Data_Train.csv',encoding='cp1252')
df_train.tail()

Unnamed: 0,STORY,SECTION
7623,"Karnataka has been a Congress bastion, but it ...",0
7624,"The film, which also features Janhvi Kapoor, w...",2
7625,The database has been created after bringing t...,1
7626,"The state, which has had an uneasy relationshi...",0
7627,"Virus stars Kunchacko Boban, Tovino Thomas, In...",2


Kemudian lakukan pembagian data train, validasi, dan data testnya dengan menggunakan lib sklearn. Pembagiannya adalah 90% data train, 5% data validasi, dan 5% data test

In [None]:
X = df_train

# split data train dan validasi sebesar 90% train dan 10% val
X_train, X_val =\
    train_test_split(X, test_size=0.1, random_state=2020)

# split data validasi menjadi data val dan test menjadi setengah (5% val dan 5% test)
X_val, X_test =\
    train_test_split(X_val, test_size=0.5, random_state=2020)

In [None]:
print("Jumlah Data Train : ", X_train.shape[0])
print("Jumlah Data Validasi : ", X_val.shape[0])
print("Jumlah Data Test : ", X_test.shape[0])

Jumlah Data Train :  6865
Jumlah Data Validasi :  381
Jumlah Data Test :  382


## Preprocessing
Preprocessing yang dilakukan pada task ini adalah lowercase dan tokenisasi yang dilakukan oleh package dari BERT.

In [None]:
# nama pretrained model
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'

#inisiasi tokenizer dari BERT
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
# kelas news dataset untuk encode tokenizer
class NewsDataset(Dataset):

  def __init__(self, stories, targets, tokenizer, max_len):
    self.stories = stories
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.stories)
  
  def __getitem__(self, item):
    story = str(self.stories[item])
    target = self.targets[item]

    encoding = self.tokenizer.encode_plus(
      story,                      #data yang ingin di tokenisasi
      add_special_tokens=True,    #memberikan token khusus, yaitu [CLS] dan [SEP]
      max_length=self.max_len,    #inisiasi maksimal panjang teks yang di klasifikasikan
      return_token_type_ids=False,
      pad_to_max_length=True,     # melakukan padding sampai max length textnya
      truncation=True,            # text dipotong jika length asli melebihi max length
      return_attention_mask=True, #mengembalikan attention mask
      return_tensors='pt'         # mengembalikan pytorch tensor
    )

    #mengembalikan text, input_id, attention_mask, dan kelas
    return {
      'story_text': story,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

Setelah itu membuat data loader

In [None]:
# fungsi untuk load data dan preprocessing
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = NewsDataset(
    stories=df.STORY.to_numpy(),
    targets=df.SECTION.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
  )

  return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=4
  )

Kemudian setting batch size 16 dan MAX_LEN = 30

In [None]:
BATCH_SIZE = 32
MAX_LEN = 100

#melakukan load data dan preprocessing
train_data_loader = create_data_loader(X_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(X_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(X_test, tokenizer, MAX_LEN, BATCH_SIZE)

## Arsitektur Model

Load model BERT dari pretrained

In [None]:
#inisiasi bert model dari pretrained
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Buat arsitektur BERT

In [None]:
class NewsClassifier(nn.Module):

  def __init__(self, n_classes, b_model):
    super(NewsClassifier, self).__init__()
    self.bert = b_model
    self.drop = nn.Dropout(p=0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  
  def forward(self, input_ids, attention_mask):
    pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(pooled_output[0][:, 0, :])
    return self.out(output)

Assign classifier ke GPU

In [None]:
#inisiasi arsitektur model yang telah disesuaikan
model = NewsClassifier(4,bert_model)
model = model.to(device)

Set optimizer, loss function, dan scheduler

In [None]:
EPOCHS = 10

#set optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

#set schedule
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

#set loss function
loss_fn = nn.CrossEntropyLoss().to(device)

## Proses Pelatihan
Membuat fungsi train epoch, untuk mpelatihan pada setiap epoch

In [None]:
def train_epoch(
  model, 
  data_loader, 
  loss_fn, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  #menandakan bahwa model sedang dilatih
  model = model.train()

  losses = []
  correct_predictions = 0
  
  #iterasi di setiap data
  for d in data_loader:

    #mengambil input id, attention mask, dan target dari setiap data
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)

    #mengeluarkan output dari data tersebut
    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    #mengeluarkan prediksi target
    _, preds = torch.max(outputs, dim=1)

    #mengeluarkan lossnya
    loss = loss_fn(outputs, targets)

    #menjumlahkan prediksi target yang benar
    correct_predictions += torch.sum(preds == targets)

    #mengumpulkan loss dari output dan target
    losses.append(loss.item())

    loss.backward()

    #untuk menghindari vanishing gradient
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

  #mengembalikan akurasi dan rata-rata loss
  return correct_predictions.double() / n_examples, np.mean(losses)

Kemudian buat function eval_model untuk evaluasi model

In [None]:
def eval_model(model, data_loader, loss_fn, device, n_examples):

  #menandakan model sedang dievaluasi
  model = model.eval()

  #array losses dari setiap data dan variabel correct_prediction untuk jumlah prediksi benar
  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      #mengambil input id, attention mask, dan target dari setiap data
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      #mengambil output dari data
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )

      #mengambil kelas prediksi
      _, preds = torch.max(outputs, dim=1)
      
      #mengeluarkan loss dari output dan targets
      loss = loss_fn(outputs, targets)

      #menjumlahkan prediksi yang benar
      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  #mengembalikan akurasi dan rata-rata loss
  return correct_predictions.double() / n_examples, np.mean(losses)

## Train BERT base uncased

In [None]:
history = defaultdict(list)
best_accuracy = 0

#Iterasi epoch
for epoch in range(EPOCHS):

  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  #memanggil fungsi train epoch
  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,    
    loss_fn, 
    optimizer, 
    device, 
    scheduler, 
    len(X_train)
  )

  #mencetak train loss dan accuracy
  print(f'Train loss {train_loss} accuracy {train_acc}')

  #memanggil fungsi eval model
  val_acc, val_loss = eval_model(
    model,
    val_data_loader,
    loss_fn, 
    device, 
    len(X_val)
  )

  #mencetak val loss dan accuracy
  print(f'Val   loss {val_loss} accuracy {val_acc}')
  print()

  #menampung history train dan val
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)

  #jika akurasi val lebih besar dari best accuracy, maka model akan disimpan
  if val_acc > best_accuracy:
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_accuracy = val_acc

Epoch 1/10
----------
Train loss 0.14338467145544412 accuracy 0.9551347414420975
Val   loss 0.07086619824985974 accuracy 0.979002624671916

Epoch 2/10
----------
Train loss 0.03559312634657375 accuracy 0.9909686817188638
Val   loss 0.06148209750729924 accuracy 0.984251968503937

Epoch 3/10
----------
Train loss 0.01344933862416182 accuracy 0.9953386744355426
Val   loss 0.04833574873191537 accuracy 0.9816272965879265

Epoch 4/10
----------
Train loss 0.007575821570935659 accuracy 0.9970866715222141
Val   loss 0.05408717639996515 accuracy 0.9868766404199475

Epoch 5/10
----------
Train loss 0.00612796780031965 accuracy 0.9967953386744355
Val   loss 0.06676019196675043 accuracy 0.9816272965879265

Epoch 6/10
----------
Train loss 0.005355147397156458 accuracy 0.9981063364894391
Val   loss 0.06277147968891465 accuracy 0.9868766404199475

Epoch 7/10
----------
Train loss 0.0052756194446268384 accuracy 0.9969410050983247
Val   loss 0.06278669078407499 accuracy 0.9868766404199475

Epoch 8/10


## Proses Pengujian
Pada proses ini, dilakukan pengujian dengan menggunakan data test, dan berikut ini adalah akurasi train, val, dan testnya

In [None]:
train_acc, _ = eval_model( model, train_data_loader,loss_fn,device, len(X_train))
val_acc, _ = eval_model(model, val_data_loader, loss_fn, device,len(X_val))
test_acc, _ = eval_model(model,test_data_loader,loss_fn,device,len(X_test))

print("Train acc", train_acc.item())
print("Val acc", val_acc.item())
print("Test acc", test_acc.item())

Train acc 0.9978150036416605
Val acc 0.9868766404199475
Test acc 0.9816753926701571


Untuk melihat classification report dibuatlah fungsi get prediction untuk mendapatkan label prediksi dari data uji

In [21]:
import torch.nn.functional as F

def get_predictions(model, data_loader):
  model = model.eval()
  
  story_texts = []
  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
    for d in data_loader:

      texts = d["story_text"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      probs = F.softmax(outputs, dim=1)

      story_texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      real_values.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return story_texts, predictions, prediction_probs, real_values

Kemudian berikut ini adalah classification report dari data uji

In [22]:
y_story_texts, y_pred, y_pred_probs, y_test = get_predictions(
  model,
  test_data_loader
)

In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97        82
           1       0.99      0.98      0.98       149
           2       1.00      1.00      1.00        96
           3       0.95      0.98      0.96        55

    accuracy                           0.98       382
   macro avg       0.98      0.98      0.98       382
weighted avg       0.98      0.98      0.98       382

