<a href="https://colab.research.google.com/github/nadyadtm/BERT-Implementation-in-News-Categorization/blob/main/News_Categorization_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Categorization using BERT

Berikut ini adalah implementasi news categorization (Pengkategorian Berita) dengan menggunakan BERT

## Import Package
Sebelum memulai implementasi, diperlukan import beberapa package terlebih dahulu dan menginstall package transformer


In [102]:
from google.colab import drive
drive.mount('/content/drive')

!pip install transformers==3
import transformers

from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Set GPU
Sebelum memulai pelatihan, diperlukan untuk set GPU untuk menjalankan Pytorch

In [103]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


## Load dan Split Train dan Test
Berikut ini adalah langkah untuk load data train

In [104]:
#Load data train
df_train = pd.read_csv('drive/My Drive/SMT 2/NLP/Tugas 2/Data_Train.csv',encoding='cp1252')
df_train.tail()

Unnamed: 0,STORY,SECTION
7623,"Karnataka has been a Congress bastion, but it ...",0
7624,"The film, which also features Janhvi Kapoor, w...",2
7625,The database has been created after bringing t...,1
7626,"The state, which has had an uneasy relationshi...",0
7627,"Virus stars Kunchacko Boban, Tovino Thomas, In...",2


Kemudian lakukan pembagian data train, validasi, dan data testnya dengan menggunakan lib sklearn. Pembagiannya adalah 90% data train, 5% data validasi, dan 5% data test

In [105]:
from sklearn.model_selection import train_test_split

X = df_train

X_train, X_val =\
    train_test_split(X, test_size=0.1, random_state=2020)

X_val, X_test =\
    train_test_split(X_val, test_size=0.5, random_state=2020)

In [106]:
print("Jumlah Data Train : ", X_train.shape[0])
print("Jumlah Data Train : ", X_val.shape[0])
print("Jumlah Data Train : ", X_test.shape[0])

Jumlah Data Train :  6865
Jumlah Data Train :  381
Jumlah Data Train :  382


## Preprocessing
Preprocessing yang dilakukan pada task ini adalah lowercase dan tokenisasi yang dilakukan oleh package dari BERT.

In [107]:
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=True)

class NewsDataset(Dataset):

  def __init__(self, stories, targets, tokenizer, max_len):
    self.stories = stories
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.stories)
  
  def __getitem__(self, item):
    story = str(self.stories[item])
    target = self.targets[item]

    encoding = self.tokenizer.encode_plus(
      story,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      pad_to_max_length=True,
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt'
    )

    return {
      'story_text': story,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

Setelah itu membuat data loader

In [108]:
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = NewsDataset(
    stories=df.STORY.to_numpy(),
    targets=df.SECTION.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
  )

  return DataLoader(
    ds,
    batch_size=batch_size,
    num_workers=4
  )

Kemudian setting batch size 16 dan MAX_LEN = 30

In [109]:
BATCH_SIZE = 16
MAX_LEN = 30
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'

train_data_loader = create_data_loader(X_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(X_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(X_test, tokenizer, MAX_LEN, BATCH_SIZE)

## Arsitektur Model

Load model BERT dari pretrained

In [112]:
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

Buat arsitektur BERT

In [113]:
class NewsClassifier(nn.Module):

  def __init__(self, n_classes):
    super(NewsClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  
  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(pooled_output)
    return self.out(output)

Assign classifier ke GPU

In [114]:
model = NewsClassifier(4)
model = model.to(device)

Set optimizer, loss function, dan scheduler

In [116]:
EPOCHS = 10

optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss().to(device)

## Proses Pelatihan
Membuat fungsi train epoch, untuk melakukan training

In [117]:
def train_epoch(
  model, 
  data_loader, 
  loss_fn, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  model = model.train()

  losses = []
  correct_predictions = 0
  
  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)

    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, targets)

    correct_predictions += torch.sum(preds == targets)
    losses.append(loss.item())

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

  return correct_predictions.double() / n_examples, np.mean(losses)

Kemudian buat eval model untuk evaluasi

In [118]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()

  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      loss = loss_fn(outputs, targets)

      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  return correct_predictions.double() / n_examples, np.mean(losses)

Melakukan Train

In [119]:
history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):

  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,    
    loss_fn, 
    optimizer, 
    device, 
    scheduler, 
    len(X_train)
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')

  val_acc, val_loss = eval_model(
    model,
    val_data_loader,
    loss_fn, 
    device, 
    len(X_val)
  )

  print(f'Val   loss {val_loss} accuracy {val_acc}')
  print()

  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)

  if val_acc > best_accuracy:
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_accuracy = val_acc

Epoch 1/10
----------
Train loss 0.38272206365553185 accuracy 0.882884195193008
Val   loss 0.25773550960002467 accuracy 0.931758530183727

Epoch 2/10
----------
Train loss 0.15707014350797224 accuracy 0.9613983976693372
Val   loss 0.30218571800893795 accuracy 0.9343832020997376

Epoch 3/10
----------
Train loss 0.0778494997832621 accuracy 0.9809176984705025
Val   loss 0.2598658752855651 accuracy 0.94750656167979

Epoch 4/10
----------
Train loss 0.040385066958274256 accuracy 0.9905316824471959
Val   loss 0.3027964792478694 accuracy 0.9396325459317585

Epoch 5/10
----------
Train loss 0.019400862002167254 accuracy 0.9947560087399854
Val   loss 0.26633608057090896 accuracy 0.9553805774278215

Epoch 6/10
----------
Train loss 0.01204925697176043 accuracy 0.9966496722505462
Val   loss 0.26926551365613705 accuracy 0.9553805774278215

Epoch 7/10
----------
Train loss 0.00988708442712587 accuracy 0.9959213401310997
Val   loss 0.28030040720598964 accuracy 0.94750656167979

Epoch 8/10
---------

## Proses Pengujian
Mengembalikan terlebih dahulu hasil data test

In [120]:
test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len(X_test)
)

test_acc.item()

0.9345549738219896

Membuat fungsi untuk mengembalikan label hasil prediksi

In [121]:
import torch.nn.functional as F

def get_predictions(model, data_loader):
  model = model.eval()
  
  story_texts = []
  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
    for d in data_loader:

      texts = d["story_text"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      probs = F.softmax(outputs, dim=1)

      story_texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      real_values.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return story_texts, predictions, prediction_probs, real_values

Mengoutputkan classification report

In [122]:
y_story_texts, y_pred, y_pred_probs, y_test = get_predictions(
  model,
  test_data_loader
)

In [123]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.88      0.90        82
           1       0.96      0.95      0.96       149
           2       0.95      0.96      0.95        96
           3       0.86      0.93      0.89        55

    accuracy                           0.93       382
   macro avg       0.92      0.93      0.93       382
weighted avg       0.94      0.93      0.93       382

