<a href="https://colab.research.google.com/github/nadyadtm/BERT-Implementation-in-Chunking/blob/main/BERT_implementation_in_Chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT implementation in Chunking

Berikut ini adalah implementasi BERT untuk kasus Chunking. Dataset yang digunakan adalah dataset CONLL 2000 


## Mount Google Drive

In [1]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Package

In [2]:
#install package transformes dan lakukan import
!pip install transformers==3
import transformers
from transformers import DistilBertModel, DistilBertTokenizer, AdamW, get_linear_schedule_with_warmup
from keras.preprocessing.sequence import pad_sequences


#import package pytorch
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader,TensorDataset

#import package sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score

#import numpy, pandas
import numpy as np
import pandas as pd

#import os
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 5.9MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 18.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 26.9MB/s 
Collecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/f7/82/0e82a95bd9db2b32569500cc1bb47aa7c4e0f57aa5e35cceba414096917b/tokenizers-0.8.0rc4-cp37-cp37m-manylinux1_x86_64.whl (3.0MB)


<torch._C.Generator at 0x7f1c5a0b52b0>

##Set GPU
Sebelum memulai pelatihan, perlu dilakukan set GPU untuk menjalankan Pytorch

In [3]:
#mengecek apakah terdapat GPU pada komputer
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
#jika tidak ada maka gunakan CPU untuk menjalankan program
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB


## Load File txt
Load train.txt dan test.txt terlebih dahulu, kemudian lakukan pengelompokkan kumpulan kata menjadi kumpulan kalimat dengan melihat tanda titiknya

In [4]:
DATA_DIR = 'drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/{}.txt'

def get_data(file):
    with open(file, 'r', encoding='latin1') as fp:
        content = fp.readlines()
    data, sent = [], []
    for line in content:
        if not line.strip():
            if sent: data.append(sent)
            sent = []
        else:
            word, pos, tag = line.strip().split()
            # tag = tag.split('-')[0]
            sent.append((word, pos, tag))
    return data

In [5]:
train_data = get_data(DATA_DIR.format('train'))
test_data = get_data(DATA_DIR.format('test'))

Kemudian, mengambil sentence dan label Tagnya. Dalam kasus ini label POS diabaikan


In [6]:
sentences = [[word[0] for word in sentence] for sentence in train_data]
sentences_test = [[word[0] for word in sentence] for sentence in test_data]

In [7]:
labels = [[s[2] for s in sentence] for sentence in train_data]
labels_test = [[s[2] for s in sentence] for sentence in test_data]

Salah satu contoh kalimat dalam data

In [8]:
sentences[0]

['Confidence',
 'in',
 'the',
 'pound',
 'is',
 'widely',
 'expected',
 'to',
 'take',
 'another',
 'sharp',
 'dive',
 'if',
 'trade',
 'figures',
 'for',
 'September',
 ',',
 'due',
 'for',
 'release',
 'tomorrow',
 ',',
 'fail',
 'to',
 'show',
 'a',
 'substantial',
 'improvement',
 'from',
 'July',
 'and',
 'August',
 "'s",
 'near-record',
 'deficits',
 '.']

Melihat maksimal jumlah kata dalam data train

In [9]:
len_sentence = [len(s) for s in sentences]
print("Jumlah kata maksimal dalam data: ", np.max(len_sentence))

Jumlah kata maksimal dalam data:  78


Daftar tag yang terdapat dalam dataset train dan test adalah sebagai berikut

In [10]:
data1 = pd.read_csv("drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/train.txt", 
                  sep=' ', 
                  names=["Words", "POS", "Tag"])
data2 = pd.read_csv("drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/test.txt", 
                  sep=' ', 
                  names=["Words", "POS", "Tag"])
data = pd.concat([data1,data2])
tag_values = list(set(data["Tag"].values))
tag_values.append("PAD")
tag2idx = {t: i for i, t in enumerate(tag_values)}
tag2idx

{'B-ADJP': 14,
 'B-ADVP': 7,
 'B-CONJP': 2,
 'B-INTJ': 21,
 'B-LST': 11,
 'B-NP': 5,
 'B-PP': 0,
 'B-PRT': 20,
 'B-SBAR': 10,
 'B-UCP': 4,
 'B-VP': 1,
 'I-ADJP': 13,
 'I-ADVP': 19,
 'I-CONJP': 15,
 'I-INTJ': 17,
 'I-LST': 12,
 'I-NP': 9,
 'I-PP': 6,
 'I-PRT': 22,
 'I-SBAR': 3,
 'I-UCP': 8,
 'I-VP': 16,
 'O': 18,
 'PAD': 23}

## Preprocessing
Inisiasi Tokenizer, max len, dan batch size terlebih dahulu. Pada kasus ini, model yang digunakan DistilBert

In [11]:
PRE_TRAINED_MODEL_NAME = 'distilbert-base-cased'
MAX_LEN = np.max(len_sentence)
bs = 32

tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




Fungsi tersebut digunakan untuk melakukan tokenisasi kalimat

In [12]:
#fungsi tokenisasi per kalimat
def tokenize_and_preserve_labels(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):

        # Tokenisasi kata dan menghitung jumlah katanya
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        #masuk ke daftar kata yang telah ditokenisasi
        tokenized_sentence.extend(tokenized_word)

        # menyesuaikan label dengan kalimat yang telah di tokenized
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [13]:
# fungsi untuk tokenisasi kumpulan kalimat
def tokenized(sentences,labels):
    tokenized_texts_and_labels = [
      tokenize_and_preserve_labels(sent, labs)
      for sent, labs in zip(sentences, labels)
    ]
    tokenized_texts = [token_label_pair[0] for token_label_pair in tokenized_texts_and_labels]
    labels = [token_label_pair[1] for token_label_pair in tokenized_texts_and_labels]
    return tokenized_texts, labels

In [14]:
tokenized_texts, labels = tokenized(sentences,labels)

Kemudian beri padding pada hasil tokenisasi kalimat (input_ids) dan tagnya. Lalu ekstrak attention_masknya

In [15]:
#menambahkan padding dengan fungsi pad sequence, dan mendapatkan attention masks dari kalimat
def get_input_tags_attention(tokenized_texts, labels, MAX_LEN):
  input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", value=0.0,
                          truncating="post", padding="post")
  tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels],
                     maxlen=MAX_LEN, value=tag2idx["PAD"], padding="post",
                     dtype="long", truncating="post")
  attention_masks = [[float(i != 0.0) for i in ii] for ii in input_ids]

  return input_ids, tags, attention_masks

In [16]:
input_ids, tags, attention_masks = get_input_tags_attention(tokenized_texts, labels, MAX_LEN)

## Train Val split dan Data Loader
Melakukan pembagian data, dengan data train 90% dan data validasi 10%

In [17]:
#melakukan pembagian data_train, dengan pembagian 90:10
tr_inputs, val_inputs, tr_tags, val_tags = train_test_split(input_ids, tags,
                                                            random_state=2018, test_size=0.1)
tr_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=2018, test_size=0.1)

Fungsi tersebut digunakan untuk konversi input_ids, tags, dan masks menjadi bentuk tensor dan load dengan Data Loader

In [18]:
def data_loader(input_ids, tag, attention_mask, bs):
  t_inputs = torch.tensor(input_ids)
  t_tags = torch.tensor(tag)
  t_masks = torch.tensor(attention_mask)

  data = TensorDataset(t_inputs, t_masks, t_tags)
  dataloader = DataLoader(data, batch_size=bs)

  return data, dataloader

In [19]:
train_data,train_dataloader = data_loader(tr_inputs, tr_tags, tr_masks,bs)
valid_data,valid_dataloader = data_loader(val_inputs, val_tags, val_masks,bs)

## BERT Model
Melakukan inisiasi model terlebih dahulu

In [20]:
from transformers import DistilBertForTokenClassification

model = DistilBertForTokenClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME,
    num_labels=len(tag2idx),
    output_attentions = False,
    output_hidden_states = False
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

In [21]:
#Deskripsi Model
model.cuda()

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

Kemudian dilakukan set optimizer, scheduler, dan epochs

In [22]:
#set epoch
epochs = 5

#set optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_dataloader) * epochs

#set schedule
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

## Model Function
Berikut ini fungsi training dan fungsi evaluasi yang digunakan untuk proses training

### Fungsi train_epoch

In [23]:
def train_epoch(
  model, 
  data_loader, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  # menandakan bahwa model akan dilatih
  model.train()

  #inisiasi total_loss
  total_loss = 0

  # inisiasi label prediksi dan label sesungguhnya
  predictions , true_labels = [], []

  #iterasi di setiap data
  for step, batch in enumerate(data_loader):
      #mengambil input id, attention mask, dan target dari setiap data
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch

      # menghapus gradient yang dihitung
      model.zero_grad()

      #mengeluarkan output dari model (forward)
      outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)

      # mendapatkan label prediksi dan label sesungguhnya
      logits = outputs[1].detach().cpu().numpy()
      label_ids = b_labels.to('cpu').numpy()
      predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
      true_labels.extend(label_ids)

      # mendapatkan loss
      loss = outputs[0]

      # melakukan backward
      loss.backward()
      total_loss += loss.item()

      # untuk menghindari 'exploding gradient'
      torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=1)
      optimizer.step()
      scheduler.step()

  # menghitung loss rata2 dan accuracy
  avg_train_loss = total_loss / len(data_loader)
  pred_tags = [tag_values[p_i] for p, l in zip(predictions, true_labels)
                                 for p_i, l_i in zip(p, l) if tag_values[l_i] != "PAD"]
  valid_tags = [tag_values[l_i] for l in true_labels
                                  for l_i in l if tag_values[l_i] != "PAD"]
  return accuracy_score(pred_tags, valid_tags),avg_train_loss

### Fungsi eval_model

In [24]:
def eval_model(model, data_loader, device, n_examples):

  # menandakan bahwa model akan dievaluasi
  model.eval()

  # inisiasi val loss, label prediksi dan label sesungguhnya
  eval_loss, eval_accuracy = 0, 0
  predictions , true_labels = [], []
  
  for batch in data_loader:
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch

      # model tidak perlu hitung gradient
      with torch.no_grad():
          #mengeluarkan output dari model (forward)
          outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)

      # mendapatkan output model
      logits = outputs[1].detach().cpu().numpy()
      label_ids = b_labels.to('cpu').numpy()

      # mendapatkan eval loss, label prediksi, dan label sebenarnya
      eval_loss += outputs[0].mean().item()
      predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
      true_labels.extend(label_ids)

  # menghitung loss rata2 dan accuracy
  eval_loss = eval_loss / len(data_loader)
  pred_tags = [tag_values[p_i] for p, l in zip(predictions, true_labels)
                                 for p_i, l_i in zip(p, l) if tag_values[l_i] != "PAD"]
  valid_tags = [tag_values[l_i] for l in true_labels
                                  for l_i in l if tag_values[l_i] != "PAD"]
  return accuracy_score(pred_tags, valid_tags),eval_loss

## Proses Training
Proses training dilakukan pada code berikut

In [25]:
for epoch in range(epochs):

    print(f'Epoch {epoch + 1}/{epochs}')
    print('-' * 10)

    #memanggil fungsi train epoch
    train_acc, train_loss = train_epoch(
      model,
      train_dataloader,    
      optimizer, 
      device, 
      scheduler, 
      len(train_data)
    )

    #mencetak train loss dan accuracy
    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(
      model,
      valid_dataloader,    
      device, 
      len(valid_data)
    )
    
    #mencetak val loss dan accuracy
    print(f'Val   loss {val_loss} accuracy {val_acc}')
    print()

Epoch 1/5
----------
Train loss 0.2963341994299775 accuracy 0.9095954381700405
Val   loss 0.10567484822656427 accuracy 0.9704180309144902

Epoch 2/5
----------
Train loss 0.07675547086234603 accuracy 0.9791250103460143
Val   loss 0.09833445413304227 accuracy 0.9733115053681566

Epoch 3/5
----------
Train loss 0.04854029431820862 accuracy 0.9869662001158753
Val   loss 0.0995386088533061 accuracy 0.974453666336709

Epoch 4/5
----------
Train loss 0.035769588609654755 accuracy 0.9906733403904041
Val   loss 0.09812257279242788 accuracy 0.9754054671438361

Epoch 5/5
----------
Train loss 0.03006212732991174 accuracy 0.9920847545489792
Val   loss 0.0980400617367455 accuracy 0.9748343866595599



## Pengujian dengan data test
Setelah proses training, dilakukan evaluasi pada data testing

### Preprocessing dan Data Loader untuk data test

In [26]:
tokenized_texts_test, labels_test = tokenized(sentences_test,labels_test)
input_ids_test, tags_test, attention_masks_test = get_input_tags_attention(tokenized_texts_test, labels_test, MAX_LEN)
test_data,test_dataloader = data_loader(input_ids_test, tags_test, attention_masks_test,bs)

### Evaluasi train, validasi, dan test

In [27]:
train_acc, train_loss = eval_model(
      model,
      train_dataloader,    
      device, 
      len(train_data)
)
val_acc, val_loss = eval_model(
      model,
      valid_dataloader,    
      device, 
      len(valid_data)
)
test_acc, test_loss = eval_model(
      model,
      test_dataloader,    
      device, 
      len(test_data)
)

In [28]:
print("Akurasi training : ", train_acc)
print("Akurasi validasi : ", val_acc)
print("Akurasi test     : ", test_acc)

Akurasi training :  0.9952430115396176
Akurasi validasi :  0.9748343866595599
Akurasi test     :  0.9733735458853943
