<a href="https://colab.research.google.com/github/nadyadtm/BERT-Implementation-in-Chunking/blob/main/BERT_implementation_in_Chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT implementation in Chunking

Chunking merupakan 


## Mount Google Drive

In [1]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Package

In [2]:
#install package transformes dan lakukan import
!pip install transformers==3
import transformers
from transformers import DistilBertModel, DistilBertTokenizer, AdamW, get_linear_schedule_with_warmup

#import package pytorch
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader,TensorDataset
from tqdm import tqdm, trange


#import package sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

#import numpy, pandas
import numpy as np
import pandas as pd

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from collections import defaultdict
from textwrap import wrap

#import os
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |▍                               | 10kB 21.0MB/s eta 0:00:01[K     |▉                               | 20kB 27.1MB/s eta 0:00:01[K     |█▎                              | 30kB 23.5MB/s eta 0:00:01[K     |█▊                              | 40kB 27.3MB/s eta 0:00:01[K     |██▏                             | 51kB 27.8MB/s eta 0:00:01[K     |██▋                             | 61kB 30.6MB/s eta 0:00:01[K     |███                             | 71kB 20.4MB/s eta 0:00:01[K     |███▌                            | 81kB 21.4MB/s eta 0:00:01[K     |████                            | 92kB 20.0MB/s eta 0:00:01[K     |████▍                           | 102kB 20.1MB/s eta 0:00:01[K     |████▊                           | 112kB 20.1MB/s eta 0:00:01[K     |█████▏                         

<torch._C.Generator at 0x7fb5342022b0>

##Set GPU
Sebelum memulai pelatihan, diperlukan untuk set GPU untuk menjalankan Pytorch

In [3]:
#mengecek apakah terdapat GPU pada komputer
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
#jika tidak ada maka gunakan CPU untuk menjalankan program
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


## Load File txt

In [4]:
DATA_DIR = 'drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/{}.txt'

def get_data(file):
    with open(file, 'r', encoding='latin1') as fp:
        content = fp.readlines()
    data, sent = [], []
    for line in content:
        if not line.strip():
            if sent: data.append(sent)
            sent = []
        else:
            word, pos, tag = line.strip().split()
            # tag = tag.split('-')[0]
            sent.append((word, pos, tag))
    return data

In [5]:
train_data = get_data(DATA_DIR.format('train'))
test_data = get_data(DATA_DIR.format('test'))

Kemudian, mengambil sentence dan labelnya. Di kasus ini kita mengabaikan POSnya

In [6]:
sentences = [[word[0] for word in sentence] for sentence in train_data]
sentences_test = [[word[0] for word in sentence] for sentence in test_data]

In [7]:
labels = [[s[2] for s in sentence] for sentence in train_data]
labels_test = [[s[2] for s in sentence] for sentence in test_data]

Berikut ini adalah tag yang ada pada dataset tersebut

In [8]:
data1 = pd.read_csv("drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/train.txt", 
                  sep=' ', 
                  names=["Words", "POS", "Tag"])
data2 = pd.read_csv("drive/My Drive/SMT 2/NLP/Tugas 3/Chunking/test.txt", 
                  sep=' ', 
                  names=["Words", "POS", "Tag"])
data = pd.concat([data1,data2])
tag_values = list(set(data["Tag"].values))
tag_values.append("PAD")
tag2idx = {t: i for i, t in enumerate(tag_values)}
tag2idx

{'B-ADJP': 21,
 'B-ADVP': 12,
 'B-CONJP': 6,
 'B-INTJ': 13,
 'B-LST': 7,
 'B-NP': 2,
 'B-PP': 10,
 'B-PRT': 22,
 'B-SBAR': 11,
 'B-UCP': 8,
 'B-VP': 15,
 'I-ADJP': 19,
 'I-ADVP': 20,
 'I-CONJP': 14,
 'I-INTJ': 16,
 'I-LST': 17,
 'I-NP': 5,
 'I-PP': 4,
 'I-PRT': 0,
 'I-SBAR': 3,
 'I-UCP': 18,
 'I-VP': 1,
 'O': 9,
 'PAD': 23}

## Preprocessing
Inisiasi Tokenizer, max len, dan batch size di awal. Pada kasus ini, model yang digunakan DistilBert

In [9]:
PRE_TRAINED_MODEL_NAME = 'distilbert-base-cased'
MAX_LEN = 30
bs = 32

tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




Fungsi tersebut untuk melakukan tokenisasi per kata

In [10]:
def tokenize_and_preserve_labels(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [11]:
def tokenized(sentences,labels):
    tokenized_texts_and_labels = [
      tokenize_and_preserve_labels(sent, labs)
      for sent, labs in zip(sentences, labels)
    ]
    tokenized_texts = [token_label_pair[0] for token_label_pair in tokenized_texts_and_labels]
    labels = [token_label_pair[1] for token_label_pair in tokenized_texts_and_labels]
    return tokenized_texts, labels

In [12]:
tokenized_texts, labels = tokenized(sentences,labels)

Kemudian berikan padding dan mendapatkan attention mask dari kalimat tersebut

In [13]:
from keras.preprocessing.sequence import pad_sequences

def get_input_tags_attention(tokenized_texts, labels, MAX_LEN):
  input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", value=0.0,
                          truncating="post", padding="post")
  tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels],
                     maxlen=MAX_LEN, value=tag2idx["PAD"], padding="post",
                     dtype="long", truncating="post")
  attention_masks = [[float(i != 0.0) for i in ii] for ii in input_ids]

  return input_ids, tags, attention_masks

In [17]:
input_ids, tags, attention_masks = get_input_tags_attention(tokenized_texts, labels, MAX_LEN)

## Train Val split dan Data Loader
Melakukan pembagian data dan load datanya

In [18]:
tr_inputs, val_inputs, tr_tags, val_tags = train_test_split(input_ids, tags,
                                                            random_state=2018, test_size=0.1)
tr_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=2018, test_size=0.1)

In [19]:
def data_loader(input_ids, tag, attention_mask, bs):
  t_inputs = torch.tensor(input_ids)
  t_tags = torch.tensor(tag)
  t_masks = torch.tensor(attention_mask)

  data = TensorDataset(t_inputs, t_masks, t_tags)
  dataloader = DataLoader(data, batch_size=bs)

  return data, dataloader

In [20]:
train_data,train_dataloader = data_loader(tr_inputs, tr_tags, tr_masks,bs)
valid_data,valid_dataloader = data_loader(val_inputs, val_tags, val_masks,bs)

## BERT Model
Inisiasi bert model

In [21]:
from transformers import DistilBertForTokenClassification

model = DistilBertForTokenClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME,
    num_labels=len(tag2idx),
    output_attentions = False,
    output_hidden_states = False
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

Insert Model ke GPU

In [22]:
model.cuda()

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

Set optimizer, epoch, dan scheduler

In [23]:
epochs = 5

#set optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_dataloader) * epochs

#set schedule
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

## Model Function
Berikut ini fungsi training dan fungsi evaluasi

### Fungsi train_epoch

In [24]:
def train_epoch(
  model, 
  data_loader, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
  # Put the model into training mode.
  model.train()
  # Reset the total loss for this epoch.
  total_loss = 0

  # Reset the validation loss for this epoch.
  nb_eval_steps, nb_eval_examples = 0, 0
  predictions , true_labels = [], []

  # Training loop
  for step, batch in enumerate(data_loader):
      # add batch to gpu
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch
      # Always clear any previously calculated gradients before performing a backward pass.
      model.zero_grad()
      # forward pass
      # This will return the loss (rather than the model output)
      # because we have provided the `labels`.
      outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)

      # Move logits and labels to CPU
      logits = outputs[1].detach().cpu().numpy()
      label_ids = b_labels.to('cpu').numpy()
      predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
      true_labels.extend(label_ids)

      # get the loss
      loss = outputs[0]
      # Perform a backward pass to calculate the gradients.
      loss.backward()
      # track train loss
      total_loss += loss.item()
      # Clip the norm of the gradient
      # This is to help prevent the "exploding gradients" problem.
      torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=1)
      # update parameters
      optimizer.step()
      # Update the learning rate.
      scheduler.step()

    # Calculate the average loss over the training data.
  avg_train_loss = total_loss / len(data_loader)
  pred_tags = [tag_values[p_i] for p, l in zip(predictions, true_labels)
                                 for p_i, l_i in zip(p, l) if tag_values[l_i] != "PAD"]
  valid_tags = [tag_values[l_i] for l in true_labels
                                  for l_i in l if tag_values[l_i] != "PAD"]
  return accuracy_score(pred_tags, valid_tags),avg_train_loss

### Fungsi eval_model

In [25]:
def eval_model(model, data_loader, device, n_examples):

  # Put the model into evaluation mode
  model.eval()
  # Reset the validation loss for this epoch.
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0
  predictions , true_labels = [], []
  
  for batch in data_loader:
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch

      # Telling the model not to compute or store gradients,
      # saving memory and speeding up validation
      with torch.no_grad():
          # Forward pass, calculate logit predictions.
          # This will return the logits rather than the loss because we have not provided labels.
          outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)

      # Move logits and labels to CPU
      logits = outputs[1].detach().cpu().numpy()
      label_ids = b_labels.to('cpu').numpy()

      # Calculate the accuracy for this batch of test sentences.
      eval_loss += outputs[0].mean().item()
      predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
      true_labels.extend(label_ids)

  eval_loss = eval_loss / len(data_loader)
  pred_tags = [tag_values[p_i] for p, l in zip(predictions, true_labels)
                                 for p_i, l_i in zip(p, l) if tag_values[l_i] != "PAD"]
  valid_tags = [tag_values[l_i] for l in true_labels
                                  for l_i in l if tag_values[l_i] != "PAD"]
  return accuracy_score(pred_tags, valid_tags),eval_loss

## Proses

In [26]:
from sklearn.metrics import accuracy_score
## Store the average loss after each epoch so we can plot them.
loss_values, validation_loss_values = [], []

for epoch in range(epochs):
    # ========================================
    #               Training
    # ========================================
    # Perform one full pass over the training set.

    print(f'Epoch {epoch + 1}/{epochs}')
    print('-' * 10)

    #memanggil fungsi train epoch
    train_acc, train_loss = train_epoch(
      model,
      train_dataloader,    
      optimizer, 
      device, 
      scheduler, 
      len(train_data)
    )

    #mencetak train loss dan accuracy
    print(f'Train loss {train_loss} accuracy {train_acc}')


    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    val_acc, val_loss = eval_model(
      model,
      valid_dataloader,    
      device, 
      len(valid_data)
    )
    
    #mencetak val loss dan accuracy
    print(f'Val   loss {val_loss} accuracy {val_acc}')
    print()


Epoch 1/5
----------
Train loss 0.3107265374726719 accuracy 0.9063687466633169
Val   loss 0.11248449701815844 accuracy 0.9692415402567095

Epoch 2/5
----------
Train loss 0.08420568990654179 accuracy 0.9765045169529671
Val   loss 0.1069880628160068 accuracy 0.9718553092182031

Epoch 3/5
----------
Train loss 0.05172471240872428 accuracy 0.9861717384249808
Val   loss 0.10925127792039088 accuracy 0.9721353558926488

Epoch 4/5
----------
Train loss 0.03755601450786113 accuracy 0.9900553758544526
Val   loss 0.11047332481081996 accuracy 0.9725087514585764

Epoch 5/5
----------
Train loss 0.030461816946857623 accuracy 0.9921228108741849
Val   loss 0.11190409905144147 accuracy 0.973022170361727



## Pengujian dengan data test

In [27]:
tokenized_texts_test, labels_test = tokenized(sentences_test,labels_test)

In [28]:
input_ids_test, tags_test, attention_masks_test = get_input_tags_attention(tokenized_texts_test, labels_test, MAX_LEN)

In [29]:
test_data,test_dataloader = data_loader(input_ids_test, tags_test, attention_masks_test,bs)

In [31]:
train_acc, train_loss = eval_model(
      model,
      train_dataloader,    
      device, 
      len(train_data)
)
val_acc, val_loss = eval_model(
      model,
      valid_dataloader,    
      device, 
      len(valid_data)
)
test_acc, test_loss = eval_model(
      model,
      test_dataloader,    
      device, 
      len(test_data)
)

In [32]:
print(train_acc)
print(val_acc)
print(test_acc)

0.9952108783720127
0.973022170361727
0.9710286188303608
