In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
import gc
import random
import torch
from numpy import asarray, ndarray, nonzero
from pandas import DataFrame, read_csv
from transformers import AutoModel, AutoTokenizer, BatchEncoding, PreTrainedModel, PreTrainedTokenizer

In [2]:
def clear_cache():
    gc.collect()
    torch.cuda.empty_cache()
clear_cache()

In [3]:
CUDA = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
CPU = torch.device("cpu")

Seeds are set according to the IndoBERT paper (https://arxiv.org/pdf/2009.05387)

In [4]:
# set random seeds
seed = 42
random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7f686936c5f0>

# 0. Set Up and Mount Google Drive

To run this notebook, please do the following first:
1. Create a folder called `NLP` in your own Google Drive.
2. Copy everything from the following link (https://drive.google.com/drive/folders/1prXBSo990_33GJbsntehgl-FFSPeX7EB?usp=sharing) into your `NLP` folder. Make sure that the folder structure remains unchanged.
3. Mount your Google Drive into Colab.

# 1. Download the Transformer

In [5]:
transformer_url: str = "indobenchmark/indobert-base-p2"
tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(transformer_url)
transformer: PreTrainedModel = AutoModel.from_pretrained(transformer_url).to(CUDA)

# 2. Load and Examine the Data

In this section, we will:
- load the training and testing data,
- determine what kinds of data preprocessings are we going to do,
- store the data as PyTorch `Dataset` objects,
- prepare `DataLoader`s for each `Dataset`.

## 2.1. Load the Data

*To load the data, you need mount your Google Drive into Colab.*

The file `train.csv` is our training data, while `test.csv` is our testing data. Also, classification labels are stored separately in the file `labels.txt`.

We will load all 3 files.

In [6]:
train_df: DataFrame = read_csv("/content/drive/MyDrive/NLP/data_worthcheck/train.csv", index_col=0)
test_df: DataFrame = read_csv("/content/drive/MyDrive/NLP/data_worthcheck/test.csv")
with open("/content/drive/MyDrive/NLP/data_worthcheck/labels.txt") as f:
    labels: list = f.readlines()
labels: ndarray = asarray([line.rstrip("\n") for line in labels])

In [7]:
train_df

Unnamed: 0,text_a,label
0,betewe buka twitter cuman ngetweet liat home b...,no
1,mas piyuuu mugo2 corona tuh mulut tersumpal ma...,no
2,e100ss gini buka informasi sejelas nya identit...,yes
3,neng solo wes ono terduga corona cobo neng ati...,no
4,midiahn nii akun gak takut takut nya isu coron...,no
...,...,...
21596,depok panas ga karuan kereta sampe pasming huj...,no
21597,oxfara arie kriting yg lebi goblo nya orang ke...,no
21598,virus corona menyaba depok cuci tangan makan n...,no
21599,mata sipit tinggal depok udah abis dah bahan c...,no


In [8]:
test_df

Unnamed: 0,text_a,label
0,jek dajal ga depok bang,no
1,detikcom untung depok masuk wilayah nya ridwan...,no
2,df dom jakarta depok yg gunain vc cabang nya c...,no
3,your2rl depok jkt,no
4,doakan indonesia selamat virus corona pkb depo...,yes
...,...,...
2795,ku tenang2 bae ku sih ya corona nya ga depok k...,no
2796,guru hati hati ya virus corona uda indonesia t...,yes
2797,4 terawan menyebut virus corona indonesia terd...,yes
2798,realffk buhari can t pronounce corona virus,no


In [9]:
labels

array(['no', 'yes'], dtype='<U3')

## 2.2. Determine What Kinds of Data Preprocessings Are We Going To Do

### 2.2.1. Encode labels

The labels are 'no' and 'yes'. We will encode 'no' as 0, and 'yes' as 1.

In [10]:
def encode_label(labels, label) -> int:
    return nonzero(labels == label)[0][0]

In [11]:
for label in labels:
    print(f"- {label} -> {encode_label(labels, label)}")

- no -> 0
- yes -> 1


Perfect.

### 2.2.2. Tokenized Text Truncation/Padding

Next, we need to know the max length of our **tokenized** texts.

In [12]:
print("Longest tokenized text length:")
print(f"- train.csv: {train_df['text_a'].map(lambda x: tokenizer(x, return_tensors='pt').input_ids.size()[1]).max()} tokens")
print(f"- test.csv: {test_df['text_a'].map(lambda x: tokenizer(x, return_tensors='pt').input_ids.size()[1]).max()} tokens")

Longest tokenized text length:
- train.csv: 1971 tokens
- test.csv: 465 tokens


*Yikes!* Well, BERT has a max length limit for tokens: `512`. So all texts that are longer than that will have to be **truncated**. Texts that are shorter will also be **padded**. So all texts will be tokenized into `512` tokens.

## 2.3. Store the Data As PyTorch `Dataset` Objects

In [13]:
class WorthcheckDataset(torch.utils.data.Dataset):

    def __init__(self, df: DataFrame, labels: ndarray) -> None:
        super().__init__()
        self.df: DataFrame = df
        self.labels: ndarray = labels
    
    def __len__(self) -> int:
        return self.df.shape[0]
    
    def __getitem__(self, idx) -> tuple:
        return (self.df["text_a"][idx], encode_label(self.labels, self.df["label"][idx]))

In [14]:
train_dataset: WorthcheckDataset = WorthcheckDataset(train_df, labels)
test_dataset: WorthcheckDataset = WorthcheckDataset(test_df, labels)

## 2.4. Prepare the `Dataloader`s

We need to create dataloaders that will split the dataset into **batches**. According to the creators of IndoBERT (https://arxiv.org/pdf/2009.05387), they used a batch size of `16` for fine-tuning. So our training dataloder will also use a batch size of `16`. Our test dataloader, though, will simply use a batch size of `len(test_dataset)`.

In [25]:
train_dataloader: torch.utils.data.DataLoader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True, pin_memory=True)
test_dataloader: torch.utils.data.DataLoader = torch.utils.data.DataLoader(test_dataset, batch_size=(len(test_dataset) // 10), pin_memory=True)

# 3. Define Model

Our model has two main parts:
- Transformer
- Classifier

The transformer's role is to encode texts into ~vectors~ tensors. The classifier will then learn to extract features from said tensors.

In [16]:
class TransformerBinaryClassifier(torch.nn.Module):

    def __init__(self, transformer: PreTrainedModel) -> None:
        super().__init__()
        self.transformer: PreTrainedModel = transformer
        self.classifier: torch.nn.Sequential = torch.nn.Sequential(
            torch.nn.Linear(transformer.config.hidden_size, 1),
            torch.nn.Sigmoid()
        )
    
    def forward(self, x: BatchEncoding) -> torch.Tensor:
        z: torch.Tensor = self.transformer(**x)
        y_tilde: torch.Tensor = self.classifier(z.pooler_output)
        return y_tilde

In [17]:
model: TransformerBinaryClassifier = TransformerBinaryClassifier(transformer)
criterion: torch.nn.Module = torch.nn.BCELoss()

# 4. Fine-tune the Model

## 4.1. Define Training and Testing Loops

In [18]:
def fit(dataloader: torch.utils.data.DataLoader, tokenizer: PreTrainedTokenizer, model: torch.nn.Module, criterion: torch.nn.Module, optimizer: torch.optim.Optimizer) -> tuple:
  
    # set model to training mode
    model.to(CUDA).train()

    # log
    epoch_loss: float = 0
    epoch_correct: int = 0
    epoch_count: int = 0

    # load a batch of data
    for X, y in dataloader:

        # move to GPU
        Z: torch.Tensor = tokenizer(list(X), padding="max_length", truncation=True, max_length=512, return_tensors="pt").to(CUDA)
        y: torch.Tensor = y.view(-1, 1).to(torch.float).to(CUDA)

        # forward pass
        y_tilde: torch.Tensor = model(Z)
        loss: float = criterion(y_tilde, y)

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # log
        epoch_loss += loss.item()
        epoch_correct += (y_tilde.round() == y).sum().item()
        epoch_count += y.size(dim=0)
        clear_cache()
    
    # return log
    return (epoch_loss, epoch_correct, epoch_count)

In [21]:
def evaluate(dataloader: torch.utils.data.DataLoader, tokenizer: PreTrainedTokenizer, model: torch.nn.Module, criterion: torch.nn.Module) -> tuple:
  
    # set model to test mode
    model.to(CUDA).eval()

    # log
    epoch_loss: float = 0
    epoch_correct: int = 0
    epoch_count: int = 0

    # load a batch of data
    for X, y in dataloader:

        # move to GPU
        Z: torch.Tensor = tokenizer(list(X), padding="max_length", truncation=True, max_length=512, return_tensors="pt").to(CUDA)
        y: torch.Tensor = y.view(-1, 1).to(torch.float).to(CUDA)

        # forward pass
        with torch.no_grad():
            y_tilde: torch.Tensor = model(Z)
            loss: float = criterion(y_tilde, y)

        # log
        epoch_loss += loss.item()
        epoch_correct += (y_tilde.round() == y).sum().item()
        epoch_count += y.size(dim=0)
        clear_cache()
    
    # return log
    return (epoch_loss, epoch_correct, epoch_count)

## 4.2. Configure Training Hyperparameters

Again, we match these with the IndoBERT paper (https://arxiv.org/pdf/2009.05387).

In [None]:
n_epochs: int = 25
learning_rate: float = 4e-5
optimizer: torch.optim.Optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

## 4.3. Train and Save the Best and Final Models

*The models are saved into (and subsequently, loaded from) Google Drive. When saving, any previously saved models created from this notebook will be overwritten.*

In [None]:
best_loss: float = float("inf")
for i_epoch in range(n_epochs):
    epoch_loss, epoch_correct, epoch_count = fit(train_dataloader, tokenizer, model, criterion, optimizer)
    print(f"Epoch [{i_epoch + 1}/{n_epochs}], loss: {epoch_loss}, accuracy: {epoch_correct / epoch_count} ({epoch_correct}/{epoch_count})")
    if epoch_loss < best_loss:
        torch.save(model.state_dict(), f"/content/drive/MyDrive/NLP/models/best/indobert_classifier{i_epoch + 1}.pt")
        best_loss = epoch_loss
torch.save(model.state_dict(), "/content/drive/MyDrive/NLP/models/final/indobert_classifier_final.pt")

Epoch [1/25], loss: 413.94131806865335, accuracy: 0.8742187861673071 (18884/21601)
Epoch [2/25], loss: 307.47342503722757, accuracy: 0.9136151104115551 (19735/21601)
Epoch [3/25], loss: 227.05701800296083, accuracy: 0.9386139530577288 (20275/21601)
Epoch [4/25], loss: 160.79203585302457, accuracy: 0.9604184991435581 (20746/21601)


Sadly, we lost the 4th epoch model because our web browser crashed. We then continue fine-tuning using the saved 3rd epoch model.

In [None]:
model.load_state_dict(torch.load("/content/drive/MyDrive/NLP/models/best/indobert_classifier3.pt"))
optimizer: torch.optim.Optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
best_loss: float = float("inf")
for i_epoch in range(3, n_epochs):
    epoch_loss, epoch_correct, epoch_count = fit(train_dataloader, tokenizer, model, criterion, optimizer)
    print(f"Epoch [{i_epoch + 1}/{n_epochs}], loss: {epoch_loss}, accuracy: {epoch_correct / epoch_count} ({epoch_correct}/{epoch_count})")
    if epoch_loss < best_loss:
        torch.save(model.state_dict(), f"/content/drive/MyDrive/NLP/models/best/indobert_classifier{i_epoch + 1}.pt")
        best_loss = epoch_loss
torch.save(model.state_dict(), "/content/drive/MyDrive/NLP/models/final/indobert_classifier_final.pt")

Epoch [4/25], loss: 178.99997822847217, accuracy: 0.9578723207258923 (20691/21601)
Epoch [5/25], loss: 147.40243205893785, accuracy: 0.9660663858154716 (20868/21601)


After the 5th epoch, we had to stop fine-tuning because we've hit Google Colab's GPU usage limit (fine-tuning for 1 epoch takes approximately 35 minutes).

# 5. Test the Saved Models on the Test Set

Using Google Colab CPU instance, we are able to test the saved models on the test set. Each test takes about 1 hour and 40 minutes (because we use a CPU instance).

*Again, the models are loaded from Google Drive.*

In [26]:
for i_epoch in range(5):
    model.load_state_dict(torch.load(f"/content/drive/MyDrive/NLP/models/best/indobert_classifier{i_epoch + 1}.pt", map_location=CUDA))
    epoch_loss, epoch_correct, epoch_count = evaluate(test_dataloader, tokenizer, model, criterion)
    print(f"Epoch {i_epoch + 1} model, test set loss: {epoch_loss}, accuracy: {epoch_correct / epoch_count} ({epoch_correct}/{epoch_count})")

Epoch 1 model, test set loss: 3.000379726290703, accuracy: 0.8785714285714286 (2460/2800)
Epoch 2 model, test set loss: 3.3991998434066772, accuracy: 0.865 (2422/2800)
Epoch 3 model, test set loss: 3.348657712340355, accuracy: 0.8760714285714286 (2453/2800)
Epoch 4 model, test set loss: 4.028535783290863, accuracy: 0.86 (2408/2800)
Epoch 5 model, test set loss: 5.872631788253784, accuracy: 0.7696428571428572 (2155/2800)
