<a href="https://colab.research.google.com/github/nwon24/nlp/blob/main/W7/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning a BERT model for text classification

BERT (Bidirectional encoder representations from transformers) is a special kind of transfomer architecture that has wide applications in NLP.

# Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import DataLoader,Dataset
from torch.optim import Adam
from torch.nn import BCELoss
from torch import accelerator
from torch import tensor
from transformers import BertForSequenceClassification
from transformers import AutoTokenizer


Here it's handy as well to set a variable that will hold the device we want to train the model on. That way we don't have to manually change the device if we no longer have  GPU.

In [2]:
device=accelerator.current_accelerator().type if accelerator.is_available() else "cpu"

# Hyperparameters

In [3]:
batch_size=8
lr=1e-5
epochs=5

# Data preparation and tokenisation

Our corpus is a CSV file of Amazon reviews eith three columns: label, title of the review, and the text of the review itself.

For the tokeniser, we are lucky to have a pretrained tokeniser from `AutoTokenizer`. This tokenizer returns the tokens of the corpus in a special dictionary, which will be used in the `DataLoader` class to feed the input correctly into the model. The output of the tokenizer is a dictionary with keys `input_ids`, `token_type_ids`, and `attention_mask`, so we can't just use our homeegrown tokenizer or that of `CounntVectorizer` anymore. We can, however, still use the `LabelEncoder` from `sklearn` to convert our labels into ones and zeroes.




In [4]:
corpus_file="xaa"
data=pd.read_csv(corpus_file,encoding="utf-8")
trainX,testX,trainY,testY=train_test_split(data["text"],data["label"],test_size=0.1)

label_encoder=LabelEncoder()

tokenizer=AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
train_tokens=tokenizer(list(trainX),truncation=True,padding=True)
test_tokens=tokenizer(list(testX),truncation=True,padding=True)
train_labels=label_encoder.fit_transform(trainY)
test_labels=label_encoder.transform(testY)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Loading the data

Models implemented in `torch` can be trained by feeding input data through the `DataLoader` and `DataSet` classes. This ensures that the data comes in the specified batches and that the data is also shuffled around.

To implement our own class that inherits from `DataLoader`, we just need to define a copy of basic methods: a method to return the amount of data we have, and a method to return the both the input and and the corresponding output given an index. In this case we have included the labels in the dictionary that is returned, but we could also have returned the labels separately.

In [5]:
class token_data(Dataset):
    def __init__(self,data,tokens,labels):
        # self.data will hold raw data (not tokenized) for __len__ method
        self.data=data
        self.tokens=tokens
        self.labels=labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self,idx):
        item={}
        for k,v in self.tokens.items():
            item[k]=tensor(v[idx])
        item["labels"]=tensor(self.labels[idx]).float()
        return item

Now that we have our  own `token_data` class, all that is required is to wrap our data using the `DataLoader` class.

In [6]:
train_loader=DataLoader(token_data(trainX,train_tokens,train_labels),shuffle=True,batch_size=batch_size)
test_loader=DataLoader(token_data(testX,test_tokens,test_labels),shuffle=True,batch_size=batch_size)

# Loading the pretrained BERT model

We use binary cross entropy loss because our classification task is binary---the reviews are either positive or negative.

In [7]:
model=BertForSequenceClassification.from_pretrained("google-bert/bert-base-cased").to(device)
optimizer=Adam(model.parameters(),lr=lr)
loss_fn=BCELoss()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training


In [None]:
model.train()

for epoch in range(epochs):
    for i,batch in enumerate(train_loader):
        optimizer.zero_grad()
        input_ids=batch["input_ids"].to(device)
        attention_mask=batch["attention_mask"].to(device)
        labels=batch["labels"].to(device)

        prediction=model(input_ids,attention_mask=attention_mask)
        loss=loss_fn(prediction.logits.max(1)[0],labels)
        loss.backward()
        optimizer.step()

        batch_loss=loss.item()
        print(f"Batch {i}, loss {batch_loss/batch_size:.2f}")
    print(f"Epoch {epoch}, loss {batch_loss/batch_size:.2f}")

Batch 0, loss 0.09
Batch 1, loss 0.10
Batch 2, loss 0.10
Batch 3, loss 0.10
Batch 4, loss 0.09
