<a href="https://colab.research.google.com/github/mobarakol/tutorial_notebooks/blob/main/Text_Classification_MultiClass_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Installing libraries

In [1]:
! pip -q install transformers

[K     |████████████████████████████████| 3.4 MB 18.2 MB/s 
[K     |████████████████████████████████| 67 kB 3.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 46.8 MB/s 
[K     |████████████████████████████████| 895 kB 64.9 MB/s 
[K     |████████████████████████████████| 596 kB 73.6 MB/s 
[?25h

#Dataset
We are using the News aggregator dataset available at by [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)<br>
- There are `422937` rows of data
- CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)


Download dataset:

In [6]:
!wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip -q NewsAggregatorDataset.zip -d data

--2022-01-19 16:40:05--  https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



Dataset Preparation

In [1]:
# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

def update_cat(x):
    return my_dict[x]

def encode_cat(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

df = pd.read_csv('./data/newsCorpora.csv', sep='\t', names=['ID','TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])
df = df[['TITLE','CATEGORY']] # Removing unwanted columns
my_dict = {
    'e':'Entertainment',
    'b':'Business',
    't':'Science',
    'm':'Health'
}

df['CATEGORY'] = df['CATEGORY'].apply(lambda x: update_cat(x))

encode_dict = {}
df['ENCODE_CAT'] = df['CATEGORY'].apply(lambda x: encode_cat(x))

Dataloader

In [2]:
from transformers import DistilBertModel, DistilBertTokenizer
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        title = str(self.data.TITLE[index])
        title = " ".join(title.split())
        inputs = self.tokenizer.encode_plus(
            title,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.ENCODE_CAT[index], dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

MAX_LEN = 512
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 32
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state=200)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

FULL Dataset: (422419, 3)
TRAIN Dataset: (337935, 3)
TEST Dataset: (84484, 3)


Architecture of DistilBERT:

In [3]:
from transformers import DistilBertModel, DistilBertTokenizer
class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 4)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistilBERTClass()
model.to(device);

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Training Script

In [15]:
def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct


def train(epoch, model, training_loader, optimizer, loss_function):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%50==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 50\{len(training_loader)} steps: {loss_step}")
            print(f"Training Accuracy per 50\{len(training_loader)} steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

def valid(model, testing_loader):
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0; val_loss = 0; nb_val_steps = 0
    nb_val_examples = 0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask).squeeze()
            loss = loss_function(outputs, targets)
            val_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            nb_val_steps += 1
            nb_val_examples+=targets.size(0)
            
            if _%50==0:
                loss_step = val_loss/nb_val_steps
                accu_step = (n_correct*100)/nb_val_examples
                print(f"Validation Loss per 10\{len(testing_loader)} steps: {loss_step}")
                print(f"Validation Accuracy per 10\{len(testing_loader)} steps: {accu_step}")

    epoch_loss = val_loss/nb_val_steps
    epoch_accu = (n_correct*100)/nb_val_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")
    
    return epoch_accu


Training

In [18]:
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

best_epoch, best_acc = 0.0, 0
for epoch in range(EPOCHS):
    train(epoch, model, training_loader, optimizer, loss_function)
    acc = valid(model, testing_loader)
    if acc > best_acc:
        best_acc = acc
        best_epoch = epoch
        torch.save(model.state_dict(), 'best_model_cifar10h.pth.tar')
    print('epoch: {}  acc: {:.4f}  best epoch: {}  best acc: {:.4f}'.format(
            epoch, acc, best_epoch, best_acc, optimizer.param_groups[0]['lr']))



Training Loss per 50\10561 steps: 0.9380739331245422
Training Accuracy per 50\10561 steps: 59.375
Training Loss per 50\10561 steps: 1.028627407317068
Training Accuracy per 50\10561 steps: 57.96568627450981
Training Loss per 50\10561 steps: 1.021858262543631
Training Accuracy per 50\10561 steps: 57.88985148514851
The Total Accuracy for Epoch 0: 57.88985148514851
Training Loss Epoch: 1.021858262543631
Training Accuracy Epoch: 57.88985148514851
Validation Loss per 10\2641 steps: 0.9856563806533813
Validation Accuracy per 10\2641 steps: 62.5
Validation Loss per 10\2641 steps: 0.9940446241229188
Validation Accuracy per 10\2641 steps: 58.76225490196079
Validation Loss per 10\2641 steps: 0.9879287977029781
Validation Accuracy per 10\2641 steps: 59.71534653465346
Validation Loss per 10\2641 steps: 0.9869846906883037
Validation Accuracy per 10\2641 steps: 59.41639072847682
Validation Loss per 10\2641 steps: 0.9823407859944585
Validation Accuracy per 10\2641 steps: 59.483830845771145
Validation 