<a href="https://colab.research.google.com/github/pranayprasad3/BERT/blob/main/DisitilBERT_For_Sequence_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence Classification with IMDb Reviews

We will download, tokenize, and train a model on the IMDb reviews dataset. 

## Downloading the dataset

In [1]:
import os
import tarfile
from torchvision.datasets.utils import download_url

In [2]:
dataset_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
download_url(dataset_url, '.')

Downloading http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz to ./aclImdb_v1.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

In [3]:
with tarfile.open('./aclImdb_v1.tar.gz', 'r:gz') as tar:
    tar.extractall(path='/content/')




## Exploring the folder structure

We can see that the data is arranged in test and train folders. The files 'README', 'imdb.vocab', 'imdbEr.txt' are metadata files and are not important in this context.

In [4]:
data_dir = '/content/aclImdb'

print(os.listdir(data_dir))

['imdbEr.txt', 'train', 'test', 'imdb.vocab', 'README']


The data can be divided into 3 classes (Again ignore metadata files) :
1. pos - positive sentiment 2. neg - negative sentiment 3. unsup - unclear sentiment

In [5]:
classes = os.listdir(data_dir + "/train")
print(classes)

['unsupBow.feat', 'unsup', 'pos', 'urls_pos.txt', 'labeledBow.feat', 'neg', 'urls_neg.txt', 'urls_unsup.txt']


Lets see the number of positive and negative sentiment example files we have :

In [6]:
pos_files = os.listdir(data_dir + "/train/pos")
print('No. of training examples for positivie sentiment:', len(pos_files))
print(pos_files[:5])

No. of training examples for positivie sentiment: 12500
['7253_10.txt', '7924_9.txt', '6955_10.txt', '4157_8.txt', '6734_10.txt']


In [7]:
neg_files = os.listdir(data_dir + "/train/neg")
print('No. of training examples for negative sentiment:', len(neg_files))
print(neg_files[:5])

No. of training examples for negative sentiment: 12500
['11976_1.txt', '4773_1.txt', '169_1.txt', '550_4.txt', '8000_4.txt']


So there are 12500 files each of positive and negative sentiment. Each file is scaled between 1-10 where. Note are no files with rating of 5 or 6 , this maybe because dataset creators didn't want any false positives / negatives .

## Assigning datasets for texts and it's corresponding labels.

In [8]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [9]:
train_texts[0]

"Not sure one can call this an anti-war film, it shows war at an elite level. These are elite troops that know what they are doing and take great pride in it. Even when they are pacifist, they still enjoy the skill level and defeating their foes, even if it does go against being a pacifist. The movies is slow and rather uneventful and in many ways is rather tame as war movies go-more so by todays standards, no body parts flying off as in modern movies. It is brutal in other ways though as you see killing at a personal level. This is more of a thinking man's movie. Once you start to watch you don't want to miss anything. The thoughts of the men in the movie and their interactions, is what the movie is about- not the combat itself or a big exciting storyline. This maybe called a war triller.<br /><br />If you are into the skill of war, if you are into reading or seeing programs about the SAS and so on, YOU WANT TO WATCH THIS MOVIE!!!!!<br /><br />Comparable movies are The Hill (1965) wit

In [10]:
train_labels[0]

1

In [11]:
train_texts[17000]

"Rather then long dance sequences and close ups of the characters which made the film drag on - the movie would have been better served explaining the story and motivations of the characters.<br /><br />The marginalisation of Nubo, the minister, auntie, mother - and the dumbing down of the dynamic and IMPORTANT rivalry between hatsumo and mameha and hatsumo and sayuri made the movie lack any real depth. If you hadn't read the book you would not really understand why Sayuri loved the Chairman and why Mameha became her mentor at all.<br /><br />Visually the film was stunning - and the actors all did the best with the C rate script they were given, but that was all that was good about this movie."

In [12]:
train_labels[17000]

0

## Splitting the training set into training and validation set

In [13]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Now let’s tackle tokenization. We’ll eventually train a classifier using pre-trained DistilBert, so let’s use the DistilBert tokenizer.

In [14]:
!pip install transformers
from transformers import DistilBertTokenizerFast

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 8.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 14.5MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 43.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=8af21804195d9f0ca54

In [15]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




Now we can simply pass our texts to the tokenizer. We’ll pass truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to be no longer model’s maximum input length. This will allow us to feed batches of sequences into the model at the same time.

In [16]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [17]:
train_encodings[0:5]

[Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

##  Turning our labels and encodings into a Dataset object.

In [18]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

## Fine-tuning with native PyTorch

In [21]:
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in tqdm(train_loader):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()
        optim.zero_grad()
model.eval()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

HBox(children=(FloatProgress(value=0.0, max=1250.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1250.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1250.0), HTML(value='')))




DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       