Created by: c00k1ez (https://github.com/c00k1ez)


# Quora question pairs classification with BERT

 Paraphrase detection is challenging NLP problem of detecting whether multiple phrases have the same meaning. 
 In this notebook, we are going to build a baseline solution for an unusual classification task. 

 For token embeddings we are going to use BERT model. Read more [here](http://jalammar.github.io/illustrated-bert/) and [here](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270).
 

In [1]:
import torch
import torch.nn.functional as F

from src.bert.models import BertClassifier
from src.bert.dataset import PairsDataset
from src.bert.data_parser import DataParser
from src.utils import seed_all

import transformers

In [2]:
seed_all(42)

In [11]:
config = {
    'model_name': 'distilbert-base-uncased',
    'pad_len': 256,
    'batch_size': 30,
    'lr': 5e-5
}

The simplest way to use BERT without enough computation resources is a distilled version of this model - DistilBert by HuggingFace. 
[Blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) about DistilBert and [distillation](https://blog.floydhub.com/knowledge-distillation/).

Models, that you can use too:
* `BertModel`
* `TransfoXLModel`
* `XLNetModel`
* `ElectraModel`
* `RobertaModel`
* `XLMRobertaModel`
* `AlbertModel`  
etc.

In [6]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained(config['model_name'])
bert_model = DistilBertModel.from_pretrained(config['model_name'])

for p in bert_model.parameters():
    p.require_grad = False

Downloading: 100%|██████████| 232k/232k [00:00<00:00, 438kB/s]
Downloading: 100%|██████████| 442/442 [00:00<00:00, 109kB/s]
Downloading: 100%|██████████| 268M/268M [04:22<00:00, 1.02MB/s]


In [10]:
parser = DataParser('./data/questions.csv')
train, test = parser.train_test_split()

In [17]:
from collections import Counter

print('Our dataset contains {} question pairs.'.format(len(train)))

tokens = []
for sample in parser.question_pairs:
    sample = tokenizer.tokenize(sample[0] + ' ' + sample[1])
    tokens.extend(sample)
counter = Counter(tokens)
print('There are {} unique tokens in dataset and {} tokens at all.'.format(len(counter), sum([v for _,v in dict(counter).items()])))
print('')
print('Most common 10 tokens:')
for token, freq in counter.most_common(10):
    print('{} : {}'.format(token, freq))

Our dataset contains 404351 question pairs.
There are 25855 unique tokens in dataset and 11083222 tokens at all.

Most common 10 tokens:
? : 852527
the : 378272
what : 327913
is : 270823
i : 224246
how : 220984
a : 212801
to : 206127
in : 201179
do : 161465


In [9]:
train_dataset = PairsDataset(train, tokenizer, config['pad_len'])
#test_dataset = PairsDataset(test, tokenizer, config['pad_len'])

train_loader = torch.utils.data.DataLoader(train_dataset, config['batch_size'], shuffle=True)
#test_loader = torch.utils.data.DataLoader(test_dataset, config['batch_size'], shuffle=False)

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertClassifier(bert_model).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
criterion = torch.nn.CrossEntropyLoss()

First of all, you have to implement `DataParser.train_test_split` and `validation` methods to validate with the macro F1 score.
## How to improve this model results?
* Play with other models instead of DistilBert: classic BERT, ALBERT, RoBERTa, TinyBERT, etc.
* Implement correct MeanPooling/MaxPooling layer (notice that you have `[PAD]` tokens during training and you have to "exclude" them from mean or max value computing).
* Use more complex model after BERT embeddings.
* You can try to use the siamese network to encode the first and second questions independently with metric learning. Read more about it [here](https://parajain.github.io/metric_learning_tutorial/).

In [14]:
from sklearn.metrics import f1_score

def validation(model, test_loader, device):
    model.eval()
    avg_val_loss = []
    avg_val_loss_value = -1.0
    y_true = []
    y_pred = []
    ################### INSERT YOUR CODE HERE ###################
        
    ################### INSERT YOUR CODE HERE ###################
    model.train()
    return avg_val_loss_value, f1_score(y_true, y_pred, average='macro')

In [15]:
def train_epoch(model, train_loader, test_loader, optimizer, epoch_num, device, criterion, log_interval=200):
    losses = []
    avg_loss = []
    step = 1
    for batch in train_loader:
        optimizer.zero_grad()
        for key in batch.keys():
            batch[key] = batch[key].to(device)
        label = batch['label'].view(-1)
        logits = model(batch)
        loss = criterion(logits, label)
        avg_loss.append(loss.detach().item())
        if step % log_interval == 0:
            val_loss = sum(avg_loss) / len(avg_loss)
            losses.append(val_loss)
            avg_loss = []
            print('epoch {}\t[{}/{}]\ttrain_loss = {:.4f}'.format(epoch_num, step, len(train_loader)))
        loss.backward()
        optimizer.step()
        step += 1
    return losses

In [None]:
EPOCHS = 5
losses = []
for epoch in range(EPOCHS):
    losses = train_epoch(model, train_loader, None, optimizer, epoch, device, criterion)