Created by: c00k1ez (https://github.com/c00k1ez)


Transformer - is a powerful architecture and it shows state of the art resutls in many seq2seq tasks, like NMT, summarization, and especially for language modeling. Although, the other important feature that transformers also are good at text classification.

# Quora question pairs classification with BERT

 Paraphrase detection is challenging NLP problem of detecting whether multiple phrases have the same meaning. 
 In this notebook, we are going to build a baseline solution for an unusual classification task. 

 For token embeddings we are going to use BERT model. Read more [here](http://jalammar.github.io/illustrated-bert/) and [here](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270).
 

In [6]:
import torch
import torch.nn.functional as F

from src.bert.models import BertClassifier
from src.bert.dataset import PairsDataset
from src.bert.data_parser import DataParser
from src.utils import seed_all

import transformers

In [2]:
seed_all(42)

In [3]:
config = {
    'model_name': 'distilbert-base-uncased',
    'pad_len': 256,
    'batch_size': 30,
    'lr': 5e-5
}

The simplest way to use BERT without enough computation resources is a distilled version of this model - DistilBert by HuggingFace. 
[Blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) about DistilBert and [distillation](https://blog.floydhub.com/knowledge-distillation/).

Models, that you can use too:
* `BertModel`
* `TransfoXLModel`
* `XLNetModel`
* `ElectraModel`
* `RobertaModel`
* `XLMRobertaModel`
* `AlbertModel`  
etc.

_Warning!_ The models will be downloaded from the Internet. Their size could be from 100 Mb to 1-2Gb.

In [9]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained(config['model_name'])
bert_model = DistilBertModel.from_pretrained(config['model_name'])

for p in bert_model.parameters():
    p.require_grad = False

In [7]:
parser = DataParser('./data/questions.tsv')
train, test = parser.train_test_split()

In [10]:
from collections import Counter

print('Our dataset contains {} question pairs.'.format(len(train)))

tokens = []
for sample in parser.question_pairs:
    sample = tokenizer.tokenize(sample[0] + ' ' + sample[1])
    tokens.extend(sample)
counter = Counter(tokens)
print('There are {} unique tokens in dataset and {} tokens at all.'.format(len(counter), sum([v for _,v in dict(counter).items()])))
print('Most common 10 tokens:')
for token, freq in counter.most_common(10):
    print('{} : {}'.format(token, freq))

Our dataset contains 404290 question pairs.
There are 25856 unique tokens in dataset and 11082400 tokens at all.

Most common 10 tokens:
? : 852404
the : 378243
what : 327860
is : 270750
i : 224198
how : 220949
a : 212794
to : 206110
in : 201156
do : 161421


In [9]:
train_dataset = PairsDataset(train, tokenizer, config['pad_len'])
#test_dataset = PairsDataset(test, tokenizer, config['pad_len'])

train_loader = torch.utils.data.DataLoader(train_dataset, config['batch_size'], shuffle=True)
#test_loader = torch.utils.data.DataLoader(test_dataset, config['batch_size'], shuffle=False)

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertClassifier(bert_model).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
criterion = torch.nn.CrossEntropyLoss()

First of all, you have to implement `DataParser.train_test_split` and `validation` methods to validate with the macro F1 score.
## How to improve this model results?
* Play with other models instead of DistilBert: classic BERT, ALBERT, RoBERTa, TinyBERT, etc.
* Implement correct MeanPooling/MaxPooling layer (notice that you have `[PAD]` tokens during training and you have to "exclude" them from mean or max value computing).
* Use more complex model after BERT embeddings.
* You can try to use the siamese network to encode the first and second questions independently with metric learning. Read more about it [here](https://parajain.github.io/metric_learning_tutorial/).

In [14]:
from sklearn.metrics import f1_score

def validation(model, test_loader, device):
    model.eval()
    avg_val_loss = []
    avg_val_loss_value = -1.0
    y_true = []
    y_pred = []
    ################### INSERT YOUR CODE HERE ###################
        
    ################### INSERT YOUR CODE HERE ###################
    model.train()
    return avg_val_loss_value, f1_score(y_true, y_pred, average='macro')

In [15]:
def train_epoch(model, train_loader, test_loader, optimizer, epoch_num, device, criterion, log_interval=200):
    losses = []
    avg_loss = []
    step = 1
    for batch in train_loader:
        optimizer.zero_grad()
        for key in batch.keys():
            batch[key] = batch[key].to(device)
        label = batch['label'].view(-1)
        logits = model(batch)
        loss = criterion(logits, label)
        avg_loss.append(loss.detach().item())
        if step % log_interval == 0:
            val_loss = sum(avg_loss) / len(avg_loss)
            losses.append(val_loss)
            avg_loss = []
            print('epoch {}\t[{}/{}]\ttrain_loss = {:.4f}'.format(epoch_num, step, len(train_loader)))
        loss.backward()
        optimizer.step()
        step += 1
    return losses

In [None]:
EPOCHS = 5
losses = []
for epoch in range(EPOCHS):
    losses = train_epoch(model, train_loader, None, optimizer, epoch, device, criterion)

# Dialogue generation with GPT2

Our next task will be try out the text generation abilities of transformers. In this notebook we are going to work with GPT2. This is a model from OpenAI, which showed state of the art results for language modeling in 2019. You can read their original blogpost [here](https://openai.com/blog/better-language-models/).  

Let's consider an interesting application of GPT2 model - dialogue generation. Describe this task a bit clearer - we have some context, for example, user question and our model have to generate a relevant answer.  
How we can train a model for it? First of all, for input, we need to use special tokens to mark context and model answer, like `[CONTEXT] some context [ANSWER] model answer`. Then there are two possible ways:
* train it like classic autoregressive LM,
* train it like seq2seq LM. Read more [here](https://arxiv.org/abs/1905.03197).  

You can read more about GPT2 [here](http://jalammar.github.io/illustrated-gpt2/) and [here](https://towardsdatascience.com/openai-gpt-2-understanding-language-generation-through-visualization-8252f683b2f8).  
[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for GPT2.

| ![seq2seq lm](https://drive.google.com/uc?export=view&id=1NxS-O0Tto2rcFrALhpUBbywriyKlSTL4) |
|:--:| 
| *seq2seq LM* |

In [1]:
import torch

import transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config

from src.utils import get_answer, seed_all
from src.gpt2.data_parser import Dialogue, DataParser
from src.gpt2.dataset import DialogueDataset

In [2]:
seed_all(42)

In [3]:
params_config = {
    'pad_len': 100,
    'train_batch_size': 10,
    'model_name': 'gpt2',
    'lr': 5e-5,
    'residual_dropout': 0.1,
    'embedding_dropout': 0.1,
    'attention_dropout': 0.1
}

We are going to use the smallest GPT2 model - it has 124M trainable parameters and requires 500 Mb of disk space.

In [7]:
config = GPT2Config.from_pretrained(params_config['model_name'])
config.resid_pdrop = params_config['residual_dropout']
config.attn_pdrop = params_config['attention_dropout']
config.embd_pdrop = params_config['embedding_dropout']

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = GPT2LMHeadModel.from_pretrained(params_config['model_name'], config=config).to(device)

Downloading: 100%|██████████| 548M/548M [06:21<00:00, 1.44MB/s]


**Important moment**: we have to add special tokens: `[CONTEXT]` and `[ANSWER]` to tokenizer, then resize model embeddings.

Few words about tokenizer.
GPT2, like some other models, uses [Byte-Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/) with special tokens in vocabulary.  

All tokenizers from `transformers` have unified structure and same methods, so we are going to use a few methods:
* `tokenizer.tokenize` to split string unto list of tokens,
* `tokenizer.encode` to transform a string into token indexes,
* `tokenizer.decode` to transform a list of ids to the string.

In [9]:
tokenizer = GPT2Tokenizer.from_pretrained(params_config['model_name'])
tokenizer.add_special_tokens({'additional_special_tokens': ['[CONTEXT]', '[ANSWER]']})
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

Now we can consider our dataset a bit closer.

In [5]:
parser = DataParser('./data/TwitterLowerAsciiCorpus.txt')
train, test = parser.train_test_split()

In [15]:
from collections import Counter

print('Our dataset contains {} context-answer pairs and unique {} dialogues'.format(len(parser.all_pairs), len(parser.dialogues)))

tokens = []
for sample in parser.all_pairs:
    sample = tokenizer.tokenize(sample['context'] + ' ' + sample['answer'])
    tokens.extend(sample)
counter = Counter(tokens)
print('There are {} unique tokens in dataset and {} tokens at all. Notice, that small GPT2 have vocabulary with 50k sub-words.'.format(len(counter), sum([v for _,v in dict(counter).items()])))
print('')
print('Most common 10 tokens:')
for token, freq in counter.most_common(10):
    print('{} : {}'.format(token, freq))

Our dataset contains 8574 context-answer pairs and unique 1983 dialogues
There are 10046 unique tokens in dataset and 233751 tokens at all. Notice, that small GPT2 have vocabulary with 50k sub-words.

Most common 10 tokens:
. : 7196
Ġi : 7186
Ġthe : 4561
Ġto : 4025
Ġyou : 3977
, : 3602
Ġa : 3373
Ġit : 3218
Ġand : 2602
's : 2285


Now you can see that we have **really** small dataset for our "toy" task.

In [17]:
train_dataset = DialogueDataset(train, tokenizer, params_config['pad_len'])
train_loader = torch.utils.data.DataLoader(train_dataset, params_config['train_batch_size'], shuffle=True)

**Another important moment**: we are using AdamW optimizer from `transformers` package, **not** classic Adam and **not** AdamW from `torch.optim`!  
[Blogpost](https://www.fast.ai/2018/07/02/adam-weight-decay/) about AdamW.

In [18]:
optimizer = transformers.AdamW(model.parameters(), lr=params_config['lr'])

As you saw earlier, we have a small dataset, so it is quite hard to get a good result and do not overfit.
## How to improve this model results?
* Implement `train_test_split` method in `DataParser` class and validation loop to calculate [perplexity](https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3).
* Find optimal `residual_dropout`, `embedding_dropout` and `attention_dropout` probabilities.
* Now just a previous sentence is used for training like context for answer. You can rewrite `Dialogue.get_pairs` method to sample one, two, three, or more sentences like context for answer.
* You can add a bit more regularizations, for example, throw random tokens from the sample, or swap answer and context with a small probability.
* Read about [BPE-dropout](https://arxiv.org/abs/1910.13267). It is hard to implement with `transformers`, so you can just read about this technique.

In [None]:
def validation(model, test_loader, device):
    ################### INSERT YOUR CODE HERE ###################
        
    ################### INSERT YOUR CODE HERE ###################
    pass

In [19]:
def train_epoch(model, loader, test_loader, optimizer, epoch_num, device, log_interval=100):
    losses = []
    avg_loss = []
    step = 1
    for batch in loader:
        optimizer.zero_grad()
        input_ids, mask, label = batch['sample'], batch['mask'], batch['label']
        input_ids = input_ids.to(device)
        mask = mask.to(device)
        label = label.to(device)
        outputs = model(input_ids, attention_mask=mask, labels=label)
        loss, logits = outputs[:2]
        avg_loss.append(loss.detach().item())
        if step % log_interval == 0:
            val_loss = sum(avg_loss) / len(avg_loss)
            losses.append(val_loss)
            avg_loss = []
            print('epoch {}\t[{}/{}]\tloss = {:.4f}'.format(epoch_num, step, len(loader), val_loss))
        loss.backward()
        optimizer.step()
        step += 1
    return losses

In [None]:
EPOCHS = 5
losses = []
for epoch in range(EPOCHS):
    ep_losses = train_epoch(model, train_loader, None, optimizer, epoch, device)

In [None]:
model = model.to(torch.device('cpu'))
model.eval()

In [None]:
get_answer("where are you?", model, tokenizer)