<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fclassification/applications/classification/QQP%20Classification%20with%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quora Duplicate Question Pair detection with BERT

### Imports

In [0]:
!pip install sh
!pip install nlp
!pip install transformers
!pip install pytorch_lightning

Collecting sh
[?25l  Downloading https://files.pythonhosted.org/packages/fa/9c/796934ee6d990d504c600056aa435e31bd49dbfba37e81d2045d37c8bdaf/sh-1.13.1-py2.py3-none-any.whl (40kB)
[K     |████████▏                       | 10kB 12.7MB/s eta 0:00:01[K     |████████████████▎               | 20kB 3.4MB/s eta 0:00:01[K     |████████████████████████▍       | 30kB 4.5MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 1.7MB/s 
[?25hInstalling collected packages: sh
Successfully installed sh-1.13.1
Collecting nlp
[?25l  Downloading https://files.pythonhosted.org/packages/99/80/05b452119eb2213fc1b3c7647f39bd231b5804edc065168f4e43dce8026d/nlp-0.2.1-py3-none-any.whl (869kB)
[K     |████████████████████████████████| 870kB 4.3MB/s 
[?25hCollecting pyarrow>=0.16.0
[?25l  Downloading https://files.pythonhosted.org/packages/ba/3f/6cac1714fff444664603f92cb9fbe91c7ae25375880158b9e9691c4584c8/pyarrow-0.17.1-cp36-cp36m-manylinux2014_x86_64.whl (63.8MB)
[K     |█████████████████████

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |▌                               | 10kB 11.0MB/s eta 0:00:01[K     |█                               | 20kB 2.8MB/s eta 0:00:01[K     |█▌                              | 30kB 3.5MB/s eta 0:00:01[K     |██                              | 40kB 3.7MB/s eta 0:00:01[K     |██▍                             | 51kB 3.3MB/s eta 0:00:01[K     |███                             | 61kB 3.6MB/s eta 0:00:01[K     |███▍                            | 71kB 4.0MB/s eta 0:00:01[K     |███▉                            | 81kB 4.1MB/s eta 0:00:01[K     |████▍                           | 92kB 4.4MB/s eta 0:00:01[K     |████▉                           | 102kB 4.4MB/s eta 0:00:01[K     |█████▍                          | 112kB 4.4MB/s eta 0:00:01[K     |█████▉                          | 122kB 4.4

I have taken the reference from [hugging-face-nlp demo](https://github.com/yk/huggingface-nlp-demo/blob/master/demo.py) and modified accordingly to the quora question pair task.

As the dataset is pretty huge (4,00,000) pairs, running BERT on it will take 6-8hours of time. I took a subset of it and ran the code. If you want, you can modify the `percent` value accordingly.

### References:

- [Hugging-face-nlp-demo code](https://github.com/yk/huggingface-nlp-demo/blob/master/demo.py)
- [Hugging-face-nlp-demo video tutorial](https://www.youtube.com/watch?v=G3pOvrKkFuk&feature=youtu.be)


In [0]:
import sh

import nlp
import transformers
import torch as th
import pytorch_lightning as pl

In [0]:
device = th.device('cuda' if th.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [0]:
debug = False
epochs = 10
batch_size = 8
lr = 1e-2
momentum = 0.9
model_type = 'bert-base-uncased'
seq_length = 100
percent = 20


sh.rm('-r', '-f', 'logs')
sh.mkdir('logs')



In [0]:
class QQPClassifier(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = transformers.BertForSequenceClassification.from_pretrained(model_type)
        self.loss = th.nn.CrossEntropyLoss(reduction='none')
    
    def prepare_data(self, split="train"):
        tokenizer = transformers.BertTokenizer.from_pretrained(model_type)

        def _tokenize(x):
            encoded = tokenizer.batch_encode_plus(
                    x['question1'],
                    x['question2'],
                    max_length=seq_length, 
                    pad_to_max_length=True)
            x['input_ids'] = encoded['input_ids']
            x['token_type_ids'] = encoded['token_type_ids']
            return x

        def _prepare_ds(split):
            ds = nlp.load_dataset('glue', 'qqp', split=f'{split}[:{batch_size if debug else f"{percent}%"}]')
            ds = ds.map(_tokenize, batched=True)
            ds.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'label'])
            return ds

        self.train_ds, self.val_ds = map(_prepare_ds, ('train', 'validation'))

    def forward(self, input_ids, token_type_ids):
        mask = (input_ids != 0).float()
        logits, = self.model(input_ids, mask, token_type_ids)
        return logits

    def training_step(self, batch, batch_idx):
        logits = self.forward(batch['input_ids'], batch['token_type_ids'])
        loss = self.loss(logits, batch['label']).mean()
        return {'loss': loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        logits = self.forward(batch['input_ids'], batch['token_type_ids'])
        loss = self.loss(logits, batch['label'])
        acc = (logits.argmax(-1) == batch['label']).float()
        return {'loss': loss, 'acc': acc}

    def validation_epoch_end(self, outputs):
        loss = th.cat([o['loss'] for o in outputs], 0).mean()
        acc = th.cat([o['acc'] for o in outputs], 0).mean()
        out = {'val_loss': loss, 'val_acc': acc}
        return {**out, 'log': out}

    def train_dataloader(self):
        return th.utils.data.DataLoader(
                self.train_ds,
                batch_size=batch_size,
                drop_last=True,
                shuffle=True,
                )

    def val_dataloader(self):
        return th.utils.data.DataLoader(
                self.val_ds,
                batch_size=batch_size,
                drop_last=False,
                shuffle=False,
                )

    def configure_optimizers(self):
        return th.optim.SGD(
            self.parameters(),
            lr=lr,
            momentum=momentum,
        )

In [0]:
model = QQPClassifier()
trainer = pl.Trainer(
    default_root_dir='logs',
    gpus=(1 if th.cuda.is_available() else 0),
    max_epochs=epochs,
    fast_dev_run=debug,
    logger=pl.loggers.TensorBoardLogger('logs/', name='qqp', version=0),
)
trainer.fit(model)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29015.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=30329.0, style=ProgressStyle(descriptio…


Downloading and preparing dataset glue/qqp (download: 57.73 MiB, generated: 107.02 MiB, total: 164.75 MiB) to /root/.cache/huggingface/datasets/glue/qqp/1.0.0...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60534884.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/qqp/1.0.0. Subsequent calls will reuse this data.


100%|██████████| 73/73 [00:24<00:00,  3.00it/s]
100%|██████████| 9/9 [00:02<00:00,  3.38it/s]

    | Name                                                   | Type                          | Params
-----------------------------------------------------------------------------------------------------
0   | model                                                  | BertForSequenceClassification | 109 M 
1   | model.bert                                             | BertModel                     | 109 M 
2   | model.bert.embeddings                                  | BertEmbeddings                | 23 M  
3   | model.bert.embeddings.word_embeddings                  | Embedding                     | 23 M  
4   | model.bert.embeddings.position_embeddings              | Embedding                     | 393 K 
5   | model.bert.embeddings.token_type_embeddings            | Embedding                     | 1 K   
6   | model.bert.embeddings.LayerNorm                        | LayerNorm                 

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…





HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…