# <center> Pair binary classification with DistillBert and Catalyst
    
<img src='https://habrastorage.org/webt/ne/n_/ow/nen_ow49hxu8zrkgolq1rv3xkhi.png'>
    

1. **Gradient accumulation.** Doing one optimization step for several bachward steps. Well explained in [this post](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) by HuggingFace
1. **Mixed-precision training.** Handled by [Nvidia Apex](https://github.com/NVIDIA/apex) and reused by Catalyst
1. **Learning rate schedule.** Standard thing when training deep neural networks, Catalysts handles lot of them
1. **Sequence bucketing (soon).** The main idea is that you can group long sentences with long ones, short ones with short ones and thus do less padding. Three approaches are described in [this Kernel](https://www.kaggle.com/bminixhofer/speed-up-your-rnn-with-sequence-bucketing)

In [1]:
# Python 
import os
import warnings
import logging
from typing import Mapping, List
from pprint import pprint
from collections import OrderedDict


# Numpy and Pandas 
import numpy as np
import pandas as pd

# PyTorch 
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Transformers 
from transformers import AutoConfig, AutoModel, AutoTokenizer

# Catalyst
from catalyst.dl import SupervisedRunner
from catalyst.dl.callbacks import AccuracyCallback, F1ScoreCallback, OptimizerCallback, SchedulerCallback
from catalyst.dl.callbacks import CheckpointCallback, InferCallback, CriterionCallback
from catalyst.utils import set_global_seed, prepare_cudnn

from sklearn.model_selection import train_test_split

# Sources
os.chdir('../../')

from pair_classification.catalyst.data.nlp.pair_bin_classify import TextPairBinaryClfDataset
from pair_classification.catalyst.contrib.models.nlp.bert.distil_pair_bin_classify import DistilBertForSequencePairBinaryClassification

In [2]:
torch.cuda.is_available()

True

**Setup**

In [3]:
MODEL_NAME = 'distilbert-base-uncased' # pretrained model from Transformers
LOG_DIR = "./logdir"    # for training logs and tensorboard visualizations
NUM_EPOCHS = 10                         # smth around 2-6 epochs is typically fine when finetuning transformers
BATCH_SIZE = 64                        # depends on your available GPU memory (in combination with max seq length)
MAX_SEQ_LENGTH = 150                   # depends on your available GPU memory (in combination with batch size)
LEARN_RATE = 5e-5                      # learning rate is typically ~1e-5 for transformers
ACCUM_STEPS = 4                        # one optimization step for that many backward passes
SEED = 11                              # random seed for reproducibility
POSITIVE_WEIGHT = [0.37]               # positive weight constant  

**Additionaly, we install [Nvidia Apex](https://github.com/NVIDIA/apex) to reuse AMP - automatic mixed-precision training.**

The idea is that we can use float16 format for faster training, only switching tio float32 when necessary. 
Here we'll only need to tell Catalyst to use fp16.

In [4]:
# FP16_PARAMS = None
FP16_PARAMS = dict(opt_level="O1") 

**Dataset**

Amazon product reviews - [competition](https://www.kaggle.com/c/amazon-pet-product-reviews-classification).
Given text of a review, we need to classify it into one of 6 categories: dogs, cats, fish aquatic pets, birds, and two others.

In [5]:
# to reproduce, download the data and customize this path
PATH_TO_DATA = './data/'

In [6]:
train_df = pd.read_csv(PATH_TO_DATA + 'train.csv', index_col='id').fillna('')
# valid_df = pd.read_csv(PATH_TO_DATA + 'valid.csv', index_col='id').fillna('')
test_df = pd.read_csv(PATH_TO_DATA + 'test.csv', index_col='test_id').fillna('')


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



In [7]:
X_train = np.arange(len(train_df))
y_train = train_df['is_duplicate'].to_numpy(dtype=np.int32)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, stratify=y_train)
val_df = train_df.iloc[X_val]
train_df = train_df.iloc[X_train]


In [8]:
train_df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
44525,79891,79892,Why should one avoid committing suicide?,Should one commit suicide?,1
234636,345188,345189,What do you think if London become an independ...,Could London survive as an independent city st...,0
384380,223575,160967,How is answering questions on Quora like study...,How do you answer questions on Quora?,0
397203,530313,530314,Does Uber provide training for new drivers?,How does Uber provide security both for the dr...,0
107351,176642,176643,How was iron discovered?,How can I use low octane number gasoline with ...,0


In [9]:
val_df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
171133,3853,264485,What are the best free and legal music downloa...,What is a good free website to download mp3 so...,1
253912,368542,100790,What is the difference between introverted and...,What's the difference between supper and dinner?,0
266588,383782,383783,Chinese Food: Why do Asians eat shark fin soup?,Do you drink soup or eat soup?,0
33086,60834,60835,What do practitioners of Doga hope to achieve?,Where did Doga get its start?,0
88837,23933,149351,What is faster the speed of light or the speed...,"What is faster, the speed of light or the spee...",0


## Torch Dataset

This is left for user to be defined. Catalyst will take care of the rest. 

**Create Torch Datasets with train, validation, and test data.**

In [10]:
train_dataset = TextPairBinaryClfDataset(
    texts_left=train_df['question1'].values.tolist(),
    texts_right=train_df['question2'].values.tolist(),
    labels=train_df['is_duplicate'].values.tolist(),
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

valid_dataset = TextPairBinaryClfDataset(
    texts_left=val_df['question1'].values.tolist(),
    texts_right=val_df['question2'].values.tolist(),
    labels=val_df['is_duplicate'].values.tolist(),
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

test_dataset = TextPairBinaryClfDataset(
    texts_left=test_df['question1'].values.tolist(),
    texts_right=test_df['question2'].values.tolist(),
    labels=None,
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)


One of the training dataset instances:

In [11]:
train_df.iloc[1]

qid1                                                       345188
qid2                                                       345189
question1       What do you think if London become an independ...
question2       Could London survive as an independent city st...
is_duplicate                                                    0
Name: 234636, dtype: object

In [12]:
pprint(train_dataset[1])

{'features_left': tensor([ 101, 2054, 2079, 2017, 2228, 2065, 2414, 2468, 2019, 2981, 2406, 1029,
         102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,   

**Finally, we define standard PyTorch loaders. This dictionary will be fed to Catalyst.**

In [13]:
train_val_loaders = {
    "train": DataLoader(dataset=train_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=True),
    "valid": DataLoader(dataset=valid_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=False)    
}

# The model

It's going to be a slightly simplified version of [`DistilBertForSequenceClassification`](https://github.com/huggingface/transformers/blob/master/transformers/modeling_distilbert.py#L547) by HuggingFace.
We need only predicted probabilities as output, nothing more - we don't need neither loss to be output nor hidden states or attentions (as in the original implementation).

In [14]:
class DistilBertForSequencePairBinaryClassification(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        
        config = AutoConfig.from_pretrained(
            model_name, num_labels=1)
        
        self.distilbert = AutoModel.from_pretrained(model_name, 
                                                    config=config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(2*config.dim, 1)
        self.dropout = nn.Dropout(config.seq_classif_dropout)
        
    def encode_sequence(self, features, mask):
        assert mask is not None, "attention mask is none"
        distilbert_output = self.distilbert(
            input_ids=features,
            attention_mask=mask
        )
        
        hidden_state = distilbert_output[0]                 # (bs, seq_len, dim)
        pooled_output = hidden_state[:, 0]                  # (bs, dim)
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        return pooled_output

    def forward(self, features_left, mask_left, features_right, mask_right):
        
        output_left = self.encode_sequence(features_left, mask_left)
        output_right = self.encode_sequence(features_right, mask_right)
        
        concat_ouputs = torch.cat((output_left, output_right), dim=1) # (bs, 2*dim)
        concat_ouputs = nn.ReLU()(concat_ouputs)  # (bs, 2*dim)
        
        concat_ouputs = self.dropout(concat_ouputs)          # (bs, 2*dim)
        logits = self.classifier(concat_ouputs)

        return logits


In [15]:
model = DistilBertForSequencePairBinaryClassification(model_name=MODEL_NAME)

## Model training

First we specify criterion, optimizer and scheduler (pure PyTorch). Then Catalyst stuff.

In [21]:
criterion = torch.nn.BCEWithLogitsLoss(torch.FloatTensor(POSITIVE_WEIGHT).cuda())
optimizer = torch.optim.Adam(model.parameters(), lr=LEARN_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)

To run Deep Learning experiments, Catalyst resorts to the [`Runner`](https://catalyst-team.github.io/catalyst/api/dl.html#catalyst.dl.core.runner.Runner) abstraction, in particular, to [`SupervisedRunner`](https://catalyst-team.github.io/catalyst/api/dl.html#module-catalyst.dl.runner.supervised).

`SupervisedRunner` implements the following methods:
 - `train` - starts the training process of the model
 - `predict_loader` - makes a prediction on the whole loader with the specified model
 - `infer` - makes the inference on the model
 
To train the model within this interface you pass the following to the `train` method:
 - model (`torch.nn.Module`) – PyTorch model to train
 - criterion (`nn.Module`) – PyTorch criterion function for training
 - optimizer (`optim.Optimizer`) – PyTorch optimizer for training
 - loaders (dict) – dictionary containing one or several `torch.utils.data.DataLoader` for training and validation
 - logdir (str) – path to output directory. There Catalyst will write logs, will dump the best model and the actual code to train the model
 - callbacks – list of Catalyst callbacks
 - scheduler (`optim.lr_scheduler._LRScheduler`) – PyTorch scheduler for training
 - ...
 
In our case we'll pass the created `DistilBertForSequenceClassification` model, cross-entropy criterion, Adam optimizer, scheduler and data loaders that we created earlier. Also, we'll be tracking accuracy and thus will need `AccuracyCallback`. To perform batch accumulation, we'll be using `OptimizationCallback`.

There are many more useful [callbacks](https://catalyst-team.github.io/catalyst/api/dl.html#module-catalyst.dl.callbacks.checkpoint) implemented, also check out [Catalyst examples](https://github.com/catalyst-team/catalyst/tree/master/examples/notebooks).

In [22]:
os.environ['CUDA_VISIBLE_DEVICES'] = "0"    # can be changed in case of multiple GPUs onboard
set_global_seed(SEED)                       # reproducibility
prepare_cudnn(deterministic=True)           # reproducibility

In [23]:
# we need a small wrapper around Catalyst's runner to be able to pass masks to it
class BertSupervisedRunner(SupervisedRunner):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, input_key=(
            'features_left',
            'mask_left',
            'features_right',
            'mask_right'
        ), **kwargs)





In [24]:
%%time
# model runner
runner = BertSupervisedRunner()

callbacks = OrderedDict({
    '_criterion': CriterionCallback(),
    '_optimizer': OptimizerCallback(accumulation_steps=ACCUM_STEPS),
    '_scheduler': SchedulerCallback(),
    '_saver': CheckpointCallback(),
    'accuracy': AccuracyCallback(num_classes=1, threshold=0.5, activation='Sigmoid'),
    'f1': F1ScoreCallback()
})

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=train_val_loaders,
    callbacks=callbacks,
    fp16=FP16_PARAMS,
    logdir=LOG_DIR,
    num_epochs=NUM_EPOCHS,
    verbose=True
)

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

1/10 * Epoch (train):   0% 0/5686 [00:00<?, ?it/s][A

RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 11.17 GiB total capacity; 10.47 GiB already allocated; 39.75 MiB free; 436.83 MiB cached)

In [21]:
!nvidia-smi

Tue Dec  3 18:08:04 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   61C    P0    59W / 149W |  10299MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

In [22]:
torch.cuda.empty_cache()

In [23]:
!nvidia-smi

Tue Dec  3 18:08:22 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   62C    P0    71W / 149W |   1985MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

# Plot metrics

<img src="https://habrastorage.org/webt/ki/ib/hy/kiibhyp373r65zriwruroiqitky.jpeg" width=30% />

There are at least 4 ways to monitor training:

### 1. Good old tqdm
There above it's set with a flag `verbose` in `runner.train`. Actually, it's not that bad :)

<img src='https://habrastorage.org/webt/ta/1s/98/ta1s988ghabz412weaq0lgs_cke.png'> 


### 2. Weights & Biases

Before launching training, you can run [Weighs & Biases](https://app.wandb.ai/) inititialization for this project. Execute `wandb init` in a separate terminal window (from the same directory where this notebook is running). `wandb` will ask your API key from https://app.wandb.ai/authorize and project name. The rest will be picked up by Catalyst's `SupervisedWandbRunner` (so you'll need to import this instead of `SupervisedRunner`). 
Following the links printed above (smth. like  https://app.wandb.ai/yorko/catalyst-nlp-bert) we can keep track of loss and metrics.

### 3. Tensorboard
During training, logs are written to `LOG_DIR` specified above. 
Similtaneously with training, you can run `tensorboard --logdir $LOG_DIR` (in another terminal tab, in case of training on a server, I also had to add a `--bin_all` flag),
and you'll get a nice dashboard. Here we see how accuracy and loss change during training.

<img src="https://habrastorage.org/webt/2a/sx/mo/2asxmoizgcpf2fnhjjkfhvf70aw.png" width=50% />

### 4. Offline metric plotting

If your training is pretty fast and/or you're not interested in tracking training progress, you can just plot losses and metrics once the training is done. Looks like it won't work in Kernels though but try it locally.

# Inference for the test set

Let's create a Torch loader for the test set and launch `infer` to actually make predictions fot the test set. First, we load the best model checkpoint, then make inference with this model.

In [None]:
test_loaders = {
    "test": DataLoader(dataset=test_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=False) 
}

In [None]:
runner.infer(
    model=model,
    loaders=test_loaders,
    callbacks=[
        CheckpointCallback(
            resume=f"{LOG_DIR}/checkpoints/best.pth"
        ),
        InferCallback(),
    ],   
    verbose=True
)

In [None]:
predicted_probs = runner.callbacks[0].predictions['logits']

Now that we have predicted probabilities, let's finally create a submission file.

In [None]:
sample_sub_df = pd.read_csv(PATH_TO_DATA + 'sample_submission.csv',
                           index_col='id')

In [None]:
train_dataset.label_dict

We need to predict original class names (strings), so we are using inverted class name dictionary to map indices of classes with highest predicted probability to actual class names. 

In [None]:
sample_sub_df['label'] = predicted_probs.argmax(axis=1)
sample_sub_df['label'] = sample_sub_df['label'].map({v:k for k, v in train_dataset.label_dict.items()})

In [None]:
sample_sub_df.head()

In [None]:
sample_sub_df.to_csv('distillbert_submission.csv')