# <center> Pair binary classification with DistillBert and Catalyst
    
<img src='https://habrastorage.org/webt/ne/n_/ow/nen_ow49hxu8zrkgolq1rv3xkhi.png'>
    

1. **Gradient accumulation.** Doing one optimization step for several bachward steps. Well explained in [this post](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) by HuggingFace
1. **Mixed-precision training.** Handled by [Nvidia Apex](https://github.com/NVIDIA/apex) and reused by Catalyst
1. **Learning rate schedule.** Standard thing when training deep neural networks, Catalysts handles lot of them
1. **Sequence bucketing (soon).** The main idea is that you can group long sentences with long ones, short ones with short ones and thus do less padding. Three approaches are described in [this Kernel](https://www.kaggle.com/bminixhofer/speed-up-your-rnn-with-sequence-bucketing)

In [1]:
# Python 
import os
import warnings
import logging
from typing import Mapping, List
from pprint import pprint
from collections import OrderedDict


# Numpy and Pandas 
import numpy as np
import pandas as pd

# PyTorch 
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Transformers 
from transformers import AutoConfig, AutoModel, AutoTokenizer

# Catalyst
from catalyst.dl import SupervisedRunner
from catalyst.dl.callbacks import AccuracyCallback, F1ScoreCallback, OptimizerCallback, SchedulerCallback
from catalyst.dl.callbacks import CheckpointCallback, InferCallback, CriterionCallback
from catalyst.utils import set_global_seed, prepare_cudnn, set_requires_grad

from sklearn.model_selection import train_test_split

# Sources
os.chdir('../../')

from pair_classification.catalyst.data.nlp.pair_bin_classify_debiased import TextPairBinaryClfDebiasedDataset
from pair_classification.catalyst.contrib.models.nlp.bert.distil_pair_bin_classify import DistilBertForSequencePairBinaryClassification
from pair_classification.catalyst.contrib.criterion.bce import SampleWeightedBCELoss


In [2]:
torch.cuda.is_available()

True

**Setup**

In [3]:
MODEL_NAME = 'distilbert-base-uncased' # pretrained model from Transformers
LOG_DIR = "./models/catalyst/logdir"    # for training logs and tensorboard visualizations
NUM_EPOCHS = 20                         # smth around 2-6 epochs is typically fine when finetuning transformers
BATCH_SIZE = 64                        # depends on your available GPU memory (in combination with max seq length)
MAX_SEQ_LENGTH = 150                   # depends on your available GPU memory (in combination with batch size)
LEARN_RATE = 5e-5                      # learning rate is typically ~1e-5 for transformers
ACCUM_STEPS = 4                        # one optimization step for that many backward passes
SEED = 11                              # random seed for reproducibility
POSITIVE_WEIGHT = None               # positive weight constant  

**Additionaly, we install [Nvidia Apex](https://github.com/NVIDIA/apex) to reuse AMP - automatic mixed-precision training.**

The idea is that we can use float16 format for faster training, only switching tio float32 when necessary. 
Here we'll only need to tell Catalyst to use fp16.

In [4]:
# FP16_PARAMS = None
FP16_PARAMS = dict(opt_level="O1") 

**Dataset**

Amazon product reviews - [competition](https://www.kaggle.com/c/amazon-pet-product-reviews-classification).
Given text of a review, we need to classify it into one of 6 categories: dogs, cats, fish aquatic pets, birds, and two others.

In [5]:
# to reproduce, download the data and customize this path
PATH_TO_DATA = './data/debiased/'

In [6]:
train_df = pd.read_csv(PATH_TO_DATA + 'train.csv', index_col='id').fillna('')[:128]
valid_df = pd.read_csv(PATH_TO_DATA + 'valid.csv', index_col='id').fillna('')[:64]
test_df = pd.read_csv('./data/test.csv', index_col='test_id').fillna('')[:128]


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



In [7]:
train_df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,weights
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
130057,208794,208795,"What is the history of Kuala Lumpur, Malaysia?","What is there to do in Kuala Lumpur, Malaysia?",0,0.513184
395808,528827,528828,How much control do US president have over mil...,How much control does a sitting President have...,0,0.786406
71099,122407,122408,What are some of the awesome places to visit i...,What are some awesome places one can visit in ...,0,0.512839
238614,350029,350030,How do I live happy even though I am ugly?,How do I be happy when I am extremely ugly?,1,0.786433
146864,231907,7084,How do I get a girl who doesn't know you like ...,How do you know the girl you like doesn't like...,0,0.561186


In [8]:
valid_df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,weights
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
323664,22892,449654,How many chromosomes are in a sperm cell?,How many chromosomes are there in gametes?,0,0.513492
361660,491529,491530,Is architecture a good career?,Who is a good architect?,0,0.786517
54605,96376,9351,How do I know if I have been blocked on messen...,How can you tell if you've been blocked on Fac...,1,0.786723
239163,350695,350696,What is my dream all about?,What do you dream about?,0,0.535758
192153,36057,18429,Could time travel be a real thing? Could it be...,What is the possibility of time travel becomin...,1,0.786602


## Torch Dataset

This is left for user to be defined. Catalyst will take care of the rest. 

**Create Torch Datasets with train, validation, and test data.**

In [9]:
train_dataset = TextPairBinaryClfDebiasedDataset(
    texts_left=train_df['question1'].values.tolist(),
    texts_right=train_df['question2'].values.tolist(),
    labels=train_df['is_duplicate'].values.tolist(),
    weights=train_df['weights'].values.tolist(),
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

valid_dataset = TextPairBinaryClfDebiasedDataset(
    texts_left=valid_df['question1'].values.tolist(),
    texts_right=valid_df['question2'].values.tolist(),
    labels=valid_df['is_duplicate'].values.tolist(),
    weights=valid_df['weights'].values.tolist(),
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

test_dataset = TextPairBinaryClfDebiasedDataset(
    texts_left=test_df['question1'].values.tolist(),
    texts_right=test_df['question2'].values.tolist(),
    labels=None,
    weights=None,
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)


One of the training dataset instances:

In [10]:
train_df.iloc[1]

qid1                                                       528827
qid2                                                       528828
question1       How much control do US president have over mil...
question2       How much control does a sitting President have...
is_duplicate                                                    0
weights                                                  0.786406
Name: 395808, dtype: object

In [11]:
pprint(train_dataset[1])

{'features_left': tensor([ 101, 2129, 2172, 2491, 2079, 2149, 2343, 2031, 2058, 2510, 3821, 1029,
         102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,   

**Finally, we define standard PyTorch loaders. This dictionary will be fed to Catalyst.**

In [12]:
train_val_loaders = {
    "train": DataLoader(dataset=train_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=True),
    "valid": DataLoader(dataset=valid_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=False)    
}

# The model

It's going to be a slightly simplified version of [`DistilBertForSequenceClassification`](https://github.com/huggingface/transformers/blob/master/transformers/modeling_distilbert.py#L547) by HuggingFace.
We need only predicted probabilities as output, nothing more - we don't need neither loss to be output nor hidden states or attentions (as in the original implementation).

In [13]:
model = DistilBertForSequencePairBinaryClassification(model_name=MODEL_NAME)

## Model training

First we specify criterion, optimizer and scheduler (pure PyTorch). Then Catalyst stuff.

In [14]:
criterion = SampleWeightedBCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARN_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)

To run Deep Learning experiments, Catalyst resorts to the [`Runner`](https://catalyst-team.github.io/catalyst/api/dl.html#catalyst.dl.core.runner.Runner) abstraction, in particular, to [`SupervisedRunner`](https://catalyst-team.github.io/catalyst/api/dl.html#module-catalyst.dl.runner.supervised).

`SupervisedRunner` implements the following methods:
 - `train` - starts the training process of the model
 - `predict_loader` - makes a prediction on the whole loader with the specified model
 - `infer` - makes the inference on the model
 
To train the model within this interface you pass the following to the `train` method:
 - model (`torch.nn.Module`) – PyTorch model to train
 - criterion (`nn.Module`) – PyTorch criterion function for training
 - optimizer (`optim.Optimizer`) – PyTorch optimizer for training
 - loaders (dict) – dictionary containing one or several `torch.utils.data.DataLoader` for training and validation
 - logdir (str) – path to output directory. There Catalyst will write logs, will dump the best model and the actual code to train the model
 - callbacks – list of Catalyst callbacks
 - scheduler (`optim.lr_scheduler._LRScheduler`) – PyTorch scheduler for training
 - ...
 
In our case we'll pass the created `DistilBertForSequenceClassification` model, cross-entropy criterion, Adam optimizer, scheduler and data loaders that we created earlier. Also, we'll be tracking accuracy and thus will need `AccuracyCallback`. To perform batch accumulation, we'll be using `OptimizationCallback`.

There are many more useful [callbacks](https://catalyst-team.github.io/catalyst/api/dl.html#module-catalyst.dl.callbacks.checkpoint) implemented, also check out [Catalyst examples](https://github.com/catalyst-team/catalyst/tree/master/examples/notebooks).

In [15]:
os.environ['CUDA_VISIBLE_DEVICES'] = "0"    # can be changed in case of multiple GPUs onboard
set_global_seed(SEED)                       # reproducibility
prepare_cudnn(deterministic=True)           # reproducibility


In [16]:
# we need a small wrapper around Catalyst's runner to be able to pass masks to it
class BertSupervisedRunner(SupervisedRunner):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, input_key=(
            'features_left',
            'mask_left',
            'features_right',
            'mask_right'
        ), **kwargs)





In [17]:
%%time

# freeze bert
set_requires_grad(getattr(model, 'distilbert'), False)

# model runner
runner = BertSupervisedRunner()

callbacks = OrderedDict({
    '_criterion': CriterionCallback(input_key=['targets', 'weights']),
    '_optimizer': OptimizerCallback(accumulation_steps=ACCUM_STEPS),
    '_saver': CheckpointCallback(),
    '_scheduler': SchedulerCallback(),
    'accuracy': AccuracyCallback(num_classes=1, threshold=0.5, activation='Sigmoid'),
    'f1': F1ScoreCallback()
})

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=train_val_loaders,
    callbacks=callbacks,
    fp16=FP16_PARAMS,
    logdir=LOG_DIR,
    num_epochs=1,
    verbose=True
)

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
1/1 * Epoch (train): 100% 2/2 [00:04<00:00,  2.29s/it, _timers/_fps=32.465, accuracy01=34.375, f1_score=0.364, loss=0.598]
1/1 * Epoch (valid): 100% 1/1 [00:02<00:00,  2.08s/it, _timers/_fps=33.064, accuracy01=37.500, f1_score=0.421, loss=0.596]
[2019-12-08 23:12:12,341] 
1/1 * Epoch 1 (train): _base/lr=5.000e-0

In [18]:
torch.cuda.empty_cache()

In [19]:
!nvidia-smi

Sun Dec  8 23:12:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   60C    P0    72W / 149W |    825MiB / 11441MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

In [20]:
runner = BertSupervisedRunner()

In [21]:
%%time

# unfreeze bert
set_requires_grad(getattr(model, 'distilbert'), True)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARN_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)

callbacks = OrderedDict({
    '_criterion': CriterionCallback(input_key=['targets', 'weights']),
    '_optimizer': OptimizerCallback(accumulation_steps=ACCUM_STEPS),
    '_scheduler': SchedulerCallback(),
    '_saver': CheckpointCallback(resume=f"{LOG_DIR}/checkpoints/best_full.pth", save_n_best=5),
    'accuracy': AccuracyCallback(num_classes=1, threshold=0.5, activation='Sigmoid'),
    'f1': F1ScoreCallback()
})

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=train_val_loaders,
    callbacks=callbacks,
    fp16=FP16_PARAMS,
    logdir=LOG_DIR,
    num_epochs=NUM_EPOCHS,
    verbose=True
)


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
=> loading checkpoint ./models/catalyst/logdir/checkpoints/best_full.pth
loaded checkpoint ./models/catalyst/logdir/checkpoints/best_full.pth (epoch 1)
1/20 * Epoch (train): 100% 2/2 [00:12<00:00,  6.14s/it, _timers/_fps=32.274, accuracy01=32.812, f1_score=0.361, loss=0.551]
1/20 * Epoch (valid): 100% 1/1 [00:02

AttributeError: 'NoneType' object has no attribute 'write'

In [24]:
torch.cuda.empty_cache()

In [25]:
!nvidia-smi

Sun Dec  8 23:16:25 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   65C    P0    74W / 149W |   1899MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

# Plot metrics

<img src="https://habrastorage.org/webt/ki/ib/hy/kiibhyp373r65zriwruroiqitky.jpeg" width=30% />

There are at least 4 ways to monitor training:

### 1. Good old tqdm
There above it's set with a flag `verbose` in `runner.train`. Actually, it's not that bad :)

<img src='https://habrastorage.org/webt/ta/1s/98/ta1s988ghabz412weaq0lgs_cke.png'> 


### 2. Weights & Biases

Before launching training, you can run [Weighs & Biases](https://app.wandb.ai/) inititialization for this project. Execute `wandb init` in a separate terminal window (from the same directory where this notebook is running). `wandb` will ask your API key from https://app.wandb.ai/authorize and project name. The rest will be picked up by Catalyst's `SupervisedWandbRunner` (so you'll need to import this instead of `SupervisedRunner`). 
Following the links printed above (smth. like  https://app.wandb.ai/yorko/catalyst-nlp-bert) we can keep track of loss and metrics.

### 3. Tensorboard
During training, logs are written to `LOG_DIR` specified above. 
Similtaneously with training, you can run `tensorboard --logdir $LOG_DIR` (in another terminal tab, in case of training on a server, I also had to add a `--bin_all` flag),
and you'll get a nice dashboard. Here we see how accuracy and loss change during training.

<img src="https://habrastorage.org/webt/2a/sx/mo/2asxmoizgcpf2fnhjjkfhvf70aw.png" width=50% />

### 4. Offline metric plotting

If your training is pretty fast and/or you're not interested in tracking training progress, you can just plot losses and metrics once the training is done. Looks like it won't work in Kernels though but try it locally.

# Inference for the test set

Let's create a Torch loader for the test set and launch `infer` to actually make predictions fot the test set. First, we load the best model checkpoint, then make inference with this model.

In [26]:
test_loaders = {
    "test": DataLoader(dataset=test_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=False) 
}

In [27]:
runner.infer(
    model=model,
    loaders=test_loaders,
    callbacks=[
        CheckpointCallback(
            resume=f"{LOG_DIR}/checkpoints/best.pth"
        ),
        InferCallback(),
    ],   
    verbose=True
)

=> loading checkpoint ./models/catalyst/logdir/checkpoints/best.pth
loaded checkpoint ./models/catalyst/logdir/checkpoints/best.pth (epoch 6)
1/1 * Epoch (test): 100% 2/2 [00:04<00:00,  2.24s/it, _timers/_fps=32.010]
Top best models:



In [28]:
predicted_probs = runner.callbacks[0].predictions['logits']

Now that we have predicted probabilities, let's finally create a submission file.

In [30]:
sample_sub_df = pd.read_csv(PATH_TO_DATA + 'sample_submission.csv',
                           index_col='test_id')


In [31]:
from pair_classification.bert_finetuning.util import sigmoid_np

sample_sub_df['is_duplicate'] = sigmoid_np(predicted_probs.squeeze())

ValueError: Length of values does not match length of index

In [None]:
sample_sub_df.head()

In [None]:
sample_sub_df.to_csv('distillbert_submission.csv')