<a href="https://colab.research.google.com/github/gstoil/Hydrowl/blob/master/KDD_2022_tutorial_session3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Knowledge Distillation

This notebook will walk you through the basic steps for knowledge distillation, using simple teacher and student models implemented from [Hugging Face Transformer](https://huggingface.co/docs/transformers/index).We will also conduct experiments on the QADSM task in [xGLUE](https://huggingface.co/datasets/xglue) dataset, which is extracted from real Bing Ads traffic. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/KDD_2022/

!pip install transformers
!pip install datasets

# test whether we can import packages properly
from datasets import load_dataset
from transformers import AutoTokenizer

Mounted at /content/drive
[Errno 2] No such file or directory: '/content/drive/MyDrive/KDD_2022/'
/content
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 16.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, 

## Preparations for Colab

Running the code snippets in this notebook requires a GPU runtime, and we also need to install some dependencies.


The code used in this notebook are available in https://github.com/sufferandjoy/kdd_2022_tutorial.git:

In [None]:
!git clone https://github.com/sufferandjoy/kdd_2022_tutorial.git

Cloning into 'kdd_2022_tutorial'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 42 (delta 17), reused 18 (delta 7), pack-reused 0[K
Unpacking objects: 100% (42/42), done.


## Teacher and Student Model

As mentioned above, both our teacher model and student model are implemented using Hugging Face Tranformer. More specifically, we use [BERT-Mini](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4) (4 layers, 4 attention heads, hidden layer size 256) as our teacher model, and a [TwinBERT](https://arxiv.org/abs/2002.06275) model constructed from two [BERT-Tiny](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) (2 layers, 2 attention heads, hidden layer size 128) as student model. Other model structures are also supported, which could be set by the `teacher_pretrained` and `student_pretrained` parameter. Below are the model strcture specified in `model.py`:

In [None]:
import torch
import torch.nn as nn
from transformers import BertModel


class TeacherModel(nn.Module):
    def __init__(self, args):
        super(TeacherModel, self).__init__()
        self.model = BertModel.from_pretrained(args.teacher_pretrained)
        hidden_size = int(args.teacher_pretrained.split('/')[1].split('_')[3].split('-')[-1])
        self.ff = nn.Linear(hidden_size, 1)

    def forward(self, ids, mask, token_type_ids):
        bert_output = self.model(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)
        output = torch.sigmoid(self.ff(bert_output.pooler_output))
        return output


class TwinBERT(nn.Module):
    def __init__(self, args):
        super(TwinBERT, self).__init__()
        self.encoder_model = BertModel.from_pretrained(args.student_pretrained)

    def forward(self, seq1, mask1, seq2, mask2):
        output_1 = self.encoder_model(seq1, attention_mask=mask1).pooler_output
        output_2 = self.encoder_model(seq2, attention_mask=mask2).pooler_output
        cosine_similarity = nn.functional.cosine_similarity(output_1, output_2).unsqueeze(-1)
        return cosine_similarity

The TwinBERT model has a two-tower structure, implemented as two BERT encoders sharing the same weights, as shown in the figure below. In this notebook, we will use query as input to the left encoder, and the concatenation of ad_title and ad_description as input to the other encoder.

![twinbert.png](https://drive.google.com/uc?id=1H3qpUI8LwqOKnk9NWbf7s6cY04SlbtED)


## Dataset and Preprocessing

Below we show several samples from the QADSM task in XGLUE, where each sample contains `query`, `ad_title`, `ad_description`, and a binary label named `relevance_label` indicating the relevance between each query-ad pair. In this notebook we will conduct training on the 100K `train` split, and conduct evaluation on the 10K `test.en` split. We further remove all training and test samples starting with "ERROR_AdRejected".

![QADSM task in XGLUE](https://drive.google.com/uc?id=1TQDw1b5iZeonUONcbdutO-7XkmFhwEva)

The logic for loading this dataset and pre-processing samples are implemented in `utils.py`. Because our teacher model and student model have different input schema (teacher takes as input a single text sequence while student takes as input two), we implement two different preprocess functions as below. Note that in `preprocess_function_student()`, we will concatenate `ad_title` and `ad_description` as the ad text, and the output will contain tokenized results for both query and ad text.

In [None]:
def preprocess_function(examples):
    # Concatenate ad_title and ad_description
    texts = []
    for i in range(len(examples['query'])):
        new_text = (examples['query'][i], examples['ad_title'][i] + ' ' + examples['ad_description'][i])
        texts.append(new_text)

    return tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=args.max_length,
        return_special_tokens_mask=False,
    )

def preprocess_function_student(examples):
    # Concatenate ad_title and ad_description
    texts = []
    for i in range(len(examples['query'])):
        new_text = examples['ad_title'][i] + ' ' + examples['ad_description'][i]
        texts.append(new_text)

    tok_q = tokenizer(
        examples['query'],
        padding='max_length',
        truncation=True,
        max_length=args.max_length_query,
        return_special_tokens_mask=False,
    )

    tok_a = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=args.max_length_ad,
        return_special_tokens_mask=False,
    )
    tok_q['input_ids_2'] = tok_a['input_ids']
    tok_q['attention_mask_2'] = tok_a['attention_mask']
    tok_q['token_type_ids_2'] = tok_a['token_type_ids']
    return tok_q

# process dataset
tokenized_dataset = dataset.map(
    preprocess_function if args.model == 'teacher' else preprocess_function_student,
    batched=True,
    num_proc=1,
    # remove_columns=dataset["train"].column_names,
    load_from_cache_file=True,
    # desc="Running tokenizer on dataset line_by_line",
)

NameError: ignored

## A Simple Experiment on Knowledge Distillation

Having introduced both models and dataset, next we will show you the major steps in knowledge distillation and demonstrate its effectiveness. To begin with, we will first introduce a key parameter in our code called `task`, which supports five different settings:


*   `teacher_ft`: this is the setting that allows us to load a pretrained teacher model and finetune it on the binary labels in QADSM task.
*   `student_ft`: similarly, this is the setting to finetune our student model directly on the binary labels.
*   `teacher_inf`: this setting allows us to conduct inference using our best finetuned teacher model, where "best" means having the smallest validation loss.
*   `student_kd`: this setting allows us to train our student model by regression to the teacher model obtained in `teacher_inf` setting.
*   `eval`: this setting will do a full evaluation on the same test data, to compare the performance of `teacher_ft`, `student_ft` and `student_kd`.

### Step 0: Student Finetuning on Binary Labels as Baseline

We firstly run our code using the `student_ft` setting, in order to get an idea on how well we are doing without knowledge distillation:





In [None]:
!python kdd_2022_tutorial/main.py --task student_ft --train_batch_size 512 --val_batch_size 2048

### Step 1: Teacher Finetuning

Having obtained the above baseline, we then turn to the first step in knowledge distillation which is teacher finetuning. Typical teacher models used in industrial applications are usually very powerful and hence very resource-consuming, but thanks to knowldege distillation, we do not need to worry about how to serve such models online. Instead, all the training and inference jobs running on these models would happen offline only, and it is the light-weight student model that would be deployed to the online environment. 

In this notebook, we can run the following code snippet for teacher finetuning, similar to what we did in the last step:

In [None]:
!python kdd_2022_tutorial/main.py --task teacher_ft --train_batch_size 512 --val_batch_size 2048

### Step 2: Teacher Inference

Once the `teacher_ft` task completed, we can then inference the entire training corpus to get teacher score on each training sample. The data set to be inferenced in this step is often refered to as **distillation data**, and it does not need to have human labels. That is why we can often leverage business logs in industrial scenairos, since we usually have plenty of business logs and sampling from these logs is much easier and cheaper than labeling by human judges.

Here we will do the inference on the same 100K training data used in the above finetuning steps. The scale of this data is much smaller than what we typically have in industrial scenarios (where we can sample billions of logs), but as we will see later, this facilitates a fair comparison between `student_ft` and `student_kd`:

In [None]:
!python kdd_2022_tutorial/main.py --task teacher_inf --val_batch_size 4096

This operation would output a prediction.tsv file under `output/teacher_inf`. We need to copy this file to `data/QADSM/`, since this is where `load_dataset.py` would try to load the inferenced data in the next step.

In [None]:
import os
if not os.path.exists('data/QADSM'):
  os.makedirs('data/QADSM')
!cp output/teacher_inf/prediction.tsv data/QADSM/prediction.tsv

### Step 3: Distill Knowledge from Teacher to Student

Finally it comes to the real distillation step! All we need to do is to run the code snippet once again with task `student_kd`. Note that this time we need to specify the relative path of `load_dataset.py` using the `load_dataset_py_path` parameter:

In [None]:
!python kdd_2022_tutorial/main.py --task student_kd --train_batch_size 512 --val_batch_size 2048 --load_dataset_py_path kdd_2022_tutorial/load_dataset.py

### Evaluation

Now that all the major steps for knowledge distillaiton have completed, we want to know whether all these efforts have brought any real impact. To see this, let us run our script for the last time with task `eval`, which will load the best checkpoint under `teacher_ft`, `student_ft` and `student_kd` settings respectively and conduct evaluation on the same test data.

The would print a table in the end of its log file, where the last row highlights the improvement by comparing metrics for `student_kd` against that of `student_ft`. As we can see, even though we experiment on such a small data set with barely no advanced training strategies nor hyper-parameter tuning, we could see a remarkable 3% AUC lift:

In [None]:
!python kdd_2022_tutorial/main.py --task eval