# Week 8: models compression

## Distillation

We will examine TinyBERT distillation approach for language models. Please, proceed to [this repository](https://github.com/cxa-unique/Simplified-TinyBERT/blob/main/task_distill_simplified.py) and study the main function.

__Approach description__

Initial approach for BERT distillation was called DistilBERT and used the following technique. 
- Setup. We use a __teacher__ model and a __student__ model. The teacher is a pretrained BERT model, the student is the model we are going to train. 
- Student's initialization. __Student layers are initialized uniformly from teacher's__: e.g. teacher has 8 attention blocks and we want student to have 4 attention blocks, so we take every second one from the teacher. 
- Loss. __Combination of losses__ was used for DistilBERT: we not only train the student with original LM loss, but we also want student's probability distribution over tokens to be close to teacher's, so we use `soft_cross_entropy` between student's and teacher's logits in addition to LM loss. __examine soft_cross_entropy function in the script__ 

For TinyBERT approach, additional MSE regularizations for student training are used: 
- The idea is to make sure that both student's embedding and student's attention matrices are close to teacher's ones. __examine lines 263 and below__


## Quantization

There are three approaches to model quantization: static, dynamic and quantization aware training. Let's have a brief overview of the dynamic quantization for pretrained BERT. In the end of this section you will see a memory, speed and quality benchmark for original and quantized versions of the same model.

This section is derived from the official PyTorch quantization [guide](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html). Note that the data downloading snippet is updated comparing to the original script.

### Dynamic quantization

In [1]:
!pip install transformers

In [2]:
from __future__ import absolute_import, division, print_function

import logging
import os
import random
import sys
import time
import warnings
from argparse import Namespace

import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from transformers import (BertConfig, BertForSequenceClassification, BertTokenizer,)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features

In [3]:
warnings.filterwarnings("ignore")

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
                    datefmt="%m/%d/%Y %H:%M:%S",
                    level=logging.WARN)

logging.getLogger("transformers.modeling_utils").setLevel(
   logging.WARN)  # Reduce logging

print(torch.__version__)

1.10.2+cu102


In [4]:
# we are setting number of threads to 1 for the benchmark
torch.set_num_threads(1)
print(torch.__config__.parallel_info())

ATen/Parallel:
	at::get_num_threads() : 1
	at::get_num_interop_threads() : 32
OpenMP 201511 (a.k.a. OpenMP 4.5)
	omp_get_max_threads() : 1
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 1
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
std::thread::hardware_concurrency() : 64
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP



In [5]:
!wget https://gist.githubusercontent.com/raffaem/bcd8e1c21339408cd477b6798e7d6133/raw/d62be52f81c503e43460a291abc1260764e728d4/download_glue_data.py
!python download_glue_data.py --data_dir='glue_data' --tasks='MRPC' 
!export GLUE_DIR=./glue_data
!export TASK_NAME=MRPC
!export OUT_DIR=./$TASK_NAME/
!wget https://download.pytorch.org/tutorial/MRPC.zip
!unzip MRPC.zip

In [6]:
configs = Namespace()

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False


# Set random seed for reproducibility.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

In [7]:
# load pretrained model
tokenizer = BertTokenizer.from_pretrained(
    configs.output_dir, do_lower_case=configs.do_lower_case)

model = BertForSequenceClassification.from_pretrained(configs.output_dir)
model = model.to(configs.device)

Here you can see an implementaion of the BERT evaluation.

In [8]:
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {"input_ids":      batch[0],
                          "attention_mask": batch[1],
                          "labels":         batch[3]}
                if args.model_type != "distilbert":
                    inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs["labels"].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, "cached_{}_{}_{}_{}".format(
        "dev" if evaluate else "train",
        list(filter(None, args.model_name_or_path.split("/"))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,
                                                # pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
                                                # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                                                # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

In [9]:
# here we specify weights data format and layers that are to be quantized
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
print(quantized_model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (key): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (value): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (dropout): Dropout(p=0.1, inplace=False)
            )
      

Let's examine memory consumptoion of both models.

In [10]:
def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print("Size (MB):", os.path.getsize("temp.p") / 1e6)
    os.remove("temp.p")

print_size_of_model(model)
print_size_of_model(quantized_model)

Size (MB): 438.017325
Size (MB): 181.497157


Finally, let's run speed and quality benchmark.

In [11]:
def time_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

In [12]:
# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [02:15<00:00,  2.66s/it]

{'acc': 0.8602941176470589, 'f1': 0.9018932874354562, 'acc_and_f1': 0.8810937025412575}
Evaluate total time (seconds): 135.8





In [13]:
# Evaluate the INT8 BERT model after the dynamic quantization
time_model_evaluation(quantized_model, configs, tokenizer)

Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [02:16<00:00,  2.68s/it]

{'acc': 0.8553921568627451, 'f1': 0.8977469670710572, 'acc_and_f1': 0.8765695619669012}
Evaluate total time (seconds): 137.0





We can now serialize and deserialize quantized model, using `torch.jit` tracer.

In [14]:
def ids_tensor(shape, vocab_size, name=None):
    #  Creates a random int32 tensor of the shape within the vocab size
    total_dims = 1
    for dim in shape:
        total_dims *= dim

    values = []
    for _ in range(total_dims):
        values.append(random.randint(0, vocab_size - 1))

    return torch.tensor(data=values, dtype=torch.long, device="cpu").view(shape).contiguous()


input_ids = ids_tensor([8, 128], 2, 100)
token_type_ids = ids_tensor([8, 128], 2)
attention_mask = ids_tensor([8, 128], vocab_size=2)
dummy_input = (input_ids, attention_mask, token_type_ids)
traced_model = torch.jit.trace(quantized_model, dummy_input, strict=False)
torch.jit.save(traced_model, "bert_traced_eager_quant.pt")

In [15]:
loaded_quantized_model = torch.jit.load("bert_traced_eager_quant.pt")

### ONNX runtime

Besides default PyTorch tools we can use external libraries that might be used as an inference engine for our model. One of such tools is called ONNX (open neural network exchange). ONNX itself is just a format for storing of a computational graph. PyTorch or Tensorflow model can be exported to ONNX. 

Alongside with ONNX format, there exists ONNX runtime engine, which can be used not only for computational graph ineference but also for quantization. Let's see a brief example of it's usage.

In [16]:
!pip install onnx onnxruntime coloredlogs sympy optimum

In [17]:
model.config

BertConfig {
  "_name_or_path": "./MRPC/",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "mrpc",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [18]:
# convert model to ONNX without quantization
model = BertForSequenceClassification.from_pretrained(configs.output_dir)
huggingface_model_path = "bert-hf"
tokenizer.save_pretrained(huggingface_model_path)
model.save_pretrained(huggingface_model_path)

onnx_model_path = "bert-example.onnx"

os.system(
    f"python -m transformers.onnx "
    f"--model {huggingface_model_path} "
    f"{onnx_model_path}"
)

Some weights of the model checkpoint at bert-hf were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using framework PyTorch: 1.10.2+cu102
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'last_hidden_state'})
	- Validating ONNX Model output "last_hidden_state":
		-[✓] (2, 8, 768) matches (2, 8, 768)
		-[✓] all values close (atol: 1e-05)
All good, model saved at: bert-example.onnx/model.onnx


0

In [19]:
from optimum.onnxruntime import ORTConfig, ORTQuantizer

# The type of quantization to apply
ort_config = ORTConfig(quantization_approach="dynamic")
quantizer = ORTQuantizer(ort_config)
# Quantize the model!
quantizer.fit(huggingface_model_path, output_dir=".", feature="sequence-classification")

In [20]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
bert-example.onnx	    homework.ipynb	  MRPC
bert-hf			    homework_utils	  MRPC.zip
bert_traced_eager_quant.pt  model.onnx		  practice.ipynb
download_glue_data.py	    model-opt.onnx	  README.md
glue_data		    model-quantized.onnx  slides.pdf


In [21]:
from datasets import Dataset
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer.onnx_config)
# Create a dataset or load one from the Hub
ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
# Tokenize the inputs & convert to PyTorch tensors
tokenizer = AutoTokenizer.from_pretrained(huggingface_model_path)

def preprocess_fn(ex):
    return tokenizer(ex["sentence"])

tokenized_ds = ds.map(preprocess_fn, remove_columns=ds.column_names)
tokenized_ds.set_format("torch")
# Create dataloader and run evaluation
dataloader = DataLoader(tokenized_ds)
ort_outputs = ort_model.evaluation_loop(dataloader)
# Extract logits!
ort_outputs.predictions

0ex [00:00, ?ex/s]

array([[ 1.1566894, -0.9845178]], dtype=float32)