Understanding and Improving KD for QAT of Large Transformer Encoders (EMNLP 2022)

Code for "Understanding and Improving Knowledge Distailltion for Quantization-Aware-Training of Large Transformer Encoders"
Proceeding, Arxiv

This paper provides in-depth analysis of the mechanism of Knowledge Distillation(KD) on Attention recovery of quantized large Transformer encoders.
Based on this analysis, we propose new sets of KD loss functions for better QAT of ultra-low bit precision (Weight Ternarization of Transformer Encoders).

Our implementation is based on the Huawei-Noah TernaryBERT Pytorch code. (link)

Setup

pip install -r requirements.txt

First, you need task-specific fine-tuned full-precision BERT models for initialize model for QAT. You can fine-tune BERT-base/large pre-trained model using huggingface example code with following link.

https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification)

Or you could download fine-tuned BERT-base/large model with Google Cloud link. (sst-2, rte provided)

https://drive.google.com/drive/folders/1OjCeLP2tmirZAhP2_ZX8pN3N2ytQBbn2?usp=share_link

Fine-tuned BERT model file should be plaece in "models" folder with its GLUE task name. (ex. models/rte)

Training (QAT)

This repository provides multiple KD options for Ternary QAT of BERT-base/large.

Attenion map/output loss

For attention map/output loss QAT,

bash run_kd_qat_map.sh $GPU_NUM # map loss
bash run_kd_qat_output.sh $GPU_NUM # output loss

KD options exploration

For exploration of KD options for QAT with BERT-base/large over GLUE tasks, use run_kd_qat_exploration.sh. For example, let's run attention-map loss QAT of BERT-base with CoLA task.

task_name=cola
bert=base
map_coeff=1
output_coeff=0
bash run_kd_qat_exploration.sh $task_name $bert $map_coeff $output_coeff

Find mixing parameters

For explorating mixing parameters for attention-map/output losses, run run_mixing_param_sweep.sh

Experiments (Analysis)

Model Directory

For experimental notebooks, you need QAT model file from Training section. Note that fine-tuned full-precision model files should be placed in models folder, and QAT model files should be placed in output folder. For example,

teacher_model_dir = "models/BERT_base/sst-2"
student_model_dir = "output/BERT_large/rte/exploration/$EXP_NAME"

Please set full-precision/QAT model directory name properly in notebook :)

Exp1. Attention Map Distance (Figure 4)

Exp 1 shows how to measure self-attention map distance between teacher model (full-precision model) and student model (ternary quantized model) This notebook provides attention map distance plot as follows.

Exp2. Hessian Eigen Max Spectra Analysis (Figure 3)

Exp2 provides hessian max eigenvalue spectra of QAT model. This implementation is based on the Pyhessian and repository of "Park et al, How do Vision Transformer Work?, ICLR 2022"

Pyhessian : https://github.com/amirgholami/PyHessian How-vits-work : xxxnell/how-do-vits-work#12

Exp3. Attention Output Analysis (Figure 5-6)

This experiements provide analysis of attention output's min-max dynamic range and attention norm. Once you load model file properly, you can find the model's attention output dynamic range per layer and conduct norm based analysis per layer/head.

Per Layer Attnetion output min-max dynamic range (Left), Norm based analysis per layer (Right)

Per Head attnetion probability, transformed output heat map visualization and difference between student and teacher model visualizataion with heat map. (with Attention-map/output loss QAT)

Attentoin Norm based analysis is based on "Kobayashi et al Attention is Not Only a Weight: Analyzing Transformers with Vector Norms, EMNLP 2020" code link

For further question, contact me anytime (minsoo2333@hanyang.ac.kr) or kindly leave questions in issues tab.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
transformer		transformer
.gitignore		.gitignore
Exp1_Self_Attention_Map_Distance_Analysis.ipynb		Exp1_Self_Attention_Map_Distance_Analysis.ipynb
Exp2_Hessian_Analysis.ipynb		Exp2_Hessian_Analysis.ipynb
Exp3_Attention_Output_Analysis.ipynb		Exp3_Attention_Output_Analysis.ipynb
README.md		README.md
eval_code.py		eval_code.py
hessian.py		hessian.py
main.py		main.py
quant_task_glue.py		quant_task_glue.py
run_kd_qat_exploration.sh		run_kd_qat_exploration.sh
run_kd_qat_map.sh		run_kd_qat_map.sh
run_kd_qat_output.sh		run_kd_qat_output.sh
run_max_eigen.sh		run_max_eigen.sh
run_mixing_param_sweep.sh		run_mixing_param_sweep.sh
run_mixing_param_sweep_2.sh		run_mixing_param_sweep_2.sh
utils.py		utils.py
utils_glue.py		utils_glue.py

MarsJacobs/kd-qat-large-enc

Folders and files

Latest commit

History

Repository files navigation

Setup

Training (QAT)

Attenion map/output loss

KD options exploration

Find mixing parameters

Experiments (Analysis)

Model Directory

Exp1. Attention Map Distance (Figure 4)

Exp2. Hessian Eigen Max Spectra Analysis (Figure 3)

Exp3. Attention Output Analysis (Figure 5-6)

About

Topics

Resources

Stars

Watchers

Forks

Languages