# **BERT Variants II - Based on Knowledge Distillation** 
### Follow book from Pg. no. 153
- One of the
challenges with using the pre-trained BERT model is that it is computationally expensive
and it is very difficult to run the model with limited resources. The pre-trained BERT model
has a large number of parameters and also high inference time, which makes it harder to
use it on edge devices such as mobile phones.
- To alleviate this issue, we transfer knowledge from a large pre-trained BERT to a small
BERT using knowledge distillation.
- Next,  we will learn about DistilBERT. With DistilBERT, we will see how to
transfer knowledge from a large pre-trained BERT to a small BERT by using knowledge
distillation in detail.
- In this Notebook, we will learn about the following topics:
  - Introducing knowledge distillation 
  - DistilBERT – the distilled version of BERT 
  - Introducing TinyBERT 
  - Transferring knowledge from BERT to neural networks

## 1. **Introducing knowledge distillation**
  - Knowledge distillation is a model compression technique in which a small model is
trained to reproduce the behavior of a large pre-trained model. It is also referred to as
teacher-student learning, where the large pre-trained model is the teacher and the small
model is the student.

## 2. **DistilBERT – the distilled version of BERT**
- The pre-trained BERT model has a large number of parameters and also high inference
time, which makes it harder to use on edge devices such as mobile phones. To solve this
issue, we use DistilBERT, which was introduced by researchers at Hugging Face.
DistilBERT is a smaller, faster, cheaper, and lighter version of BERT.
- DistilBERT uses knowledge distillation. The ultimate idea of
DistilBERT is that we take a large pre-trained BERT model and transfer its knowledge to a
small BERT through knowledge distillation. The large pre-trained BERT is called a teacher
BERT and the small BERT is called a student BERT. 
- Since the small BERT (student BERT) acquires its knowledge from the large pre-trained
BERT (teacher BERT) through distillation, we can call it small BERT DistilBERT.
DistilBERT is 60% faster and its size is 40% smaller compared to large BERT models. Now
that we have a basic idea of DistilBERT, let's get into the details and learn how it works.

## 3. **Distillation in TinyBERT**
- As we learned at the beginning of the section, apart from transferring knowledge from the
output layer (prediction layer) of the teacher to the student BERT, here, we transfer
knowledge from other layers as well.
  - Transformer layer (encoder layer)
  - Embedding layer (input layer)
  - Prediction layer (output layer)

### **Transformer layer distillation**
  - Attention-based distillation
    - We perform attention-based distillation by minimizing the mean squared error
between the attention matrix of the student and teacher. It is also important to know that
we use an unnormalized attention matrix, that is, the attention matrix without the softmax
function. This is because the unnormalized attention matrix performs better and attains
faster convergence in this setting.
  - Hidden state-based distillation
    - The hidden state is basically
the output of the encoder, that is, the representation. So, in hidden-state distillation, we
transfer knowledge from the hidden state of the teacher encoder to the hidden state of the
student encoder. Let be the hidden state of the student and be the hidden state of the
teacher. Then we perform distillation by minimizing the mean squared error between 
 and as expressed here:
 

 ### **Embedding layer distillation**
 - In embedding layer distillation, we transfer knowledge from the embedding layer of the
teacher to the embedding layer of the student. Let denote the Es embedding of the student
and denote Et the embedding of the teacher, then we train the network to perform
embedding layer distillation by minimizing the mean squared error between the
embedding of student and teacher.

### **Prediction layer distillation**
- In prediction layer distillation, we transfer knowledge from the final output layer, which is
logits produced by the teacher BERT, to the student BERT. This is similar to the distillation
loss we learned about with DistilBERT. 
- We perform prediction layer distillation by minimizing the cross-entropy loss between the
soft target and soft prediction. 

## **Training the student BERT (TinyBERT)**
- In TinyBERT, we will use a two-stage learning framework as follows:
  - General distillation
  - Task-specific distillation

## **The data augmentation method**
- Note that to perform distillation at the fine-tuning step, we need more task-specific data
points. That is, for task-specific distillation, we need more data points. So we use a data
augmentation method to obtain the augmented dataset. We will fine-tune the general
TinyBERT with this augmented dataset. The next section will show us how data
augmentation is performed.

## **Findings**
- TinyBERT is 96% effective, 7.5 times smaller, and 9.4 times faster during inference
than the BERT-Base model.

- DistiBERT gives us almost 97% accurate results as the original BERT-Base model. Since
DistilBERT is lighter, we can easily deploy it on any edge device and it is 60% faster at
inference compared to the BERT model

# **NOTE**:
- So far, we have learned how to perform knowledge distillation from a large pre-trained BERT to a smaller BERT. Can we perform knowledge distillation and transfer knowledge from a pre-trained BERT to a simple neural network?



