# BERT and Friends final project

In [None]:
## Installing the Dependencies ##

!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 22.6 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 15.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 37.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1


In this Project, There are three important sections:

**Part 1:** We will fine-tune the BERT-base, distilRoBERTa and DistilBERT and BERT-tiny (student) model on the Amazon Massive dataset.

**Part 2:** We will perform task-specific Knowledge Distillation using the Amazon Massive dataset.

Student model: BERT-tiny (2 layers and 128 hidden dimension and 2 attention heads)

We use our fine-tuned models in part-1 as teachers. The Knowledge distillation is performed in three different settings:

1.   Only with BERT model
2.   Only with distilBERT model
3.   With the combination of two models - BERT and distilBERT model 

**Part 3:** We will analyze the model size and the processing time

In [None]:
## Importing the libraries ##

import time
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Here, we will use the models fine-tuned on Amazon Massive dataset for the analysis. The results were almost same with the models fine-tuned on Emotion dataset.

The model size was calculated by the number of parameters (in Millions) and the processing time is calculated in Milliseconds(ms).

**Device configuration:** RAM - 12.68 GB, Disk space - 107.72 GB and on cpu in Google colab 

# Teacher models

## BERT-base model

In [None]:
checkpoint = "gokuls/bert-base-Massive-intent" ## Model used for analysis ##

## Tokenization ##
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Model ##
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

## Eample sentence ##
example = "wake me up at eight am on monday"

## Getting processing time ##

start_time = time.time()

inputs = tokenizer(example, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

end_time =time.time()

print('The processing time is: ', end_time-start_time)
print('Number of Parameters: ', sum(p.numel() for p in model.parameters())) ## Getting the number of parameters ##

Downloading:   0%|          | 0.00/348 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.25k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

The processing time is:  0.2662777900695801
Number of Parameters:  109528380


**Teacher Model: BERT-base**

1. **Processing time :** 266.27 ms
2. **Model size :** 109.52 M paramaters

## DistilBERT model

In [None]:
checkpoint = "gokuls/distilbert-base-Massive-intent" ## Model used for analysis ##

## Tokenization ##
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Model ##
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

## Eample sentence ##
example = "wake me up at eight am on monday"

## Getting processing time ##

start_time = time.time()

inputs = tokenizer(example, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

end_time =time.time()

print('The processing time is: ', end_time-start_time)
print('Number of Parameters: ', sum(p.numel() for p in model.parameters())) ## Getting the number of parameters ##

Downloading:   0%|          | 0.00/360 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

The processing time is:  0.050185441970825195
Number of Parameters:  66999612


**Teacher Model: DistilBERT**

1. **Processing time :** 50.18 ms
2. **Model size :** 67 M paramaters

## DistilRoBERTa model

In [None]:
checkpoint = "gokuls/distilroberta-base-Massive-intent" ## Model used for analysis ##

## Tokenization ##
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Model ##
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

## Eample sentence ##
example = "wake me up at eight am on monday"

## Getting processing time ##

start_time = time.time()

inputs = tokenizer(example, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

end_time =time.time()

print('The processing time is: ', end_time-start_time)
print('Number of Parameters: ', sum(p.numel() for p in model.parameters())) ## Getting the number of parameters ##

Downloading:   0%|          | 0.00/386 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

The processing time is:  0.04914546012878418
Number of Parameters:  82164540


**Teacher Model: DistilRoBERTa**

1. **Processing time :** 49 ms
2. **Model size :** 82.16 M paramaters

# Student model (BERT-tiny)

In [None]:
checkpoint = "gokuls/bert-tiny-Massive-intent-KD-distilBERT" ## Model used for analysis ##

## Tokenization ##
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Model ##
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

## Eample sentence ##
example = "wake me up at eight am on monday"

## Getting processing time ##

start_time = time.time()

inputs = tokenizer(example, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

end_time =time.time()

print('The processing time is: ', end_time-start_time)
print('Number of Parameters: ', sum(p.numel() for p in model.parameters())) ## Getting the number of parameters ##

Downloading:   0%|          | 0.00/374 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.6M [00:00<?, ?B/s]

The processing time is:  0.006506443023681641
Number of Parameters:  4393660


**Student Model: BERT-tiny**

1. **Processing time :** 6.5 ms
2. **Model size :** 4.39 M paramaters