After the advent of BERT, many pretrained language models were introduced, and most of them opted for larger model sizes to achieve better performance. While large-scale pretrained models indeed ensure good performance, they come with the drawback of being challenging to use in typical computing environments. To address this issue, there has been a movement to build efficient pretrained models that can offer a certain level of performance.
There are broadly two ways to enhance model efficiency: one is by reducing the number of model parameters, and the other is by improving the attention mechanism. This project compares representative models of these two approaches with the baseline model, BERT, and assess the efficiency gains in real tasks. Models aimed at lightweighting include ALBERT, Distil BERT, and Mobile BERT, while those focused on enhancing attention mechanisms include Reformer, Longformer, and BigBird.
LightWeight Focused Models
- ALBERT
A Lite BERT
- Distil BERT
Distilled BERT
- Mobile BERT
Attention Focused Models
- Reformer
- Longformer
- BigBird
LightWeight Focused Models
Model | Params | Size | LightWeight Ratio (BERT Based) |
---|---|---|---|
BERT | 109,482,240 | 417.649 MB | 100% |
AlBERT | 11,683,584 | 44.577 MB | 10.67% |
Distil BERT | 66,362,880 | 253.158 MB | 60.62% |
Mobile BERT | 24,581,888 | 93.776 MB | 22.45% |
Attention Focused Models
Model | Params | Size | Attention Type |
---|---|---|---|
BERT | 109,482,240 | 417.649 MB | Full Attention |
Reformer | 148,654,080 | 567.070 MB | Sparse Attention |
Longformer | 148,659,456 | 567.091 MB | - |
Big Bird | 127,468,800 | 486.317 MB | - |
LightWeight Focused Models
BERT | AlBERT | Distil BERT | Mobile BERT | |
---|---|---|---|---|
COLA Accuracy | - | - | - | - |
Training Speed per Batch | - | - | - | - |
Attention Focused Models
BERT | Reformer | Longformer | Big Bird | |
---|---|---|---|---|
IMDB Accuracy | - | - | - | - |
Training Speed per Batch | - | - | - | - |
python3 run.py -mode ['lightweight', 'attention']
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
- Reformer: The Efficient Transformer
- Longformer: The Long-Document Transformer
- Big Bird: Transformers for Longer Sequences