## Compact Transformers  🤏

We all know bigger is often better! Nothing cooler than rolling up in your larger-than-life Transformer model. Who doesn't want to drive a monster truck once in their life...

However, monster trucks aren't always practical in real life. Models need to run smoothly in production. So the best model size is often the one most suited for your practical situation and requirements.

In this tip, we'll have a look at some sub-25million parameter transformer models available in Huggingface. So you can have both a smoothly running application and brag to your manager that you implemented transformers!

### The test 🔨
We will be evaluating all of our candidates on a few fronts:


*   GPU memory during finetune: relevant in case of hardware or cost limits
*   GPU memory during inference: relevant when running on edge device
*   Model artefact size: relevant when facing hardware restrictions
*   Model CPU & GPU inference time: relevant when dealing with latency restrictions
*   Time taken to finetune: relevant when dealing with frequent retraining loops

All of these benchmarked against the final performance after training for one epoch!

*Note: tests were done in Google Colab, with a V100 GPU.*

... let the games begin 🏋️‍♀️



### Setup

In [None]:
!pip install -q transformers sentencepiece datasets

Check GPU:

In [None]:
!nvidia-smi

### Imports

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
import numpy as np
import time
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

import torch

### Data

For this tip, we will look at a simple and straightforward NLP task of classifying some news headlines in various categories

In [None]:
dataset = load_dataset("ag_news")

### Models

In [None]:
!mkdir tests

In [None]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": (preds == p.label_ids).mean()
    }

in the below segment, feel free to tweak the `max_steps` parameter for faster notebook execution!

In [None]:
def train_and_evaluate_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize(batch):
        return tokenizer(batch['text'], padding=True, truncation=True)

    train_dataset, test_dataset = load_dataset('ag_news', split=['train', 'test'])
    train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
    test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    training_args = TrainingArguments(
        output_dir='./results',          # output directory
        num_train_epochs=1,             # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        warmup_steps=500,                # number of warmup steps for learning rate scheduler
        weight_decay=0.01,               # strength of weight decay
        #logging_dir='./logs',            # directory for storing logs
        learning_rate=2e-5,
        evaluation_strategy='epoch',
        max_steps=2500
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, 
        num_labels=4
    )

    # Get the number of parameters

    parameters = model.num_parameters()

    # Finetune the model

    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    start_time = time.time()

    trainer.train()

    duration = time.time() - start_time

    # Get the metrics

    metrics = trainer.evaluate()

    # Store the model
    trainer.save_model(f"tests/{model_name}")
    tokenizer.save_pretrained(f"tests/{model_name}")

    return parameters, metrics, duration



In [None]:
run_dict = []

Now meet our contenders:

#### MobileBert

A knowledge-transfer model based on a BERT_LARGE-derivated teacher, with some architectural tricks. Claims a 62ms latency on a Pixel 4 phone!

[link to paper](https://arxiv.org/abs/2004.02984)

In [None]:
model_name = 'google/mobilebert-uncased'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "mobilebert",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 9035,
    "gpu_memory_inference": 1481,
    "artefact_size": 147,
    "duration": duration,
    "metrics": metrics
})

#### ALBert
Through two clever parameter-reduction techniques, the memory footprint of this model could be reduced in comparison with BERT.

[link to paper](https://arxiv.org/pdf/1909.11942.pdf)

In [None]:
model_name = 'albert-base-v2'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "albert",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 12023,
    "gpu_memory_inference": 1397,
    "artefact_size": 47.4,
    "duration": duration,
    "metrics": metrics
})

#### TinyBert
Result of a novel Knowledge Distillation (KD) method, accompanied by a two-stage learning framework.

[link to paper](https://arxiv.org/abs/1909.10351)

In [None]:
model_name = 'huawei-noah/TinyBERT_General_4L_312D'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "tinybert",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 3797,
    "gpu_memory_inference": 1405,
    "artefact_size": 62.7,
    "duration": duration,
    "metrics": metrics
})

#### BERT-small
The following three are all results of so-called Pre-trained Distillation, allowing more compact models to yield better performance.

[link to paper](https://arxiv.org/abs/1908.08962)

In [None]:
model_name = 'google/bert_uncased_L-4_H-512_A-8'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "bert-small",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 3797,
    "gpu_memory_inference": 1459,
    "artefact_size": 116,
    "duration": duration,
    "metrics": metrics
})

#### BERT-mini

[link to paper](https://arxiv.org/abs/1908.08962)

In [None]:
model_name = 'google/bert_uncased_L-4_H-256_A-4'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "bert-mini",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 2579,
    "gpu_memory_inference": 1383,
    "artefact_size": 45.1,
    "duration": duration,
    "metrics": metrics
})

#### BERT-tiny

[link to paper](https://arxiv.org/abs/1908.08962)

In [None]:
model_name = 'google/bert_uncased_L-2_H-128_A-2'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "bert-tiny",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 1695,
    "gpu_memory_inference": 1357,
    "artefact_size": 17.8,
    "duration": duration,
    "metrics": metrics
})

#### Distilbert

A true classic default for our finetuning efforts! Serves as a bit of a baseline against which we will compare the smaller models.

[link to paper](https://arxiv.org/abs/1910.01108)

In [None]:
model_name = 'distilbert-base-uncased'
parameters, metrics, duration = train_and_evaluate_model(model_name)
run_dict.append({
    "name": "distilbert",
    "model_name": model_name,
    "parameters": parameters,
    "gpu_memory_finetune": 6691,
    "gpu_memory_inference": 1631,
    "artefact_size": 268,
    "duration": duration,
    "metrics": metrics
})

### Comparison

In [None]:
from transformers import TextClassificationPipeline, pipeline

In [None]:
test_sentence = "Today, Google, Facebook and Amazon announced that they would be joining hands in releasing a new unified AI framework called TensorBro."

#### Batches prediction speed on GPU
Model size impacts the max batch-size, and thus affects how many sentences we can process per second. For each model, we try to find out the max batch-size (per increment of 5000), and how long it takes for to predict this batch. This could near the theoretical prediction speed when deploying this model on a similar GPU.

In [None]:
final_count = 0
final_time = 0

for model in run_dict:
    print(f"testing model {model.get('model_name')}")
    test_pipe = pipeline("sentiment-analysis",f"tests/{model.get('model_name')}", device=0)
    try:
        for i in range(1, 100000, 5000):
            print(f"Trying size {i}")
            start_time = time.time()
            result=test_pipe([test_sentence]*i)
            final_time = time.time() - start_time
            final_count = i
            time.sleep(10)
    except RuntimeError as e:
        print(e)
        model["inference_time_max_batch"] = final_time
        model["inference_size_max_batch"] = final_count

#### Sequential CPU inference time
GPU's are costly, so sometimes you wanna do sequential performance or performance on a CPU. Let's test this out as well:

In [None]:
for model in run_dict:
    test_pipe = pipeline("sentiment-analysis",f"tests/{model.get('model_name')}", device=-1)

    start_time = time.time()
    for i in range(100):
        test_pipe(test_sentence)
    
    duration = time.time() - start_time

    # Average out
    duration /= 100.

    # Express in msec
    duration *= 1000.


    model["inference_time_cpu"] = duration

### Visualize

In [None]:
import pandas as pd

In [None]:
df_data = pd.DataFrame(run_dict)
df_data['eval_accuracy'] = df_data.metrics.apply(lambda x: x['eval_accuracy'])
df_data['batched_prediction_speed'] = df_data.inference_size_max_batch / df_data.inference_time_max_batch

In [None]:
df_data.head(10)

Unnamed: 0,name,model_name,parameters,gpu_memory_finetune,gpu_memory_inference,artefact_size,duration,metrics,inference_time_max_batch,inference_size_max_batch,inference_time_cpu,eval_accuracy,batched_prediction_speed
0,mobilebert,google/mobilebert-uncased,24583940,9035,1481,147.0,1701.219089,"{'eval_loss': 0.2605212330818176, 'eval_accura...",15.224273,20001,52.183349,0.918684,1313.757337
1,albert,albert-base-v2,11686660,12023,1397,47.4,3460.138656,"{'eval_loss': 0.23464326560497284, 'eval_accur...",12.188317,5001,113.805959,0.925132,410.310946
2,tinybert,huawei-noah/TinyBERT_General_4L_312D,14351500,3797,1405,62.7,408.08737,"{'eval_loss': 0.3276883363723755, 'eval_accura...",5.773093,25001,9.504721,0.894079,4330.60718
3,bert-small,google/bert_uncased_L-4_H-512_A-8,28765700,3797,1459,116.0,549.804414,"{'eval_loss': 0.2530038058757782, 'eval_accura...",5.292146,15001,19.072402,0.914079,2834.578027
4,bert-mini,google/bert_uncased_L-4_H-256_A-4,11171588,2579,1383,45.1,252.107582,"{'eval_loss': 0.3101600706577301, 'eval_accura...",5.392508,30001,6.873202,0.902105,5563.459987
5,bert-tiny,google/bert_uncased_L-2_H-128_A-2,4386436,1695,1357,17.8,121.61159,"{'eval_loss': 0.4182027280330658, 'eval_accura...",7.484345,70001,2.526121,0.88,9352.989822
6,distilbert,distilbert-base-uncased,66956548,6691,1631,268.0,1509.763152,"{'eval_loss': 0.21709568798542023, 'eval_accur...",11.219041,10001,59.247344,0.9275,891.430889


In [None]:
import plotly.express as px

fig = px.scatter(
    df_data,
    title="Accuracy vs finetune time and required GPU memory",
    labels={
        "duration": "Time (sec) to finetune for 1 epoch",
        "eval_accuracy": "Accuracy on the test-set"
    },
    x="duration",
    y="eval_accuracy",
    size="gpu_memory_finetune", hover_name="name", size_max=100)
fig.show()

In [None]:
fig = px.scatter(
    df_data,
    title="Accuracy vs CPU inference time and model artefact size (Mb)",
    labels={
        "inference_time_cpu": "Inference time (ms) on CPU",
        "eval_accuracy": "Accuracy on the test-set"
    },
    x="inference_time_cpu",
    y="eval_accuracy",
    size="artefact_size", hover_name="name", size_max=100)
fig.show()

In [None]:
fig = px.scatter(
    df_data,
    title="Accuracy vs GPU batches prediction speed and model artefact size (Mb)",
    labels={
        "batched_prediction_speed": "Prediction speed (sentences/sec) at max batch size",
        "eval_accuracy": "Accuracy on the test-set"
    },
    x="batched_prediction_speed",
    y="eval_accuracy",
    size="artefact_size", hover_name="name", size_max=100)
fig.show()

### Conclusion

Some points to be made:


1.   **Albert and Mobilebert** were a bit beefier and slower than we had expected them to be, in all honesty. Could be that we did something wrong, or that an implementation quirk in Huggingface gave worse performance than expected.
2.   Very pleasantly surprised with the **baby-BERT** models. Especially the BERT-small seems to provide a nice-tradeoff.
3.   **Google** apparently REALLY ❤️ small models, almost all of the candidates came from them.
4.   **No GPU** for inference? No problem! For inference, the time for a single prediction seems to be relatively okay, depending on the use-case of course.
5.   **# Parameters isn't everything**. TinyBERT for example has more parameters than Albert, but still seems to be speedier 🤔. Implementation-differences can have a profound effect.


So all in all, depending on your scenario, there are plenty of options to look into. A base BERT doesn't have to be your initial step.

Albeit if you want quicker experimentation, or if you want a faster running application, and you don't HAVE to necessarily have the best possible performance: there are a ton of great options to check out.