# ModernBert

ModernBERT was published in late 2024 and has shown substantial improvements over other BERT family models (BERT, RoBERTa, ALBERT, etc.). This notebook showcases the improvements of ModernBERT compared to BERT. Specifically, we will look at the differences in tokenization, long-context capability, model architecture, model outputs, and inference speed.

Here are some helpful resources:

https://huggingface.co/docs/transformers/main/en/model_doc/modernbert

https://huggingface.co/blog/modernbert

https://huggingface.co/answerdotai/ModernBERT-base

https://huggingface.co/docs/transformers/model_doc/bert

# Import libaries

In [1]:
!pip install transformers datasets -q

import numpy as np
import pandas as pd
import torch

from datasets import load_dataset

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModel

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h

# Load models & tokenizer

In [2]:
# Let's first load the BERT model
bert_checkpoint = "bert-base-cased"
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)
bert_model = BertModel.from_pretrained(bert_checkpoint)

# BERT tokenizer and model can also be loaded with AutoTokenizer and AutoModel
#bert_tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)
#bert_model = AutoModel.from_pretrained(bert_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [3]:
# Then we will load the ModernBERT model
mbert_checkpoint = "answerdotai/ModernBERT-base"
mbert_tokenizer = AutoTokenizer.from_pretrained(mbert_checkpoint)
mbert_model = AutoModel.from_pretrained(mbert_checkpoint)

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

# Tokenizer

Let's first take a look at the difference between the tokenizers for the two models.

In [4]:
text = "This is MIDS 266. Let's learn some NLP!"

In [5]:
bert_inputs = bert_tokenizer(text, return_tensors="pt")
bert_inputs

{'input_ids': tensor([[  101,  1188,  1110, 26574, 13675,  1744,  1545,   119,  2421,   112,
           188,  3858,  1199, 21239,  2101,   106,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [6]:
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_inputs

{'input_ids': tensor([[50281,  1552,   310,   353, 15782, 30610,    15,  1281,   434,  3037,
           690,   427, 13010,     2, 50282]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

What do we notice at first glance?

Let's now take a closer look at the `input_ids`.

In [7]:
bert_inputs.input_ids.shape

torch.Size([1, 17])

In [8]:
mbert_inputs.input_ids.shape

torch.Size([1, 15])

We can see that the two models have different length of input_ids for texts with the same word count!

Let's then take a look at how the tokenization differs between the two models.

In [9]:
btokens = bert_tokenizer.tokenize(text)
print(btokens)

['This', 'is', 'MI', '##DS', '26', '##6', '.', 'Let', "'", 's', 'learn', 'some', 'NL', '##P', '!']


In [10]:
mtokens = mbert_tokenizer.tokenize(text)
print(mtokens)

['This', 'Ġis', 'ĠM', 'IDS', 'Ġ266', '.', 'ĠLet', "'s", 'Ġlearn', 'Ġsome', 'ĠN', 'LP', '!']


Wow! The tokens look quite different between the two models!

We already know BERT uses CLS and SEP tokens, does ModernBERT do the same?

In [11]:
bert_tokenizer.decode(bert_tokenizer.encode(text))

"[CLS] This is MIDS 266. Let's learn some NLP! [SEP]"

In [12]:
bert_inputs.input_ids

tensor([[  101,  1188,  1110, 26574, 13675,  1744,  1545,   119,  2421,   112,
           188,  3858,  1199, 21239,  2101,   106,   102]])

In [13]:
mbert_tokenizer.decode(mbert_tokenizer.encode(text))

"[CLS]This is MIDS 266. Let's learn some NLP![SEP]"

In [14]:
mbert_inputs.input_ids

tensor([[50281,  1552,   310,   353, 15782, 30610,    15,  1281,   434,  3037,
           690,   427, 13010,     2, 50282]])

Similar to other BERT family models, ModernBERT also uses CLS and SEP tokens. Can you guess the input_id for these special tokens?

Let's now try batch encode, what's different now?

Read the [ModernBert Config](https://huggingface.co/docs/transformers/main/en/model_doc/modernbert#transformers.ModernBertConfig) to identify other special tokens and the input ids for each of them.

In [15]:
bert_input = bert_tokenizer.batch_encode_plus(
    ['This is great!', 'This is terrible!'],
    max_length=10,
    truncation=True,
    padding='max_length',
    return_tensors='pt'
)

bert_input

{'input_ids': tensor([[ 101, 1188, 1110, 1632,  106,  102,    0,    0,    0,    0],
        [ 101, 1188, 1110, 6434,  106,  102,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [16]:
mbert_input = mbert_tokenizer.batch_encode_plus(
    ['This is great!', 'This is terrible!'],
    max_length=10,
    truncation=True,
    padding='max_length',
    return_tensors='pt'
)

mbert_input

{'input_ids': tensor([[50281,  1552,   310,  1270,     2, 50282, 50283, 50283, 50283, 50283],
        [50281,  1552,   310, 11527,     2, 50282, 50283, 50283, 50283, 50283]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

# Long Context Inputs

For long context illustration, we will use the [long-context retrieval (MLDR)](https://huggingface.co/datasets/sentence-transformers/mldr) dataset. This dataset has 10K triples of anchor-positive-negative datapoints, and is ideal for information retrieval. We will cover information retrieval in Week 10.

We can directly load the dataset from HuggingFace Hub. For this exercise, we will only take the first 5 datapoints as an example.

In [19]:
from datasets import load_dataset, DownloadConfig

# Remove the 'timeout' argument
download_config = DownloadConfig()
dataset = load_dataset("sentence-transformers/mldr",
                      "en-triplet",
                      split="train",
                      download_config=download_config).take(5)

README.md:   0%|          | 0.00/236k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [17]:
dataset = load_dataset("sentence-transformers/mldr", "en-triplet", split="train").take(5)

ConnectionError: Couldn't reach 'sentence-transformers/mldr' on the Hub (ReadTimeout)

In [20]:
dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 5
})

This dataset is designed for long context retrieval, let's take a look at the first example

In [21]:
text = dataset[0]["positive"]
len(text)

12379

Wow this is surely a long text, what happens if we try to tokenize it for BERT?

In [22]:
bert_inputs = bert_tokenizer(text, return_tensors="pt")
bert_inputs

Token indices sequence length is longer than the specified maximum sequence length for this model (2450 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': tensor([[  101,  1287,  5665,  ...,  4052, 12762,   102]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

Uhoh, we see a warning that the sequence length exceeds the max sequence length allowed for the model.

What happens if we directly use this long text in the BERT model without any preprocessing? All the texts after the 512th token will be lost!

Think about what you can do to combat this issue if you'd like to use BERT model on this dataset? What disadvantage would this impose compared to using a model that has long context capabilities?

Now let's tokenize it for ModernBERT

In [23]:
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_inputs

{'input_ids': tensor([[50281,  8732, 26456,  ..., 13416, 20759, 50282]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [24]:
mbert_inputs.input_ids.shape

torch.Size([1, 2671])

No more warnings about the sequence length exceeding the maximum sequence length. This is because ModernBERT allows for long context up to 8192 tokens! This is huge!

# Model Architecture

Let's now take a look at the model architecture!

In [25]:
for name, param in bert_model.named_parameters():
    print(name)

embeddings.word_embeddings.weight
embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight
embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.attention.output.dense.bias
encoder.layer.0.attention.output.LayerNorm.weight
encoder.layer.0.attention.output.LayerNorm.bias
encoder.layer.0.intermediate.dense.weight
encoder.layer.0.intermediate.dense.bias
encoder.layer.0.output.dense.weight
encoder.layer.0.output.dense.bias
encoder.layer.0.output.LayerNorm.weight
encoder.layer.0.output.LayerNorm.bias
encoder.layer.1.attention.self.query.weight
encoder.layer.1.attention.self.query.bias
encoder.layer.1.attention.self.key.weight
encoder.layer.1.attention.self.key

In [26]:
for name, param in mbert_model.named_parameters():
    print(name)

embeddings.tok_embeddings.weight
embeddings.norm.weight
layers.0.attn.Wqkv.weight
layers.0.attn.Wo.weight
layers.0.mlp_norm.weight
layers.0.mlp.Wi.weight
layers.0.mlp.Wo.weight
layers.1.attn_norm.weight
layers.1.attn.Wqkv.weight
layers.1.attn.Wo.weight
layers.1.mlp_norm.weight
layers.1.mlp.Wi.weight
layers.1.mlp.Wo.weight
layers.2.attn_norm.weight
layers.2.attn.Wqkv.weight
layers.2.attn.Wo.weight
layers.2.mlp_norm.weight
layers.2.mlp.Wi.weight
layers.2.mlp.Wo.weight
layers.3.attn_norm.weight
layers.3.attn.Wqkv.weight
layers.3.attn.Wo.weight
layers.3.mlp_norm.weight
layers.3.mlp.Wi.weight
layers.3.mlp.Wo.weight
layers.4.attn_norm.weight
layers.4.attn.Wqkv.weight
layers.4.attn.Wo.weight
layers.4.mlp_norm.weight
layers.4.mlp.Wi.weight
layers.4.mlp.Wo.weight
layers.5.attn_norm.weight
layers.5.attn.Wqkv.weight
layers.5.attn.Wo.weight
layers.5.mlp_norm.weight
layers.5.mlp.Wi.weight
layers.5.mlp.Wo.weight
layers.6.attn_norm.weight
layers.6.attn.Wqkv.weight
layers.6.attn.Wo.weight
layers.6.mlp

What differences do you see when comparing the layers?

# Model outputs

We have seen the differences in inputs and model architecture, let's now take a look at model outputs

Remember, our text is too long for BERT, so we must truncate it before we can feed the tokenized inputs to the model

In [27]:
bert_inputs = bert_tokenizer(text, return_tensors="pt", truncation=True)
bert_outputs = bert_model(**bert_inputs)
bert_outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.5935,  0.0940, -0.1687,  ..., -0.3806,  0.0434,  0.3832],
         [ 0.4253, -0.4766,  0.7440,  ..., -0.3942,  0.3230,  0.2140],
         [ 0.4182, -0.5374,  0.5692,  ..., -0.2323, -0.0629,  1.2690],
         ...,
         [ 0.5098, -0.5685,  0.2344,  ..., -0.5356,  0.0076,  0.3847],
         [ 0.4078, -0.4184,  0.0023,  ..., -0.1491, -0.1271,  0.2958],
         [ 1.3882,  0.1162,  1.2960,  ...,  0.3271,  0.2593, -0.2280]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-5.8410e-01,  3.0264e-01,  9.9930e-01, -9.6620e-01,  8.9615e-01,
          9.3145e-01,  7.0277e-01, -9.9296e-01, -8.9288e-01,  6.4332e-02,
          9.1099e-01,  9.9563e-01, -9.9678e-01, -9.9903e-01,  7.9328e-01,
         -8.7767e-01,  9.3608e-01, -5.5321e-01, -9.9983e-01, -4.3656e-01,
         -5.8475e-01, -9.9944e-01,  1.0964e-01,  9.4960e-01,  7.4010e-01,
          9.5588e-02,  9.5480e-01,  9.9985e-01,  3.6781e-01, -7.141

BERT has two outputs, do you remember what they are?

In [28]:
print('Shape of first BERT output: ', bert_outputs[0].shape)
print('Shape of second BERT output: ', bert_outputs[1].shape)

Shape of first BERT output:  torch.Size([1, 512, 768])
Shape of second BERT output:  torch.Size([1, 768])


With ModernBERT, thanks to the long context capabilities, we do not need to truncate our long text before feeding it to the model.

In [29]:
# this cell takes a minute to run because our text is so long
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_outputs = mbert_model(**mbert_inputs)
mbert_outputs

BaseModelOutput(last_hidden_state=tensor([[[ 0.1680, -0.3384, -0.7694,  ..., -0.4745, -0.1105, -0.7589],
         [-0.8740, -0.3295,  0.6591,  ..., -1.9694, -2.3188,  0.4383],
         [ 0.3498, -1.4112,  0.0969,  ...,  0.4195,  0.0887, -0.6138],
         ...,
         [ 1.8556, -0.1904, -1.1067,  ..., -1.8617,  0.1445, -0.0550],
         [ 0.2851, -0.9601, -0.9530,  ..., -1.6096,  0.1400,  0.2290],
         [ 0.1789, -0.0395,  0.0412,  ...,  0.0570,  0.1783,  0.1189]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

Interesting! ModernBERT only has one output!

In [30]:
print('Shape of ModernBERT output: ', mbert_outputs[0].shape)

Shape of ModernBERT output:  torch.Size([1, 2671, 768])


Compare the first output from BERT and the only output from ModernBERT, what is the difference? What does the 2nd dimension represent?

In [31]:
bert_lhs_shape = bert_outputs.last_hidden_state.shape
bert_lhs_data = bert_outputs.last_hidden_state

print("Last hidden state shape")
print(bert_lhs_shape)

print("Last hidden state")
print(bert_lhs_data)

Last hidden state shape
torch.Size([1, 512, 768])
Last hidden state
tensor([[[ 0.5935,  0.0940, -0.1687,  ..., -0.3806,  0.0434,  0.3832],
         [ 0.4253, -0.4766,  0.7440,  ..., -0.3942,  0.3230,  0.2140],
         [ 0.4182, -0.5374,  0.5692,  ..., -0.2323, -0.0629,  1.2690],
         ...,
         [ 0.5098, -0.5685,  0.2344,  ..., -0.5356,  0.0076,  0.3847],
         [ 0.4078, -0.4184,  0.0023,  ..., -0.1491, -0.1271,  0.2958],
         [ 1.3882,  0.1162,  1.2960,  ...,  0.3271,  0.2593, -0.2280]]],
       grad_fn=<NativeLayerNormBackward0>)


In [32]:
mbert_lhs_shape = mbert_outputs.last_hidden_state.shape
mbert_lhs_data = mbert_outputs.last_hidden_state

print("Last hidden state shape")
print(mbert_lhs_shape)

print("Last hidden state")
print(mbert_lhs_data)

Last hidden state shape
torch.Size([1, 2671, 768])
Last hidden state
tensor([[[ 0.1680, -0.3384, -0.7694,  ..., -0.4745, -0.1105, -0.7589],
         [-0.8740, -0.3295,  0.6591,  ..., -1.9694, -2.3188,  0.4383],
         [ 0.3498, -1.4112,  0.0969,  ...,  0.4195,  0.0887, -0.6138],
         ...,
         [ 1.8556, -0.1904, -1.1067,  ..., -1.8617,  0.1445, -0.0550],
         [ 0.2851, -0.9601, -0.9530,  ..., -1.6096,  0.1400,  0.2290],
         [ 0.1789, -0.0395,  0.0412,  ...,  0.0570,  0.1783,  0.1189]]],
       grad_fn=<NativeLayerNormBackward0>)


As expected, the output tensors from the two models are different.

What advantage does long context capability have over short context model? Why might one prefer one model over another for their task?

# Inference Speed

To make an apple-to-apple comparison of inference speed between the two models, let's truncate the long text to 512 tokens for both models.

In [33]:
texts = dataset["positive"]
len(texts)

5

In [34]:
bert_inputs = bert_tokenizer(texts,
                             max_length=512,
                             padding=True,
                             truncation=True,
                             return_tensors='pt')

mbert_inputs = mbert_tokenizer(texts,
                               max_length=512,
                               padding=True,
                               truncation=True,
                               return_tensors='pt')

In [35]:
%%time

bert_outputs = bert_model(**bert_inputs)

CPU times: user 8.16 s, sys: 2.36 s, total: 10.5 s
Wall time: 10.9 s


In [36]:
%%time

mbert_outputs = mbert_model(**mbert_inputs)

CPU times: user 12.4 s, sys: 3.21 s, total: 15.6 s
Wall time: 16.2 s


Even with the enhanced capabilities of ModernBERT, the inference time is comparable between the two models on CPU.

We can also run inference on a GPU and compare the speed on a GPU.

Select a GPU runtime and run the cells below.

In [1]:
!pip install transformers datasets -q

import numpy as np
import pandas as pd
import torch

from datasets import load_dataset

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModel

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m378.9/485.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m

In [2]:
bert_checkpoint = "bert-base-cased"
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)
bert_model = BertModel.from_pretrained(bert_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [3]:
mbert_checkpoint = "answerdotai/ModernBERT-base"
mbert_tokenizer = AutoTokenizer.from_pretrained(mbert_checkpoint)
mbert_model = AutoModel.from_pretrained(mbert_checkpoint)

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

In [4]:
dataset = load_dataset("sentence-transformers/mldr", "en-triplet", split="train").take(5)
texts = dataset["positive"]

README.md:   0%|          | 0.00/236k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Again, we truncate the long context texts to the same length to compare the two models.

In [5]:
bert_inputs = bert_tokenizer(texts,
                             max_length=512,
                             padding=True,
                             truncation=True,
                             return_tensors='pt')

mbert_inputs = mbert_tokenizer(texts,
                               max_length=512,
                               padding=True,
                               truncation=True,
                               return_tensors='pt')

In [6]:
%%time

bert_outputs = bert_model(**bert_inputs)

CPU times: user 6.95 s, sys: 903 ms, total: 7.85 s
Wall time: 8.43 s


In [7]:
%%time

mbert_outputs = mbert_model(**mbert_inputs)

CPU times: user 10.7 s, sys: 2.29 s, total: 13 s
Wall time: 14.1 s


GPU certainly makes things much faster and the inference speed between the two models are comparable to each other!

Next, take a look at the [ModernBERT documentation](https://huggingface.co/docs/transformers/main/en/model_doc/modernbert) to see if you can finetune a ModernBERT model for a downstream task!