---
title: Notes on NeoBERT
#subtitle: "..."
description: "My study notes on the 'NeoBERT: A Next-Generation BERT' paper and explorations with the model"
date: 2025-06-25
toc: true
---

The paper "[NeoBERT: A Next-Generation BERT](https://arxiv.org/abs/2502.19587)" (2025) by Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, Sarath Chandar introduces a new encoder model with updated architecture, training data, and pre-training methods and is intended as a strong backbone model.

> [W]e introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies.

These are the most important highlights at a first glance:

- **Performance:** It performs state-of-the-art on MTEB for its parameter class against other backbone models like BERT, NomicBERT, or ModernBERT base.
- **Size:** The model is medium-sized with 250M parameters
- **Sequence length:** Compared to BERT and RoBERTa (512) it has an extended context length of 4,096 tokens similar to NomicBERT (2,048) and ModernBERT (8,192)
- **Speed:** NeoBERT is faster than ModernBERT in inference speed, despite being 100M parameters larger than the ModernBERT base
- **Dimensions:** NeoBERT maintains same hidden size as base models (768) allowing for seamless plug-and-play

And here are the relevant links:

- arXiv: [https://arxiv.org/abs/2502.19587](https://arxiv.org/abs/2502.19587)
- Model on Hugging Face: [https://huggingface.co/chandar-lab/NeoBERT](https://huggingface.co/chandar-lab/NeoBERT)
- Code: [https://github.com/chandar-lab/NeoBERT](https://github.com/chandar-lab/NeoBERT)


## Motivation
The motivation behind is paper is the fact that encoders have not received as much love as LLMs in recent years, although they are equally important for downstream applications, like RAG systems.

Today's LLMs are capable of in-context learning and reasoning because of advancements in architecture, training data, pre-training, and fine-tuning.
While there has been research on fine-tuning methods for pre-trained encoders (e.g., [GTE](Li et al., 2023b) or [jina-embeddings](Sturua et al., 2024)), they are applied to older base models, like [BERT](TODO link) from 2019.
Therefore, the authors see a lack of updated open-source
base models to apply these new fine-tuning techniques to.

> As a result, there is a dire need for a new generation of BERT-like pre-trained models that incorporate up-to-date knowledge and leverage both architectural and training innovations, forming stronger backbones for these more advanced fine-tuning procedures.

Recent work on modernizing these base models are NomicBERT and ModernBERT, which this paper takes inspiration from:

- NomicBERT:
    - Paper: [https://arxiv.org/abs/2402.01613](https://arxiv.org/abs/2402.01613) (`nomic-bert-2048`, not to be confused with `nomic-embed-text-v1`)
    - Model: [https://huggingface.co/nomic-ai/nomic-bert-2048](https://huggingface.co/nomic-ai/nomic-bert-2048)
    - License: apache-2.0
    - Released: February 2025
- ModernBERT:
    - Paper: [https://arxiv.org/abs/2412.13663](https://arxiv.org/abs/2412.13663)
    - Model: [https://huggingface.co/answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
    - Model: [https://huggingface.co/answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
    - License: apache-2.0
    - Released: December 2024

## Key insights
The paper covers a lot of interesting nitty-gritty details on recent advancements around architecture choice, training data selection, and pre-training methods.
But if you step back, I think the key insights are that they are confirming what we already know from LLMs:

1. Training on a lot of good data = Better models
2. Increasing model size = Better models (even at small scale)

### Training on a lot of good data = Better models

According to the paper, the modification with the biggest improvement was changing the training data:

> [...] replacing Wikitext and BookCorpus with the significantly larger and more diverse RefinedWeb dataset **improved the score by +3.6%** [...]

They trained NeoBERT on [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), which is a 2.8 TB large dataset. It contains 600B tokens and is 18 times larger than RoBERTa's training dataset.

> Following the same trend, we pre-trained NeoBERT on RefinedWeb (Penedo et al., 2023), a massive dataset containing 600B tokens, nearly 18 times larger than RoBERTa’s.

I think it's interesting that apparently the newer NomicBERT was trained on the same dataset as BERT with 13GB, while RoBERTa was trained on an extended dataset of 160 GB, and it's just now that this has been done to encoders, while this has been done to generative models for a while already.

> Recent generative models like the LLaMA family (Touvron et al., 2023; Dubey et al., 2024) have demonstrated that language models benefit from being trained on significantly more tokens than was previously standard. Recently, LLaMA-3.2 1B was successfully trained on up to 9T tokens without showing signs of saturation. Moreover, encoders are less sample-efficient than decoders since they only make predictions for masked tokens. **Therefore, it is reasonable to believe that encoders of similar sizes can be trained on an equal or even greater number of tokens without saturating.**

### Increasing model size = Better models (even at small scale)

The second most impactful modification was increasing the model size and finding an optimal depth-to-width ratio for the Transformer architecture:

> [...] while increasing the model size from 120M to 250M in M7 **led to a +2.9% relative improvement**.

So, NomicBERT and ModernBERT base both have around 150M parameters and are considered small-sized.
NeoBERT with 250M parameters can be considered medium-sized, so it makes sense that it performs better than smaller models.

But what's interesting is that they took the depth-to-width ratio into consideration when increasing the model size:

> In contrast, small language models like BERT, RoBERTa, and NomicBERT are instead in a width-inefficiency regime. To maximize NeoBERT’s parameter efficiency while ensuring it remains a seamless plug-and-play replacement, we retain the original BERT base width of 768 and instead increase its depth to achieve this optimal ratio.

So, they first increased the number of parameters to 250M with a depth-to-width ratio of 16 x 1056 (too wide) and then they optimized the depth-to-width ratio to 28 x 768 (more width-efficient).

> Note that to assess the impact of the depth-to-width ratio, we first scale the number of parameters in M7 to 250M while maintaining a similar ratio to BERT base, resulting in 16 layers of dimension 1056. In M8, the ratio is then adjusted to 28 layers of dimension 768.

This is nice, because by keeping the hidden size at 768, NeoBERT can be easily switched out for other base models:
> NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models

What I also think is remarkable is that it has a much faster inference speed compared to both ModernBERT models despite it's size.
There's a nice figure in the paper showing the thoughput at different sequence length.
The figure is missing NomicBERT though, which I don't know why.

>  For extended sequences, NeoBERT significantly outperforms ModernBERT base, despite having 100M more parameters, **achieving a 46.7% speedup on sequences of 4, 096 tokens**.

## Old Encoders vs. Modern Encoders

The paper features a nice overview table of different characteristics between older encoders, like BERT (2019), RoBERTa (2019), and newer encoders, like ModernBERT (2024) and NomicBERT (2025), which shows their differences. Here I'm summarizing key differences between older and newer encoders that stood out to me:

| Configuration | Older Encoders  | Newer Encoders  |
| -- | --  | -- |
| Position encoding and sequence lengths | Absolute positional embeddings with a sequence length of 512 | RoPE for handling longer sequences of 2,048 to 8,192 |
| Masking rate | 15 % | Optimal masking rate was found to be between 20 to 40 % [by Wettig et al.](https://arxiv.org/abs/2202.08005) |
| Optimizer| Adam | AdamW |
| Training  | DDP | FlashAttention and other |
| Normalization  | Post-Layer Normalization | Pre-Layer Normalization (normalization layer is moved inside the residual connection of each feed-forward and attention block)|

## NeoBERT Example

Since NeoBERT is a state-of-the-art encoder, it can be used as an embedding model.
Let's take it for a spin, shall we?

I'm going to quickyl test how to create an embedding vector and then use it with the [Weaviate](https://github.com/weaviate/weaviate) vector database.

If you want to test it out yourself, you can open this blog as a Jupyter notebook in Colab.
<a target="_blank" href="https://colab.research.google.com/github/iamleonie/website/blob/main/blog/neobert.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Let's first install the required packages.
(I'm fixing the `transformers` version that has worked for me here.)

In [1]:
#| echo: true
#| output: false
!pip install transformers==4.41.0 torch xformers==0.0.28.post3
!pip install -U weaviate-client

Collecting transformers==4.41.0
  Downloading transformers-4.41.0-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.28.post3
  Downloading xformers-0.0.28.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.41.0)
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting torch
  Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-many

### Generate a single embedding

Then, let's try the sample code from the [Hugging Face model card](https://huggingface.co/chandar-lab/NeoBERT).

In [2]:
#| echo: true
#| output: false
from transformers import AutoModel, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/928 [00:00<?, ?B/s]

model.py: 0.00B [00:00, ?B/s]

rotary.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/chandar-lab/NeoBERT:
- rotary.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/chandar-lab/NeoBERT:
- model.py
- rotary.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

model.safetensors:   0%|          | 0.00/981M [00:00<?, ?B/s]

Let's take a look at the generated output:

Nice. It looks like a vector embedding.

In [3]:
embedding[0].tolist()[:5]

[-1.9510374069213867,
 -0.9967884421348572,
 -1.2037752866744995,
 0.24297747015953064,
 -0.1231154128909111]

As you can see, NeoBERT creates vectors of dimension 768.

In [4]:
embedding.shape

torch.Size([1, 768])

### Use it for vector search

And last, but not least, let's use NeoBERT for vector search. 
For this, we will spin up an embedded instance of Weaviate.

In [6]:
import weaviate, os

# Connect to an embedded Weaviate instance
client = weaviate.connect_to_embedded()

client.is_ready()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.30.5/weaviate-v1.30.5-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 2556


True

To use NeoBERT with Weaviate, we're going to use the 'Bring your own vectors' approach and therefore won't specify a vectorizer during the creation of the collection. We will only specify one property field `"Text"` here.

In [50]:
import weaviate.classes.config as wc

# Delete the collection if it already exists
if (client.collections.exists("TestCollection")):
    client.collections.delete("TestCollection")

collection = client.collections.create(
    name="TestCollection",
    vectorizer_config=wc.Configure.Vectorizer.none(),
    properties=[
        wc.Property(name="Text", data_type=wc.DataType.TEXT),
    ]
)

Since we are not using an automatic vectorizer, we have to write a small function to generate embeddings manually.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

def get_embedding(source_text: str):
  inputs = tokenizer(text, return_tensors="pt")

  # Generate embeddings
  outputs = model(**inputs)
  embedding = outputs.last_hidden_state[:, 0, :]
  return embedding[0].tolist()

Since we're just playing around with NeoBERT, we will load a small toy dataset of Jeopardy questions into the database.

In [51]:
import weaviate.classes as wvc
import requests, json

url = 'https://raw.githubusercontent.com/weaviate/weaviate-examples/main/jeopardy_small_dataset/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

objs = []
for i, d in enumerate(data):
    objs.append(wvc.data.DataObject(
        properties={
            "Text": d["Question"],
        },
        vector=get_embedding(d["Question"])
    ))

collection.data.insert_many(objs)

BatchObjectReturn(_all_responses=[UUID('87425cb6-95cf-412a-aad5-576eb041f428'), UUID('1376af87-61fc-443f-8461-e8ebf7dd894f'), UUID('935d6a44-8fa6-4bd2-a867-8c2252426562'), UUID('51d06b87-e592-472b-a130-d6d32410134a'), UUID('b909dfb4-859b-4c5a-bad4-4bef34f19911'), UUID('9dd808f2-ab6a-4ee4-9782-cf4027f82d1b'), UUID('5dc9a795-4641-4cc3-82a6-b3d0118fdb39'), UUID('51e34d78-68e8-4fdb-bf08-908cfe7742c0'), UUID('edce80a1-a891-41dd-a4a4-5e82d4f58827'), UUID('bbc13949-0237-4e36-af94-36ab6eafd9a9')], elapsed_seconds=0.0041272640228271484, errors={}, uuids={0: UUID('87425cb6-95cf-412a-aad5-576eb041f428'), 1: UUID('1376af87-61fc-443f-8461-e8ebf7dd894f'), 2: UUID('935d6a44-8fa6-4bd2-a867-8c2252426562'), 3: UUID('51d06b87-e592-472b-a130-d6d32410134a'), 4: UUID('b909dfb4-859b-4c5a-bad4-4bef34f19911'), 5: UUID('9dd808f2-ab6a-4ee4-9782-cf4027f82d1b'), 6: UUID('5dc9a795-4641-4cc3-82a6-b3d0118fdb39'), 7: UUID('51e34d78-68e8-4fdb-bf08-908cfe7742c0'), 8: UUID('edce80a1-a891-41dd-a4a4-5e82d4f58827'), 9: UUID

Now, we're all set and we can run a simple search query with the `near_vector` method.

In [53]:
response = collection.query.near_vector(
    near_vector=get_embedding("African animal"),
    limit=3,
    include_vector = True
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties)
    print("Vector:", item.vector, "\n")

ID: 51d06b87-e592-472b-a130-d6d32410134a
Data: {'text': 'Weighing around a ton, the eland is the largest species of this animal in Africa'}
Vector: {'default': [-1.9510374069213867, -0.9967884421348572, -1.2037752866744995, 0.24297747015953064, -0.1231154128909111, 1.15748131275177, -1.6040924787521362, -1.8666054010391235, -0.5423393249511719, -1.5033859014511108, 2.1337497234344482, -1.035761833190918, 0.6672560572624207, 0.8548232316970825, 1.4260377883911133, 1.7781187295913696, 0.49215567111968994, -1.5952364206314087, 3.0107059478759766, 0.8425706028938293, -0.02023920789361, -0.17652419209480286, 1.5272082090377808, -1.5370515584945679, -2.210336446762085, 1.9915522336959839, -0.47231099009513855, -1.5976117849349976, -1.3054083585739136, -1.3994256258010864, 2.627300500869751, 0.892441987991333, 2.0996060371398926, -0.9450390934944153, -1.0044033527374268, -3.512587547302246, 3.1640431880950928, 0.6829001307487488, -0.7545009851455688, -2.748605728149414, -1.1117602586746216, 3