<a href="https://colab.research.google.com/github/kjahan/semantic_similarity/blob/main/examples/modern_embeddings_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introducing Nomic Embed: A Truly Open Embedding Model

`Longer context 8192`

https://www.nomic.ai/blog/posts/nomic-embed-text-v1

`text embedding model with a 8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks.`

https://huggingface.co/nomic-ai/nomic-embed-text-v1

https://huggingface.co/nomic-ai/nomic-bert-2048

https://www.databricks.com/blog/mosaicbert


### Task instruction prefixes
`search_document`

`Purpose: embed texts as documents from a dataset`

`This prefix is used for embedding texts as documents, for example as documents for a RAG index.`

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)


## Use clustering prefix

### Purpose: embed texts to group them into clusters

`This prefix is used for embedding texts in order to group them into clusters, discover common topics, or remove semantic duplicates.`

https://sbert.net/

In [None]:
# The sentences to encode
sentences = [
    "clustering: The weather is lovely today.",
    "clustering: It's so sunny outside!",
    "clustering: He drove to the stadium.",
]

# sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
embeddings = model.encode(sentences)

print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)

# SBERT results

# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

### E5-mistral-7b-instruct

`Improving Text Embeddings with Large Language Models. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024`

`This model has 32 layers and the embedding size is 4096.`

https://huggingface.co/intfloat/e5-mistral-7b-instruct

### It crashes the notebook due to the model size!

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
# In case you want to reduce the maximum sequence length:
model.max_seq_length = 4096

### Tresting

`Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.`

`Have a look at config_sentence_transformers.json for the prompts that are pre-configured, such as web_search_query, sts_query, and summarization_query. Additionally, check out unilm/e5/utils.py for prompts we used for evaluation. You can use these via e.g. model.encode(queries, prompt="Instruct: Given a claim, find documents that refute the claim\nQuery: ").`

In [None]:
queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

## jina-embeddings-v2-base-en

`jina-embeddings-v2-base-en is an English, monolingual embedding model supporting 8192 sequence length. It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length. The backbone jina-bert-v2-base-en is pretrained on the C4 dataset. The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives. These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.`

`The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi. This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.`

`With a standard size of 137 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.`

https://huggingface.co/jinaai/jina-embeddings-v2-base-en

In [None]:
#!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method


In [None]:
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))


### Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

`Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.`


https://arxiv.org/abs/2108.12409


### ModernBERT

`Finally, a Replacement for BERT`

`This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, better downstream performance and much faster processing.`

`ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.`

https://huggingface.co/blog/modernbert

## gte-modernbert-base

`We are excited to introduce the gte-modernbert series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The gte-modernbert series models include both text embedding models and rerank models.`

`The gte-modernbert models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.`

https://huggingface.co/Alibaba-NLP/gte-modernbert-base

In [None]:
# Requires transformers>=4.48.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_path = "Alibaba-NLP/gte-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

In [None]:
input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]]

In [None]:
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[1:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]]

In [None]:
# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)
# (4, 768)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.4289, 0.7131, 0.3366]])

## gte-reranker-modernbert-base

`We are excited to introduce the gte-modernbert series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The gte-modernbert series models include both text embedding models and rerank models.`

`The gte-modernbert models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.`

In [None]:
# Requires transformers>=4.48.0
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name_or_path = "Alibaba-NLP/gte-reranker-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.float16,
)
model.eval()



In [None]:
pairs = [
    ["what is the capital of China?", "Beijing"],
    ["how to implement quick sort in python?", "Introduction of quick sort"],
    ["how to implement quick sort in python?", "The weather is nice today"],
]

with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

# tensor([ 2.1387,  2.4609, -1.6729])


In [None]:
# Requires transformers>=4.48.0
from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "Alibaba-NLP/gte-reranker-modernbert-base",
    automodel_args={"torch_dtype": "auto"},
)

pairs = [
    ["what is the capital of China?", "Beijing"],
    ["how to implement quick sort in python?","Introduction of quick sort"],
    ["how to implement quick sort in python?", "The weather is nice today"],
]

scores = model.predict(pairs)
print(scores)
# [0.8945664  0.9213594  0.15742092]
# NOTE: Sentence Transformers calls Softmax over the outputs by default, hence the scores are in [0, 1] range.
