# Embeddings with Python

## Setup

Please check you have configured your environement properly with uv (see [setup](../setup.md))

## Embeddings

[sentence-transformers](https://github.com/UKPLab/sentence-transformers) is a Python library for computing sentence embeddings. This library works fully locally but requires an internet connection to download embedding models.

The documentation is available [here](https://www.sbert.net/).

In [None]:
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


*The warning message in red is OK.*

### (Down)Load model

We load the [all-mpnet-base-v2]((https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) embedding model. It's a rather *small* embedding model (109M --109 millions-- parameters) hosted on [HuggingFace](https://huggingface.co/). It converts any input text into a vector of 768 dimensions.

In [None]:
model_name = "all-mpnet-base-v2"

The first time the model is called, it will be downloaded locally. This can take some time and disk space (about 419 MB). By default, models are stored in `$HOME/.cache/huggingface/`.

If you're using university computers, to avoid overloading your HOME directory and also the NFS server that supports it, we will store models are locally in the `/tmp` folder.

In [None]:
import os
import socket
username = os.environ["USER"]
hostname = socket.gethostname()
print(f"This code is running on computer {hostname} with user {username}")

In [None]:
if hostname.startswith("lk"):
    # University computers
    model = SentenceTransformer(
        model_name,
        cache_folder=f"/tmp/{username}/huggingface/hub"
    )
else:
    # Personal computers
    model = SentenceTransformer(model_name)

This is the kind of output you could expect to get while downloading the model:

```
modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]
sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]
1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]
```

### Basic example

Here is a couple of sentences to play with:

In [None]:
sentences = [
    "DNA carries genetic information in cells.",
    "Proteins are made up of chains of amino acids.",
    "DNA encodes the sequence of residues.",
    "RNA is a type of nucleic acid."
]

We get the embeddings for each sentence. Each embedding is a vector of 768 dimensions.

In [None]:
embeddings = model.encode(sentences)
print("Size of the first vector:")
print(len(embeddings[0]))
print("Ten first elements of the first vector:")
print(embeddings[0,:10])

Get similarity between all embeddings:

In [None]:
similarities = model.similarity(embeddings, embeddings)
print(similarities)

We obtain a 4 x 4 square matrix. The diagonal is made of 1 because a sentence is identical to itself.

Remark: This code also works to get similarities based on the cosin distance

```python
import numpy as np
similarities = np.inner(embeddings, embeddings)
```

We now display the most similar sentence for a given sentence:

In [None]:
import numpy as np
for idx, sentence in enumerate(sentences):
    # Discard similarity for the sentence itself.
    # Score of 1 is remplaced by -1.
    similarities[idx][idx] = -1
    # Find index of the most similar sentence.
    most_similar_idx = np.argmax(similarities[idx])
    print(f"Original sentence    : {sentence}")
    print(f"Most similar sentence: {sentences[most_similar_idx]}\n")

What do you think of these results? Do you agree with the most similar sentences?

### Try by yourself

Use different sentences and compare similarities between same.

### Other models

The `all-mpnet-base-v2` model takes a maximum of 384 tokens as input.

Larger embedding models are openly available, such as [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct):
- 1.78B -- billions -- parameters (about 7 GB of data model to download)
- embedding vector with 1,536 dimensions
- max input tokens: 32k

**If you want to use this more powerful model, be aware it will take some time to download on your machine.**

For comparison, here is a list of commercial embedding models provided by [OpenAI](https://openai.com/api/pricing/):


| Model                    | Description                                                                       | Max token | Output Dimension | Price ($US / 1M tokens) |
| ------------------------ | --------------------------------------------------------------------------------- | --------- | ---------------- | ---------------------- |
| `text-embedding-3-large` | Most capable embedding model for both english and non-english tasks               | 8191      | 3,072            | 0.13                   |
| `text-embedding-3-small` | Increased performance over 2nd generation ada embedding model                     | 8191      | 1,536            | 0.02                   |
| `text-embedding-ada-002` | Most capable 2nd generation embedding model, replacing 16 first generation models | 8191      | 1,536            | 0.10                   |
