# Tokenizer and embeddings with Python

## Setup

In [1]:
from IPython.display import Markdown, display; display(Markdown("env_instructions.md"))

From university computers, use the Conda environment `ppoulain-llm-24`:

```bash
$ conda activate ppoulain-llm-24
```

You can also try to create this environement on your own computer.

Either with [Miniconda](https://docs.anaconda.com/miniconda/):

```bash
$ mkdir -p llm-practicals
$ cd llm-practicals
$ curl https://raw.githubusercontent.com/pierrepo/llm-practicals/main/content/practical-env.yml --output practical-env.yml
# or wget https://raw.githubusercontent.com/pierrepo/llm-practicals/main/content/practical-env.yml
$ conda env create -f practical-env.yml
$ conda activate ppoulain-llm-24
$ jupyter lab
```

or with [Pixi](https://pixi.sh):

```bash
$ mkdir -p llm-practicals
$ cd llm-practicals
$ curl https://raw.githubusercontent.com/pierrepo/llm-practicals/main/content/practical-env.yml --output practical-env.yml
# or wget https://raw.githubusercontent.com/pierrepo/llm-practicals/main/content/practical-env.yml
$ pixi init --import practical-env.yml
$ pixi run jupyter lab
```

## Tokenizer

[tiktoken](https://github.com/openai/tiktoken) is an open-source Python library developped by OpenAI to tokenize text. This library works fully locally and does not require any internet connection.

### Load model

In [2]:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

### First example

Get tokens id from a simple sentence:

In [3]:
tokens = enc.encode("Hello world")
print(tokens)

[9906, 1917]


Visualize tokens by separating them with `|`:

In [4]:
print("|".join([ enc.decode([tok]) for tok in tokens]))

Hello| world


The first token is `Hello` and the second is `  world` (with a space before `world`).

### Hello bioinformatics

Let's try with another sentence:

In [5]:
tokens = enc.encode("Hello bioinformatics")
print(tokens)
print("|".join([ enc.decode([tok]) for tok in tokens]))

[9906, 17332, 98588]
Hello| bio|informatics


We have this time 3 tokens.

Here is the same sentence in a different language:

In [6]:
tokens = enc.encode("Salut la bioinformatique")
print(tokens)
print("|".join([ enc.decode([tok]) for tok in tokens]))

[17691, 332, 1208, 17332, 258, 2293, 2428]
Sal|ut| la| bio|in|format|ique


The word `bioinformatics` is expressed in 2 tokens whereas its French equivalent (`bioinformatique`) is made of 4 tokens.

Tokenizers are optimized for the English language. Equivalent sentences in other languages usually takes more tokens. This is an important difference considering that costs to use LLM APIs are usually per (million) tokens.

### Explore by yourself

Compare tokenization of other sentences or words in different languages (English, French, Italian, Russian, Chinese...).

## Embeddings

[sentence-transformers](https://github.com/UKPLab/sentence-transformers) is a Python library for computing sentence embeddings. This library works fully locally but requires an internet connection to download embedding models.

The documentation is available [here](https://www.sbert.net/).

In [7]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


*The warning message in red is OK.*

### (Down)Load model

We load the [all-mpnet-base-v2]((https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) embedding model. It's a rather *small* embedding model (109M --109 millions-- parameters) hosted on [HuggingFace](https://huggingface.co/). It converts any input text into a vector of 768 dimensions.

In [8]:
model_name = "all-mpnet-base-v2"

The first time the model is called, it will be downloaded locally. This can take some time and disk space (about 419 MB). By default, models are stored in `$HOME/.cache/huggingface/`. If you're using university computers, to avoid overloading your HOME directory and also the NFS server that supports it, we specify where models are stored locally:

In [9]:
import os
import socket
username = os.environ["USER"]
hostname = socket.gethostname()
print(f"This code is running on computer {hostname} with user {username}")

This code is running on computer vanille with user pierre


In [10]:
if hostname.startswith("lk"):
    # University computers
    model = SentenceTransformer(
        model_name,
        cache_folder=f"/sratch/{username}/huggingface/hub"
    )
else:
    # Personal computers
    model = SentenceTransformer(model_name)

This is the kind of output you could expect to get while downloading the model:

```
modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]
sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]
1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]
```

### Basic example

Here is a couple of sentences to play with:

In [11]:
sentences = [
    "DNA carries genetic information in cells.",
    "Proteins are made up of chains of amino acids.",
    "DNA encodes the sequence of residues.",
    "RNA is a type of nucleic acid."
]

We get the embeddings for each sentence. Each embedding is a vector of 768 dimensions.

In [12]:
embeddings = model.encode(sentences)
print(embeddings.shape)
print(embeddings)

(4, 768)
[[-0.01080427  0.02138491 -0.03320902 ... -0.04414995 -0.02035814
  -0.0343384 ]
 [ 0.05468469 -0.02634791  0.01001398 ... -0.07231469 -0.00555746
  -0.0263908 ]
 [ 0.01480863 -0.00215497 -0.02918642 ...  0.00138177  0.01167384
  -0.05353284]
 [ 0.06408021 -0.03017923  0.0217301  ... -0.02100319 -0.05333204
  -0.05483887]]


Get similarity between all embeddings:

In [13]:
similarities = model.similarity(embeddings, embeddings)
print(similarities)

tensor([[1.0000, 0.3202, 0.6056, 0.4671],
        [0.3202, 1.0000, 0.4517, 0.4007],
        [0.6056, 0.4517, 1.0000, 0.4412],
        [0.4671, 0.4007, 0.4412, 1.0000]])


We obtain a 4 x 4 square matrix. The diagonal is made of 1 because a sentence is identical to itself.

Remark: This code also works to get similarities based on the cosin distance

```python
import numpy as np
similarities = np.inner(embeddings, embeddings)
```

We now display the most similar sentence for a given sentence:

In [14]:
import numpy as np
for idx, sentence in enumerate(sentences):
    # Discard similarity for the sentence itself.
    # Score of 1 is remplaced by -1.
    similarities[idx][idx] = -1
    # Find index of the most similar sentence.
    most_similar_idx = np.argmax(similarities[idx])
    print(f"Original sentence    : {sentence}")
    print(f"Most similar sentence: {sentences[most_similar_idx]}\n")

Original sentence    : DNA carries genetic information in cells.
Most similar sentence: DNA encodes the sequence of residues.

Original sentence    : Proteins are made up of chains of amino acids.
Most similar sentence: DNA encodes the sequence of residues.

Original sentence    : DNA encodes the sequence of residues.
Most similar sentence: DNA carries genetic information in cells.

Original sentence    : RNA is a type of nucleic acid.
Most similar sentence: DNA carries genetic information in cells.



What do you think of these results? Do you agree with the most similar sentences?

### Try by yourself

Use different sentences and compare similarities between same.

### Other models

The `all-mpnet-base-v2` model takes a maximum of 384 tokens as input.

Larger embedding models are openly available, such as [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct):
- 1.78B -- billions -- parameters (about 7 GB of data model to download)
- embedding vector with 1,536 dimensions
- max input tokens: 32k

**If you want to use this more powerful model, be aware it will take some time to download onto your machine.**

For comparison, here is a list of commercial embedding models provided by [OpenAI](https://openai.com/api/pricing/):


| Model                    | Description                                                                       | Max token | Output Dimension | Price ($US / 1M tokens) |
| ------------------------ | --------------------------------------------------------------------------------- | --------- | ---------------- | ---------------------- |
| `text-embedding-3-large` | Most capable embedding model for both english and non-english tasks               | 8191      | 3,072            | 0.13                   |
| `text-embedding-3-small` | Increased performance over 2nd generation ada embedding model                     | 8191      | 1,536            | 0.02                   |
| `text-embedding-ada-002` | Most capable 2nd generation embedding model, replacing 16 first generation models | 8191      | 1,536            | 0.10                   |
