In this notebook, we explore the way that text with multiple words/tokens is encoded into an embedding vector using the popular sentence_transformers library. We will Check:

- OpenAI Embedding
- Open source encoder input embeddings
- Open source encoder output embedding (with context)
- Improved encoder for queries and documents (bi-encoder)

In [1]:
#Define rich theme for better object printing

from rich.console import Console
from rich_theme_manager import  Theme, ThemeManager
import pathlib
theme_dir = pathlib.Path("themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

In [3]:
first_sentence = "I have no interest in politics"

from dotenv import load_dotenv

load_dotenv()

# from openai import OpenAI
# client = OpenAI()

# response = client.embeddings.create(
#     input=first_sentence,
#     model="text-embedding-3-small"
# )

# console.print(response)

False

We will use the default tokenizer of the model. Every word or subword is converted into a token with a constant ID. For example, in the following two sentences, the word interest is tokenized to the same ID (3037).

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

first_sentence = "I have no interest in politics"
second_sentence = "The bank's interest rate rises"

tokenized_first_sentence = model.tokenize([first_sentence])
console.rule(f"{first_sentence}")
console.print(tokenized_first_sentence)

In [None]:
tokenized_second_sentence = model.tokenize([second_sentence])
console.rule(f"{second_sentence}")
console.print(tokenized_second_sentence)

The token ID can be used to convert it back into readable text:

In [6]:

sentence_tokens = model.tokenizer.convert_ids_to_tokens(tokenized_second_sentence["input_ids"][0])

console.print(sentence_tokens)


In [7]:
vocabulary = (
    model
    ._first_module()
    .tokenizer
    .get_vocab()
    .items()
)

console.print("[bold]Vocabulary size[/bold]:", len(vocabulary))
console.print(dict(list(vocabulary)[:20]))

Let's see part of the tokenizer vocabulary. We will search for the token for interest and see its neighbors.

In [8]:
sorted_vocabulary = sorted(vocabulary, key = lambda x: x[1])
sorted_tokens = [token for token,_ in sorted_vocabulary]

focused_token = "interest"
# Find the index of the 'interest' token
focused_index = sorted_tokens.index(focused_token)
# Get 20 tokens around the focused token
start_index = max(0, focused_index - 10)
end_index = min(len(sorted_tokens), focused_index + 11)
tokens_around_focused_index = sorted_tokens[start_index:end_index]

from rich.table import Table

table = Table(title=f"Tokens around '{focused_token}':")
table.add_column("id", justify="right", style="cyan", no_wrap=True)
table.add_column("token", style="bright_green")

for i, token in enumerate(tokens_around_focused_index, start=start_index):
    if token == focused_token:
        table.add_row(f"[bold][black on yellow]{i}[/black on yellow][/bold]", f"[bold][black on yellow]{token}[/black on yellow][/bold]")
    else:
        table.add_row(str(i), token)

console.print(table)