# Introducing Vocabularies
In this notebook, we look at vocabularies. Vocabulary is essential to a large language model as it translates text into numbers. Since models don't work with text, we need to transform pieces of text into tokens. The first step is creating tokens from text; whitespace is often used to create tokens and possibly other characters. We need the vocabulary to translate a token into a number. An LLM does not know all the available words in each language it supports. It becomes problematic when working in a domain with very specific terms.

## Loading the vocabulary
Using the Tiktoken library, we load the OpenAI vocabularies available from the start. The size of the loaded vocabulary is also printed.

In [None]:
import tiktoken
from understanding_llm import pretty_print_bytes, find_matching_tokens, interactive_mode

# encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.get_encoding("p50k_base")
vocabulary = encoding.token_byte_values()

print(f'Size of the vocabulary: {encoding.n_vocab}')

## Encode the text using the encoder
Use the loaded encoder with the vocabulary to encode the string into an array of numbers representing the tokens from the vocabulary.

In [None]:
text_en = "Large Language Models: size does matter"""
text_nl = "Grote Taal Modellen: de grootte is belangrijk"

encoded_en = encoding.encode(text_en)
encoded_nl = encoding.encode(text_nl)

## Print the results
We use a function to print the tokens so you can recognise them.

In [None]:
pretty_print_bytes(encoded_en, encoding)
pretty_print_bytes(encoded_nl, encoding)
print(f"""# tokens en: {len(encoded_en)}\n# tokens nl: {len(encoded_nl)} """)

# Searching the vocabulary
As we have the vocabulary available, we can also search in the vocabulary. Below, we use a few characters and find those items in the vocabulary that contain those characters.

In [None]:
find_matching_tokens('side', vocabulary)

Below is a form that you can use to find those tokens that contain the tokens that you type. With every character you type, the matching tokens change.

In [None]:
interactive_mode(encoding, vocabulary)