# Tokenizer with Python

## Setup

Please check you have configured your environement properly with uv (see [setup](../setup.md))

## Tokenizer

[tiktoken](https://github.com/openai/tiktoken) is an open-source Python library developped by OpenAI to tokenize text. This library works fully locally and does not require any internet connection.

## Load model

In [1]:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

## First example

Get tokens id from a simple sentence:

In [2]:
tokens = enc.encode("Hello world")
print(tokens)

[9906, 1917]


Visualize tokens by separating them with `|`:

In [3]:
print("|".join([ enc.decode([tok]) for tok in tokens]))

Hello| world


The first token is `Hello` and the second is `  world` (with a space before `world`).

## Hello bioinformatics

Let's try with another sentence:

In [4]:
tokens = enc.encode("Hello bioinformatics")
print(tokens)
print("|".join([ enc.decode([tok]) for tok in tokens]))

[9906, 17332, 98588]
Hello| bio|informatics


We have this time 3 tokens.

Here is the same sentence in a different language:

In [5]:
tokens = enc.encode("Salut la bioinformatique")
print(tokens)
print("|".join([ enc.decode([tok]) for tok in tokens]))

[17691, 332, 1208, 17332, 258, 2293, 2428]
Sal|ut| la| bio|in|format|ique


The English word `bioinformatics` is expressed in 2 tokens whereas its French equivalent (`bioinformatique`) is made of 4 tokens.

Tokenizers are optimized for the English language. Equivalent sentences in other languages usually takes more tokens. This is an important difference considering that costs to use LLM APIs are usually per (million) tokens.

### Explore by yourself

Compare tokenization of other sentences or words in different languages (English, French, Italian, Russian, Chinese...).