# Understanding Tokenization

Hello and welcome! This notebook will give you a hands-on feel for **tokenization**, an important concept when working with Large Language Models.

As we discussed, tokenization is the process of breaking text down into the smaller units, or "tokens," that an LLM can actually understand. These tokens can be words, parts of words, or even just punctuation and spaces.

Understanding how your text is converted into tokens is the key to mastering prompt engineering. It helps explain:

- **Cost:** Why some prompts are more expensive than others.
- **Context Windows:** Why a model has a finite "memory."
- **Model Behavior:** Why a small change in a word can lead to a very different result.

In this notebook, we'll use `tiktoken`, OpenAI's official tokenizer library, to see exactly how this process works.

## Loading the Right Tokenizer

Different models use different tokenization rules. Therefore, the first step is always to load the specific tokenizer that corresponds to the model you intend to use. Since we will be using `gpt-4o-mini` in our future examples, let's load its tokenizer.

> Best Practice: Always match your tokenizer to your model to get an accurate token count and representation. The `tiktoken.encoding_for_model()` function makes this easy.

## Tokenizing a Simple Sentence

Let's start with a basic example. We will take a simple English sentence and see how the tokenizer breaks it down. The `.encode()` method converts our human-readable string into a list of integers, where each integer is a unique ID for a specific token.

We can then decode these integers one by one to see exactly what text each token represents.

## Tokenizing Complex or Uncommon Words

Now, what happens with a word that might not be in the tokenizer's dictionary, like "Tokenization"? Instead of failing, the tokenizer breaks it down into smaller, recognizable sub-words. This allows the model to handle any word imaginable.

This is a critical concept: **one word does not always equal one token**.

## Tokenizing Code

Finally, let's see how tokenization applies to computer code. The process is exactly the same. The tokenizer breaks down the code into its constituent parts, including keywords, variable names, operators, and whitespace.