# Chapter 1 - Tokenizer and Sampler

In this chapter we focus on the outermost layers of a Transformer-based language model. The core of the model is a deep neural network called a Transformer, but your direct interface during text generation is through its input and output.

We start with the input and output because they help visualize the end-to-end process. Regardless of how the internal computation works, you ultimately feed text in and receive generated text back. Input and output are not trivial with LMs, unlike simple stdin/stdout, so we explore them first.

Thus we dive a little deeper into how these input and output steps work in an LM as the first step of our journey.

## Goals

- Understand how the input of an LM, i.e. the tokenizer, works  
  - Keywords: Byte Pair Encoding (BPE)
- Understand how the output of an LM, i.e. the sampler, works  
  - Keywords: Softmax, Temperature scaling, Top-K sampling, Top-P sampling


## Tokenizer

When you enter text into ChatGPT, that text isn't how the LLM sees your request because a Transformer network can only understand numbers. We need to translate the text into a sequence of integers. This process is called *tokenization* and the component doing it is the *tokenizer*.

### Naive ways to tokenize a text

The most naive way to tokenize arbitrary text is splitting by bytes. Because plain text is a sequence of bytes regardless of encoding, you can easily convert any text into a sequence of bytes (0-255).

The problem with this approach is the length of the token sequence. We'll see in later chapters that sequence length is one of the most expensive resources of a Transformer network. Because byte-level tokenization has only 256 possible tokens (often called vocabulary size), the sequence ends up longer than methods with a much larger vocabulary.

How can we increase the vocabulary size and reduce the sequence length? Another naive approach is to use words as tokens. This doesn't require four tokens for the word `time` like byte-based tokenization, but only a single token representing `time`. However, some languages don't have an easy way to split words, and this method isn't resilient to typos, whereas modern LMs can understand text even with misspellings.

### Byte Pair Encoding (BPE)

Byte Pair Encoding is a popular method to tokenize text into an arbitrary vocabulary size. We won't dive deep into the implementation details, but the core idea is:

- **Training**
  1. Split many texts into bytes (like the naive method above)
  2. Find the most frequent pair of bytes in the entire corpus
  3. Add that pair to the vocabulary as a new token
  4. Replace all occurrences of that pair with the new token
  5. Repeat 2-4 until the desired vocabulary size is reached
- **Tokenizing**
  1. Follow the same merge order learned during training

BPE is a language-agnostic algorithm but depends on the training data. In the worst case, it can still tokenize a completely unseen text down to individual bytes.

Note: the training mentioned here is not training a neural network. It is training new tokens and merge orders from a dataset. Such trained data is usually distributed with the Transformer model itself because they are tightly coupled.

### Visualize tokenizer
You can visualize the tokenization [here](https://tiktokenizer.vercel.app/?model=Qwen%2FQwen2.5-72B). `Hello` and ` World` (note the leading space) are tokenized to `9707` and `4337`. `Learn` and ` Transformer` become `23824` and `62379`, and `198` represents `
`.

Lastly, `<|endoftext|>` is a special token indicating the end of text generation and its token id is `151643`. This special token was inserted into the learning dataset of the Transformer so the model knows to emit it when generation should stop.

![Tiktokenizer](./tiktokenizer.png)


## Coding

Now let's implement a simple tokenization process to see how it works with PyTorch.

We use the `transformers` package provided by Hugging Face. This package already contains many pretrained models and tokenizers. While we build the model network using PyTorch primitives, we simply use the pretrained tokenizer for the Qwen3 model as is. If you're interested in implementing BPE yourself, I highly recommend Stanford's CS336 Assignment 1.


First, load the pretrained tokenizer for the Qwen3 model. It has `151,669` vocabulary tokens, which is much larger than the `256` tokens used in byte-based tokenization.


In [90]:
from transformers import AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab_size = tokenizer.vocab_size + len(tokenizer.added_tokens_decoder)
print("vocaburary size:", vocab_size)

vocaburary size: 151669


Next, add a utility function to tokenize a text into a one-dimensional PyTorch tensor.


In [82]:
from torch import Tensor
from jaxtyping import Int64

def tokenize(text: str) -> Int64[Tensor, "seq_len"]:
    return tokenizer(text, return_tensors="pt")["input_ids"][0]

The implementation looks a bit awkward, but the output is simple: a 1-D tensor of token IDs (`int`).

You can see the output tokens match the ones shown above when visualizing the tokenization because both use the same tokenizer.


In [195]:
Hello_World = tokenize("Hello World\n")
Learn_Transformer = tokenize("Learn Transformer\n")
EndToken = tokenize("<|endoftext|>")

print(Hello_World, "Hello World\\n")
print(Learn_Transformer, "Learn Transformer\\n")
print(EndToken, "<|endoftext|>")

tensor([9707, 4337,  198]) Hello World\n
tensor([23824, 62379,   198]) Learn Transformer\n
tensor([151643]) <|endoftext|>


Now we understand what the raw input to a Transformer-based LM looks like.

During training the model sees pairs of sequences and expected next tokens:
1. `[..., 9707]` ("…Hello")
   - Expected next token: `4337` (" World")
2. `[..., 23824]` ("…Learn")
   - Expected next token: `62379` (" Transformer")

During inference (text generation) we iteratively feed the predicted token back into the model to generate the next one.


## Sampler

On the other side of a Transformer-based language model we need a way to translate the predicted tokens back into text, just like in ChatGPT's responses. This last-mile translation from token id to text is straightforward because we have a full mapping, though we must handle multi-byte characters correctly.

However, we first have to decide which single token to produce from the model's output. The Transformer's final output is a vector of weights for the entire vocabulary, not a single token id. Each element represents the confidence for that token given the input text and the generated text so far, so we need to pick exactly one token from this distribution.

### Naive method

The simplest method is `argmax`. Because a higher weight means a more likely next token, picking the highest value seems logical. `argmax` simply selects the highest value. However, Transformers are stochastic and sometimes the model may be too small, so an `argmax` choice isn't always ideal.

### Softmax sampling

Instead, we can treat this vector as a probability distribution and sample from it. Because the raw output may contain arbitrary values, including negatives or infinities, we normalize it first. `softmax` is the most common method to convert logits into probabilities between 0 and 1 that sum to 1. After applying `softmax` we can use `multinomial` sampling to pick one token.

### Temperature scaling

Softmax sampling is the basis of many sampling schemes, but techniques exist to adjust the behavior. One is temperature scaling, where we divide the logits by a temperature value (`logits / temperature`). Lower temperature reduces the variance of the `softmax` distribution—when the temperature approaches zero, sampling becomes almost identical to `argmax`. Higher temperature increases variance and makes lower-probability tokens more likely. High temperature is sometimes considered more creative, while low temperature is more conservative.

### Top-K sampling

Although a high temperature can generate creative outputs, we typically want to avoid completely irrelevant tokens. Top-K sampling filters out all but the top K tokens before sampling. This keeps the search space fixed and maintains some randomness while avoiding extremely low-probability tokens.

### Top-P sampling

Top-P is another filtering method. After `softmax`, it keeps only the top P-percentile tokens. Unlike top-K, top-P does not guarantee the number of remaining tokens. If the distribution is sharply peaked it may keep only one or two tokens; if it is broad it may keep many.

### Real world usage

In practice these three techniques are often used together—typically: 1) temperature scaling, 2) top-K sampling and 3) top-P sampling. The blog posts below describe these techniques in detail:

- [Is a Zero Temperature Deterministic?](https://medium.com/google-cloud/is-a-zero-temperature-deterministic-c4a7faef4d20)
- [Beyond temperature: Tuning LLM output with top-k and top-p](https://medium.com/google-cloud/beyond-temperature-tuning-llm-output-with-top-k-and-top-p-24c2de5c3b16)


## Coding

Let's try `softmax` sampling only for now to keep things simple. I'll add an extra section for Temperature + Top-K + Top-P later.

For now we assume the output logits vector represents just one token. This isn't realistic because language models typically output a sequence of logits and use the last position as the next token's unnormalized probability distribution. But this simplified setup is enough for now.

Then let's create a logits vector as if it's generated by the model:
  - The size of the logits vector is the same as the vocabulary size of our tokenizer
    - We use `torch.ones()` to create a vector of that size filled with `1`
  - " World" (`4337`), " Transformer" (`62379`) or "<|endoftext|>" (`151643`) each with the same probability, and the rest are 0%

Note: To represent 0% after `softmax`, we set `-inf` for all other elements because the softmax of `-inf` is always `0`.


In [None]:
import torch

weight = torch.ones(vocab_size) * float("-inf")
weight[4337] = 1.0
weight[62379] = 1.0
weight[151643] = 1.0
print("logits:", weight.shape, weight)

logits: torch.Size([151669]) tensor([-inf, -inf, -inf,  ..., -inf, -inf, -inf])


Let's check the `softmax` values. As expected, all three candidates have 33.33% probability because the logits are `1.0` for all of them.


In [None]:
import torch

softmax = torch.softmax(weight, dim=0)
print(softmax[softmax > 0.0], "softmax > 0.0")
print(torch.nonzero(softmax, as_tuple=True)[0], "softmax > 0.0 indices")

tensor([0.3333, 0.3333, 0.3333]) softmax > 0.0
tensor([  4337,  62379, 151643]) softmax > 0.0 indices


Lastly, sample one token by following the softmax probability distribution. Because sampling is stochastic, you'll see different outputs each time you run the code below.


In [None]:
from torch import Tensor
from jaxtyping import Float

def predict_next_token(logits: Float[Tensor, "vocab_size"]) -> str:
    softmax = torch.softmax(logits, dim=0)
    next_token = torch.multinomial(softmax, num_samples=1).item()
    return tokenizer.decode(next_token)

print(f"Next token: '{predict_next_token(weight)}'")

Next token: ' World'
