# Chapter 1 - Tokenizer and Sampler

In this chapter, we'd like to understand the most outer layers of the Transformer based language model. Although the core of the language model is a special deep neural network called Transformer,
the input and output interface is the way you interact with the model directly, especially when generating text.

The reason why we start from input and output is because it helps you to understand visualizing
end to end process easily. No matter how the information is processed in between, it ends up to just inputting a text and outputting a generated text. However, inputting/outputting aren't very trivial with LM unlike stdin/stdout.

Thus, we're diving deeper a little bit about the details of how these input/output for LM is working first here. At the end of this chapter, we'll train a very simple model manually to generate tiny fixed text examples although it won't work well.

## Goals

- Understand how input of LM i.e. Tokenizer works
  - Keywords: Byte Pair Encoding (BPE)
- Understand how output of LM i.e. Sampler works
  - Keywords: Softmax, Temperature scaling, Top-K sampling, Top-P sampling
- Understand the basic linear layer model
  - Keywords: Linear layer, Matrix multiplication


## Tokenizer

When you enter a text into Chat window of ChatGPT, that text isn't the way that LLM sees your request because Transformer network can only understand numbers, not text. So, we need to translate your input text into the sequence of numbers (specifically integers). This process is called "Tokenization" and the function doing it is called "Tokenizer".

### Naive ways to tokenize a text

The most naive way to tokenize arbitrary text is spliting by bytes. Because the plain text is actually the sequence if bytes regardless of encoding, you can easily convert any text into a sequence of bytes (0-255).

The problem of this approach is the length of the token sequence. We'll see in the following chapters but the sequence length is one of the most expencive resource of Transformer network. Because this method only has 256 types of tokens (often called vocabrary size), the sequence length is longer than the methods having much larger vocabrary size.

How to increase the vocabrary size / reduce the sequence length? Another naive approach is using words as tokens. This doesn't require 4 tokens for a word `time` like bytes-as-tokens but just require 1 token that represents `time`. However, some languages don't have easy way to split "words". Also, this method is not resilient to typo while the modern LM can understand the text even we type it with misspelling.

### Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is one of the popular methods to tokenize any text into the arbitrary vocabrary size. We won't dive deep into BPE's implementation details, but the core concept is something like below:

- Training
  1. Split many texts into bytes (like the first method)
  2. Find the most frequent pair of bytes from the entire texts
  3. Add that pair to the vocaburary as a new token
  4. Replace all the pairs to the new token
  5. Repeat 2-4 until you reach the desired vocaburary size
- Tokenizing
  1. Follow the same order of merge process

So, BPE is language agnostic algorythm but training data dependent. In the worst case scenario, it could tokenize a text into just a sequence of bytes if the input text is completely new to the training dataset.

Note: Training here is not a traing of neural network. It's a training for new tokens and merge orders from a particular dataset. Normally, such a trained data is distributed along with the the Transformer model itself because both are tightly coupled.

### Visuzalize tokenizer
You can visualize the tokenization [here](https://tiktokenizer.vercel.app/?model=Qwen%2FQwen2.5-72B) like below. `Hello` and ` World` (Note: Including the leading space) are tokenized to `9707` and `4337`. Same for `Learn` and ` Transformer` to `23824` and `62379`. Also, `198` represents `\n`.

Lastly, `<|endoftext|>` is a special token to indicate the end of text generation and it's token id is `151643`. This special text is inserted into the lerning dataset of the Transformer neural network to indicate the end of the text chunk. So, the model knows when to stop the text generation and it predicts this special token, then.

![Tiktokenizer](./tiktokenizer.png)


## Coding

Now. let's implement the simple tokenization process to see how it really works with PyTorch.

We use `transformers` package providedby Hugging Face. This package contains bunch of pretrained models and tokenizers already. While we build the model network using PyTorch premitives, we just use the pretrained tokenizer for Qwen3 model as-is. If you're interested in implementing BPE, I highly recommend you to try Stanford CS336 Assignment 1.

First, load the pretrained tokenize for Qwen3 model. You can see it has `151,669` vocaburaries that is quite larger than `256` (byte based tokenization).

In [90]:
from transformers import AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab_size = tokenizer.vocab_size + len(tokenizer.added_tokens_decoder)
print("vocaburary size:", vocab_size)

vocaburary size: 151669


Next, add a utility function to tokenize a text into PyTorch tensor of one dimension.

In [82]:
from torch import Tensor
from jaxtyping import Int64

def tokenize(text: str) -> Int64[Tensor, "seq_len"]:
    return tokenizer(text, return_tensors="pt")["input_ids"][0]

The implementation looks awkward, but the output should be easy to understand.

You can see the output tokens matches the ones you saw above when visualizing the tokenization because both use the same tokenizer.

We keep using these two tiny sequences as the input and output of our models for a while i.e.

- If the token is "Hello" => Predict " World"
- If the token is "Learn" => Predict " Transformer"
- If the token is either " World" or " Transformer" => Predict "<|endoftext|>"

That's all we want so far. We don't care about any other input text for now.

In [83]:
Hello_World = tokenize("Hello World")
print(Hello_World, "Hello World")
Learn_Transformer = tokenize("Learn Transformer")
print(Learn_Transformer, "Learn Transformer")
EndToken = tokenize("<|endoftext|>")
print(EndToken, "<|endoftext|>")

tensor([9707, 4337]) Hello World
tensor([23824, 62379]) Learn Transformer
tensor([151643]) <|endoftext|>


## Sampler

On the other side of Transfomer-based language model, we need a way to translate the predicted tokens into text, just like you see as ChatGPT's response. The lastmile translation i.e. from token id to the corresponding text flagment should be straightforward as we have the full mapping of this translation (Note: you have to deal with multi-byte characters well).

However, you need one more step to decide what is the single token that Transformer predicts as the next token from its output. To understand this process, let's describe about how Transformer outputs. Transformer's final output is the distribution of weights for all vocaburary. It's not a single token id but a vector that has the same length as the vocabrary size and each element represents the confidence level of the corresponding token id given the input text and the generated text so far. Thus, we need to pick exactly one token based on this distribution.

### Naive method

The most naive method is called `argmax`. Because the confidence value is higher if the token is highly expected to come next, the highest value's token is the logical choise. `argmax` is a strategy to pick the highest value simply. However, because Transformer is stochastic and sometime the network could be not large enough, `argmax` choice isn't ideal all the time.

### Softmax sampling

Instead, we can treat this distribution as a probability distribution and sample one token by following the probability, not always pick the highest. Because the Transformer's output could be arbitrary values that might include negative and infinite, we should normalize them. `softmax` is the most popular method to convert the arbitrary numbers (sometime called "logits") to the probability values i.e. each value is between 0 and 1 and the sum of all values is 1. In our case, once we apply `softmax` to the logits, we can simply use `multinomial` function to sample one token based on the probability distribution.

### Temperature scaling

Softmax sampling is the base of the sampling logic but there are several techniques used in the real world sampling to adjust the behavior for some reasons. One method is temperature scaling. This method is simply scale the logis by a single temperature number (`logits / temperature`). This means that the lower the temperature is, the lower variants the `softmax` nomalized distribution becomes. So, if temperature is almost zero, the sampling is almost identical to `argmax`. The higher temperature means higher variant of probability distribution so that there is a high chance to pick lower probability tokens. Sometime people say high temperature is more creative while low temperature is more concervative.

### Top-K sampling

Although high temperature is great to generate creative outputs, we typically want to avoid completely unnecessary tokens. Top-K sampling is simply filtering out only top K tokens, then sampling. Top-K gives us the fixed size of search pool to maintain some randomness of sampling while avoiding from sampling lower probability tokens.

### Top-P sampling

Top-P is another filtering method. After `softmax`, it filters out only top P-percentile tokens. Unlike top-K, top-P doesn't gurantee the number of tokens remains. If the distribution is almost delta function, it could only select top 1 or 2. If the distribution is high variant, it would only drop the last 1 or 2 outliers.

### Real world usage

In the real world, these three techniques are sometime used togather. In such case, people typically do: 1. Temperature scaling, 2. Top-K sampling and 3. Top-P sampling. The blog posts below describe these techniques well, so I highly recommend to read:

- [Is a Zero Temperature Deterministic?](https://medium.com/google-cloud/is-a-zero-temperature-deterministic-c4a7faef4d20)
- [Beyond temperature: Tuning LLM output with top-k and top-p](https://medium.com/google-cloud/beyond-temperature-tuning-llm-output-with-top-k-and-top-p-24c2de5c3b16)


## Coding

Let's try `softmax` sampling only for now to minimize the scope. I'll add an extra section to do Temperature + Top-K + Top-P later.

For now, we assume the outpt logits is only one token. This isn't realistic, the language models typically output a sequence of tokens and we pick the last logits as the next prediction. But for now, this simple architecture is better.

Then, let's create a logits vector as if it's generated by the model:
  - The logits's size is the same as the vocaburary size of our tokenizer
    - Using `torch.ones()` that generates a vector of the size specified filling `1` everywhere
  - " World" (`4337`), " Transformer" (`62379`) or "<|endoftext|>" (`151643`) with the same probability for each and the rest is 0%

Note: To represent 0% after `softmax`, we set `-inf` to all the elements (because `softmax` of `-inf` is always `0`.)

In [None]:
import torch

weight = torch.ones(vocab_size) * float("-inf")
weight[4337] = 1.0
weight[62379] = 1.0
weight[151643] = 1.0
print("logits:", weight.shape, weight)

logits: torch.Size([151669]) tensor([-inf, -inf, -inf,  ..., -inf, -inf, -inf])


Let's check `softmax` values. As expected, all the three candidates are 33.33% probability because the logits are `1.0` for all.

In [None]:
import torch

softmax = torch.softmax(weight, dim=0)
print(softmax[softmax > 0.0], "softmax > 0.0")
print(torch.nonzero(softmax, as_tuple=True)[0], "softmax > 0.0 indices")

tensor([0.3333, 0.3333, 0.3333]) softmax > 0.0
tensor([  4337,  62379, 151643]) softmax > 0.0 indices


Lastly, sample one token by followin the softmax probability distribution. Because it's sampling, you'll see different outputs whenever you run the code below.

In [None]:
from torch import Tensor
from jaxtyping import Float

def predict_next_token(logits: Float[Tensor, "vocab_size"]) -> str:
    softmax = torch.softmax(logits, dim=0)
    next_token = torch.multinomial(softmax, num_samples=1).item()
    return tokenizer.decode(next_token)

print(f"Next token: '{predict_next_token(weight)}'")

Next token: ' World'


## Liner layer model

In this last section, we try to stich all things we've done and mimic it as a manually trained language model that generates one of three tokens randomly.

This is not a machine learning but a simple execise to understand how PyTorch's neural network models work.

First, let's create `nn.Linear` model with the size of input is `1` and the output is `vocab_size`. The "linear" means that this module is just for linear scaling. Althoug it's based on Tensor computation, the mathematical expression is just `Y = AX + B`. In this example, `bias=False` means there is no `B` term, so it's `Y = AX`.

`X` is a single element vector which is the token id of the input (remember, we assume only one token at a time.)

 So far we don't utilize this input information at all.




In [None]:
from torch import nn

model = nn.Linear(1, vocab_size, bias=False)
weight = model(tokenize("Hello").float())
print(f"Next token: '{predict_next_token(weight)}'")


Next token: 'CO'


In [194]:
import torch

weight = torch.zeros(vocab_size, 1)
weight[4337] = 1.0
weight[62379] = 1.0
weight[151643] = 1.0

model.load_state_dict({"weight": weight})

logits = model(tokenize("Hello").float())
print(f"Next token: '{predict_next_token(logits)}'")

Next token: ' Transformer'
