# Week 3 — Part 01: Tokens and Context Windows Lab

**Estimated time:** 90–120 minutes

---

## Pre-study (Self-learn)

Foundations Course assumes Self-learn is complete. If you need a refresher on tokenization and prompt engineering:

- [Foundations Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Prompt engineering and evaluation](../self_learn/Chapters/3/02_prompt_engineering_evaluation.md)

---

## What success looks like (end of Part 01)

- You can estimate token counts for a given text.
- You can explain why long prompts fail (context window limits).
- You can design prompts that fit within limits.

### Checkpoint

After running this notebook:
- You can measure prompt size in tokens
- You can identify when a prompt exceeds typical context limits

## Learning Objectives

- Understand tokenization basics
- Measure token counts in prompts
- Design prompts for context window constraints

### What this part covers
This notebook builds **practical intuition about tokens and context windows** — the two most important constraints when working with LLM APIs.

**Token** = the unit an LLM processes. Not a word, not a character — something in between (roughly 4 characters per token for English).

**Context window** = the maximum number of tokens per request. If your prompt + expected output exceeds this limit, the request fails or the model truncates its output.

**Why this matters immediately:** Every API call you make has a token budget. Exceeding it causes failures. Staying well within it reduces cost and latency.

## Overview

This lab is about budgeting context so your prompts and structured outputs don’t fail unexpectedly.

You will:

- compare token counts across inputs
- implement a simple truncation strategy
- build a small token-budget summary helper (TODO)

If you need more background on why prompts fail under long context, use the Self-learn links at the top of the notebook.

### What this cell does
Defines two tokenizers and compares them:

- **`simple_tokenize()`** — a regex-based approximation. Fast, no dependencies, but not what LLMs actually use.
- **`tiktoken`** — OpenAI's actual tokenizer. Gives exact counts for GPT models.

**Key insight:** `simple_tokenize` and `tiktoken` will give different counts for the same text. The difference matters when you're budgeting prompts — always use the model's actual tokenizer for production code.

**If tiktoken is not installed:** Run `pip install tiktoken` in your environment. The notebook will still run without it (the `try/except` handles the missing import gracefully).

In [None]:
import re
from typing import List

try:
    import tiktoken
except Exception as e:  # pragma: no cover
    tiktoken = None
    _tiktoken_error = e


def simple_tokenize(text: str) -> List[str]:
    return re.findall(r"\w+|[^\w\s]", text)


sample = "Hello, world! Token test."
print("simple tokens:", simple_tokenize(sample))

if tiktoken is None:
    print("tiktoken not available:", _tiktoken_error)
else:
    # Note: cl100k_base is used by gpt-4 (non-turbo) and gpt-3.5-turbo.
    # Newer models (gpt-4o, gpt-4o-mini) use o200k_base.
    # Use tiktoken.encoding_for_model("<model_name>") to get the correct encoding automatically.
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode("Token counting is useful for budgeting.")
    print("tiktoken count (cl100k_base):", len(tokens))
    print("Tip: for gpt-4o / gpt-4o-mini, use tiktoken.encoding_for_model('gpt-4o') → o200k_base")

### What this cell does
Implements `truncate_text_tokens()` — a function that cuts text to fit within a maximum token budget by encoding to tokens, slicing, then decoding back to text.

**Why truncate at the token level, not character level?** Characters and tokens don't have a fixed ratio. Truncating at 800 characters might give you 150 tokens or 400 tokens depending on the content. Token-level truncation gives you precise control over the budget.

**Practical use:** In Week 6, you'll compress tabular data before sending it to an LLM. This truncation pattern is one tool in that toolkit — useful when you have a block of text that might be too long.

## Truncation exercise

Use a simple truncation strategy to fit text into a max token budget.

### What this cell does
Defines `token_budget_summary_todo()` — a helper that returns both a simple token count and a tiktoken count for any text.

**Your task:** The current implementation returns `None` for `n_tiktoken_tokens`. Implement it to return the real tiktoken count when tiktoken is available, and `None` when it's not.

**Why both counts?** The simple tokenizer is always available (no dependencies), so it's useful as a quick estimate. The tiktoken count is exact for OpenAI models. Having both lets you see how much the approximation differs from reality.

In [None]:
def truncate_text_tokens(text: str, max_tokens: int, *, enc) -> str:
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])


if tiktoken is not None:
    long_text = "data " * 400
    truncated = truncate_text_tokens(long_text, 200, enc=enc)
    print("orig tokens:", len(enc.encode(long_text)))
    print("trunc tokens:", len(enc.encode(truncated)))

## Exercise: token budgeting helper (TODO)

Implement the TODO function below.

Goal:

- Given a piece of text, return a dict containing:
  - `n_simple_tokens` using `simple_tokenize`
  - `n_tiktoken_tokens` if `tiktoken` is available, otherwise `None`

Checkpoint:

- Running the cell prints token counts for at least 2 inputs.

In [None]:
def token_budget_summary_todo(text: str) -> dict:
    # TODO: implement.
    # Keep it runnable even if tiktoken is not installed.
    return {"n_simple_tokens": len(simple_tokenize(text)), "n_tiktoken_tokens": None}


examples = [
    "Hello world!",
    "https://example.com/some/path?with=query&and=more",
]

for s in examples:
    print(s, "->", token_budget_summary_todo(s))

## Appendix: Solutions (peek only after trying)

Reference implementation for `token_budget_summary_todo`.

In [None]:
def token_budget_summary_todo(text: str) -> dict:
    n_simple = len(simple_tokenize(text))

    if tiktoken is None:
        return {"n_simple_tokens": n_simple, "n_tiktoken_tokens": None}

    enc = tiktoken.get_encoding("cl100k_base")
    n_tiktoken = len(enc.encode(text))
    return {"n_simple_tokens": n_simple, "n_tiktoken_tokens": int(n_tiktoken)}


for s in examples:
    print(s, "->", token_budget_summary_todo(s))