# Week 3 — Part 01: Tokens and Context Windows Lab

**Estimated time:** 90–120 minutes

---

## Pre-study (Self-learn)

Foundamental Course assumes Self-learn is complete. If you need a refresher on tokenization and prompt engineering:

- [Foundamental Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Prompt engineering and evaluation](../../self_learn/Chapters/3/02_prompt_engineering_evaluation.md)

---

## What success looks like (end of Part 01)

- You can estimate token counts for a given text.
- You can explain why long prompts fail (context window limits).
- You can design prompts that fit within limits.

### Checkpoint

After running this notebook:
- You can measure prompt size in tokens
- You can identify when a prompt exceeds typical context limits

## Learning Objectives

- Understand tokenization basics
- Measure token counts in prompts
- Design prompts for context window constraints

## Overview

This lab is about budgeting context so your prompts and structured outputs don’t fail unexpectedly.

You will:

- compare token counts across inputs
- implement a simple truncation strategy
- build a small token-budget summary helper (TODO)

If you need more background on why prompts fail under long context, use the Self-learn links at the top of the notebook.

In [None]:
import re
from typing import List

try:
    import tiktoken
except Exception as e:  # pragma: no cover
    tiktoken = None
    _tiktoken_error = e


def simple_tokenize(text: str) -> List[str]:
    return re.findall(r"\w+|[^\w\s]", text)


sample = "Hello, world! Token test."
print("simple tokens:", simple_tokenize(sample))

if tiktoken is None:
    print("tiktoken not available:", _tiktoken_error)
else:
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode("Token counting is useful for budgeting.")
    print("tiktoken count:", len(tokens))

## Truncation exercise

Use a simple truncation strategy to fit text into a max token budget.

In [None]:
def truncate_text_tokens(text: str, max_tokens: int, *, enc) -> str:
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])


if tiktoken is not None:
    long_text = "data " * 400
    truncated = truncate_text_tokens(long_text, 200, enc=enc)
    print("orig tokens:", len(enc.encode(long_text)))
    print("trunc tokens:", len(enc.encode(truncated)))

## Exercise: token budgeting helper (TODO)

Implement the TODO function below.

Goal:

- Given a piece of text, return a dict containing:
  - `n_simple_tokens` using `simple_tokenize`
  - `n_tiktoken_tokens` if `tiktoken` is available, otherwise `None`

Checkpoint:

- Running the cell prints token counts for at least 2 inputs.

In [None]:
def token_budget_summary_todo(text: str) -> dict:
    # TODO: implement.
    # Keep it runnable even if tiktoken is not installed.
    return {"n_simple_tokens": len(simple_tokenize(text)), "n_tiktoken_tokens": None}


examples = [
    "Hello world!",
    "https://example.com/some/path?with=query&and=more",
]

for s in examples:
    print(s, "->", token_budget_summary_todo(s))

## Appendix: Solutions (peek only after trying)

Reference implementation for `token_budget_summary_todo`.

In [None]:
def token_budget_summary_todo(text: str) -> dict:
    n_simple = len(simple_tokenize(text))

    if tiktoken is None:
        return {"n_simple_tokens": n_simple, "n_tiktoken_tokens": None}

    enc = tiktoken.get_encoding("cl100k_base")
    n_tiktoken = len(enc.encode(text))
    return {"n_simple_tokens": n_simple, "n_tiktoken_tokens": int(n_tiktoken)}


for s in examples:
    print(s, "->", token_budget_summary_todo(s))