# Notebook 01: Decoding and Tokens

**Objectives:**
- Understand how text becomes tokens
- Explore input/context/output token mapping
- Experiment with decoding parameters (temperature, top_p, max_tokens)
- Observe token count impact on cost and latency

**Key Concepts:**
- Tokenization is subword-based (BPE)
- Different models use different encodings
- Tokens ≠ Words (can be characters, subwords, or whole words)


In [None]:
import sys
sys.path.append('..')

from utils.token_utils import pick_encoding, count_text_tokens
from utils.logging_utils import log_llm_call
from utils.llm_client import LLMClient
from utils.router import pick_model
import tiktoken

## Part 1: Tokenization Basics

Let's see how different text gets tokenized.


In [6]:
# Get encoding for OpenAI
encoding = pick_encoding("openai", "gpt-4o")

# Example texts
examples = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Supercalifragilisticexpialidocious",
    "你好世界",  # Chinese
    "مرحبا بالعالم",  # Arabic
    "print('Hello, World!')",  # Code
]

print("Text → Tokens")
print("=" * 80)

for text in examples:
    tokens = encoding.encode(text)
    decoded_tokens = [encoding.decode([t]) for t in tokens]
    
    print(f"\nText: {text}")
    print(f"Token count: {len(tokens)}")
    print(f"Tokens: {decoded_tokens}")
    print(f"Token IDs: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")


Text → Tokens

Text: Hello, world!
Token count: 4
Tokens: ['Hello', ',', ' world', '!']
Token IDs: [13225, 11, 2375, 0]

Text: The quick brown fox jumps over the lazy dog.
Token count: 10
Tokens: ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', '.']
Token IDs: [976, 4853, 19705, 68347, 65613, 1072, 290, 29082, 6446, 13]

Text: Supercalifragilisticexpialidocious
Token count: 10
Tokens: ['Super', 'cal', 'if', 'rag', 'il', 'istic', 'exp', 'ial', 'id', 'ocious']
Token IDs: [17260, 5842, 366, 17764, 311, 6207, 8067, 563, 315, 170661]

Text: 你好世界
Token count: 2
Tokens: ['你好', '世界']
Token IDs: [177519, 28428]

Text: مرحبا بالعالم
Token count: 4
Tokens: ['مرح', 'با', ' بالع', 'الم']
Token IDs: [158894, 26537, 101462, 12773]

Text: print('Hello, World!')
Token count: 7
Tokens: ['print', "('", 'Hello', ',', ' World', '!', "')"]
Token IDs: [1598, 706, 13225, 11, 5922, 0, 1542]


## Part 2: Input vs Context vs Output Tokens

Understanding the token breakdown:
- **Input tokens**: Your prompt (system + user messages)
- **Context tokens**: Additional context (e.g., RAG documents)
- **Output tokens**: Model's response

Cost and latency depend on total tokens.


In [7]:
from utils.token_utils import count_messages_tokens

# Example with context
messages = [
    {"role": "system", "content": "You are a helpful document summarizer."},
    {"role": "user", "content": "Summarize the key points from the context."}
]

context_strs = [
    "CloudSync Pro is a file synchronization platform with real-time updates.",
    "It supports AES-256 encryption and is GDPR compliant.",
    "The service offers 99.9% uptime SLA and unlimited storage for enterprise."
]

# Count tokens separately
token_breakdown = count_messages_tokens(messages, "openai", "gpt-4o", context_strs)

print("Token Breakdown:")
print("=" * 50)
print(f"Input tokens (prompt):     {token_breakdown['input_tokens']:>6}")
print(f"Context tokens:            {token_breakdown['context_tokens']:>6}")
print(f"Estimated total (input):   {token_breakdown['estimated_total']:>6}")
print(f"\nNote: Output tokens determined by model response length")


Token Breakdown:
Input tokens (prompt):         26
Context tokens:                40
Estimated total (input):       69

Note: Output tokens determined by model response length


## Part 3: Decoding Parameter Experiments

Key parameters that affect output:
- **temperature** (0.0-2.0): Controls randomness (0 = deterministic, higher = creative)
- **max_tokens**: Limits output length
- **top_p**: Nucleus sampling threshold

Let's see how these affect the response.


In [8]:
# Setup client
model = pick_model("google", "general")
client = LLMClient("google", model)

prompt_msg = [
    {"role": "system", "content": "You are a creative storyteller."},
    {"role": "user", "content": "Write the opening sentence of a sci-fi story."}
]

# Temperature sweep
temperatures = [0.0, 0.5, 1.0, 1.5]

print("Temperature Sweep (same prompt, different temperatures)")
print("=" * 80)

for temp in temperatures:
    response = client.chat(prompt_msg, temperature=temp, max_tokens=50)
    
    print(f"\nTemperature = {temp}")
    print(f"Output: {response['text']}")
    print(f"Completion tokens: {response['usage']['completion_tokens_actual']}")
    
    # Log it
    log_llm_call(
        provider="openai",
        model=model,
        technique=f"temp_sweep_{temp}",
        latency_ms=response['latency_ms'],
        usage=response['usage'],
    )


Temperature Sweep (same prompt, different temperatures)

Temperature = 0.0
Output: The crimson sun bled across the methane swamps of Xylos, painting the skeletal remains of the crashed starship in hues of rust and despair.

Completion tokens: 30

Temperature = 0.5
Output: The crimson sun bled across the corrugated iron horizon of Neo-Dustbowl, painting the rust-colored shacks in hues of dying hope.

Completion tokens: 29

Temperature = 1.0
Output: The bioluminescent moss pulsed with an eerie green light, casting long, skeletal shadows across the rusting hull of the abandoned starship, a silent testament to a war humanity had long forgotten.

Completion tokens: 38

Temperature = 1.5
Output: The crimson sun bled across the metallic horizon, painting the chrome city of Veridium in hues of rust and regret.

Completion tokens: 24


## Part 4: Max Tokens Limiting

Control output length to manage cost and focus.


In [9]:
max_tokens_values = [10, 30, 100]

prompt_msg = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

print("Max Tokens Limiting")
print("=" * 80)

for max_tok in max_tokens_values:
    response = client.chat(prompt_msg, temperature=0.7, max_tokens=max_tok)
    
    print(f"\nMax tokens = {max_tok}")
    print(f"Output ({response['usage']['completion_tokens_actual']} tokens):")
    print(f"{response['text']}")
    print("-" * 80)


Max Tokens Limiting

Max tokens = 10
Output (10 tokens):
Okay, imagine you have a light switch. A
--------------------------------------------------------------------------------

Max tokens = 30
Output (30 tokens):
Imagine a regular computer bit like a light switch: it can be either on (1) or off (0).  Quantum computers use **qubits
--------------------------------------------------------------------------------

Max tokens = 100
Output (100 tokens):
Okay, imagine regular computers use light switches: they can be either **on (1)** or **off (0)**.  These are called **bits**.

Quantum computers are different.  They use **qubits**. Think of a dimmer switch instead of a light switch.  A qubit can be:

* **On (1)**,
* **Off (0)**,
* **Or, and this is the key, *both on AND off at the same time***!
--------------------------------------------------------------------------------


## Key Takeaways

1. **Tokenization is subword-based**: One word can be multiple tokens, especially for rare words or non-English text
2. **Input vs Output tokens**: Input tokens (your prompt) + output tokens (response) = total cost
3. **Temperature controls randomness**: Use 0.0 for deterministic, 0.7-1.0 for creative tasks
4. **max_tokens limits length**: Prevents runaway costs and focuses responses

**Next:** `02_prompt_structure_patterns.ipynb` explores structured prompt engineering patterns.
