## Tokenizer Lab: Understanding How Text Becomes Tokens

**Goals**
- Learn what a tokenizer is and how it maps text to token IDs
- Inspect tokens, IDs, and special tokens from the provided tokenizer in `tokenizer/`
- Run experiments to see how spaces, punctuation, casing, and numbers affect tokenization
- Complete an assignment: Convert your student ID into tokens and analyze the results

You will use `transformers.AutoTokenizer` to load the local tokenizer used by the RocLM demo (`roc_demo.py`).


In [None]:
# Setup: load tokenizer from local directory
from transformers import AutoTokenizer
from pathlib import Path

# In notebooks, __file__ is not defined; use current working directory
TOKENIZER_PATH = (Path.cwd() / "tokenizer").resolve()
print(f"Loading tokenizer from: {TOKENIZER_PATH}")

assert TOKENIZER_PATH.exists(), f"Tokenizer directory not found at {TOKENIZER_PATH}"

tokenizer = AutoTokenizer.from_pretrained(str(TOKENIZER_PATH), trust_remote_code=True)
print("Loaded tokenizer!\n")

# Display basic info
print("Special tokens:")
print(tokenizer.special_tokens_map)
print("\nVocab size:", len(tokenizer))


### 1) Basics: Encode and Decode

Use the tokenizer to convert text to token IDs and back.
- `tokenizer.encode(text)` → list of integers (token IDs)
- `tokenizer.decode(ids)` → string
- `tokenizer(text, return_tensors="pt")` → ready-to-model tensors

Run the following cell and inspect the outputs.


In [None]:
sample_texts = [
    "Hello world!",
    "hello world!",
    "Hello,   world!",  # extra spaces
    "AI/ML & NLP: 2025",
]

for text in sample_texts:
    ids = tokenizer.encode(text)
    back = tokenizer.decode(ids)
    print("Text:", repr(text))
    print("Token IDs:", ids)
    print("Decoded:", repr(back))
    print("-")

# Tensor form
encoded = tokenizer("Hello world!", return_tensors="pt")
print("Tensor keys:", list(encoded.keys()))
for k, v in encoded.items():
    print(k, v.shape, v)


### 2) Inspect Tokens and Special Tokens

Let's inspect how the tokenizer represents tokens and special tokens.
- View the first few and some random token IDs
- Inspect special tokens like BOS/EOS, PAD, etc.


In [None]:
from random import sample

print("Special tokens map:")
for k, v in tokenizer.special_tokens_map.items():
    print(f"  {k}: {repr(v)} -> id {tokenizer.convert_tokens_to_ids(v)}")

print("\nAll special tokens:")
print(tokenizer.all_special_tokens)
print("All special ids:")
print(tokenizer.all_special_ids)

print("\nExamples from vocab:")
example_ids = list(range(10)) + sample(range(100, len(tokenizer)), 10)
for tid in example_ids:
    tok = tokenizer.convert_ids_to_tokens(tid)
    print(f"  id {tid:6d} -> token {repr(tok)}")


### 3) Experiments: What Affects Tokenization?

Explore how the tokenizer handles:
- Spaces: single vs multiple
- Punctuation: commas, slashes, hyphens
- Casing: `Hello` vs `hello`
- Numbers: `2025`, `000123`, phone-like strings

Run the cells and compare token IDs and lengths.


In [None]:
def show_tokenization(text: str):
    ids = tokenizer.encode(text)
    toks = [tokenizer.convert_ids_to_tokens(i) for i in ids]
    print(f"Text: {repr(text)}")
    print("IDs:", ids)
    print("TOK:", toks)
    print("Len:", len(ids))
    print("-")

cases = [
    "Hello world!",
    "Hello  world!",  # double space
    "Hello, world!",
    "Hello-world",
    "AI/ML",
    "hello",
    "Hello",
    "2025",
    "000123",
    "(585) 555-1234",
]

for c in cases:
    show_tokenization(c)


### 4) Assignment: Tokenize Your Student ID

Follow the instructions below and complete the analysis.

1. Enter your student ID as a string (e.g., `"p1234567"` or numeric-only if that's your format).
2. Tokenize it using the local tokenizer.
3. Report:
   - The token IDs
   - The tokens (string form)
   - The number of tokens
4. Analyze:
   - Does your ID tokenize into a single token or multiple?
   - If multiple, why do you think the splits happen where they do?
   - Try variants: add a prefix like `"id:"`, add spaces, or different casing. What changes?

Fill in the next cell and write your observations in the markdown cell after that.


In [None]:
# === Student Work ===
# Replace with your own student ID string
student_id = "p1234567"  # <-- change me

ids = tokenizer.encode(student_id)
toks = [tokenizer.convert_ids_to_tokens(i) for i in ids]

print("Student ID:", student_id)
print("Token IDs:", ids)
print("Tokens:", toks)
print("Num tokens:", len(ids))

# Try some variants
variants = [
    student_id,
    f"id:{student_id}",
    f"ID:{student_id}",
    f"{student_id} ",  # trailing space
    f" {student_id}",  # leading space
]

print("\nVariants:")
for v in variants:
    v_ids = tokenizer.encode(v)
    v_toks = [tokenizer.convert_ids_to_tokens(i) for i in v_ids]
    print(f"{repr(v)} -> {v_ids} -> {v_toks}")


#### Analysis (write your observations here)

- Is your ID a single token or multiple? Why?
- How do prefixes (`id:` vs `ID:`) change the result?
- What effect do leading/trailing spaces have?
- Relate this to subword tokenization (e.g., BPE/WordPiece) and vocabulary coverage.
