Authored by: Aryan Mistry

# Tokenization, Inference and Text Completion

Tokenization breaks a string into meaningful units that a language model can process. Inference uses a trained model to compute outputs such as next‑token probabilities, and *completion* refers to generating additional text given a prompt. In this expanded lab you will practise different levels of tokenization, build n‑gram models for completion, and explore how to use pre‑trained models (in a reference code block). [3]

## 1 – Tokenization

Different tokenization schemes affect how models interpret text:

- **Whitespace or word tokenization** splits on spaces and punctuation. It is simple but struggles with contractions and unknown words.
- **Subword tokenization** (e.g. Byte‑Pair Encoding, WordPiece, SentencePiece) segments rare words into smaller units, allowing a manageable vocabulary while still capturing word structure.
- **Character tokenization** treats each character as a token and is robust to unseen words, but sequences become much longer.

Below we implement simple word and character tokenizers using regular expressions and Python string methods. [3]

In [None]:
import re

def word_tokenize(text: str):
    return re.findall(r'\b\w+\b', text.lower())

def char_tokenize(text: str):
    return list(text.lower())

example = 'Tokenization is the first step in building language models!'
print('Word tokens:', word_tokenize(example))
print('Character tokens:', char_tokenize(example)[:20])  # show first 20 characters

Word tokens: ['tokenization', 'is', 'the', 'first', 'step', 'in', 'building', 'language', 'models']
Character tokens: ['t', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ']


### Exercise: Token Frequencies

Using the `word_tokenize` function, compute the frequency of each word in the following short paragraph. Display the 5 most common words and their counts.

In [None]:
paragraph = (
    'Large language models have revolutionized natural language processing. ' +
    'These models are trained on vast amounts of text data and can generate coherent sentences. ' +
    'However, they still rely on basic tokenization techniques to break text into manageable pieces.'
)

# TODO: count word frequencies and display the top 5
# from collections import Counter
# tokens = word_tokenize(paragraph)
# word_counts = Counter(tokens)
# top5 = word_counts.most_common(5)
# print(top5)


## 2 – Building a Markov Completion Model

We revisit the Markov (bigram) model as a simple way to perform next‑word inference and completion. The functions below train a bigram model and generate completions. [7]

In [None]:
from collections import defaultdict, Counter
import random

def train_bigram(tokens):
    counts = defaultdict(Counter)
    for i in range(len(tokens)-1):
        counts[tokens[i]][tokens[i+1]] += 1
    probs = {}
    for w1, counter in counts.items():
        total = sum(counter.values())
        probs[w1] = {w2: c/total for w2, c in counter.items()}
    return probs

def generate_bigram(probs, start, length=10):
    current = start.lower()
    words = [current]
    for _ in range(length-1):
        next_dict = probs.get(current)
        if not next_dict:
            break
        candidates, p = zip(*next_dict.items())
        current = random.choices(candidates, p)[0]
        words.append(current)
    return ' '.join(words)

# Train on the paragraph tokens
# tokens = word_tokenize(paragraph)
# bigram_probs = train_bigram(tokens)
# print('Completion starting from "models":', generate_bigram(bigram_probs, 'models'))

### Exercise: Trigram Model and Perplexity

1. **Trigram Model:** Modify the training function to build a trigram model (conditioning on two previous words). Use it to generate text and compare the coherence to the bigram model.
2. **Perplexity:** Implement a function that computes the perplexity of a given test sentence under your bigram model. Recall that perplexity for a sentence \(w_1, \dots, w_T\) is defined as \(\exp(-	frac{1}{T}\sum_{t} \log P(w_t \mid w_{t-1}))\).
3. **Character vs Word Models:** Build a character‑level trigram model using `char_tokenize`. Compare its generated sequences to those of the word‑level trigram model.

In [None]:
# TODO: implement trigram training and perplexity calculations here


## 3 – Using Pre‑trained Models (Reference)

Modern language models use subword tokenization and consider long contexts. To use a pre‑trained model such as GPT‑2 or GPT‑Neo, install the `transformers` library and run the code below. Because this environment does not include the package or weights, the following cell is provided as a reference only. [4]

In [None]:
# Reference code for using a Hugging Face model for completion
# !pip install transformers torch
# from transformers import AutoTokenizer, AutoModelForCausalLM
# import torch

# model_name = 'gpt2'
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

# prompt = 'The history of artificial intelligence begins'
# inputs = tokenizer(prompt, return_tensors='pt')
# outputs = model.generate(**inputs, max_length=30, do_sample=True, top_k=50, top_p=0.95)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Foundational LLMs & Transformers
1. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NIPS 2017).
2. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
4. OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
5. Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.

Generative AI & Sampling

6. Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
8. Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, University of Toronto.

Retrieval-Augmented Generation (RAG) & Knowledge Grounding

9. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS 2020.
10. deepset ai (2023). Haystack: Open-Source Framework for Search and RAG Applications. https://haystack.deepset.ai
11. LangChain (2023). LangChain Documentation and Cookbook. https://python.langchain.com

Evaluation & Safety

12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL 2002.
13. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop 2004.
14. OpenAI (2024). Evaluating Model Outputs: Faithfulness and Grounding. OpenAI Docs.
15. Guardrails AI (2024). Open-Source Guardrails Framework. https://github.com/shreyar/guardrails

Prompt Engineering & Instruction Tuning

16. White, J. (2023). The Prompting Guide. https://www.promptingguide.ai
17. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

Agents & Tool Use

18. Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
19. LangChain (2024). LangChain Agents and Tools Documentation.
20. Microsoft (2023). Semantic Kernel Developer Guide. https://learn.microsoft.com/en-us/semantic-kernel/
21. Google DeepMind (2024). Gemini Technical Report. arXiv:2312.11805.

State, Memory & Orchestration

22. LangGraph (2024). Stateful Agent Orchestration Framework. https://langchain-langgraph.vercel.app
23. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

Pedagogical and Course Design References

24. fast.ai (2023). fast.ai Deep Learning Course Notebooks. https://course.fast.ai
25. Ng, A. (2023). DeepLearning.AI Short Courses on Generative AI.
26. MIT 6.S191, Stanford CS324, UC Berkeley CS294-158. (2022–2024). Course Materials and Public Notebooks for ML and LLMs.