# Project 1: Build an LLM Playground

Welcome to your first project! In this project, you'll build a simple large language model (LLM) playground, an interactive environment where you can experiment with LLMs and understand how they work under the hood.

The goal here is to understand the foundations and mechanics behind LLMs rather than relying on higher-level abstractions or frameworks. You'll see what happens ‚Äúunder the hood‚Äù, how an LLM receives a text, processes it, and generate a response. In later projects, you'll use frameworks like Ollama and LangChain that simplify many of these steps. But before that, this project will help you build a solid mental model of how LLMs actually work.

We'll use Google Colab, a free browser-based platform that lets you run Python code and machine learning models without installing anything locally. Click the button below to open this notebook in Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bytebyteai/ai-eng-projects-2/blob/main/project_1/lm_playground.ipynb)

If you prefer to run the project locally, you can use the provided `env.yaml` file to create a compatible environment using conda. To do so, open a terminal in the same directory as this notebook and run:

```bash
# Create and activate the conda environment
conda env create -f env.yaml && conda activate llm_playground

# Register this environment as a Jupyter kernel
python -m ipykernel install --user --name=llm_playground --display-name "llm_playground"
```


---
## Learning Objectives  
- Understand tokenization and how raw text is converted into a sequence of discrete tokens
- Inspect GPT-2 and the Transformer architecture
- Learn how to load pretrained LLMs using Hugging Face
- Explore decoding strategies to generate text from LLMs
- Compare completion models with instruction-tuned models


Let's get started!

In [1]:
# Confirm required libraries are installed and working.
import torch, transformers, tiktoken
print("torch", torch.__version__, "| transformers", transformers.__version__)
print("‚úÖ Environment check complete. You're good to go!")

torch 2.8.0+cu126 | transformers 4.57.1
‚úÖ Environment check complete. You're good to go!


# 1 - Tokenization

A neural network cannot process raw text directly. It needs numbers.
Tokenization is the process of converting text into numerical IDs that models can understand. In this section, you will learn how tokenization works in practice and why it is an essential step in every language model pipeline.

Tokenization methods generally fall into three main categories:
1. Word-level
2. Character-level
3. Subword-level

### 1.1 - Word-level tokenization
This method splits text by whitespace and treats each word as a single token. In the next cell, you will implement a basic word-level tokenizer by building a vocabulary that maps words to IDs and writing `encode` and `decode` functions.

In [2]:
# Creating a tiny corpus. In practice, a corpus is generally the entire internet-scale dataset used for training.
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Tokenization converts text to numbers",
    "Large language models predict the next token"
]

# Step 1: Build vocabulary (all unique words in the corpus) and mappings
vocab = []      # Will store all unique words from the corpus
word2id = {}    # Dictionary mapping word -> numerical ID
id2word = {}    # Dictionary mapping numerical ID -> word

# Build vocabulary from corpus
all_words = []  # Temporary list to collect all words

# Extract all words from each sentence in corpus
for sentence in corpus:
    # Split sentence into words by whitespace and convert to lowercase for consistency
    words = sentence.lower().split()
    # Add words to our temporary list
    all_words.extend(words)

# Get unique words and sort for consistent ordering
unique_words = sorted(set(all_words))

# Add special tokens first - important for handling edge cases
vocab.append('[UNK]')  # Unknown token for words not in vocabulary
vocab.append('[PAD]')  # Padding token for sequence alignment

# Add all unique words from corpus to vocabulary
vocab.extend(unique_words)

# Build mapping dictionaries
for idx, word in enumerate(vocab):
    word2id[word] = idx      # Map word -> ID
    id2word[idx] = word      # Map ID -> word

print(f"Vocabulary size: {len(vocab)} words")
print("First 15 vocab entries:", vocab[:15])
print("word2id mapping:", word2id)


Vocabulary size: 21 words
First 15 vocab entries: ['[UNK]', '[PAD]', 'brown', 'converts', 'dog', 'fox', 'jumps', 'language', 'large', 'lazy', 'models', 'next', 'numbers', 'over', 'predict']
word2id mapping: {'[UNK]': 0, '[PAD]': 1, 'brown': 2, 'converts': 3, 'dog': 4, 'fox': 5, 'jumps': 6, 'language': 7, 'large': 8, 'lazy': 9, 'models': 10, 'next': 11, 'numbers': 12, 'over': 13, 'predict': 14, 'quick': 15, 'text': 16, 'the': 17, 'to': 18, 'token': 19, 'tokenization': 20}


In [4]:
# Step 2: Define encode and decode functions
def encode(text):
    # converts text to token IDs
    # Input: string of text
    # Output: list of numerical token IDs

    # Convert text to lowercase and split into words
    words = text.lower().split()
    token_ids = []

    for word in words:
        # If word is in vocabulary, use its ID, otherwise use [UNK] token
        if word in word2id:
            token_ids.append(word2id[word])
        else:
            token_ids.append(word2id['[UNK]'])  # Handle unknown words

    return token_ids



def decode(ids):
    # converts token IDs back to text
    """
    YOUR CODE HERE (~1-5 lines of code)
    """
    # converts token IDs back to text
    # Input: list of numerical token IDs
    # Output: reconstructed text string

    words = []
    for token_id in ids:
        # Convert each ID back to word
        if token_id in id2word:
            words.append(id2word[token_id])
        else:
            words.append('[UNK]')  # Handle invalid IDs

    # Join words with spaces to reconstruct original text
    return ' '.join(words)

In [5]:
# Step 3: Test your tokenizer with random sentences.
# Try a sentence with unseen words and see what happens (and how to fix it)

"""
YOUR CODE HERE
"""

print("\n=== Testing Tokenizer ===")

# Test 1: Sentence from corpus (should work perfectly)
test_sentence1 = "the quick brown fox"
encoded1 = encode(test_sentence1)
decoded1 = decode(encoded1)
print(f"Test 1 - Known sentence:")
print(f"  Original: '{test_sentence1}'")
print(f"  Encoded: {encoded1}")
print(f"  Decoded: '{decoded1}'")

# Test 2: Sentence with unseen words (demonstrates OOV problem)
test_sentence2 = "the elephant runs fast"  # "elephant" and "runs" not in original corpus
encoded2 = encode(test_sentence2)
decoded2 = decode(encoded2)
print(f"\nTest 2 - Unknown words:")
print(f"  Original: '{test_sentence2}'")
print(f"  Encoded: {encoded2}")
print(f"  Decoded: '{decoded2}'")
print(f"  Note: 'elephant' and 'runs' became '[UNK]' - OOV problem!")

# Test 3: Mixed case and punctuation (shows limitations)
test_sentence3 = "The Quick BROWN fox!"
encoded3 = encode(test_sentence3)
decoded3 = decode(encoded3)
print(f"\nTest 3 - Case handling:")
print(f"  Original: '{test_sentence3}'")
print(f"  Encoded: {encoded3}")
print(f"  Decoded: '{decoded3}'")

# Test 4: Show vocabulary mapping
print(f"\nVocabulary mapping examples:")
for i, (word, idx) in enumerate(list(word2id.items())[:8]):
    print(f"  '{word}' -> {idx}")


=== Testing Tokenizer ===
Test 1 - Known sentence:
  Original: 'the quick brown fox'
  Encoded: [17, 15, 2, 5]
  Decoded: 'the quick brown fox'

Test 2 - Unknown words:
  Original: 'the elephant runs fast'
  Encoded: [17, 0, 0, 0]
  Decoded: 'the [UNK] [UNK] [UNK]'
  Note: 'elephant' and 'runs' became '[UNK]' - OOV problem!

Test 3 - Case handling:
  Original: 'The Quick BROWN fox!'
  Encoded: [17, 15, 2, 0]
  Decoded: 'the quick brown [UNK]'

Vocabulary mapping examples:
  '[UNK]' -> 0
  '[PAD]' -> 1
  'brown' -> 2
  'converts' -> 3
  'dog' -> 4
  'fox' -> 5
  'jumps' -> 6
  'language' -> 7


While word-level tokenization is simple and easy to understand, it has two key limitations that make it impractical for large-scale models:
1.  large vocabulary size: every new word or variation (for example, run, runs, running) increases the total vocabulary, leading to higher memory and training costs.
2. Out-of-vocabulary (OOV) problem: the model cannot handle unseen or rare words that were not part of the training vocabulary, so they must be replaced with a generic [UNK] token.

The next section introduces character-level tokenization, where text is represented as individual characters instead of words.

### 1.2 - Character-level tokenization

In this approach, every single character (including spaces, punctuation, and even emojis) is assigned its own ID.

In the next section, we will rebuild a tokenizer using the same corpus as before, but this time with a character-level approach.
For simplicity, assume we are only using lowercase and uppercase English letters (a-z, A-Z).

In [6]:
import string

# Step 1: Create a vocabulary that includes all uppercase and lowercase letters.
vocab = []
char2id = {}
id2char = {}
"""
YOUR CODE HERE (~5 lines of code)
"""
# Build character vocabulary
# Add special tokens first
vocab.append('[UNK]')  # For any characters not in our defined set
vocab.append('[PAD]')  # Padding token

# Add all uppercase and lowercase English letters
# string.ascii_letters gives 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
vocab.extend(list(string.ascii_letters))

# Build mapping dictionaries
for idx, char in enumerate(vocab):
    char2id[char] = idx      # Map character -> ID
    id2char[idx] = char      # Map ID -> character

print(f"Vocabulary size: {len(vocab)} (52 letters + 2 specials)")
print("Vocabulary:", vocab)


Vocabulary size: 54 (52 letters + 2 specials)
Vocabulary: ['[UNK]', '[PAD]', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


In [7]:
# Step 2: Implement encode() and decode() functions to convert between text and IDs.
def encode(text):
    # convert text to list of IDs
    """
    YOUR CODE HERE (~2-5 lines of code)
    """
    # convert text to list of IDs
    token_ids = []
    for char in text:
        # If character is in vocabulary, use its ID, otherwise use [UNK]
        if char in char2id:
            token_ids.append(char2id[char])
        else:
            token_ids.append(char2id['[UNK]'])  # Handle unknown characters
    return token_ids


def decode(ids):
    # Convert list of IDs to text
    """
    YOUR CODE HERE (~2-5 lines of code)
    """
    # Convert list of IDs to text
    chars = []
    for token_id in ids:
        # Convert each ID back to character
        if token_id in id2char:
            chars.append(id2char[token_id])
        else:
            chars.append('[UNK]')  # Handle invalid IDs
    # Join characters to reconstruct original text
    return ''.join(chars)

In [8]:
# Step 3: Test your tokenizer on a short sample word.
"""
YOUR CODE HERE (~2-5 lines of code)
"""
print("\n=== Testing Character-Level Tokenizer ===")

# Test with mixed case
test_text = "Hello"
encoded = encode(test_text)
decoded = decode(encoded)

print(f"Original text: '{test_text}'")
print(f"Encoded IDs: {encoded}")
print(f"Decoded text: '{decoded}'")

# Show what each character maps to
print(f"\nCharacter mappings in 'Hello':")
for char in test_text:
    print(f"  '{char}' -> {char2id[char]}")

# Test with punctuation (should become [UNK])
test_text2 = "Hello, World!"
encoded2 = encode(test_text2)
decoded2 = decode(encoded2)

print(f"\nOriginal text: '{test_text2}'")
print(f"Encoded IDs: {encoded2}")
print(f"Decoded text: '{decoded2}'")
print("Note: Punctuation marks ',' and '!' became [UNK] tokens")

# Compare sequence length with word-level
word_level_tokens = len(test_text.split())  # Would be 1 for "Hello"
char_level_tokens = len(encoded)            # Is 5 for "Hello"
print(f"\nSequence length comparison for '{test_text}':")
print(f"  Word-level: {word_level_tokens} token(s)")
print(f"  Character-level: {char_level_tokens} tokens")


=== Testing Character-Level Tokenizer ===
Original text: 'Hello'
Encoded IDs: [35, 6, 13, 13, 16]
Decoded text: 'Hello'

Character mappings in 'Hello':
  'H' -> 35
  'e' -> 6
  'l' -> 13
  'l' -> 13
  'o' -> 16

Original text: 'Hello, World!'
Encoded IDs: [35, 6, 13, 13, 16, 0, 0, 50, 16, 19, 13, 5, 0]
Decoded text: 'Hello[UNK][UNK]World[UNK]'
Note: Punctuation marks ',' and '!' became [UNK] tokens

Sequence length comparison for 'Hello':
  Word-level: 1 token(s)
  Character-level: 5 tokens


Character-level tokenization solves the out-of-vocabulary problem but introduces new challenges:

1. Longer sequences: because each word becomes many tokens, models need to process much longer inputs.
2. Weaker semantic representation: individual characters carry very little meaning, so models must learn relationships across many steps.
3. Higher computational cost: longer sequences lead to more tokens per input, which increases training and inference time.

To find a better balance between vocabulary size and sequence length, we move to subword-level tokenization next.

### 1.3 - Subword-level tokenization

Sub-word methods such as `Byte-Pair Encoding (BPE)`, `WordPiece`, and `SentencePiece` **learn** common groups of characters and merge them into tokens. For example, the word **unbelievable** might turn into three tokens: **["un", "believ", "able"]**. This approach strikes a balance between word-level and character-level methods and fix their limitations.

The BPE algorithm builds a vocabulary iteratively using the following process:
1. Start with individual characters (each character is a token).
2. Count all adjacent pairs of tokens in a large text corpus.
3. Merge the most frequent pair into a new token.

Repeat steps 2 and 3 until you reach the desired vocabulary size (for example, 50,000 tokens).

In the next cell, you will experiment with BPE in practice to see how it compresses text into meaningful subword units. Instead of implementing the algorithm from scratch, you will use a pretrained tokenizer, which was already trained on a large text corpus to build its vocabulary, such as the data used to train `GPT-2`. This allows you to see how BPE works in practice with a real, learned vocabulary.

In [10]:
from transformers import AutoTokenizer

# Step 1: Load a pretrained GPT-2 tokenizer from Hugging Face.
# Refer to this to learn more: https://huggingface.co/docs/transformers/en/model_doc/gpt2

"""
YOUR CODE HERE (~1 line of code)
"""
tokenizer = AutoTokenizer.from_pretrained("gpt2")



In [11]:
# Step 2: Use it to write encode and decode helper functions
def encode(text):
    """
    Encode text into token IDs using GPT-2 tokenizer
    """
    # The tokenizer returns a dictionary, we want the 'input_ids' which are the token IDs
    return tokenizer.encode(text)

def decode(ids):
    """
    YOUR CODE HERE (~1 line of code)
    """
    """
    Decode token IDs back to text using GPT-2 tokenizer
    """
    return tokenizer.decode(ids)

In [12]:
# 3. Inspect the tokens to see how BPE breaks words apart.
sample = "Unbelievable tokenization powers! üöÄ"
"""
YOUR CODE HERE
"""

print("=== Subword Tokenization with GPT-2 BPE ===")
print(f"Original text: '{sample}'")

# Encode the text
encoded_ids = encode(sample)
print(f"Encoded IDs: {encoded_ids}")

# Decode back to text
decoded_text = decode(encoded_ids)
print(f"Decoded text: '{decoded_text}'")

# Get the actual tokens to see how words are split
tokens = tokenizer.tokenize(sample)
print(f"Tokens: {tokens}")

# Show detailed breakdown
print(f"\nDetailed token breakdown:")
for i, (token, token_id) in enumerate(zip(tokens, encoded_ids)):
    print(f"  Token {i:2d}: '{token:10}' -> ID: {token_id:5}")

# Compare with other methods
print(f"\n=== Comparison with Other Methods ===")
print(f"Original text length: {len(sample)} characters")

# Character-level length
char_length = len(sample)
print(f"Character-level tokens: {char_length}")

# Word-level length (approximate)
word_length = len(sample.split())
print(f"Word-level tokens: {word_length}")

# Subword-level length
subword_length = len(tokens)
print(f"Subword-level tokens: {subword_length}")

# Show the benefits of subword tokenization
print(f"\n=== Benefits of Subword Tokenization ===")

# Test with complex words
complex_words = ["unbelievable", "tokenization", "reorganization", "antidisestablishmentarianism"]

for word in complex_words:
    word_tokens = tokenizer.tokenize(word)
    print(f"'{word}': {len(word_tokens)} tokens ‚Üí {word_tokens}")

# Test OOV handling
print(f"\n=== OOV Handling Test ===")
oov_text = "The quantum flux capacitor hummed rhythmically."
oov_tokens = tokenizer.tokenize(oov_text)
oov_encoded = encode(oov_text)
oov_decoded = decode(oov_encoded)

print(f"Text with potentially unknown words: '{oov_text}'")
print(f"Tokens: {oov_tokens}")
print(f"Decoded: '{oov_decoded}'")
print("Note: Even complex/uncommon words get broken into meaningful subwords!")

=== Subword Tokenization with GPT-2 BPE ===
Original text: 'Unbelievable tokenization powers! üöÄ'
Encoded IDs: [3118, 6667, 11203, 540, 11241, 1634, 5635, 0, 12520, 248, 222]
Decoded text: 'Unbelievable tokenization powers! üöÄ'
Tokens: ['Un', 'bel', 'iev', 'able', 'ƒ†token', 'ization', 'ƒ†powers', '!', 'ƒ†√∞≈Å', 'ƒº', 'ƒ¢']

Detailed token breakdown:
  Token  0: 'Un        ' -> ID:  3118
  Token  1: 'bel       ' -> ID:  6667
  Token  2: 'iev       ' -> ID: 11203
  Token  3: 'able      ' -> ID:   540
  Token  4: 'ƒ†token    ' -> ID: 11241
  Token  5: 'ization   ' -> ID:  1634
  Token  6: 'ƒ†powers   ' -> ID:  5635
  Token  7: '!         ' -> ID:     0
  Token  8: 'ƒ†√∞≈Å       ' -> ID: 12520
  Token  9: 'ƒº         ' -> ID:   248
  Token 10: 'ƒ¢         ' -> ID:   222

=== Comparison with Other Methods ===
Original text length: 35 characters
Character-level tokens: 35
Word-level tokens: 4
Subword-level tokens: 11

=== Benefits of Subword Tokenization ===
'unbelievable': 4 tokens ‚Üí

### 1.4 - TikToken

`tiktoken` is a fast, production-ready library for tokenization used by OpenAI models.
It is designed for efficiency and consistency with how OpenAI counts tokens in GPT models.

In this section, you will explore how different model families use different tokenizers. We will compare tokenizers used to train `GPT-2` and more powerful models such as `GPT-4`. By trying both, you will see how tokenization has evolved to handle more diverse text (including emojis, Unicode, and special characters) while remaining efficient.

In the next cell, you will use tiktoken to load these encodings and inspect how each one splits the same text. You may find reading this doc helpful: https://github.com/openai/tiktoken

In [2]:
import tiktoken

# Compare GPT-2 and GPT-4 tokenizers using tiktoken.

# Step 1: Load two tokenizers
# GPT-2 uses a BPE tokenizer trained on WebText data
gpt2_encoder = tiktoken.get_encoding("gpt2")

# GPT-4 uses the cl100k_base tokenizer which handles more languages and emojis better
gpt4_encoder = tiktoken.get_encoding("cl100k_base")  # Used by GPT-4, GPT-3.5-turbo

# Step 2: Encode the same sentence with both and observe how they differ
sentence = "The üåü star-programmer implemented AGI overnight."

print("=== Comparing GPT-2 vs GPT-4 Tokenizers ===")
print(f"Original sentence: '{sentence}'")
print(f"String length: {len(sentence)} characters\n")

# Encode with both tokenizers
gpt2_tokens = gpt2_encoder.encode(sentence)
gpt4_tokens = gpt4_encoder.encode(sentence)

gpt2_decoded_tokens = [gpt2_encoder.decode([token]) for token in gpt2_tokens]
gpt4_decoded_tokens = [gpt4_encoder.decode([token]) for token in gpt4_tokens]

print("GPT-2 Tokenization:")
print(f"  Token IDs: {gpt2_tokens}")
print(f"  Token count: {len(gpt2_tokens)}")
print(f"  Tokens: {gpt2_decoded_tokens}")

print("\nGPT-4 Tokenization:")
print(f"  Token IDs: {gpt4_tokens}")
print(f"  Token count: {len(gpt4_tokens)}")
print(f"  Tokens: {gpt4_decoded_tokens}")

# Show detailed comparison
print(f"\n=== Detailed Comparison ===")
print(f"{'Token':<15} {'GPT-2':<30} {'GPT-4':<30}")
print("-" * 75)
for i in range(max(len(gpt2_tokens), len(gpt4_tokens))):
    gpt2_info = f"'{gpt2_decoded_tokens[i]}' (ID:{gpt2_tokens[i]})" if i < len(gpt2_tokens) else ""
    gpt4_info = f"'{gpt4_decoded_tokens[i]}' (ID:{gpt4_tokens[i]})" if i < len(gpt4_tokens) else ""
    print(f"{f'Token {i}':<15} {gpt2_info:<30} {gpt4_info:<30}")

# Test with different types of text
test_cases = [
    "Emojis: üöÄüåüüòäüéâ",
    "Code: def hello_world(): print('Hello!')",
    "French: Bonjour le monde! √áa va?",
    "Japanese: „Åì„Çì„Å´„Å°„ÅØ‰∏ñÁïå",
    "Arabic: ŸÖÿ±ÿ≠ÿ®ÿß ÿ®ÿßŸÑÿπÿßŸÑŸÖ",
    "Special chars: 123$$$!!!###"
]

print(f"\n=== Testing Different Text Types ===")
for test_text in test_cases:
    gpt2_len = len(gpt2_encoder.encode(test_text))
    gpt4_len = len(gpt4_encoder.encode(test_text))

    print(f"\nText: '{test_text}'")
    print(f"  GPT-2 tokens: {gpt2_len}")
    print(f"  GPT-4 tokens: {gpt4_len}")
    print(f"  Difference: {gpt2_len - gpt4_len:+d} tokens")

# Show efficiency with a longer text
longer_text = """
The quick brown fox jumps over the lazy dog. ü¶äüê∂
Python programming: def calculate_sum(a, b): return a + b
Multiple languages: Hello! ‰Ω†Â•Ω! ¬°Hola! Bonjour!
Mathematical expressions: E = mc¬≤ and ‚àë(x_i) from i=1 to n
"""

gpt2_long = gpt2_encoder.encode(longer_text)
gpt4_long = gpt4_encoder.encode(longer_text)

print(f"\n=== Efficiency Comparison ===")
print(f"Longer text character count: {len(longer_text)}")
print(f"GPT-2 token count: {len(gpt2_long)}")
print(f"GPT-4 token count: {len(gpt4_long)}")
print(f"GPT-4 is {len(gpt2_long)/len(gpt4_long):.2f}x more efficient for this text")

# Show specific improvements
print(f"\n=== Key Improvements in GPT-4 Tokenizer ===")
improvement_examples = [
    "üåü", "üöÄ", "üòä", "ü¶ä", "üê∂", "√áa", "„Åì„Çì", "ŸÖÿ±ÿ≠ÿ®ÿß"
]

print(f"{'Text':<10} {'GPT-2 Tokens':<15} {'GPT-4 Tokens':<15}")
print("-" * 40)
for example in improvement_examples:
    gpt2_t = len(gpt2_encoder.encode(example))
    gpt4_t = len(gpt4_encoder.encode(example))
    print(f"'{example}':{gpt2_t:>7} tokens{gpt4_t:>7} tokens")


=== Comparing GPT-2 vs GPT-4 Tokenizers ===
Original sentence: 'The üåü star-programmer implemented AGI overnight.'
String length: 48 characters

GPT-2 Tokenization:
  Token IDs: [464, 12520, 234, 253, 3491, 12, 23065, 647, 9177, 13077, 40, 13417, 13]
  Token count: 13
  Tokens: ['The', ' ÔøΩ', 'ÔøΩ', 'ÔøΩ', ' star', '-', 'program', 'mer', ' implemented', ' AG', 'I', ' overnight', '.']

GPT-4 Tokenization:
  Token IDs: [791, 11410, 234, 253, 6917, 67120, 1195, 11798, 15432, 40, 25402, 13]
  Token count: 12
  Tokens: ['The', ' ÔøΩ', 'ÔøΩ', 'ÔøΩ', ' star', '-program', 'mer', ' implemented', ' AG', 'I', ' overnight', '.']

=== Detailed Comparison ===
Token           GPT-2                          GPT-4                         
---------------------------------------------------------------------------
Token 0         'The' (ID:464)                 'The' (ID:791)                
Token 1         ' ÔøΩ' (ID:12520)                ' ÔøΩ' (ID:11410)               
Token 2         'ÔøΩ' (ID:234

Try changing the input sentence and observe how different tokenizers behave.
Experiment with:
- Emojis, special characters, or punctuation
- Code snippets or structured text
- Non-English text (for example, Japanese, French, or Arabic)

If you are curious, you can also attempt to implement the BPE algorithm yourself using a small text corpus to see how token merges are learned in practice.

### 1.5 - Key Takeaways
- **Word-level**: simple and intuitive, but limited by large vocabularies and out-of-vocabulary issues
- **Character-level**: flexible and covers all text, but produces long sequences that are harder to model
- **Subword / BPE**: balances both worlds and is the default choice for most modern LLMs
- **TikToken**: a production-ready tokenizer used in OpenAI models, demonstrating how optimized subword vocabularies are applied in real systems

# 2. What is a Language Model?

At its core, a **language model (LM)** is just a *very large* mathematical function built from many neural-network layers.  
Given a sequence of tokens `[t‚ÇÅ, t‚ÇÇ, ‚Ä¶, t‚Çô]`, it learns to output a probability for the next token `t‚Çô‚Çä‚ÇÅ`.


Each layer performs basic mathematical operations such as matrix multiplication and attention. When hundreds of these layers are stacked together, the model learns complex patterns and statistical relationships in text. The final output is a vector of scores that represents how likely each possible token is to appear next. You can think of the entire model as one giant equation whose parameters were optimized during training to minimize prediction errors.

### 2.1 - A Single `Linear` Layer

Before jumping into Transformers, let's start with the simplest building block: a `Linear` layer.

A Linear layer computes `y = Wx + b`.

Where:  
  * `x` - input vector  
  * `W` - weight matrix (learned)  
  * `b` - bias vector (learned)

Although this operation looks simple, stacking many linear layers (along with nonlinear activation functions) allows neural networks to model highly complex relationships in data.

In the next cell, you will explore how a **Linear layer** works in practice by implementing one from scratch. You will define the weights and bias, then perform the matrix multiplication and addition manually to see what happens inside this layer. You may find the following links useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html
- https://docs.pytorch.org/docs/stable/generated/torch.randn.html
- https://docs.pytorch.org/docs/stable/generated/torch.matmul.html

In [1]:
import torch
import torch.nn as nn

# Define a MyLinear PyTorch module and perform y = Wx + b.

class MyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super(MyLinear, self).__init__()
        # Initialize weights and bias as learnable parameters.
        """
        Initialize weight matrix with shape (out_features, in_features)
        and bias vector with shape (out_features)
        """
        self.weight = nn.Parameter(torch.randn(out_features, in_features))  # W: (out, in)
        self.bias = nn.Parameter(torch.randn(out_features))                 # b: (out,)


    def forward(self, x):
        # Matrix multiplication followed by bias addition
        # Matrix multiplication followed by bias addition
        # y = Wx + b
        # x shape: (in_features,) or (batch_size, in_features)
        # W shape: (out_features, in_features)
        # b shape: (out_features,)
        # Output shape: (out_features,) or (batch_size, out_features)
        return torch.matmul(x, self.weight.T) + self.bias  # Wx + b


lin = MyLinear(3, 2)
x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", lin.weight)
print("Bias   :", lin.bias)
print("Output :", lin(x))

Input : tensor([ 1.0000, -1.0000,  0.5000])
Weights: Parameter containing:
tensor([[-1.9507, -0.7035,  1.2750],
        [-0.4567,  0.4959, -1.0900]], requires_grad=True)
Bias   : Parameter containing:
tensor([1.2180, 2.7709], requires_grad=True)
Output : tensor([0.6083, 1.2734], grad_fn=<AddBackward0>)


Next, you will use PyTorch's built-in nn.Linear module, which performs the same computation `(y = Wx + b)` but automatically handles parameter initialization, gradient tracking, and integration with the rest of a neural network. Comparing your manual implementation with this built-in version will help you understand what a linear layer does and how deep learning frameworks make these operations easier to use.

You may find this link useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html

In [2]:
import torch.nn as nn, torch

# Create a linear layer using pytorch's nn.Linear
"""
YOUR CODE HERE (~1 line of code)
"""
lin_pt = nn.Linear(3, 2)  # in_features=3, out_features=2

x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", lin.weight)
print("Bias   :", lin.bias)
print("Output :", lin(x))


Input : tensor([ 1.0000, -1.0000,  0.5000])
Weights: Parameter containing:
tensor([[-1.9507, -0.7035,  1.2750],
        [-0.4567,  0.4959, -1.0900]], requires_grad=True)
Bias   : Parameter containing:
tensor([1.2180, 2.7709], requires_grad=True)
Output : tensor([0.6083, 1.2734], grad_fn=<AddBackward0>)


### 2.2 - A `Transformer` Layer

Most LLMs are a **stack of identical Transformer blocks**. Each block fuses two main components:

| Step | What it does | Where it lives in code |
|------|--------------|------------------------|
| **Multi-Head Self-Attention** | Every token looks at every other token and decides *what matters*. | `block.attn` |
| **Feed-Forward Network (MLP)** | Re-mixes information token-by-token. | `block.mlp` |

In the next section, you will load `GPT-2` and inspect its first Transformer block to see these components in a real model. You will locate its layers, print their shapes and parameters, and understand how a block processes a batch of token embeddings.

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2Config
import torch.nn as nn

# Step 1: load the smallest GPT-2 model (124M parameters) using the Hugging Face transformers library.
# Refer to: https://huggingface.co/docs/transformers/en/model_doc/gpt2
"""
YOUR CODE HERE (~1 line of code)
"""
model = GPT2LMHeadModel.from_pretrained("gpt2")  # This loads the 124M parameter version

print("=== GPT-2 Model Loaded ===")
print(f"Model type: {type(model)}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")


# Step 2: # Inspect the first Transformer block one by printing it.
"""
YOUR CODE HERE (~1-2 line of code)
"""
print("\n" + "="*60)
print("=== Inspecting Transformer Blocks ===")

# GPT-2 model structure: transformer -> h -> list of blocks
transformer_blocks = model.transformer.h
print(f"Number of transformer blocks: {len(transformer_blocks)}")

# Get the first block
first_block = transformer_blocks[0]
print(f"\nFirst transformer block structure:")
print(first_block)

print(f"\nComponents of the first block:")
print(f"  - ln_1 (layer norm 1): {first_block.ln_1}")
print(f"  - attn (attention): {first_block.attn}")
print(f"  - ln_2 (layer norm 2): {first_block.ln_2}")
print(f"  - mlp (feed-forward): {first_block.mlp}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

=== GPT-2 Model Loaded ===
Model type: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>
Number of parameters: 124,439,808

=== Inspecting Transformer Blocks ===
Number of transformer blocks: 12

First transformer block structure:
GPT2Block(
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attn): GPT2Attention(
    (c_attn): Conv1D(nf=2304, nx=768)
    (c_proj): Conv1D(nf=768, nx=768)
    (attn_dropout): Dropout(p=0.1, inplace=False)
    (resid_dropout): Dropout(p=0.1, inplace=False)
  )
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (mlp): GPT2MLP(
    (c_fc): Conv1D(nf=3072, nx=768)
    (c_proj): Conv1D(nf=768, nx=3072)
    (act): NewGELUActivation()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

Components of the first block:
  - ln_1 (layer norm 1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  - attn (attention): GPT2Attention(
  (c_attn): Conv1D(nf=2304, nx=768)
  (c_proj): Conv1D(nf=768, nx=768)
  (attn_dropout):

In this section, you will run a minimal forward pass through one GPT-2 block to understand how tokens are transformed inside the model.

In [3]:
# Step 1: Create a small dummy input with a sequence of 8 random token IDs.
"""
YOUR CODE HERE (~2-3 lines of code)
"""
batch_size, seq_len = 1, 8  # Define batch size (1 sample) and sequence length (8 tokens)
dummy_input_ids = torch.randint(0, model.config.vocab_size, (batch_size, seq_len))  # Generate random token IDs within vocabulary range

# Step 2: Convert token IDs into embeddings
# GPT-2 uses two embedding layers:
#   - wte (word token embeddings)
#   - wpe (positional embeddings)
# Add them together to form the initial hidden representation of your input tokens.
"""
YOUR CODE HERE (~2-4 lines of code)
"""
word_embeddings = model.transformer.wte(dummy_input_ids)  # Convert token IDs to word embeddings using GPT-2's word token embedding layer
position_ids = torch.arange(seq_len).unsqueeze(0)  # Create position IDs [0, 1, 2, ..., 7] and add batch dimension
positional_embeddings = model.transformer.wpe(position_ids)  # Get positional embeddings using GPT-2's positional embedding layer
input_embeddings = word_embeddings + positional_embeddings  # Combine word and positional embeddings by element-wise addition

# Step 3: Pass the embeddings through a single Transformer block
# This simulates one layer of computation in GPT-2.
"""
YOUR CODE HERE (~1 line of code)
"""
first_block = model.transformer.h[0]  # Get the first transformer block from GPT-2's 12-layer stack
block_output = first_block(input_embeddings)  # Forward pass: process embeddings through attention and MLP layers

# Handle tuple output - extract the first element which is the hidden states
if isinstance(block_output, tuple):  # Check if output is a tuple
    block_output = block_output[0]  # Extract the first element (hidden states) from the tuple

# Step 4: Inspect the result
# The output shape should be (batch_size, sequence_length, hidden_size)
"""
YOUR CODE HERE (~1 line of code)
"""
print(f"Output shape: {block_output.shape}")  # Print the shape: (batch_size, sequence_length, hidden_size) = (1, 8, 768)

Output shape: torch.Size([1, 8, 768])


### 2.3 - Inside GPT-2

GPT-2 is essentially a stack of identical Transformer blocks arranged in sequence.
Each block contains attention, feed-forward, and normalization layers that process token representations step by step.

In this section, you will print the modules inside the GPT-2 Transformer to see how these components are organized.
This will help you understand how the model scales from a single block to a full network of many layers working together.

In [4]:
# Print the name of all layers inside gpt.transformer.
# You may find this helpful: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_children

"""
YOUR CODE HERE (~2-4 line of code)
"""

print("=== GPT-2 Transformer Architecture ===")
for name, module in model.transformer.named_children():  # Iterate through immediate child modules of transformer
    print(f"{name:15} -> {type(module).__name__}")  # Print module name and its class type

=== GPT-2 Transformer Architecture ===
wte             -> Embedding
wpe             -> Embedding
drop            -> Dropout
h               -> ModuleList
ln_f            -> LayerNorm


As you can see, the Transformer holds various modules, arranged from a list of blocks (`h`). The following table summarizes these modules:

| Step | What it does | Why it matters |
|------|--------------|----------------|
| **Token ‚Üí Embedding** | Converts IDs to vectors | Gives the model a numeric ‚Äúhandle‚Äù on words |
| **Positional Encoding** | Adds ‚Äúwhere am I?‚Äù info | Order matters in language |
| **Multi-Head Self-Attention** | Each token asks ‚Äúwhich other tokens should I look at?‚Äù | Lets the model relate words across a sentence |
| **Feed-Forward Network** | Two stacked Linear layers with a non-linearity | Mixes information and adds depth |
| **LayerNorm & Residual** | Stabilize training and help gradients flow | Keeps very deep networks trainable |


### 2.4 LLM's output

When you pass a sequence of tokens through a language model, it produces a tensor of logits with shape
`(batch_size, seq_len, vocab_size)`.
Each position in the sequence receives a vector of scores representing how likely every possible token is to appear next. By applying a softmax function on the last dimension, these logits can be converted into probabilities that sum to 1.

In the next cell, you will feed an 8-token dummy sequence into GPT-2, print the shape of its logits, and display the five most likely next tokens predicted for the final position in the sequence.


In [5]:
import torch, torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Step 1: Load GPT-2 model and its tokenizer
"""
YOUR CODE HERE (~2 lines of code)
"""
model = GPT2LMHeadModel.from_pretrained("gpt2")  # Load pre-trained GPT-2 model weights
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")  # Load tokenizer that matches GPT-2's training

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [6]:
# Step 2: Tokenize input text
text = "Hello my name"

"""
YOUR CODE HERE (~1 line of code)
"""
text = "Hello my name"  # Input prompt for the model to complete
inputs = tokenizer(text, return_tensors="pt")  # Convert text to token IDs and attention mask

In [7]:
# Step 3: Pass the input IDs to the model
"""
YOUR CODE HERE (~2-3 line of code)
"""
with torch.no_grad():  # Disable gradient calculation for inference (faster, less memory)
    outputs = model(inputs["input_ids"])  # Forward pass through GPT-2 to get logits
    logits = outputs.logits  # Extract raw output scores before softmax

In [8]:
# Step 4: Predict the next token
# We take the logits from the final position, apply softmax to get probabilities,
# and then extract the top 5 most likely next tokens. You may find F.softmax and torch.topk helpful in your implementation.

"""
YOUR CODE HERE (~3-7 line of code)
"""
last_token_logits = logits[0, -1, :]  # Get logits for the last position in sequence
probabilities = F.softmax(last_token_logits, dim=-1)  # Convert logits to probabilities
top5_probs, top5_indices = torch.topk(probabilities, 5)  # Get top 5 probabilities and their token indices

print(f"Input text: '{text}'")
print(f"Logits shape: {logits.shape}")  # Should be (batch_size, seq_len, vocab_size)
print("\nTop 5 predicted next tokens:")
for i, (prob, idx) in enumerate(zip(top5_probs, top5_indices)):
    token = tokenizer.decode(idx)  # Convert token ID back to text
    print(f"{i+1}. '{token}' - probability: {prob:.4f} ({prob*100:.2f}%)")

Input text: 'Hello my name'
Logits shape: torch.Size([1, 3, 50257])

Top 5 predicted next tokens:
1. ' is' - probability: 0.7773 (77.73%)
2. ',' - probability: 0.0373 (3.73%)
3. ''s' - probability: 0.0332 (3.32%)
4. ' was' - probability: 0.0127 (1.27%)
5. ' and' - probability: 0.0076 (0.76%)


### 2.5 - Key Takeaway

A language model is not a black box or something mysterious.
It is a large composition of simple, understandable layers such as linear layers, attention, and normalization, trained together to predict the next token in a sequence.

By learning this next-token prediction task at scale, the model gradually develops an internal understanding of language structure, meaning, and context, which allows it to generate coherent and relevant text.

# 3 - Text Generation (Decoding)
Once a language model has been trained to predict token probabilities, we can use it to generate text.
This process is called text generation or decoding.

At each step, the model outputs a probability distribution over possible next tokens.
A decoding algorithm then selects one token based on that distribution, appends it to the sequence, and repeats the process to build text word by word. Different decoding strategies control how the model chooses the next token and how creative or deterministic the output will be. For example:
- **Greedy** decoding: always pick the token with the highest probability. Simple and consistent, but often repetitive.
- **Top-k** or **Nucleus** (top-p) sampling: randomly sample from the top few likely tokens to add variety.
- Beam search: explores multiple candidate continuations and keeps the best overall sequence.

Note: `Temperature` adjusts randomness in sampling. Higher values make outputs more diverse, while lower values make them more focused and deterministic.

### 3.1 - Greedy decoding
In this section, you will use GPT-2 and Hugging Face's built-in generate method to produce text using the greedy decoding strategy.

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "gpt2"
device = "cuda" if torch.cuda.is_available() else "mps"


# Step 1. Load GPT-2 model and tokenizer.
"""
YOUR CODE HERE (~2 lines of code)
"""
model = AutoModelForCausalLM.from_pretrained(model_id)  # Load GPT-2 model for causal language modeling
tokenizer = AutoTokenizer.from_pretrained(model_id)  # Load tokenizer that matches the model

# Step 2. Implement a text generation function using HuggingFace's generate method.
def generate(model, tokenizer, prompt, max_new_tokens=128):
    """
    YOUR CODE HERE (~3-6 lines of code)
    """

    inputs = tokenizer(prompt, return_tensors="pt")  # Tokenize input prompt into model-ready format
    with torch.no_grad():  # Disable gradient calculation for inference efficiency
        outputs = model.generate(  # Generate text using greedy decoding (default)
            inputs["input_ids"],  # Input token IDs (tensor, not list)
            max_new_tokens=max_new_tokens,  # Maximum number of new tokens to generate
            pad_token_id=tokenizer.eos_token_id,  # Use EOS token for padding to avoid warnings
            do_sample=False  # Disable sampling to use greedy decoding (most likely token at each step)
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)  # Convert generated tokens back to readable text




In [6]:
tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Greedy")
    print(generate(model, tokenizer, prompt, 80))


 GPT-2 | Greedy
Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and

 GPT-2 | Greedy
What is 2+2?

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

 GPT-2 | Greedy
Suggest a party theme.

The party theme is a simple, simple, and fun way to get your friends to join you.

The party theme is a simple, simple, and fun way to get your friends to join you. The party theme is a simple, simple, and fun way to get your friends to join you. The party theme is a simple, simple, and f

Naively selecting the single most probable token at each step (known as greedy decoding) often leads to poor results in practice:
- Repetition loops: phrases like ‚ÄúThe cat is is is‚Ä¶‚Äù
- Short-sighted choices: the most likely token right now might lead to incoherent text later

These issues are why more advanced decoding methods such as top-k and nucleus sampling are commonly used to make model outputs more diverse and natural.

### 3.2 - Top-k and top-p sampling
The generate function you implemented earlier can easily be extended to use different decoding strategies.

In this section, you will reimplement the same function but adapt it to support Top-k and Top-p (nucleus) sampling. These methods introduce controlled randomness, allowing the model to explore multiple plausible continuations instead of always choosing the single most likely next token.

In [10]:
# Implement `generate` to support 3 strategies: greedy, top_k, and top_o
# You may find this link helpful: https://huggingface.co/docs/transformers/en/main_classes/text_generation

def generate(model, tokenizer, prompt, strategy, max_new_tokens):
    """
    YOUR CODE HERE (~10-15 lines of code)
    """
    inputs = tokenizer(prompt, return_tensors="pt")  # Convert text input to token IDs

    # Set generation parameters based on the selected strategy
    if strategy == "greedy":
        generation_config = {
            "do_sample": False,  # Always pick the most likely token
            "num_beams": 1,  # No beam search
        }
    elif strategy == "top_k":
        generation_config = {
            "do_sample": True,  # Enable sampling
            "top_k": 50,  # Consider only top 50 most likely tokens
            "temperature": 1.0,  # Neutral temperature (no adjustment)
        }
    elif strategy == "top_p":
        generation_config = {
            "do_sample": True,  # Enable sampling
            "top_p": 0.9,  # Consider tokens that make up 90% of probability mass
            "temperature": 1.0,  # Neutral temperature
        }
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

    # Common generation parameters for all strategies
    generation_config.update({
        "max_new_tokens": max_new_tokens,  # Maximum tokens to generate
        "pad_token_id": tokenizer.eos_token_id,  # Use EOS token for padding
    })

    with torch.no_grad():  # Disable gradients for inference
        outputs = model.generate(inputs["input_ids"], **generation_config)  # Generate text

    return tokenizer.decode(outputs[0], skip_special_tokens=True)  # Convert back to text


In [11]:

tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Greedy")
    print(generate(model, tokenizer, prompt, "greedy", 40))
    print(f"\n GPT-2 | Top-p")
    print(generate(model, tokenizer, prompt, "top_p", 40))
    print(f"\n GPT-2 | Top-k")
    print(generate(model, tokenizer, prompt, "top_k", 40))


 GPT-2 | Greedy
Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger

 GPT-2 | Top-p
Once upon a time, you could easily imagine you were being made to look like a man in the middle of a crowd. You were not dressed in a suit, you were wearing a coat, shirt, and tie.

 GPT-2 | Top-k
Once upon a time they looked like the same old.

"So was I?" She sighed. "Still, how could it go any better?"

"They were a total different breed back in the days

 GPT-2 | Greedy
What is 2+2?

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

 GPT-2 | Top-p
What is 2+2? 2=20? 2=12.8? 2=5.7? - 3=9.2? 3=6.8? 2=5.4? - 4=7?

 GPT-2 | Top-k
What is 2+2? Which color determines what color we can change?

There is a system for determining where redness is set and how it should be set. There are dif

### 3.3 - Try It Yourself

Now it‚Äôs time to experiment with text generation. Replace the sample prompts with your own prompts or adjust the decoding strategy.
You can experiment with:
- strategy: "greedy", "beam", "top_k", "top_p"
- temperature: values between 0.2 and 2.0
- k or p: thresholds that control sampling diversity

Try generating the same prompt with `greedy` and `top_p` (for example, 0.9). Notice how even small temperature changes can make the output more focused or more free-form.




# 4 - Completion vs. Instruction-tuned LLMs

So far, we have used `GPT-2` to generate text from a given input prompt. However, `GPT-2` is just a completion model. It simply continues the provided text without understanding it as a task or question. It is not designed to engage in dialogue or follow instructions.

In contrast, instruction-tuned LLMs (such as `Qwen-Chat`) undergo an additional post-training stage after base pre-training. This process fine-tunes the model to behave helpfully and safely when interacting with users. Because of this extra stage, instruction-tuned models can:

- Interpret prompts as requests rather than just text to continue
- Stay in conversation mode, answering questions and following steps
- Handle refusals and safety boundaries appropriately
- Maintain a consistent helpful persona, rather than drifting into storytelling

### 4.1 - `Qwen/Qwen3-0.6B` vs. `GPT2`

In the next cell, you will feed the same prompt to two different models:

- GPT-2 (completion-only): continues the text in the same writing style
- Qwen/Qwen3-0.6B (instruction-tuned): interprets the input as an instruction and responds helpfully

Comparing the two outputs will make the difference between completion and instruction-tuned behavior clear.



In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load both GPT-2 and Qwen models using HuggingFace `.from_pretrained` method.
"""
YOUR CODE HERE (~10-15 lines of code)
"""
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")  # Load GPT-2 (completion-only model)
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load GPT-2 tokenizer

qwen_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
#qwen_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B", torch_dtype=torch.float16, device_map="auto")  # Load Qwen 0.5B model with efficient settings
qwen_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")  # Load Qwen tokenizer

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

We have now downloaded two small checkpoints: GPT-2 (124M parameters) and Qwen3-0.6B (600M parameters). If the previous cell took some time to run, that was mainly due to model download speed. The models will be cached locally, so future runs will be faster.

Next, we will generate text using our generate function with both models and the same prompt to directly compare how a completion-only model (GPT-2) behaves differently from an instruction-tuned model (Qwen).

In [13]:

tests=[("Once upon a time", "greedy"),("What is 2+2?", "top_k"),("Suggest a party theme.", "top_p")]

"""
YOUR CODE HERE (~3-5 lines of code)
"""
for prompt, strategy in tests:
    print(f"Prompt: '{prompt}' | Strategy: {strategy}")

    # GPT-2 (completion model)
    gpt2_result = generate(gpt2_model, gpt2_tokenizer, prompt, strategy, max_new_tokens=50)
    print(f"GPT-2:  {gpt2_result}")

    # Qwen (instruction-tuned model) - needs proper formatting
    qwen_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    qwen_result = generate(qwen_model, qwen_tokenizer, qwen_prompt, strategy=strategy, max_new_tokens=50)
    # Clean up Qwen response to show only the assistant part
    if "<|im_start|>assistant" in qwen_result:
        qwen_result = qwen_result.split("<|im_start|>assistant")[1].strip()
    print(f"Qwen:   {qwen_result}")


Prompt: 'Once upon a time' | Strategy: greedy
GPT-2:  Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger
Qwen:   user
Once upon a time
assistant
Once upon a timeERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA
ERICA

Prompt: 'What is 2+2?' | Strategy: top_k
GPT-2:  What is 2+2? What is 3 with 0 and 1+1? The answer can't all be explained by anything simple, but here are some things you can explain by 2+2 (from below it may seem somewhat awkward to be writing this):

One thing
Qwen:   user
What is 2+2?
assistant
The result is 4.2
Prompt: 'Suggest a party theme.' | Strategy: top_p
GPT-2:  Suggest a party theme. The last word is the key point: don't be a dick.

The answer to all of that is simple: get your pants down and make

# 5. (Optional) A Small Interactive LLM Playground
This section is optional. You do not need to implement it to complete the project. It is meant purely for exploration and will not significantly affect your core AI engineering skills.

If you are curious, you can build a simple interactive playground to experiment with text generation. You can:
- Create input widgets for the prompt, model selection, decoding strategy, and temperature
- Use Hugging Face's generate method to produce text based on the selected settings
- Display the model's response directly in the notebook output

You may find following links helpful:
- https://ipywidgets.readthedocs.io/en/latest/
- https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

In [15]:
import ipywidgets as widgets
from IPython.display import display, Markdown

# Steps to implement:
# 1. Load models and tokenizers (GPT-2 and Qwen).
# 2. Define a helper function to generate text with different decoding strategies.
# 3. Create interactive UI elements (prompt box, model selector, strategy selector, temperature slider).
# 4. Add a button to trigger text generation.
# 5. Define the button‚Äôs behavior.
# 6. Display the full UI for the playground.

"""
YOUR CODE HERE (~3-5 lines of code)
"""
# Create interactive widgets
prompt_input = widgets.Textarea(description="Prompt:", value="Explain quantum computing")  # Text input for user prompt
model_dropdown = widgets.Dropdown(description="Model:", options=[("GPT-2", "gpt2"), ("Qwen", "qwen")])  # Model selection
strategy_dropdown = widgets.Dropdown(description="Strategy:", options=[("Greedy", "greedy"), ("Top-K", "top_k"), ("Top-P", "top_p")])  # Decoding strategy
generate_button = widgets.Button(description="Generate Text")  # Button to trigger generation
output_display = widgets.Output()  # Area to display generated text

# Define button click behavior
def on_generate_click(b):
    with output_display:  # Capture output in display area
        output_display.clear_output()  # Clear previous results
        prompt = prompt_input.value  # Get current prompt
        model = model_dropdown.value  # Get selected model
        strategy = strategy_dropdown.value  # Get selected strategy
        # Generate and display text based on selections
        if model == "gpt2":
            result = generate(gpt2_model, gpt2_tokenizer, prompt, strategy, 100)
        else:
            #qwen_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
            result = generate(qwen_model, qwen_tokenizer, prompt, strategy, 100)
        display(Markdown(f"**Generated Text:**\n\n{result}"))  # Show formatted output

generate_button.on_click(on_generate_click)  # Connect button to function
display(widgets.VBox([prompt_input, model_dropdown, strategy_dropdown, generate_button, output_display]))  # Show all widgets

VBox(children=(Textarea(value='Explain quantum computing', description='Prompt:'), Dropdown(description='Model‚Ä¶


## üéâ Congratulations!

You've just learned, explored, and inspected a real **LLM**. In one project you:
* Learned how **tokenization** works in practice
* Used `tiktoken` library to load and experiment with most advanced tokenizers.
* Explored LLM architecture and inspected GPT2 blocks and layers
* Learned decoding strategies and used `top-p` to generate text from GPT2
* Loaded a powerful chat model, `Qwen3-0.6B` and generated text
* Built an LLM playground


üëè **Great job!** Take a moment to celebrate. You now have a working mental model of how LLMs work. The skills you used here power most LLMs you see everywhere.
