<a href="https://colab.research.google.com/github/mdufresne/ai-enj-proj/blob/main/project_1/lm_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1: Build an LLM Playground

Welcome to your first project! In this project, you'll build a simple large language model (LLM) playground, an interactive environment where you can experiment with LLMs and understand how they work under the hood.

The goal here is to understand the foundations and mechanics behind LLMs rather than relying on higher-level abstractions or frameworks. You'll see what happens ‚Äúunder the hood‚Äù, how an LLM receives a text, processes it, and generate a response. In later projects, you'll use frameworks like Ollama and LangChain that simplify many of these steps. But before that, this project will help you build a solid mental model of how LLMs actually work.

We'll use Google Colab, a free browser-based platform that lets you run Python code and machine learning models without installing anything locally. Click the button below to open this notebook in Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bytebyteai/ai-eng-projects-2/blob/main/project_1/lm_playground.ipynb)

If you prefer to run the project locally, you can use the provided `env.yaml` file to create a compatible environment using conda. To do so, open a terminal in the same directory as this notebook and run:

```bash
# Create and activate the conda environment
conda env create -f env.yaml && conda activate llm_playground

# Register this environment as a Jupyter kernel
python -m ipykernel install --user --name=llm_playground --display-name "llm_playground"
```


---
## Learning Objectives  
- Understand tokenization and how raw text is converted into a sequence of discrete tokens
- Inspect GPT-2 and the Transformer architecture
- Learn how to load pretrained LLMs using Hugging Face
- Explore decoding strategies to generate text from LLMs
- Compare completion models with instruction-tuned models


Let's get started!

In [None]:
# Confirm required libraries are installed and working.
import torch, transformers, tiktoken
print("torch", torch.__version__, "| transformers", transformers.__version__)
print("‚úÖ Environment check complete. You're good to go!")

# 1 - Tokenization

A neural network cannot process raw text directly. It needs numbers.
Tokenization is the process of converting text into numerical IDs that models can understand. In this section, you will learn how tokenization works in practice and why it is an essential step in every language model pipeline.

Tokenization methods generally fall into three main categories:
1. Word-level
2. Character-level
3. Subword-level

### 1.1 - Word-level tokenization
This method splits text by whitespace and treats each word as a single token. In the next cell, you will implement a basic word-level tokenizer by building a vocabulary that maps words to IDs and writing `encode` and `decode` functions.

In [None]:
# Creating a tiny corpus. In practice, a corpus is generally the entire internet-scale dataset used for training.
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Tokenization converts text to numbers",
    "Large language models predict the next token"
]

# Step 1: Build vocabulary (all unique words in the corpus) and mappings
vocab = []
word2id = {}
id2word = {}

"""
YOUR CODE HERE (~6-15 lines of code)
"""

print(f"Vocabulary size: {len(vocab)} words")
print("First 15 vocab entries:", vocab[:15])


In [None]:
# Step 2: Define encode and decode functions
def encode(text):
    # converts text to token IDs
    """
    YOUR CODE HERE (~1-5 lines of code)
    """
    pass


def decode(ids):
    # converts token IDs back to text
    """
    YOUR CODE HERE (~1-5 lines of code)
    """
    pass

In [None]:
# Step 3: Test your tokenizer with random sentences.
# Try a sentence with unseen words and see what happens (and how to fix it)

"""
YOUR CODE HERE
"""

While word-level tokenization is simple and easy to understand, it has two key limitations that make it impractical for large-scale models:
1.  large vocabulary size: every new word or variation (for example, run, runs, running) increases the total vocabulary, leading to higher memory and training costs.
2. Out-of-vocabulary (OOV) problem: the model cannot handle unseen or rare words that were not part of the training vocabulary, so they must be replaced with a generic [UNK] token.

The next section introduces character-level tokenization, where text is represented as individual characters instead of words.

### 1.2 - Character-level tokenization

In this approach, every single character (including spaces, punctuation, and even emojis) is assigned its own ID.

In the next section, we will rebuild a tokenizer using the same corpus as before, but this time with a character-level approach.
For simplicity, assume we are only using lowercase and uppercase English letters (a-z, A-Z).

In [None]:
import string

# Step 1: Create a vocabulary that includes all uppercase and lowercase letters.
vocab = []
char2id = {}
id2char = {}
"""
YOUR CODE HERE (~5 lines of code)
"""

print(f"Vocabulary size: {len(vocab)} (52 letters + 2 specials)")


In [None]:
# Step 2: Implement encode() and decode() functions to convert between text and IDs.
def encode(text):
    # convert text to list of IDs
    """
    YOUR CODE HERE (~2-5 lines of code)
    """
    pass


def decode(ids):
    # Convert list of IDs to text
    """
    YOUR CODE HERE (~2-5 lines of code)
    """
    pass

In [None]:
# Step 3: Test your tokenizer on a short sample word.
"""
YOUR CODE HERE (~2-5 lines of code)
"""

Character-level tokenization solves the out-of-vocabulary problem but introduces new challenges:

1. Longer sequences: because each word becomes many tokens, models need to process much longer inputs.
2. Weaker semantic representation: individual characters carry very little meaning, so models must learn relationships across many steps.
3. Higher computational cost: longer sequences lead to more tokens per input, which increases training and inference time.

To find a better balance between vocabulary size and sequence length, we move to subword-level tokenization next.

### 1.3 - Subword-level tokenization

Sub-word methods such as `Byte-Pair Encoding (BPE)`, `WordPiece`, and `SentencePiece` **learn** common groups of characters and merge them into tokens. For example, the word **unbelievable** might turn into three tokens: **["un", "believ", "able"]**. This approach strikes a balance between word-level and character-level methods and fix their limitations.

The BPE algorithm builds a vocabulary iteratively using the following process:
1. Start with individual characters (each character is a token).
2. Count all adjacent pairs of tokens in a large text corpus.
3. Merge the most frequent pair into a new token.

Repeat steps 2 and 3 until you reach the desired vocabulary size (for example, 50,000 tokens).

In the next cell, you will experiment with BPE in practice to see how it compresses text into meaningful subword units. Instead of implementing the algorithm from scratch, you will use a pretrained tokenizer, which was already trained on a large text corpus to build its vocabulary, such as the data used to train `GPT-2`. This allows you to see how BPE works in practice with a real, learned vocabulary.

In [None]:
from transformers import AutoTokenizer

# Step 1: Load a pretrained GPT-2 tokenizer from Hugging Face.
# Refer to this to learn more: https://huggingface.co/docs/transformers/en/model_doc/gpt2

"""
YOUR CODE HERE (~1 line of code)
"""


In [None]:
# Step 2: Use it to write encode and decode helper functions
def encode(text):
    """
    YOUR CODE HERE (~1 line of code)
    """
    pass


def decode(ids):
    """
    YOUR CODE HERE (~1 line of code)
    """
    pass

In [None]:
# 3. Inspect the tokens to see how BPE breaks words apart.
sample = "Unbelievable tokenization powers! üöÄ"
"""
YOUR CODE HERE
"""

### 1.4 - TikToken

`tiktoken` is a fast, production-ready library for tokenization used by OpenAI models.
It is designed for efficiency and consistency with how OpenAI counts tokens in GPT models.

In this section, you will explore how different model families use different tokenizers. We will compare tokenizers used to train `GPT-2` and more powerful models such as `GPT-4`. By trying both, you will see how tokenization has evolved to handle more diverse text (including emojis, Unicode, and special characters) while remaining efficient.

In the next cell, you will use tiktoken to load these encodings and inspect how each one splits the same text. You may find reading this doc helpful: https://github.com/openai/tiktoken

In [None]:
import tiktoken

# Compare GPT-2 and GPT-4 tokenizers using tiktoken.

# Step 1: Load two tokenizers
"""
YOUR CODE HERE (~2-3 line of code)
"""

# Step 2: Encode the same sentence with both and observe how they differ
sentence = "The üåü star-programmer implemented AGI overnight."

"""
YOUR CODE HERE (~3-10 lines of code)
"""


Try changing the input sentence and observe how different tokenizers behave.
Experiment with:
- Emojis, special characters, or punctuation
- Code snippets or structured text
- Non-English text (for example, Japanese, French, or Arabic)

If you are curious, you can also attempt to implement the BPE algorithm yourself using a small text corpus to see how token merges are learned in practice.

### 1.5 - Key Takeaways
- **Word-level**: simple and intuitive, but limited by large vocabularies and out-of-vocabulary issues
- **Character-level**: flexible and covers all text, but produces long sequences that are harder to model
- **Subword / BPE**: balances both worlds and is the default choice for most modern LLMs
- **TikToken**: a production-ready tokenizer used in OpenAI models, demonstrating how optimized subword vocabularies are applied in real systems

# 2. What is a Language Model?

At its core, a **language model (LM)** is just a *very large* mathematical function built from many neural-network layers.  
Given a sequence of tokens `[t‚ÇÅ, t‚ÇÇ, ‚Ä¶, t‚Çô]`, it learns to output a probability for the next token `t‚Çô‚Çä‚ÇÅ`.


Each layer performs basic mathematical operations such as matrix multiplication and attention. When hundreds of these layers are stacked together, the model learns complex patterns and statistical relationships in text. The final output is a vector of scores that represents how likely each possible token is to appear next. You can think of the entire model as one giant equation whose parameters were optimized during training to minimize prediction errors.

### 2.1 - A Single `Linear` Layer

Before jumping into Transformers, let's start with the simplest building block: a `Linear` layer.

A Linear layer computes `y = Wx + b`.

Where:  
  * `x` - input vector  
  * `W` - weight matrix (learned)  
  * `b` - bias vector (learned)

Although this operation looks simple, stacking many linear layers (along with nonlinear activation functions) allows neural networks to model highly complex relationships in data.

In the next cell, you will explore how a **Linear layer** works in practice by implementing one from scratch. You will define the weights and bias, then perform the matrix multiplication and addition manually to see what happens inside this layer. You may find the following links useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html
- https://docs.pytorch.org/docs/stable/generated/torch.randn.html
- https://docs.pytorch.org/docs/stable/generated/torch.matmul.html

In [None]:
import torch
import torch.nn as nn

# Define a MyLinear PyTorch module and perform y = Wx + b.

class MyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super(MyLinear, self).__init__()
        # Initialize weights and bias as learnable parameters.
        """
        YOUR CODE HERE (~2-4 lines of code)
        """
        pass

    def forward(self, x):
        # Matrix multiplication followed by bias addition
        """
        YOUR CODE HERE (~1-5 lines of code)
        """
        pass


lin = MyLinear(3, 2)
x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", lin.weight)
print("Bias   :", lin.bias)
print("Output :", lin(x))

Next, you will use PyTorch's built-in nn.Linear module, which performs the same computation `(y = Wx + b)` but automatically handles parameter initialization, gradient tracking, and integration with the rest of a neural network. Comparing your manual implementation with this built-in version will help you understand what a linear layer does and how deep learning frameworks make these operations easier to use.

You may find this link useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html

In [None]:
import torch.nn as nn, torch

# Create a linear layer using pytorch's nn.Linear
"""
YOUR CODE HERE (~1 line of code)
"""

x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", lin.weight)
print("Bias   :", lin.bias)
print("Output :", lin(x))


### 2.2 - A `Transformer` Layer

Most LLMs are a **stack of identical Transformer blocks**. Each block fuses two main components:

| Step | What it does | Where it lives in code |
|------|--------------|------------------------|
| **Multi-Head Self-Attention** | Every token looks at every other token and decides *what matters*. | `block.attn` |
| **Feed-Forward Network (MLP)** | Re-mixes information token-by-token. | `block.mlp` |

In the next section, you will load `GPT-2` and inspect its first Transformer block to see these components in a real model. You will locate its layers, print their shapes and parameters, and understand how a block processes a batch of token embeddings.

In [2]:
import torch
from transformers import GPT2LMHeadModel

# Step 1: load the smallest GPT-2 model (124M parameters) using the Hugging Face transformers library.
# Refer to: https://huggingface.co/docs/transformers/en/model_doc/gpt2
model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")

# Step 2: # Inspect the first Transformer block one by printing it.
gpt2model = list(model.children())[0]
print(gpt2model)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)


In this section, you will run a minimal forward pass through one GPT-2 block to understand how tokens are transformed inside the model.

In [3]:
import torch.nn as nn, torch

input = torch.randint(0, 9, (8,))
print(input)

# Step 2: Convert token IDs into embeddings
# GPT-2 uses two embedding layers:
#   - wte (word token embeddings)
#   - wpe (positional embeddings)
# Add them together to form the initial hidden representation of your input tokens.
s = gpt2model.wte(input) + gpt2model.wpe(input)
print(s)

# Step 3: Pass the embeddings through a single Transformer block
# This simulates one layer of computation in GPT-2.
# Add a batch dimension to 's' as the model expects a batch input
l1out = gpt2model.h[0](s.unsqueeze(0))

# Step 4: Inspect the result
# The output shape should be (batch_size, sequence_length, hidden_size)
print("Output shape of one Transformer block:", l1out)

tensor([7, 6, 0, 3, 3, 4, 6, 4])
tensor([[-0.1274, -0.2147,  0.2489,  ..., -0.0826, -0.2036, -0.0815],
        [-0.0812,  0.0113,  0.1880,  ..., -0.0622, -0.1281,  0.1082],
        [-0.1289, -0.2367,  0.0371,  ..., -0.1794,  0.0433,  0.0998],
        ...,
        [-0.0430, -0.1362,  0.2328,  ..., -0.1065,  0.0762,  0.0504],
        [-0.0812,  0.0113,  0.1880,  ..., -0.0622, -0.1281,  0.1082],
        [-0.0430, -0.1362,  0.2328,  ..., -0.1065,  0.0762,  0.0504]],
       grad_fn=<AddBackward0>)
Output shape of one Transformer block: (tensor([[[-6.7387e-02,  2.1939e-01,  1.0999e+00,  ...,  5.5760e-01,
           6.1481e-01,  1.0630e+00],
         [-1.2417e+00,  8.1025e-01,  1.2618e+00,  ...,  1.1601e-03,
           2.6868e-01, -9.3052e-01],
         [-8.9524e-01,  1.0951e-01,  1.3277e+00,  ..., -8.0049e-01,
           6.5304e-01,  2.4764e-01],
         ...,
         [ 8.8487e-01,  3.1337e-01, -1.2165e-01,  ..., -1.1742e+00,
           4.1094e-01,  1.1239e-01],
         [-1.1177e+00,  7.14

### 2.3 - Inside GPT-2

GPT-2 is essentially a stack of identical Transformer blocks arranged in sequence.
Each block contains attention, feed-forward, and normalization layers that process token representations step by step.

In this section, you will print the modules inside the GPT-2 Transformer to see how these components are organized.
This will help you understand how the model scales from a single block to a full network of many layers working together.

In [4]:
# Print the name of all layers inside gpt.transformer.
# You may find this helpful: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_children

for name, module in gpt2model.named_children():
    print(name)
    for subname, submodule in module.named_children():
        print(f"  {subname}")
        for subsubname, subsubmodule in submodule.named_children():
            print(f"    {subsubname}")


wte
wpe
drop
h
  0
    ln_1
    attn
    ln_2
    mlp
  1
    ln_1
    attn
    ln_2
    mlp
  2
    ln_1
    attn
    ln_2
    mlp
  3
    ln_1
    attn
    ln_2
    mlp
  4
    ln_1
    attn
    ln_2
    mlp
  5
    ln_1
    attn
    ln_2
    mlp
  6
    ln_1
    attn
    ln_2
    mlp
  7
    ln_1
    attn
    ln_2
    mlp
  8
    ln_1
    attn
    ln_2
    mlp
  9
    ln_1
    attn
    ln_2
    mlp
  10
    ln_1
    attn
    ln_2
    mlp
  11
    ln_1
    attn
    ln_2
    mlp
ln_f


As you can see, the Transformer holds various modules, arranged from a list of blocks (`h`). The following table summarizes these modules:

| Step | What it does | Why it matters |
|------|--------------|----------------|
| **Token ‚Üí Embedding** | Converts IDs to vectors | Gives the model a numeric ‚Äúhandle‚Äù on words |
| **Positional Encoding** | Adds ‚Äúwhere am I?‚Äù info | Order matters in language |
| **Multi-Head Self-Attention** | Each token asks ‚Äúwhich other tokens should I look at?‚Äù | Lets the model relate words across a sentence |
| **Feed-Forward Network** | Two stacked Linear layers with a non-linearity | Mixes information and adds depth |
| **LayerNorm & Residual** | Stabilize training and help gradients flow | Keeps very deep networks trainable |


### 2.4 LLM's output

When you pass a sequence of tokens through a language model, it produces a tensor of logits with shape
`(batch_size, seq_len, vocab_size)`.
Each position in the sequence receives a vector of scores representing how likely every possible token is to appear next. By applying a softmax function on the last dimension, these logits can be converted into probabilities that sum to 1.

In the next cell, you will feed an 8-token dummy sequence into GPT-2, print the shape of its logits, and display the five most likely next tokens predicted for the final position in the sequence.


In [5]:
import torch, torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Step 1: Load GPT-2 model and its tokenizer
gpt2model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")
tokenizer = GPT2TokenizerFast.from_pretrained("openai-community/gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [6]:
# Step 2: Tokenize input text
text = "Hello my name"

tokens = tokenizer(text, return_tensors="pt")
print(tokens)

{'input_ids': tensor([[15496,   616,  1438]]), 'attention_mask': tensor([[1, 1, 1]])}


In [7]:
# Step 3: Pass the input IDs to the model
logits = gpt2model(**tokens)
print(logits)

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-35.2362, -35.3265, -38.9753,  ..., -44.4645, -43.9974, -36.4580],
         [-74.1302, -75.3139, -79.5631,  ..., -81.1064, -82.6454, -78.4387],
         [-54.6547, -56.0921, -62.8932,  ..., -66.3084, -68.8439, -60.8408]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=DynamicCache(layers=[DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer]), hidden_states=None, attentions=None, cross_attentions=None)


In [16]:
# Step 4: Predict the next token
# We take the logits from the final position, apply softmax to get probabilities,
# and then extract the top 5 most likely next tokens. You may find F.softmax and torch.topk helpful in your implementation.

import torch.nn.functional as F

# Get the logits for the last token in the sequence (for the first batch)
# logits.logits has shape (batch_size, sequence_length, vocab_size)
# We want the prediction for the token *after* the last input token.
# So we take the logits corresponding to the last position in the input sequence.
last_token_logits = logits.logits[0, -1, :] # Shape (vocab_size,)

# Apply softmax to convert logits to probabilities
probabilities = F.softmax(last_token_logits, dim=-1) # Shape (vocab_size,)

# Get the top 5 most likely next tokens and their probabilities
top_k_results = torch.topk(probabilities, k=5)
print(top_k_results)

# Decode the token IDs. top_k_results.indices is already a 1D tensor of shape (5,)
decoded_tokens = tokenizer.decode(top_k_results.indices)
print(decoded_tokens)

torch.return_types.topk(
values=tensor([0.7773, 0.0373, 0.0332, 0.0127, 0.0076], grad_fn=<TopkBackward0>),
indices=tensor([318,  11, 338, 373, 290]))
 is,'s was and


### 2.5 - Key Takeaway

A language model is not a black box or something mysterious.
It is a large composition of simple, understandable layers such as linear layers, attention, and normalization, trained together to predict the next token in a sequence.

By learning this next-token prediction task at scale, the model gradually develops an internal understanding of language structure, meaning, and context, which allows it to generate coherent and relevant text.

# 3 - Text Generation (Decoding)
Once a language model has been trained to predict token probabilities, we can use it to generate text.
This process is called text generation or decoding.

At each step, the model outputs a probability distribution over possible next tokens.
A decoding algorithm then selects one token based on that distribution, appends it to the sequence, and repeats the process to build text word by word. Different decoding strategies control how the model chooses the next token and how creative or deterministic the output will be. For example:
- **Greedy** decoding: always pick the token with the highest probability. Simple and consistent, but often repetitive.
- **Top-k** or **Nucleus** (top-p) sampling: randomly sample from the top few likely tokens to add variety.
- Beam search: explores multiple candidate continuations and keeps the best overall sequence.

Note: `Temperature` adjusts randomness in sampling. Higher values make outputs more diverse, while lower values make them more focused and deterministic.

### 3.1 - Greedy decoding
In this section, you will use GPT-2 and Hugging Face's built-in generate method to produce text using the greedy decoding strategy.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "gpt2"
device = "cuda" if torch.cuda.is_available() else "mps"


# Step 1. Load GPT-2 model and tokenizer.
"""
YOUR CODE HERE (~2 lines of code)
"""

# Step 2. Implement a text generation function using HuggingFace's generate method.
def generate(model, tokenizer, prompt, max_new_tokens=128):
    """
    YOUR CODE HERE (~3-6 lines of code)
    """
    pass

In [None]:
tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Greedy")
    print(generate(model, tokenizer, prompt, 80))

Naively selecting the single most probable token at each step (known as greedy decoding) often leads to poor results in practice:
- Repetition loops: phrases like ‚ÄúThe cat is is is‚Ä¶‚Äù
- Short-sighted choices: the most likely token right now might lead to incoherent text later

These issues are why more advanced decoding methods such as top-k and nucleus sampling are commonly used to make model outputs more diverse and natural.

### 3.2 - Top-k and top-p sampling
The generate function you implemented earlier can easily be extended to use different decoding strategies.

In this section, you will reimplement the same function but adapt it to support Top-k and Top-p (nucleus) sampling. These methods introduce controlled randomness, allowing the model to explore multiple plausible continuations instead of always choosing the single most likely next token.

In [None]:
# Implement `generate` to support 3 strategies: greedy, top_k, and top_o
# You may find this link helpful: https://huggingface.co/docs/transformers/en/main_classes/text_generation

def generate(model, tokenizer, prompt, strategy="greedy", max_new_tokens=128):
    """
    YOUR CODE HERE (~10-15 lines of code)
    """
    pass

In [None]:

tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Top-p")
    print(generate(model, tokenizer, prompt, "top-p", 40))

### 3.3 - Try It Yourself

Now it‚Äôs time to experiment with text generation. Replace the sample prompts with your own prompts or adjust the decoding strategy.
You can experiment with:
- strategy: "greedy", "beam", "top_k", "top_p"
- temperature: values between 0.2 and 2.0
- k or p: thresholds that control sampling diversity

Try generating the same prompt with `greedy` and `top_p` (for example, 0.9). Notice how even small temperature changes can make the output more focused or more free-form.




# 4 - Completion vs. Instruction-tuned LLMs

So far, we have used `GPT-2` to generate text from a given input prompt. However, `GPT-2` is just a completion model. It simply continues the provided text without understanding it as a task or question. It is not designed to engage in dialogue or follow instructions.

In contrast, instruction-tuned LLMs (such as `Qwen-Chat`) undergo an additional post-training stage after base pre-training. This process fine-tunes the model to behave helpfully and safely when interacting with users. Because of this extra stage, instruction-tuned models can:

- Interpret prompts as requests rather than just text to continue
- Stay in conversation mode, answering questions and following steps
- Handle refusals and safety boundaries appropriately
- Maintain a consistent helpful persona, rather than drifting into storytelling

### 4.1 - `Qwen/Qwen3-0.6B` vs. `GPT2`

In the next cell, you will feed the same prompt to two different models:

- GPT-2 (completion-only): continues the text in the same writing style
- Qwen/Qwen3-0.6B (instruction-tuned): interprets the input as an instruction and responds helpfully

Comparing the two outputs will make the difference between completion and instruction-tuned behavior clear.



In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load both GPT-2 and Qwen models using HuggingFace `.from_pretrained` method.
"""
YOUR CODE HERE (~10-15 lines of code)
"""


We have now downloaded two small checkpoints: GPT-2 (124M parameters) and Qwen3-0.6B (600M parameters). If the previous cell took some time to run, that was mainly due to model download speed. The models will be cached locally, so future runs will be faster.

Next, we will generate text using our generate function with both models and the same prompt to directly compare how a completion-only model (GPT-2) behaves differently from an instruction-tuned model (Qwen).

In [None]:

tests=[("Once upon a time", "greedy"),("What is 2+2?", "top_k"),("Suggest a party theme.", "top_p")]

"""
YOUR CODE HERE (~3-5 lines of code)
"""


# 5. (Optional) A Small Interactive LLM Playground
This section is optional. You do not need to implement it to complete the project. It is meant purely for exploration and will not significantly affect your core AI engineering skills.

If you are curious, you can build a simple interactive playground to experiment with text generation. You can:
- Create input widgets for the prompt, model selection, decoding strategy, and temperature
- Use Hugging Face's generate method to produce text based on the selected settings
- Display the model's response directly in the notebook output

You may find following links helpful:
- https://ipywidgets.readthedocs.io/en/latest/
- https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown

# Steps to implement:
# 1. Load models and tokenizers (GPT-2 and Qwen).
# 2. Define a helper function to generate text with different decoding strategies.
# 3. Create interactive UI elements (prompt box, model selector, strategy selector, temperature slider).
# 4. Add a button to trigger text generation.
# 5. Define the button‚Äôs behavior.
# 6. Display the full UI for the playground.

"""
YOUR CODE HERE (~3-5 lines of code)
"""


## üéâ Congratulations!

You've just learned, explored, and inspected a real **LLM**. In one project you:
* Learned how **tokenization** works in practice
* Used `tiktoken` library to load and experiment with most advanced tokenizers.
* Explored LLM architecture and inspected GPT2 blocks and layers
* Learned decoding strategies and used `top-p` to generate text from GPT2
* Loaded a powerful chat model, `Qwen3-0.6B` and generated text
* Built an LLM playground


üëè **Great job!** Take a moment to celebrate. You now have a working mental model of how LLMs work. The skills you used here power most LLMs you see everywhere.
