# Week 2 Practice: From Transformers to Alignment
### Learning Objectives:
* Understand attention mechanisms through NumPy code
* Build a simple transformer block
* Predict next token using a pretrained LLM
* Analyze hallucinations
* Explore supervised fine-tuning logic
* Understand how DPO works via preference modeling

#### Package Introduction

In this notebook, we will use several important Python libraries:

- **[NumPy](https://education.launchcode.org/data-analysis-curriculum/eda-with-pandas/reading/numpy-intro/index.html?utm_term=launchcode&utm_campaign=&utm_source=bing&utm_medium=ppc&hsa_acc=4368208516&hsa_cam=568518766&hsa_grp=1173180668353233&hsa_ad=&hsa_src=o&hsa_tgt=dat-2325123495982042:loc-190&hsa_kw=launchcode&hsa_mt=b&hsa_net=adwords&hsa_ver=3&msclkid=f69fad9bed3f18d44e31c5a6703d580b&utm_content=Group%202)**: The fundamental package for scientific computing with Python. We use it for matrix operations and to demonstrate the attention mechanism.
- **[PyTorch](https://www.geeksforgeeks.org/start-learning-pytorch-for-beginners/)**: A popular deep learning framework. We use it to build and train neural network models, including transformer blocks.
- **[Hugging Face Transformers](https://github.com/huggingface/transformers)**: Provides state-of-the-art pre-trained models and tools for natural language processing. We use it to load and interact with large language models (LLMs).
- **[huggingface-cli](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli)**: A command-line tool for managing Hugging Face models and datasets. Useful for downloading models or checking your authentication.

Make sure you have these packages installed. You can install them using pip if needed:


In [4]:
import sys
!{sys.executable} -m pip install python-dotenv numpy torch transformers huggingface_hub
from dotenv import load_dotenv
load_dotenv()

# For `huggingface-cli`, it is included with `huggingface_hub` or `transformers`. You can check your installation with:

!huggingface-cli --help



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
usage: huggingface-cli <command> [<args>]

positional arguments:
  {download,upload,repo-files,env,login,whoami,logout,auth,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag,version,upload-large-folder}
                        huggingface-cli command helpers
    download            Download files from the Hub
    upload              Upload a file or a folder to a repo on the Hub
    repo-files          Manage files in a repo on the Hub
    env                 Print information about the environment.
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account you are logged
                        in as.
  

### Part 1: Attention Mechanism (Self-Attention)

In [2]:
import numpy as np

# Random Q, K, V matrices
def generate_random_qkv(seq_len=4, d_model=8):
    return [np.random.rand(seq_len, d_model) for _ in range(3)]

# Scaled dot-product attention
def self_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    weights = softmax(scores)
    output = np.dot(weights, V)
    return output, weights

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Q, K, V = generate_random_qkv()
out, attn_weights = self_attention(Q, K, V)
print("Attention Output:\n", out)
print("Attention Weights:\n", attn_weights)


Attention Output:
 [[0.53548901 0.66392775 0.55686998 0.4933884  0.63604606 0.82976013
  0.56207106 0.72222938]
 [0.51730692 0.67209703 0.55225692 0.48447672 0.64504055 0.82935251
  0.55577758 0.71951072]
 [0.51688515 0.67368168 0.5530581  0.49419091 0.64699743 0.83166348
  0.55489162 0.71800925]
 [0.53046839 0.66599947 0.55854521 0.48407288 0.63883148 0.8266584
  0.55291995 0.72824893]]
Attention Weights:
 [[0.25381026 0.23048414 0.18819445 0.32751115]
 [0.27283275 0.22793776 0.19673454 0.30249495]
 [0.2727269  0.21660305 0.19701241 0.31365764]
 [0.26986406 0.24203092 0.18106032 0.30704471]]


 Discussion: Walk students through the QK^T score computation, scaling, and softmax. Explain how this captures relationships between tokens.

### Part 2: Mini Transformer Block in PyTorch

In [3]:
import torch
import torch.nn as nn

class MiniTransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__() #calling the constructor of nn.Module class
        #two attention heads/QKV matrices to look for different aspects
        #batch first means input shape is batch, seq_len, embed_dim
        self.attn = nn.MultiheadAttention(embed_dim, num_heads=2, batch_first=True) 
        #2 layer fnn (fnn == ffn), so one hidden layer
        self.ffn = nn.Sequential(
            #expand embedding dimension to process features better
            #arguments: (in_features, out_features))
            nn.Linear(embed_dim, embed_dim * 4),
            #apply nonlinearity
            nn.ReLU(),
            #compress
            nn.Linear(embed_dim * 4, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_output)
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        return x

x = torch.randn(1, 5, 16)  # batch_size=1, seq_len=5, embed_dim=16
model = MiniTransformerBlock(embed_dim=16)
out = model(x)
print(out.shape)


torch.Size([1, 5, 16])


 Goal: Show how self-attention and FFN work with residual and norm in PyTorch.

### Part 3: Next Token Prediction using HuggingFace

#### Option 1: Use a Publicly Available Model
Use a non-gated model such as TinyLlama or mistralai/Mistral-7B-v0.1

In [None]:
# Step 1: Login via CLI
import os
!{sys.executable} -m pip install ipywidgets
!{sys.executable} -m pip install --upgrade "huggingface_hub[cli]"
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
!huggingface-cli login --token {HUGGINGFACE_TOKEN}


In [18]:
!{sys.executable} -m pip install ipywidgets
from huggingface_hub import login, whoami
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(HUGGINGFACE_TOKEN)  # or load from os.getenv
print(whoami())

Collecting ipywidgets
  Obtaining dependency information for ipywidgets from https://files.pythonhosted.org/packages/58/6a/9166369a2f092bd286d24e6307de555d63616e8ddb373ebad2b5635ca4cd/ipywidgets-8.1.7-py3-none-any.whl.metadata
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Obtaining dependency information for widgetsnbextension~=4.0.14 from https://files.pythonhosted.org/packages/ca/51/5447876806d1088a0f8f71e16542bf350918128d0a69437df26047c8e46f/widgetsnbextension-4.0.14-py3-none-any.whl.metadata
  Downloading widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Obtaining dependency information for jupyterlab_widgets~=3.0.15 from https://files.pythonhosted.org/packages/43/6a/ca128561b22b60bd5a0c4ea26649e68c8556b82bc70a0c396eebc977fe86/jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata
  Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl.met

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

{'type': 'user', 'id': '68890ea5eeb54fefc6b182ed', 'name': 'felixpeng24', 'fullname': 'Felix Peng', 'canPay': False, 'periodEnd': None, 'isPro': False, 'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/iWtMjWcO0PfFdF7RD_4F2.jpeg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'Inference AI', 'role': 'fineGrained', 'createdAt': '2025-07-30T15:59:09.506Z', 'fineGrained': {'canReadGatedRepos': True, 'global': ['discussion.write', 'post.write'], 'scoped': [{'entity': {'_id': '68890ea5eeb54fefc6b182ed', 'type': 'user', 'name': 'felixpeng24'}, 'permissions': ['repo.content.read', 'repo.write', 'inference.serverless.write', 'inference.endpoints.infer.write', 'inference.endpoints.write', 'user.webhooks.read', 'user.webhooks.write', 'collection.read', 'collection.write', 'discussion.write', 'user.billing.read', 'job.write']}]}}}}


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

login(token=HUGGINGFACE_TOKEN)

# you have to visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 to sign the agreement in order to use this model
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True, device_map="auto")


If you're using a Mac with M1/M2/M3 and have this line working:

In [None]:
! python -c "import torch; print(torch.backends.mps.is_available())"


If it returns True, then you can run like this:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

# Optional if already logged in via CLI

login(token=HUGGINGFACE_TOKEN)

# Check device for MacBook (MPS if available, else CPU)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# Load tokenizer and model with Hugging Face gated repo access
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True).to(device)

# Prepare input prompt
prompt = "The Eiffel Tower is located in"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Run generation
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))


ImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.

If you are not sure about your GPU in your device, selects the best available device in the order: CUDA → MPS → CPU and also prints which one it chose:

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

# Optional if already logged in via CLI
# login(token="your_hf_token")

# ------------------------------------------
# 📦 Device Selection: CUDA > MPS > CPU
# ------------------------------------------
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("✅ Using CUDA (GPU)")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("🟡 Using MPS (Apple Silicon GPU)")
else:
    device = torch.device("cpu")
    print("🔴 Using CPU")

# ------------------------------------------
# 🧠 Load Model from Hugging Face
# ------------------------------------------
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_auth_token=True,
    torch_dtype=torch.float16 if device.type != "cpu" else torch.float32  # avoid FP16 on CPU
).to(device)

# ------------------------------------------
# 📝 Prompt + Inference
# ------------------------------------------
prompt = "The Eiffel Tower is located in"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print("📝 Generated Output:", tokenizer.decode(outputs[0], skip_special_tokens=True))


🟡 Using MPS (Apple Silicon GPU)


Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.70s/it]
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


📝 Generated Output: The Eiffel Tower is located in Paris, France, and is one of the most


### Part 4: DPO vs PPO – Side-by-Side Educational Example
#### DPO: Direct Preference Optimization

In [20]:
import torch
import torch.nn.functional as F

# Simulated log-probs of chosen vs rejected completions
chosen_logp = torch.tensor([[-1.0]])
rejected_logp = torch.tensor([[-2.0]])

def dpo_loss(chosen_logp, rejected_logp, beta=0.1):
    return -F.logsigmoid((chosen_logp - rejected_logp) / beta).mean()

print("DPO Loss:", dpo_loss(chosen_logp, rejected_logp).item())



DPO Loss: 4.5398901420412585e-05


#### PPO: Proximal Policy Optimization (simplified for in-class demo)

In [19]:
import torch
import torch.nn.functional as F

# Simulated old and new policy log-probs (log π_θ(a|s) and log π_θ_old(a|s))
old_log_prob = torch.tensor([[-1.0]])  # from reference policy (e.g. GPT-4 before PPO step)
new_log_prob = torch.tensor([[-0.8]])  # from updated policy
reward = torch.tensor([[1.0]])         # reward from human or reward model
epsilon = 0.2                          # PPO clipping parameter

# Compute ratio of new to old policy
log_ratio = new_log_prob - old_log_prob
ratio = torch.exp(log_ratio)

# Unclipped and clipped advantages
advantage = reward  # assume reward ~ advantage for simplicity
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

# PPO loss (negative of the clipped surrogate objective)
ppo_loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
print("PPO Loss:", ppo_loss.item())


PPO Loss: -1.2000000476837158


## 🔍 DPO vs PPO: Alignment Loss Comparison

| Criterion         | DPO (Direct Preference Optimization)     | PPO (Proximal Policy Optimization)        |
|------------------|-------------------------------------------|--------------------------------------------|
| 🧠 Origin         | Preference modeling (UnfoldAI 2023)        | Reinforcement Learning (OpenAI 2017)       |
| ✅ Rejection Signal | Yes — uses chosen vs rejected pairs       | No — requires scalar reward                |
| 🏆 Reward Signal   | Implicit via logit difference              | Explicit reward model needed               |
| 📐 Loss Function   | `-log(sigmoid((chosen - rejected)/β))`     | `-min(ratio * A, clipped_ratio * A)`       |
| 🔧 Optimization    | Binary classification over preferences     | Policy gradient with clipped surrogate     |
| 🎯 Application     | DPO-tuned models like LLaMA 3              | RLHF-tuned models like InstructGPT         |
| ⚙️ Complexity      | Simpler (no reward model needed)           | More complex (needs reward model + sampling) |



Explain how aligning models toward human preference uses logit differences.

### Bonus: Inference with Quantization (O1 & O3)

In [21]:
# Run model with torch_dtype=torch.float16 for O1
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.32s/it]


Concept: Explain how FP16/O1 optimizes memory and speed at inference.

### 🧠 Concept Breakdown: How FP16 & O1 Optimize Inference

#### 📌 What is FP16?

- **FP16** = *16-bit floating point*, also called **half precision**.
- It uses **less memory** than FP32 (standard 32-bit float), with:
  - 1 sign bit
  - 5 exponent bits
  - 10 mantissa bits
- Typical FP32 values: `0.123456789`
- FP16 representation: `0.1234` (lower precision but good enough for inference)

---

#### 📉 Why Use FP16?

| Feature        | FP32                     | FP16                    |
|----------------|--------------------------|-------------------------|
| Memory usage   | 4 bytes per value         | 2 bytes per value       |
| Compute speed  | Slower on GPUs            | Much faster on GPUs (especially A100/H100) |
| Energy usage   | Higher                    | Lower                   |
| Precision      | High                      | Slightly reduced (acceptable for inference) |

🧠 FP16 helps run **large models** on GPUs with limited memory (e.g., 24GB vs 80GB cards).

---

#### ⚙️ What Is O1 Optimization?

`O1` is a setting from **[DeepSpeed](https://www.deepspeed.ai/)** and **[Accelerate](https://huggingface.co/docs/accelerate)** used for **mixed-precision inference/training**.

| Optimization Level | Description                           |
|--------------------|---------------------------------------|
| O0                 | Full precision (FP32)                 |
| **O1**             | **Mixed precision (auto FP16 + FP32 fallback)** |
| O2                 | Pure FP16                             |
| O3                 | Advanced optimizations (e.g., quantization, kernel fusion) |

##### 🔧 What Does O1 Do?
- Automatically **casts compatible operations** (like matmul) to FP16
- **Keeps numerically sensitive ops** (e.g., layer norm, softmax) in FP32
- Result: **Best balance** between speed and stability

---

#### ⚡ Benefits at Inference Time

| Metric            | Before (FP32 / O0) | After (FP16 / O1) |
|------------------|---------------------|--------------------|
| VRAM usage       | High                | ~2x lower          |
| Batch size limit | Smaller             | Larger             |
| Latency          | Higher              | Lower              |
| Throughput       | Lower               | Higher             |

**Example:** Running LLaMA-7B in FP32 might require ~30GB VRAM, while FP16 can bring that down to ~16GB.

---

#### 💡 Code Example (Hugging Face Transformers)

In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "openchat/openchat-3.5-1210"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # 👈 Enable half-precision
    device_map="auto"
)

Downloading shards: 100%|██████████| 3/3 [01:20<00:00, 26.91s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00,  3.96s/it]


#### 🧪 Bonus: Combine with O3
O3 goes further with quantization, sparse attention, and custom kernels

Supported by tools like DeepSpeed, vLLM, AWQ, and TensorRT

