# Week 2 Practice: From Transformers to Alignment
### Learning Objectives:
* Understand attention mechanisms through NumPy code
* Build a simple transformer block
* Predict next token using a pretrained LLM
* Analyze hallucinations
* Explore supervised fine-tuning logic
* Understand how DPO works via preference modeling

#### Package Introduction

In this notebook, we will use several important Python libraries:

- **[NumPy](https://education.launchcode.org/data-analysis-curriculum/eda-with-pandas/reading/numpy-intro/index.html?utm_term=launchcode&utm_campaign=&utm_source=bing&utm_medium=ppc&hsa_acc=4368208516&hsa_cam=568518766&hsa_grp=1173180668353233&hsa_ad=&hsa_src=o&hsa_tgt=dat-2325123495982042:loc-190&hsa_kw=launchcode&hsa_mt=b&hsa_net=adwords&hsa_ver=3&msclkid=f69fad9bed3f18d44e31c5a6703d580b&utm_content=Group%202)**: The fundamental package for scientific computing with Python. We use it for matrix operations and to demonstrate the attention mechanism.
- **[PyTorch](https://www.geeksforgeeks.org/start-learning-pytorch-for-beginners/)**: A popular deep learning framework. We use it to build and train neural network models, including transformer blocks.
- **[Hugging Face Transformers](https://github.com/huggingface/transformers)**: Provides state-of-the-art pre-trained models and tools for natural language processing. We use it to load and interact with large language models (LLMs).
- **[huggingface-cli](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli)**: A command-line tool for managing Hugging Face models and datasets. Useful for downloading models or checking your authentication.

Make sure you have these packages installed. You can install them using pip if needed:


In [1]:
! pip install numpy torch transformers huggingface_hub

# For `huggingface-cli`, it is included with `huggingface_hub` or `transformers`. You can check your installation with:

! huggingface-cli --help


Collecting transformers
  Downloading transformers-4.54.1-py3-none-any.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.7 kB ? eta -:--:--
     ------------------ ------------------- 20.5/41.7 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 41.7/41.7 kB 499.5 kB/s eta 0:00:00
Collecting huggingface_hub
  Downloading huggingface_hub-0.34.3-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.7.34-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     -------------------


[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: C:\Users\johnny\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


usage: huggingface-cli <command> [<args>]

positional arguments:
  {download,upload,repo-files,env,login,whoami,logout,auth,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag,version,upload-large-folder}
                        huggingface-cli command helpers
    download            Download files from the Hub
    upload              Upload a file or a folder to a repo on the Hub
    repo-files          Manage files in a repo on the Hub
    env                 Print information about the environment.
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account you are logged
                        in as.
    logout              Log out
    auth                Other authentication related commands
    repo                {create} Commands to interact with your huggingface.co
                        repos.
    lfs-enable-largefiles
                        Co

### Part 1: Attention Mechanism (Self-Attention)

In [1]:
import numpy as np

# Random Q, K, V matrices
def generate_random_qkv(seq_len=4, d_model=8):
    return [np.random.rand(seq_len, d_model) for _ in range(3)]

# Scaled dot-product attention
def self_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    weights = softmax(scores)
    output = np.dot(weights, V)
    return output, weights

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Q, K, V = generate_random_qkv()
out, attn_weights = self_attention(Q, K, V)
print("Attention Output:\n", out)
print("Attention Weights:\n", attn_weights)


Attention Output:
 [[0.472317   0.6004395  0.47280575 0.56079671 0.76212407 0.48117356
  0.54135689 0.61600153]
 [0.46835529 0.61284489 0.47745831 0.557511   0.77170991 0.48230037
  0.56134899 0.61996662]
 [0.47425451 0.61578046 0.48533681 0.56396424 0.76298441 0.46473933
  0.53955355 0.61884739]
 [0.46978432 0.60429574 0.47276668 0.55957995 0.76502296 0.48343427
  0.54745493 0.61506526]]
Attention Weights:
 [[0.29168436 0.24899303 0.23584364 0.22347897]
 [0.31473761 0.23272291 0.24143685 0.21110263]
 [0.29163734 0.21916253 0.24068584 0.24851429]
 [0.29723171 0.24508877 0.23970565 0.21797388]]


 Discussion: Walk students through the QK^T score computation, scaling, and softmax. Explain how this captures relationships between tokens.

### Part 2: Mini Transformer Block in PyTorch

In [7]:
import torch
import torch.nn as nn

class MiniTransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads=2, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.ReLU(),
            nn.Linear(embed_dim * 4, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_output)
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        return x

x = torch.randn(1, 5, 16)  # batch_size=1, seq_len=5, embed_dim=16
model = MiniTransformerBlock(embed_dim=16)
out = model(x)
print(out.data)


tensor([[[ 0.3437, -0.7434,  1.9965,  0.1187, -0.0828, -2.5316,  0.3967,
           0.6023, -0.8104,  0.4995, -0.1412,  0.0738,  0.9935, -1.1261,
           0.9544, -0.5436],
         [ 0.5594,  1.2446, -1.4856, -1.0434, -0.1766, -0.1144,  0.9168,
           1.5608, -0.6772,  0.0592, -0.8327, -1.6486, -0.4579,  1.4244,
          -0.4341,  1.1052],
         [ 0.3649,  0.0834,  0.4399,  0.8397,  0.9748, -1.9832, -0.1018,
           0.7218, -0.5787,  0.6477,  1.4206, -0.4369,  1.0727, -1.9019,
          -0.2366, -1.3265],
         [-0.5305,  0.3394, -1.8503, -0.4006,  2.2120, -0.0298,  0.5067,
           0.3052,  0.4162, -1.6105,  0.6547,  0.4939, -1.6868,  0.2873,
           0.5142,  0.3790],
         [-0.6227, -0.5340,  0.3867, -1.8154,  0.3300,  0.6576,  0.3610,
          -0.6669,  0.6062,  0.8167,  1.9314, -0.4063,  1.3760, -1.1439,
          -1.5919,  0.3156]]])


 Goal: Show how self-attention and FFN work with residual and norm in PyTorch.

### Part 3: Next Token Prediction using HuggingFace

#### Option 1: Use a Publicly Available Model
Use a non-gated model such as TinyLlama or mistralai/Mistral-7B-v0.1

In [2]:
# Step 1: Login via CLI
#! pip install --upgrade huggingface_hub[cli]
# To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
#! huggingface-cli login --token **removed**
! hf auth login --token **removed**


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `MLE_Week2` has been saved to C:\Users\johnny\.cache\huggingface\stored_tokens
Your token has been saved to C:\Users\johnny\.cache\huggingface\token
Login successful.
The current active token is: `MLE_Week2`


In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

# login(token="your_hf_token")  # optional if already logged in via CLI

# you have to visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 to sign the agreement in order to use this model
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True, device_map="auto")




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.


If you're using a Mac with M1/M2/M3 and have this line working:

In [14]:
! python -c "import torch; print(torch.backends.mps.is_available())"


False


If it returns True, then you can run like this:

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

# Optional if already logged in via CLI
# login(token="your_hf_token")

# Check device for MacBook (MPS if available, else CPU)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# Load tokenizer and model with Hugging Face gated repo access
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True).to(device)

# Prepare input prompt
prompt = "The Eiffel Tower is located in"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Run generation
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))


  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


Loading checkpoint shards: 100%|██████████| 2/2 [00:29<00:00, 14.89s/it]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The Eiffel Tower is located in Paris, France, and is one of the most


If you are not sure about your GPU in your device, selects the best available device in the order: CUDA → MPS → CPU and also prints which one it chose:

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

# Optional if already logged in via CLI
# login(token="your_hf_token")

# ------------------------------------------
# 📦 Device Selection: CUDA > MPS > CPU
# ------------------------------------------
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("✅ Using CUDA (GPU)")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("🟡 Using MPS (Apple Silicon GPU)")
else:
    device = torch.device("cpu")
    print("🔴 Using CPU")

# ------------------------------------------
# 🧠 Load Model from Hugging Face
# ------------------------------------------
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_auth_token=True,
    torch_dtype=torch.float16 if device.type != "cpu" else torch.float32  # avoid FP16 on CPU
).to(device)

# ------------------------------------------
# 📝 Prompt + Inference
# ------------------------------------------
prompt = "The Eiffel Tower is located in"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print("📝 Generated Output:", tokenizer.decode(outputs[0], skip_special_tokens=True))


  from .autonotebook import tqdm as notebook_tqdm


✅ Using CUDA (GPU)


Loading checkpoint shards: 100%|██████████| 2/2 [00:20<00:00, 10.33s/it]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


📝 Generated Output: The Eiffel Tower is located in Paris, France, and is one of the most


### Part 4: DPO vs PPO – Side-by-Side Educational Example
#### DPO: Direct Preference Optimization

In [32]:
import torch
import torch.nn.functional as F

# Simulated log-probs of chosen vs rejected completions
chosen_logp = torch.tensor([[-2.0]])
rejected_logp = torch.tensor([[-4.0]])

def dpo_loss(chosen_logp, rejected_logp, beta=0.1):
    return -F.logsigmoid((chosen_logp - rejected_logp) / beta).mean()

print("DPO Loss:", dpo_loss(chosen_logp, rejected_logp).item())

print("DPO Loss:", dpo_loss(torch.tensor([[-1.0]]), torch.tensor([[-10.0]])).item())
print("DPO Loss:", dpo_loss(torch.tensor([[-2.0]]), torch.tensor([[-15.0]])).item())
print("DPO Loss:", dpo_loss(torch.tensor([[-3.0]]), torch.tensor([[-7.0]])).item())
print("DPO Loss:", dpo_loss(torch.tensor([[-4.0]]), torch.tensor([[-5.0]])).item())



DPO Loss: 2.06115369216775e-09
DPO Loss: 8.194008692231508e-40
DPO Loss: -0.0
DPO Loss: 4.24835413113866e-18
DPO Loss: 4.5398901420412585e-05


#### PPO: Proximal Policy Optimization (simplified for in-class demo)

In [34]:
import torch
import torch.nn.functional as F

# Simulated old and new policy log-probs (log π_θ(a|s) and log π_θ_old(a|s))
old_log_prob = torch.tensor([[-1.0]])  # from reference policy (e.g. GPT-4 before PPO step)
new_log_prob = torch.tensor([[-0.8]])  # from updated policy
reward = torch.tensor([[1.0]])         # reward from human or reward model
epsilon = 0.2                          # PPO clipping parameter

# Compute ratio of new to old policy
log_ratio = new_log_prob - old_log_prob
ratio = torch.exp(log_ratio)

# Unclipped and clipped advantages
advantage = reward  # assume reward ~ advantage for simplicity
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

# PPO loss (negative of the clipped surrogate objective)
ppo_loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
print("PPO Loss:", ppo_loss.item())


PPO Loss: -1.2000000476837158


## 🔍 DPO vs PPO: Alignment Loss Comparison

| Criterion         | DPO (Direct Preference Optimization)     | PPO (Proximal Policy Optimization)        |
|------------------|-------------------------------------------|--------------------------------------------|
| 🧠 Origin         | Preference modeling (UnfoldAI 2023)        | Reinforcement Learning (OpenAI 2017)       |
| ✅ Rejection Signal | Yes — uses chosen vs rejected pairs       | No — requires scalar reward                |
| 🏆 Reward Signal   | Implicit via logit difference              | Explicit reward model needed               |
| 📐 Loss Function   | `-log(sigmoid((chosen - rejected)/β))`     | `-min(ratio * A, clipped_ratio * A)`       |
| 🔧 Optimization    | Binary classification over preferences     | Policy gradient with clipped surrogate     |
| 🎯 Application     | DPO-tuned models like LLaMA 3              | RLHF-tuned models like InstructGPT         |
| ⚙️ Complexity      | Simpler (no reward model needed)           | More complex (needs reward model + sampling) |



Explain how aligning models toward human preference uses logit differences.

### Bonus: Inference with Quantization (O1 & O3)

In [5]:
# Run model with torch_dtype=torch.float16 for O1
import torch

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.


Concept: Explain how FP16/O1 optimizes memory and speed at inference.

### 🧠 Concept Breakdown: How FP16 & O1 Optimize Inference

#### 📌 What is FP16?

- **FP16** = *16-bit floating point*, also called **half precision**.
- It uses **less memory** than FP32 (standard 32-bit float), with:
  - 1 sign bit
  - 5 exponent bits
  - 10 mantissa bits
- Typical FP32 values: `0.123456789`
- FP16 representation: `0.1234` (lower precision but good enough for inference)

---

#### 📉 Why Use FP16?

| Feature        | FP32                     | FP16                    |
|----------------|--------------------------|-------------------------|
| Memory usage   | 4 bytes per value         | 2 bytes per value       |
| Compute speed  | Slower on GPUs            | Much faster on GPUs (especially A100/H100) |
| Energy usage   | Higher                    | Lower                   |
| Precision      | High                      | Slightly reduced (acceptable for inference) |

🧠 FP16 helps run **large models** on GPUs with limited memory (e.g., 24GB vs 80GB cards).

---

#### ⚙️ What Is O1 Optimization?

`O1` is a setting from **[DeepSpeed](https://www.deepspeed.ai/)** and **[Accelerate](https://huggingface.co/docs/accelerate)** used for **mixed-precision inference/training**.

| Optimization Level | Description                           |
|--------------------|---------------------------------------|
| O0                 | Full precision (FP32)                 |
| **O1**             | **Mixed precision (auto FP16 + FP32 fallback)** |
| O2                 | Pure FP16                             |
| O3                 | Advanced optimizations (e.g., quantization, kernel fusion) |

##### 🔧 What Does O1 Do?
- Automatically **casts compatible operations** (like matmul) to FP16
- **Keeps numerically sensitive ops** (e.g., layer norm, softmax) in FP32
- Result: **Best balance** between speed and stability

---

#### ⚡ Benefits at Inference Time

| Metric            | Before (FP32 / O0) | After (FP16 / O1) |
|------------------|---------------------|--------------------|
| VRAM usage       | High                | ~2x lower          |
| Batch size limit | Smaller             | Larger             |
| Latency          | Higher              | Lower              |
| Throughput       | Lower               | Higher             |

**Example:** Running LLaMA-7B in FP32 might require ~30GB VRAM, while FP16 can bring that down to ~16GB.

---

#### 💡 Code Example (Hugging Face Transformers)

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "openchat/openchat-3.5-1210"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # 👈 Enable half-precision
    device_map="auto"
)

  from .autonotebook import tqdm as notebook_tqdm
Fetching 3 files: 100%|██████████| 3/3 [11:51<00:00, 237.02s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:07<00:00,  2.53s/it]
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Some parameters are on the meta device because they were offloaded to the disk and cpu.


#### 🧪 Bonus: Combine with O3
O3 goes further with quantization, sparse attention, and custom kernels

Supported by tools like DeepSpeed, vLLM, AWQ, and TensorRT



In [None]:
!pip install gguf

# working on llama-cpp-python install

Collecting gguf
  Downloading gguf-0.17.1-py3-none-any.whl.metadata (4.3 kB)
Downloading gguf-0.17.1-py3-none-any.whl (96 kB)
Installing collected packages: gguf
Successfully installed gguf-0.17.1


In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Define the model and GGUF file name
# We'll use a GGUF-specific model repository from TheBloke,
# which is a great source for quantized models.
model_id = "TheBloke/Llama-2-7B-Chat-GGUF"
gguf_file = "C:\\Users\\johnny\\python\\InferenceAI\\MLE_in_Gen_AI-Course\\p101\\llama-2-7b-chat.Q4_K_M.gguf"

# 2. Load the tokenizer and model from the GGUF file
# The `gguf_file` parameter is the key to loading a GGUF model.
try:
    print(f"Loading GGUF model '{gguf_file}' from '{model_id}'...")
    tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=gguf_file)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        gguf_file=gguf_file,
        torch_dtype=torch.float32,  # GGUF models are dequantized to a PyTorch format for inference
        device_map="auto"           # Automatically map model layers to available devices (e.g., GPU, CPU)
    )
    print("Model loaded successfully!")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure the model and GGUF file exist on the Hugging Face Hub or locally.")

# 3. Use the model for inference (optional)
if 'model' in locals() and 'tokenizer' in locals():
    # Define your prompt
    prompt = "Tell me a short story about a dragon."
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate a response
    print("\nGenerating text...")
    outputs = model.generate(**inputs, max_new_tokens=100)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print("\n--- Generated Response ---")
    print(response)
    print("--------------------------")

Loading GGUF model 'C:\Users\johnny\python\InferenceAI\MLE_in_Gen_AI-Course\p101\llama-2-7b-chat.Q4_K_M.gguf' from 'TheBloke/Llama-2-7B-Chat-GGUF'...


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


An error occurred: TheBloke/Llama-2-7B-Chat-GGUF does not appear to have a file named C:\Users\johnny\python\InferenceAI\MLE_in_Gen_AI-Course\p101\llama-2-7b-chat.Q4_K_M.gguf. Checkout 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main' for available files.
Please ensure the model and GGUF file exist on the Hugging Face Hub or locally.

Generating text...

--- Generated Response ---
Tell me a short story about a dragon.

Once upon a time, in a faraway kingdom, there lived a dragon named Dracul. Dracul was different from other dragons. He didn't breathe fire, and he didn't want to conquer the kingdom. All he wanted was to live in peace and protect the people he cared about.

One day, a group of villagers came to Dracul's cave, seeking his help. They were being terrorized by a band of go
--------------------------
