# Week 10: Introduction to Large Language Models - Homework

**ML2: Advanced Machine Learning**

**Estimated Time**: 1 hour

---

This homework combines programming exercises and knowledge-based questions to reinforce this week's concepts.

## Setup

Run this cell to import necessary libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print('✓ Libraries imported successfully')

---
## Part 1: Programming Exercises (60%)

Complete the following programming tasks. Read each description carefully and implement the requested functionality.

### Exercise 1: Experiment: Temperature and Sampling

**Time**: 10 min

Explore how temperature affects LLM output diversity.

In [None]:
# Conceptual demonstration (pseudocode)
# In reality, use OpenAI API

# Temperature = 0.0 (deterministic, always picks most likely token)
prompt = "The capital of France is"
# Output: "Paris" (every time)

# Temperature = 0.7 (balanced creativity)
# Output: "Paris" (high probability)
#         "Paris, which is known for" (medium probability)

# Temperature = 1.5 (very creative/random)
# Output: "located in Europe" (lower probability)
#         "a fascinating question" (even lower probability)

# TODO: Run experiments with different temperatures
# Observe: Low temp = boring/repetitive, High temp = creative/nonsensical

---
## Part 2: Knowledge Questions (40%)

Answer the following questions to test your conceptual understanding.

### Question 1 (Short Answer)

**Question 1 - Emergence from Scale**

Small language models (millions of parameters): Complete sentences
Large language models (billions of parameters): Reasoning, math, coding, translation

Emergent abilities = capabilities that appear suddenly at scale, not present in smaller models.

Explain:
1. Why does scale enable new capabilities?
2. Is this just "more data" or something fundamental?
3. What surprised researchers about GPT-3's abilities?

**Hint**: Scale allows models to capture more complex patterns and relationships in data.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 2 (Short Answer)

**Question 2 - Pretraining vs Fine-tuning**

Pretraining: Learn language on massive unlabeled text ("predict next word")
Fine-tuning: Adapt to specific tasks with labeled data

Explain:
1. Why is pretraining on unlabeled data so powerful?
2. What does the model learn during pretraining?
3. How does fine-tuning leverage this?

**Hint**: Pretraining = general language understanding. Fine-tuning = task specialization.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 3 (Multiple Choice)

**Question 3 - Temperature Parameter**

Temperature controls how the model samples from its probability distribution.

What happens with temperature = 0?

A) Random outputs
B) Always selects the most likely next token (deterministic)
C) Model stops working
D) Longer responses

A) Random outputs
B) Always selects the most likely next token (deterministic)
C) Model stops working
D) Longer responses

**Hint**: Temp=0 → greedily pick highest probability token every time.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 4 (Short Answer)

**Question 4 - Context Window Limitations**

GPT-4 has a ~8k-128k token context window (depending on version).

Explain:
1. What happens when your conversation exceeds the context window?
2. Why can't we just make infinite context windows?
3. How does this affect long document summarization?

**Hint**: Attention is O(n²) in sequence length. Memory and computation explode.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 5 (Short Answer)

**Question 5 - In-Context Learning**

You can teach GPT new tasks by providing examples IN THE PROMPT (no fine-tuning needed).

Example:
Prompt: "Translate to French: Hello → Bonjour, Goodbye → Au revoir, Thank you → "
Output: "Merci"

Explain:
1. How does the model "learn" from these examples without training?
2. Why is this a game-changer for NLP?
3. What are the limits?

**Hint**: The model recognizes the pattern at inference time, leveraging its pretraining.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 6 (Multiple Choice)

**Question 6 - Tokenization**

LLMs don't process characters or words—they process TOKENS.

What is a token?

A) Always one word
B) Always one character  
C) A subword unit (could be part of word, whole word, or punctuation)
D) A sentence

A) Always one word
B) Always one character
C) A subword unit (could be part of word, whole word, or punctuation)
D) A sentence

**Hint**: "ChatGPT" might be 2 tokens: "Chat" + "GPT". Tokenization is subword-based.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 7 (Short Answer)

**Question 7 - Hallucinations**

LLMs sometimes generate plausible-sounding but factually incorrect information.

Explain:
1. Why do hallucinations happen?
2. Is this fundamentally fixable, or inherent to the approach?
3. How can you reduce hallucinations in practice?

**Hint**: LLMs predict plausible text, not necessarily true text. They don't have a "fact database".

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 8 (Short Answer)

**Question 8 - RLHF (Reinforcement Learning from Human Feedback)**

ChatGPT uses RLHF to align with human preferences.

Process:
1. Collect human rankings of model outputs
2. Train reward model to predict human preferences
3. Fine-tune LLM to maximize reward

Explain: Why is this better than just supervised fine-tuning on human demonstrations?

**Hint**: RLHF allows learning from comparisons ("A is better than B"), not just demonstrations.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 9 (Short Answer)

**Question 9 - Zero-Shot vs Few-Shot**

Zero-shot: "Classify sentiment: 'I love this!' → "
Few-shot: "Positive: 'Great!' Negative: 'Terrible!' Classify: '  I love this!' → "

Explain:
1. When does few-shot help significantly?
2. When is zero-shot sufficient?
3. What does this reveal about what LLMs learned during pretraining?

**Hint**: LLMs have general capabilities. Few-shot refines them for specific formats/tasks.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 10 (Short Answer)

**Question 10 - Scaling Laws**

Research shows that LLM performance follows predictable scaling laws:
Performance = f(model_size, data_size, compute)

Explain:
1. What does this predict about future models?
2. What are the bottlenecks to continued scaling?
3. Is "bigger is always better" sustainable?

**Hint**: Bottlenecks: compute cost, energy, data availability, diminishing returns.

**Your Answer**:

[Write your answer here in 2-4 sentences]

---
## Submission

Before submitting:
1. Run all cells to ensure code executes without errors
2. Check that all questions are answered
3. Review your explanations for clarity

**To Submit**:
- File → Download → Download .ipynb
- Submit the notebook file to your course LMS

**Note**: Make sure your name is in the filename (e.g., homework_01_yourname.ipynb)