# Week 1: Introduction to Deep Learning - Homework

**ML2: Advanced Machine Learning**

**Estimated Time**: 1 hour

---

This homework combines programming exercises and knowledge-based questions to reinforce this week's concepts.

## Setup

Run this cell to import necessary libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print('✓ Libraries imported successfully')

---
## Part 1: Programming Exercises (60%)

Complete the following programming tasks. Read each description carefully and implement the requested functionality.

### Exercise 1: Experiment: Observing Feature Learning

**Time**: 8 min

Run this code to visualize what happens when a network learns features automatically vs. using hand-crafted features. Observe the outputs and answer the reflection questions below.

In [None]:
import torch
import torch.nn as nn
import numpy as np

# Simulate a simple pattern recognition task
# Pattern: Detect if sum of inputs > 5
np.random.seed(42)
torch.manual_seed(42)

# Generate data
X = torch.randn(100, 4)  # 100 samples, 4 features
y = (X.sum(dim=1) > 0).float()  # Label: 1 if sum > 0, else 0

# Network that LEARNS features
model = nn.Sequential(
    nn.Linear(4, 8),   # Learned feature extraction
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

# Train for a few steps
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.BCELoss()

for epoch in range(20):
    optimizer.zero_grad()
    predictions = model(X).squeeze()
    loss = loss_fn(predictions, y)
    loss.backward()
    optimizer.step()

print(f"Final loss: {loss.item():.4f}")
print(f"First layer weights (learned features):")
print(model[0].weight.data)

# TODO: After running, answer reflection questions below

### Exercise 2: Experiment: Network Without Nonlinearity

**Time**: 10 min

This experiment demonstrates why activation functions are essential. Compare two networks: one with ReLU, one without.

In [None]:
import torch
import torch.nn as nn

# Network WITH nonlinearity (ReLU)
network_with_relu = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 15),
    nn.ReLU(),
    nn.Linear(15, 5)
)

# Network WITHOUT nonlinearity (just linear layers)
network_without_relu = nn.Sequential(
    nn.Linear(10, 20),
    nn.Linear(20, 15),
    nn.Linear(15, 5)
)

# Test input
x = torch.randn(1, 10)

# Compare outputs
output_with = network_with_relu(x)
output_without = network_without_relu(x)

print("With ReLU output:", output_with)
print("Without ReLU output:", output_without)

# TODO: Now manually compute what network_without_relu is equivalent to
# Hint: Multiple linear transformations collapse into a single linear transformation
# Can you express the 3-layer linear network as a SINGLE equivalent linear layer?

---
## Part 2: Knowledge Questions (40%)

Answer the following questions to test your conceptual understanding.

### Question 1 (Short Answer)

**Question 1 - Automatic Feature Learning (Conceptual)**

Traditional machine learning for image classification requires manually designing features (e.g., edge detectors, color histograms, texture filters). Deep learning does not.

Explain in 3-4 sentences:
1. WHY can deep networks learn features automatically?
2. WHAT enables this (what architectural property)?
3. What is the tradeoff (what does deep learning need more of)?

**Hint**: Think about what happens in each layer of a deep network and how backpropagation adjusts those layers.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 2 (Short Answer)

**Question 2 - Feature Hierarchy (Conceptual)**

In a deep CNN for face recognition:
- Layer 1 might detect edges
- Layer 2 might detect facial features (eyes, nose)
- Layer 3 might detect whole faces

Explain: Why does depth create this hierarchy? What would happen if you used a single-layer network instead?

**Hint**: Consider how each layer builds on representations from the previous layer.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 3 (Short Answer)

**Question 3 - Nonlinearity Experiment Reflection**

Based on the 'Network Without Nonlinearity' experiment above:

Prove mathematically or explain conceptually why the 3-layer network without ReLU is equivalent to a SINGLE linear layer. What does this tell you about the necessity of activation functions?

**Hint**: Remember: Linear(Linear(x)) = Linear(x) because you can multiply weight matrices together.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 4 (Multiple Choice)

**Question 4 - Understanding Nonlinearity**

A neural network with 10 layers but NO activation functions can represent:

A) Any possible function (universal approximation)
B) Only linear functions
C) Only polynomial functions
D) Only step functions

A) Any possible function (universal approximation)
B) Only linear functions
C) Only polynomial functions
D) Only step functions

**Hint**: What happens when you compose linear transformations?

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 5 (Short Answer)

**Question 5 - ReLU Design Choice**

ReLU(x) = max(0, x) is one of the simplest possible nonlinear functions. Yet it became the dominant activation function (replacing sigmoid).

Explain TWO advantages ReLU has over sigmoid for deep networks. One should relate to gradients, one to computation.

**Hint**: Think about what happens to gradients when x is large and positive in sigmoid vs ReLU.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 6 (Short Answer)

**Question 6 - Gradient Descent Intuition**

Gradient descent updates weights using: θ_new = θ_old - α × ∇L

Where ∇L is the gradient of the loss.

Explain in simple terms:
1. What does the gradient ∇L represent geometrically?
2. Why do we SUBTRACT it (the negative sign)?
3. What role does α (learning rate) play?

**Hint**: Think of the loss function as a landscape/terrain you're trying to navigate.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 7 (Multiple Choice)

**Question 7 - Loss Function Purpose**

The loss function in deep learning serves to:

A) Measure how wrong the model is, providing a signal for gradient descent
B) Prevent overfitting by penalizing complex models
C) Speed up training by reducing computation
D) Automatically select which features to learn

A) Measure how wrong the model is, providing a signal for gradient descent
B) Prevent overfitting by penalizing complex models
C) Speed up training by reducing computation
D) Automatically select which features to learn

**Hint**: What do we need to compute gradients?

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 8 (Short Answer)

**Question 8 - Feature Learning Reflection**

After running the 'Observing Feature Learning' experiment:

Look at the learned weights in the first layer. These represent the FEATURES the network learned.

Explain: How did the network 'know' which features to learn? What guided it to learn useful features rather than random ones?

**Hint**: The answer involves both the loss function and backpropagation.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 9 (Short Answer)

**Question 9 - Connecting the Concepts**

Integrate all three key insights:

Explain how (1) automatic feature learning, (2) nonlinearity, and (3) gradient descent work TOGETHER to enable deep learning. 

Your answer should show how all three are necessary and how they interact.

**Hint**: Think: What would happen if you removed any one of these three components?

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 10 (Short Answer)

**Question 10 - Scaling to Real Problems**

ImageNet (image classification) has 1000 classes and ~1.2 million training images. Traditional ML would require human experts to manually design thousands of features.

Explain: Why does deep learning have an advantage that GROWS as the problem gets more complex (more classes, more data)? What breaks down in the traditional approach?

**Hint**: Consider both the human effort required and what happens when you have more data.

**Your Answer**:

[Write your answer here in 2-4 sentences]

---
## Submission

Before submitting:
1. Run all cells to ensure code executes without errors
2. Check that all questions are answered
3. Review your explanations for clarity

**To Submit**:
- File → Download → Download .ipynb
- Submit the notebook file to your course LMS

**Note**: Make sure your name is in the filename (e.g., homework_01_yourname.ipynb)