# Lab 1: Python Foundations for Deep Learning

## Learning Objectives

By the end of this lab, you will be able to:
1. Write type-annotated Python functions using modern syntax
2. Apply Pythonic patterns (comprehensions, context managers, enumerate/zip)
3. Perform NumPy array operations including broadcasting and reshaping
4. Load, preprocess, and visualise data with Pandas and Matplotlib
5. Understand OOP conventions used in PyTorch (classes, `__call__`, etc.)
6. Implement linear regression using sklearn and from scratch
7. Write generators for memory-efficient data processing

## Prerequisites

- Basic Python programming (variables, loops, functions, classes)
- Familiarity with importing packages

## Why This Lab?

This lab covers **all Python/ML foundations** needed before diving into deep learning:
- **Type hints** make code self-documenting
- **NumPy broadcasting** is essential for tensor operations
- **Pandas** for data loading and preprocessing
- **sklearn basics** bridge to neural network training
- **OOP patterns** like `__call__` are central to `nn.Module`
- **Generators** power PyTorch DataLoaders

In [None]:
# ==== Environment Setup ====
import os
import sys

# Detect environment
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running on Google Colab")
else:
    print("Running locally")

def download_file(url: str, filename: str) -> str:
    """Download file if it doesn't exist. Works on both Colab and local."""
    if os.path.exists(filename):
        print(f"'{filename}' already exists")
        return filename
    
    print(f"Downloading {filename}...")
    if IN_COLAB:
        import subprocess
        subprocess.run(['wget', '-q', url, '-O', filename], check=True)
    else:
        import urllib.request
        urllib.request.urlretrieve(url, filename)
    print(f"Downloaded {filename}")
    return filename

In [None]:
# ==== Device Setup ====
import torch

def get_device():
    """Get best available device: CUDA > MPS > CPU."""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"Using CUDA GPU: {torch.cuda.get_device_name(0)}")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device('mps')
        print("Using Apple MPS (Metal)")
    else:
        device = torch.device('cpu')
        print("Using CPU")
    return device

DEVICE = get_device()

---

# Part 2: Python Foundations for Deep Learning

---

## 2.1 Type Hints

Type hints make code self-documenting and enable better IDE support.

### Basic Syntax (Python 3.10+)

In [None]:
# Basic type hints
def greet(name: str) -> str:
    return f"Hello, {name}!"

# Collections - use lowercase (Python 3.10+)
def average(values: list[float]) -> float:
    return sum(values) / len(values)

# Dictionaries
def word_count(text: str) -> dict[str, int]:
    words = text.lower().split()
    return {word: words.count(word) for word in set(words)}

# Optional values (can be None)
def find_index(items: list[str], target: str) -> int | None:
    try:
        return items.index(target)
    except ValueError:
        return None

# Test
print(f"greet('World'): {greet('World')}")
print(f"average([1, 2, 3, 4, 5]): {average([1, 2, 3, 4, 5])}")
print(f"word_count('the cat and the dog'): {word_count('the cat and the dog')}")
print(f"find_index(['a', 'b', 'c'], 'b'): {find_index(['a', 'b', 'c'], 'b')}")
print(f"find_index(['a', 'b', 'c'], 'x'): {find_index(['a', 'b', 'c'], 'x')}")

<details>
<summary><b>Q: Why use `int | None` instead of `Optional[int]`?</b></summary>

**A:** `int | None` is the modern Python 3.10+ syntax. It's more readable and doesn't require importing from `typing`. The older `Optional[int]` still works but is more verbose.

```python
# Old style (pre-3.10)
from typing import Optional, List
def f(x: Optional[int]) -> List[str]: ...

# Modern style (3.10+)
def f(x: int | None) -> list[str]: ...
```
</details>

### Exercise: Add Type Hints

Add type hints to the following functions:

In [None]:
# Exercise: Add type hints to these functions

def calculate_loss(predictions, targets):
    """Calculate mean squared error loss."""
    return sum((p - t) ** 2 for p, t in zip(predictions, targets)) / len(predictions)

def get_batch(data, batch_idx, batch_size):
    """Get a batch from data. Returns None if batch_idx out of range."""
    start = batch_idx * batch_size
    if start >= len(data):
        return None
    return data[start:start + batch_size]

def create_optimiser_config(lr, momentum, weight_decay):
    """Create optimiser configuration dictionary."""
    return {"lr": lr, "momentum": momentum, "weight_decay": weight_decay}

# Test (uncomment after adding hints):
# print(calculate_loss([1.0, 2.0], [1.1, 2.2]))
# print(get_batch([1,2,3,4,5], 0, 2))
# print(create_optimiser_config(0.01, 0.9, 1e-4))

<details>
<summary><b>Solution: Type Hints</b></summary>

```python
def calculate_loss(predictions: list[float], targets: list[float]) -> float:
    """Calculate mean squared error loss."""
    return sum((p - t) ** 2 for p, t in zip(predictions, targets)) / len(predictions)

def get_batch(data: list, batch_idx: int, batch_size: int) -> list | None:
    """Get a batch from data. Returns None if batch_idx out of range."""
    start = batch_idx * batch_size
    if start >= len(data):
        return None
    return data[start:start + batch_size]

def create_optimiser_config(lr: float, momentum: float, weight_decay: float) -> dict[str, float]:
    """Create optimiser configuration dictionary."""
    return {"lr": lr, "momentum": momentum, "weight_decay": weight_decay}
```
</details>

## 2.2 Docstrings

Use Google-style docstrings for complex functions:

In [None]:
def train_model(
    model,
    train_loader,
    epochs: int = 10,
    learning_rate: float = 0.001,
    verbose: bool = True
) -> dict[str, list[float]]:
    """
    Train a PyTorch model.
    
    Args:
        model: PyTorch model to train (nn.Module)
        train_loader: DataLoader with training data
        epochs: Number of training epochs
        learning_rate: Learning rate for optimiser
        verbose: Whether to print progress
    
    Returns:
        Dictionary with 'train_loss' history
    
    Raises:
        ValueError: If epochs < 1
    
    Example:
        >>> history = train_model(model, loader, epochs=5)
        >>> plt.plot(history['train_loss'])
    """
    if epochs < 1:
        raise ValueError("epochs must be >= 1")
    # ... training code ...
    return {"train_loss": []}

# For simple/obvious functions, a one-liner is fine:
def relu(x: float) -> float:
    """Return max(0, x)."""
    return max(0, x)

## 2.3 Pythonic Patterns

### List Comprehensions

In [None]:
# Instead of:
squares_loop = []
for i in range(10):
    squares_loop.append(i ** 2)

# Use:
squares = [i ** 2 for i in range(10)]
print(f"Squares: {squares}")

# With condition
evens = [i for i in range(20) if i % 2 == 0]
print(f"Evens: {evens}")

# Dict comprehension
word_lengths = {word: len(word) for word in ["cat", "elephant", "dog"]}
print(f"Word lengths: {word_lengths}")

# Set comprehension (removes duplicates)
unique_lengths = {len(word) for word in ["cat", "bat", "elephant", "ant"]}
print(f"Unique lengths: {unique_lengths}")

### enumerate, zip, sorted

In [None]:
# enumerate - get index and value
fruits = ["apple", "banana", "cherry"]
for i, fruit in enumerate(fruits):
    print(f"{i}: {fruit}")

# zip - iterate multiple sequences together
names = ["Alice", "Bob", "Charlie"]
scores = [85, 92, 78]
for name, score in zip(names, scores):
    print(f"{name}: {score}")

# sorted with key function
students = [("Alice", 85), ("Bob", 92), ("Charlie", 78)]
by_score = sorted(students, key=lambda x: x[1], reverse=True)
print(f"By score (desc): {by_score}")

<details>
<summary><b>Q: When should you use a list comprehension vs a regular loop?</b></summary>

**A:** Use comprehensions when:
- Building a new list/dict/set from an iterable
- The logic fits on one readable line

Use regular loops when:
- You need complex logic or multiple statements
- You're modifying in place rather than creating new
- Readability suffers from one-liner

**Rule of thumb:** If you can't understand it in 5 seconds, use a loop.
</details>

### Exercise: Pythonic Refactoring

Refactor this verbose code to use Pythonic patterns:

In [None]:
# VERBOSE VERSION - refactor this to be Pythonic!

# Task 1: Create list of (name, score) tuples where score > 80
names = ["Alice", "Bob", "Charlie", "Diana"]
scores = [95, 72, 88, 65]
high_scorers = []
for i in range(len(names)):
    if scores[i] > 80:
        high_scorers.append((names[i], scores[i]))

# Task 2: Create dict mapping filename -> extension
files = ["data.csv", "model.pt", "config.json", "README.md"]
extensions = {}
for f in files:
    parts = f.split(".")
    name = parts[0]
    ext = parts[1]
    extensions[name] = ext

# Task 3: Read file, count non-empty lines (use context manager!)
f = open("test_file.txt", "w")
f.write("line1\n\nline2\nline3\n")
f.close()

f = open("test_file.txt", "r")
lines = f.readlines()
f.close()
count = 0
for line in lines:
    if line.strip() != "":
        count = count + 1

# Cleanup
import os
os.remove("test_file.txt")

print(f"High scorers: {high_scorers}")
print(f"Extensions: {extensions}")
print(f"Non-empty lines: {count}")

<details>
<summary><b>Solution: Pythonic Refactoring</b></summary>

```python
# Task 1: zip + list comprehension with filter
high_scorers = [(n, s) for n, s in zip(names, scores) if s > 80]

# Task 2: dict comprehension with split unpacking
extensions = {f.split(".")[0]: f.split(".")[1] for f in files}
# Or cleaner:
extensions = {Path(f).stem: Path(f).suffix[1:] for f in files}

# Task 3: context manager + sum with generator
with open("test_file.txt", "w") as f:
    f.write("line1\n\nline2\nline3\n")

with open("test_file.txt", "r") as f:
    count = sum(1 for line in f if line.strip())
```
</details>

### Context Managers

In [None]:
# Context managers ensure cleanup (files close, locks release, etc.)

# File I/O - always use 'with'
from pathlib import Path

# Write
with open("test.txt", "w") as f:
    f.write("Hello, World!")

# Read
with open("test.txt", "r") as f:
    content = f.read()
print(f"File content: {content}")

# Clean up
Path("test.txt").unlink()

# PyTorch example: disable gradients for inference
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

with torch.no_grad():
    y = x * 2  # No gradient tracking here
print(f"y.requires_grad: {y.requires_grad}")

In [None]:
# Basic try/except
def safe_divide(a: float, b: float) -> float | None:
    try:
        return a / b
    except ZeroDivisionError:
        print("Warning: Division by zero")
        return None

print(safe_divide(10, 2))
print(safe_divide(10, 0))

# Multiple exception types with proper chaining
def parse_int(s: str) -> int:
    try:
        return int(s)
    except ValueError as e:
        raise ValueError(f"Cannot parse '{s}' as integer") from e  # Chain exceptions!
    except TypeError as e:
        raise TypeError(f"Expected string, got {type(s)}") from e

# finally - always runs (cleanup)
def read_with_cleanup(filename: str) -> str:
    f = None
    try:
        f = open(filename, "r")
        return f.read()
    except FileNotFoundError:
        return ""
    finally:
        if f:
            f.close()
            print("File closed")

# Demo exception chaining
try:
    parse_int("abc")
except ValueError as e:
    print(f"Caught: {e}")
    print(f"Original cause: {e.__cause__}")

<details>
<summary><b>Q: When should you catch exceptions vs let them propagate?</b></summary>

**A:** 
- **Catch** when you can handle it meaningfully (retry, default value, cleanup)
- **Propagate** when the caller should decide how to handle it

**Bad:** Catching everything and hiding errors
```python
try:
    result = do_something()
except:  # Never do this!
    pass
```

**Good:** Catch specific exceptions you can handle
```python
try:
    data = load_file(path)
except FileNotFoundError:
    data = default_data
```
</details>

### Part 2 Key Takeaways

- **Type hints** (`def f(x: int) -> str`) improve code clarity and IDE support
- **Comprehensions** are cleaner than loops for building collections
- **Context managers** (`with`) ensure proper resource cleanup
- **Exception chaining** (`raise ... from e`) preserves the original error

---

# Part 3: NumPy Essentials

NumPy is the foundation for all deep learning frameworks. Understanding it is essential.

---

## 3.1 Why NumPy Matters for Deep Learning

In [None]:
import numpy as np
import time

# Vectorisation is MUCH faster than loops
size = 1_000_000

# Loop version
a_list = list(range(size))
b_list = list(range(size))

start = time.time()
c_list = [a + b for a, b in zip(a_list, b_list)]
loop_time = time.time() - start

# NumPy version
a_np = np.arange(size)
b_np = np.arange(size)

start = time.time()
c_np = a_np + b_np
numpy_time = time.time() - start

print(f"Loop time: {loop_time:.4f}s")
print(f"NumPy time: {numpy_time:.4f}s")
print(f"NumPy is {loop_time/numpy_time:.1f}x faster")

<details>
<summary><b>Deep Dive: Why is NumPy so fast?</b></summary>

NumPy achieves 10-100x speedups through several mechanisms:

1. **Contiguous Memory Layout**: Arrays store data in continuous memory blocks, enabling efficient CPU cache utilisation. Python lists store pointers to scattered objects.

2. **Compiled C/Fortran Backend**: Core operations are implemented in optimised C code, not interpreted Python.

3. **SIMD Vectorisation**: Modern CPUs can process multiple numbers per instruction (Single Instruction, Multiple Data). NumPy operations leverage this automatically.

4. **No Type Checking Per Element**: Python lists check types dynamically for each element. NumPy arrays have uniform dtype - no per-element overhead.

5. **No Python Object Overhead**: Each Python object has ~28 bytes of overhead (reference count, type pointer, etc.). NumPy stores raw numbers.

```python
# Memory comparison
import sys
py_list = [1.0] * 1000
np_array = np.ones(1000)
print(f"Python list: {sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list)} bytes")
print(f"NumPy array: {np_array.nbytes} bytes")  # Just 8000 bytes (8 bytes per float64)
```

**Rule**: If you're looping over array elements in Python, you're probably doing it wrong.
</details>

## 3.2 Array Creation & Indexing

In [None]:
import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])          # From list
b = np.zeros((3, 4))                     # 3x4 zeros
c = np.ones((2, 3))                      # 2x3 ones
d = np.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
e = np.linspace(0, 1, 5)                 # 5 points from 0 to 1
f = np.random.randn(3, 3)                # 3x3 standard normal

print(f"zeros shape: {b.shape}")
print(f"arange: {d}")
print(f"linspace: {e}")

# Indexing
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"\narr:\n{arr}")
print(f"arr[0, 1]: {arr[0, 1]}")         # Single element
print(f"arr[0, :]: {arr[0, :]}")         # First row
print(f"arr[:, 1]: {arr[:, 1]}")         # Second column
print(f"arr[0:2, 1:3]:\n{arr[0:2, 1:3]}")  # Subarray

# Boolean indexing
print(f"\narr > 5: {arr[arr > 5]}")

## 3.3 Broadcasting

Broadcasting allows operations between arrays of different shapes.

### Rules:
1. Compare shapes from right to left
2. Dimensions match if they're equal OR one of them is 1
3. Missing dimensions are treated as 1

**Before running each cell below, predict the output shape!**

In [None]:
import numpy as np

# Scalar broadcasts to any shape
a = np.array([[1, 2, 3], [4, 5, 6]])  # Shape: (2, 3)
print(f"a + 10:\n{a + 10}")  # 10 broadcasts to (2, 3)

# Row vector broadcasts across rows
row = np.array([100, 200, 300])  # Shape: (3,)
print(f"\na + row:\n{a + row}")  # (3,) -> (2, 3)

# Column vector broadcasts across columns
col = np.array([[10], [20]])  # Shape: (2, 1)
print(f"\na + col:\n{a + col}")  # (2, 1) -> (2, 3)

# Outer product via broadcasting
x = np.array([1, 2, 3])[:, np.newaxis]  # Shape: (3, 1)
y = np.array([10, 20])                   # Shape: (2,)
print(f"\nOuter product (x * y):\n{x * y}")  # (3, 1) * (2,) -> (3, 2)

<details>
<summary><b>Q: Why does `np.array([1,2]) + np.array([[1],[2],[3]])` work?</b></summary>

**A:** Let's trace the broadcasting:
- Left: shape (2,)
- Right: shape (3, 1)

Align from right:
```
     (2,)  ->  (1, 2)  [add dimension]
  (3, 1)   ->  (3, 1)
  Result:      (3, 2)  [both expand]
```

Each expands where it has size 1:
```python
[[1, 2],      [[1, 1],     [[2, 3],
 [1, 2],  +    [2, 2],  =   [3, 4],
 [1, 2]]       [3, 3]]      [4, 5]]
```
</details>

In [None]:
# Broadcasting Debugger - useful helper function
def broadcast_shapes(*shapes):
    """Visualise how shapes align and what the result will be."""
    max_dims = max(len(s) for s in shapes)
    
    # Pad shapes with 1s on the left
    padded = [((1,) * (max_dims - len(s))) + s for s in shapes]
    
    print("Shape alignment (right-aligned):")
    for i, (orig, pad) in enumerate(zip(shapes, padded)):
        print(f"  Array {i+1}: {str(orig):>15} -> {pad}")
    
    # Compute result shape
    result = []
    for dims in zip(*padded):
        if len(set(d for d in dims if d != 1)) > 1:
            print(f"\n❌ INCOMPATIBLE: dimension has {dims} (multiple non-1 values)")
            return None
        result.append(max(dims))
    
    print(f"\n✓ Result shape: {tuple(result)}")
    return tuple(result)

# Test it
print("Example 1: (2,3) + (3,)")
broadcast_shapes((2, 3), (3,))

print("\nExample 2: (3,1) + (1,4)")
broadcast_shapes((3, 1), (1, 4))

print("\nExample 3: Incompatible shapes")
broadcast_shapes((3, 4), (5,))

### Exercise: Broadcasting

Fix the code to add bias to each sample:

In [None]:
import numpy as np

# Data: 100 samples, 784 features (like MNIST flattened)
X = np.random.randn(100, 784)
bias = np.random.randn(784)

# This should add bias to each row
result = X + bias  # Does this work?
print(f"X shape: {X.shape}")
print(f"bias shape: {bias.shape}")
print(f"result shape: {result.shape}")
assert result.shape == (100, 784), "Shape mismatch!"
print("Broadcasting worked!")

<details>
<summary><b>Solution: Broadcasting</b></summary>

The code already works! Broadcasting automatically handles this case:

```python
X.shape     # (100, 784)
bias.shape  # (784,)

# NumPy aligns from right:
#   X:    (100, 784)
#   bias:      (784,)  → treated as (1, 784)
# Result: (100, 784) ✓
```

If bias had shape `(100,)` instead, you'd need to reshape:
```python
bias_wrong = np.random.randn(100)  # Shape (100,)
result = X + bias_wrong[:, np.newaxis]  # Reshape to (100, 1) for column broadcast
```
</details>

## 3.4 Common Operations

In [None]:
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Original shape: {a.shape}")

# Reshape
b = a.reshape(3, 2)
print(f"Reshaped to (3,2):\n{b}")

# Transpose
print(f"Transposed:\n{a.T}")

# Flatten
print(f"Flattened: {a.flatten()}")

# Concatenate
c = np.array([[7, 8, 9]])
print(f"\nVertical concat:\n{np.concatenate([a, c], axis=0)}")

d = np.array([[10], [20]])
print(f"\nHorizontal concat:\n{np.concatenate([a, d], axis=1)}")

In [None]:
# Reductions along axes
a = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Array:\n{a}")

print(f"\nSum all: {a.sum()}")
print(f"Sum rows (axis=1): {a.sum(axis=1)}")     # Sum each row
print(f"Sum cols (axis=0): {a.sum(axis=0)}")     # Sum each column

print(f"\nMean all: {a.mean():.2f}")
print(f"Mean rows: {a.mean(axis=1)}")

# Matrix multiplication
W = np.random.randn(3, 4)  # 3x4
x = np.random.randn(4, 2)  # 4x2
y = W @ x                   # 3x2
print(f"\nW @ x: {W.shape} @ {x.shape} = {y.shape}")

<details>
<summary><b>Q: What's the difference between `axis=0` and `axis=1` in reductions?</b></summary>

**A:** The axis parameter specifies which dimension to "collapse":
- `axis=0`: Collapse rows → result has shape of a single row
- `axis=1`: Collapse columns → result has shape of a single column

Think of it as: "sum **along** this axis" or "reduce **this** dimension"

```python
a = [[1, 2, 3],
     [4, 5, 6]]  # Shape (2, 3)

a.sum(axis=0)  # [5, 7, 9]   - summed down columns, shape (3,)
a.sum(axis=1)  # [6, 15]     - summed across rows, shape (2,)
```
</details>

### Exercise: Shape Prediction

**Predict the output shapes before running!** Write your predictions, then verify.

In [None]:
import numpy as np

# VIEWS: Slicing creates a view (shares memory!)
original = np.array([1, 2, 3, 4, 5])
view = original[1:4]  # This is a VIEW

print(f"Original: {original}")
print(f"View: {view}")

# Modifying the view changes the original!
view[0] = 999
print(f"After modifying view[0]:")
print(f"  Original: {original}")  # Also changed!
print(f"  View: {view}")

# COPIES: Use .copy() to get independent data
original = np.array([1, 2, 3, 4, 5])
copy = original[1:4].copy()  # Explicit copy

copy[0] = 999
print(f"\nWith .copy():")
print(f"  Original: {original}")  # Unchanged!
print(f"  Copy: {copy}")

# How to check: views share memory
a = np.array([1, 2, 3])
b = a[:]      # View
c = a.copy()  # Copy

print(f"\nShares memory?")
print(f"  a and b: {np.shares_memory(a, b)}")  # True
print(f"  a and c: {np.shares_memory(a, c)}")  # False

<details>
<summary><b>Q: Is `arr.reshape(3, 4)` a view or a copy?</b></summary>

**A:** It depends! Reshape returns a **view** when possible (if the data is contiguous in memory), but may return a **copy** if the memory layout doesn't allow a view.

```python
a = np.arange(12).reshape(3, 4)  # Usually a view
b = a.T.reshape(6, 2)            # Must be a copy (transpose breaks contiguity)
```

**Safe approach:** If you need to be sure, use `.copy()` explicitly. If you want to ensure a view (and error otherwise), use `.reshape()` with `order='A'` or `np.ndarray.view()`.
</details>

## 3.5 Views vs Copies (Critical!)

Understanding when NumPy creates a view vs a copy prevents subtle bugs.

### Part 3 Key Takeaways

- **Vectorise** operations—loops over arrays are slow
- **Broadcasting** aligns shapes from the right, expanding size-1 dimensions
- **Views** share memory with originals; use `.copy()` for independence
- **axis=0** collapses rows, **axis=1** collapses columns

---

# Part 4: Data Handling with Pandas

Pandas is the standard library for tabular data in Python. Essential for loading and preprocessing ML datasets.

---

## 4.1 Loading Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris, load_wine

# Load from sklearn
iris = load_iris(as_frame=True)
df = iris['data']
target = iris['target']

print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

## 4.2 DataFrame Basics

In [None]:
# Filtering
short_sepals = df[df['sepal length (cm)'] < 5]
print(f"Flowers with sepal < 5cm: {len(short_sepals)}")

# Selecting columns
subset = df[['sepal length (cm)', 'petal length (cm)']]
print(f"Subset shape: {subset.shape}")

# Adding columns
df_with_target = df.copy()
df_with_target['species'] = target
df_with_target['species_name'] = df_with_target['species'].map({0: 'setosa', 1: 'versicolour', 2: 'virginica'})

# GroupBy
print("\nMean by species:")
print(df_with_target.groupby('species_name')[['sepal length (cm)', 'petal length (cm)']].mean())

<details>
<summary><b>Q: What's the difference between `df[col]` and `df[[col]]`?</b></summary>

**A:** 
- `df['col']` returns a **Series** (1D)
- `df[['col']]` returns a **DataFrame** (2D, single column)

```python
type(df['sepal length (cm)'])  # pandas.Series
type(df[['sepal length (cm)']])  # pandas.DataFrame
```

Use double brackets when you need to keep the DataFrame structure (e.g., for sklearn).
</details>

## 4.3 Handling Missing Values

In [None]:
# Create sample data with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print("Original:")
print(df_missing)
print(f"\nMissing values per column:\n{df_missing.isna().sum()}")

# Option 1: Drop rows with any NaN
print(f"\nAfter dropna(): {len(df_missing.dropna())} rows")

# Option 2: Fill with value
print(f"\nFill with 0:\n{df_missing.fillna(0)}")

# Option 3: Fill with column mean
print(f"\nFill with mean:\n{df_missing.fillna(df_missing.mean())}")

## 4.4 Encoding Categorical Variables

In [None]:
# One-hot encoding with pandas
df_categories = pd.DataFrame({
    'colour': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'M', 'S']
})

print("Original:")
print(df_categories)

# One-hot encode
encoded = pd.get_dummies(df_categories, columns=['colour', 'size'])
print("\nOne-hot encoded:")
print(encoded)

## 4.5 Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
X = np.array([[1, 100], [2, 200], [3, 300], [4, 400]])

# StandardScaler: zero mean, unit variance
scaler_std = StandardScaler()
X_standardised = scaler_std.fit_transform(X)
print("StandardScaler (mean=0, std=1):")
print(X_standardised)
print(f"Mean: {X_standardised.mean(axis=0)}, Std: {X_standardised.std(axis=0)}")

# MinMaxScaler: scale to [0, 1]
scaler_mm = MinMaxScaler()
X_minmax = scaler_mm.fit_transform(X)
print("\nMinMaxScaler [0, 1]:")
print(X_minmax)

<details>
<summary><b>Q: When should you use StandardScaler vs MinMaxScaler?</b></summary>

**A:**
- **StandardScaler**: When features follow roughly Gaussian distribution. Works well with most ML algorithms, especially those sensitive to feature magnitudes (SVM, logistic regression, neural networks).

- **MinMaxScaler**: When you need bounded values (e.g., [0,1] for image pixels or probabilities). Sensitive to outliers - a single extreme value can compress all other values.

**Rule of thumb**: Start with StandardScaler for neural networks. Use MinMaxScaler when interpretability of the scale matters.
</details>

### Part 4 Key Takeaways

- **pandas** is essential for loading and exploring tabular data
- Always check for **missing values** before training
- **One-hot encoding** converts categories to numeric features
- **Scaling** (StandardScaler/MinMaxScaler) improves model training

---

# Part 5: Visualisation with Matplotlib

Visualisation is critical for understanding data and debugging models.

---

## 5.1 Basic Plots

In [None]:
import matplotlib.pyplot as plt

# Reload iris for plotting
iris = load_iris(as_frame=True)
df = iris['data']
target = iris['target']

# Scatter plot with colour by class
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: Sepal dimensions
colours = ['red', 'green', 'blue']
for i, species in enumerate(['setosa', 'versicolour', 'virginica']):
    mask = target == i
    axes[0].scatter(df.loc[mask, 'sepal length (cm)'], 
                   df.loc[mask, 'sepal width (cm)'],
                   c=colours[i], label=species, alpha=0.7)
axes[0].set_xlabel('Sepal Length (cm)')
axes[0].set_ylabel('Sepal Width (cm)')
axes[0].set_title('Sepal Dimensions')
axes[0].legend()

# Right: Petal dimensions
for i, species in enumerate(['setosa', 'versicolour', 'virginica']):
    mask = target == i
    axes[1].scatter(df.loc[mask, 'petal length (cm)'], 
                   df.loc[mask, 'petal width (cm)'],
                   c=colours[i], label=species, alpha=0.7)
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title('Petal Dimensions')
axes[1].legend()

plt.tight_layout()
plt.show()

## 5.2 Histograms and Distributions

In [None]:
# Histogram of features
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=20, edgecolour='black', alpha=0.7)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

## 5.3 Plotting for Deep Learning

Common plots you'll use when training models:

In [None]:
# Simulated training history
epochs = range(1, 51)
train_loss = 2.0 * np.exp(-np.array(epochs) / 10) + 0.1 + np.random.randn(50) * 0.05
val_loss = 2.0 * np.exp(-np.array(epochs) / 12) + 0.15 + np.random.randn(50) * 0.08

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss curves
axes[0].plot(epochs, train_loss, label='Train Loss', colour='blue')
axes[0].plot(epochs, val_loss, label='Val Loss', colour='orange')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Curves')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Gradient histogram (simulated)
gradients = np.random.randn(1000) * 0.1
axes[1].hist(gradients, bins=50, edgecolour='black', alpha=0.7)
axes[1].axvline(x=0, colour='red', linestyle='--', label='Zero')
axes[1].set_xlabel('Gradient Value')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Gradient Distribution (Healthy)')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Tip: If gradients cluster near 0 -> vanishing gradients")
print("     If gradients are huge -> exploding gradients")

<details>
<summary><b>Q: What does a bimodal gradient histogram suggest?</b></summary>

**A:** A bimodal (two-peaked) gradient histogram often indicates:
1. Different layers learning at different rates
2. Potential issues with initialisation
3. Some weights updating much faster than others

Healthy gradient distributions are typically unimodal and centred near zero.
</details>

<details>
<summary><b>Q: What should you look for in training curves?</b></summary>

**A:** Key patterns to watch:
- **Train loss decreasing, val loss stable then increasing** → Overfitting, stop earlier
- **Both losses plateau high** → Underfitting, increase model capacity
- **Loss spikes or oscillates** → Learning rate too high
- **Very slow decrease** → Learning rate too low
- **Train and val loss track closely** → Good generalisation
</details>

---

# Part 6: OOP for Deep Learning

PyTorch heavily uses OOP. Understanding these patterns is essential.

---

## 6.1 Classes Review

### Part 5 Key Takeaways

- **Scatter plots** reveal feature relationships and class separability
- **Histograms** show feature distributions
- **Training curves** diagnose overfitting, underfitting, and learning rate issues
- **Gradient histograms** detect vanishing/exploding gradients

In [None]:
class NeuralNetwork:
    """A simple neural network class demonstrating OOP patterns."""
    
    # Class attribute (shared by all instances)
    default_activation = "relu"
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        """Initialise the network."""
        # Instance attributes (unique to each instance)
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Simulated weights
        self.weights = {
            "W1": np.random.randn(input_size, hidden_size) * 0.01,
            "W2": np.random.randn(hidden_size, output_size) * 0.01,
        }
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Forward pass."""
        h = x @ self.weights["W1"]
        h = np.maximum(0, h)  # ReLU
        return h @ self.weights["W2"]

# Usage
net = NeuralNetwork(784, 128, 10)
x = np.random.randn(32, 784)  # Batch of 32
output = net.forward(x)
print(f"Input: {x.shape} -> Output: {output.shape}")

## 6.2 Naming Conventions

In [None]:
class DataProcessor:
    """Demonstrates Python naming conventions."""
    
    def __init__(self, data: list):
        self.data = data              # Public: anyone can access
        self._cache = {}              # Protected: internal use, but accessible
        self.__secret = "hidden"      # Private: name-mangled to _DataProcessor__secret
    
    def process(self):
        """Public method - part of the API."""
        return self._preprocess()
    
    def _preprocess(self):
        """Protected method - internal helper, but subclasses can override."""
        return [x * 2 for x in self.data]
    
    def __validate(self):
        """Private method - truly internal, not for subclasses."""
        return all(isinstance(x, (int, float)) for x in self.data)

dp = DataProcessor([1, 2, 3])
print(f"Public data: {dp.data}")
print(f"Protected _cache: {dp._cache}")  # Works but discouraged
# print(dp.__secret)  # AttributeError!
print(f"Mangled name: {dp._DataProcessor__secret}")  # How to access if needed

<details>
<summary><b>Q: When should you use `_protected` vs `__private`?</b></summary>

**A:**
- **`_protected`**: Use for internal methods that subclasses might need to override. It's a convention saying "internal, but accessible."

- **`__private`**: Use when you truly want to prevent accidental override in subclasses. Python mangles the name to `_ClassName__method`, making it harder (but not impossible) to access.

**In practice:** Most Python code uses `_protected`. Use `__private` sparingly.
</details>

## 6.3 Dunder Methods

Dunder (double underscore) methods let you customise how objects behave.

In [None]:
class Tensor:
    """A simple tensor class demonstrating dunder methods."""
    
    def __init__(self, data: list):
        self.data = np.array(data)
    
    def __repr__(self) -> str:
        """For developers - unambiguous representation."""
        return f"Tensor(shape={self.data.shape}, dtype={self.data.dtype})"
    
    def __str__(self) -> str:
        """For users - readable representation."""
        return f"Tensor with shape {self.data.shape}"
    
    def __len__(self) -> int:
        """Enable len(tensor)."""
        return len(self.data)
    
    def __getitem__(self, idx):
        """Enable tensor[idx]."""
        return self.data[idx]
    
    def __call__(self, x):
        """Enable tensor(x) - used heavily in PyTorch!"""
        return self.data @ x

t = Tensor([[1, 2], [3, 4]])
print(f"repr: {repr(t)}")
print(f"str: {str(t)}")
print(f"len: {len(t)}")
print(f"t[0]: {t[0]}")
print(f"t([1, 1]): {t(np.array([1, 1]))}")  # Callable!

<details>
<summary><b>Q: Why does PyTorch use `__call__` for the forward pass?</b></summary>

**A:** In PyTorch, `model(x)` calls `model.__call__(x)`, which internally calls `model.forward(x)` but also handles:
- Hooks (callbacks before/after forward)
- Gradient tracking setup
- Module state management

This is why you define `forward()` but call `model(x)`, not `model.forward(x)`.
</details>

## 6.5 Inheritance (Essential for PyTorch)

PyTorch's `nn.Module` uses inheritance heavily. You'll subclass it for every model.

<details>
<summary><b>Q: Why do we call `super().__init__()` in subclasses?</b></summary>

**A:** `super().__init__()` calls the parent class's `__init__` method, ensuring proper initialisation of inherited attributes. Without it:

```python
class Linear(Module):
    def __init__(self, in_features, out_features):
        # WRONG: forgot super().__init__()
        self.weight = ...
        
layer = Linear(10, 5)
print(layer.training)  # AttributeError! .training was never set
```

In PyTorch, forgetting `super().__init__()` is a common bug that breaks module registration, parameter tracking, and device movement.
</details>

In [None]:
# Simplified nn.Module-like base class
class Module:
    """Base class demonstrating PyTorch's Module pattern."""
    
    def __init__(self):
        self._modules = {}
        self.training = True
    
    def __call__(self, x):
        """When you call model(x), this runs."""
        return self.forward(x)
    
    def forward(self, x):
        """Subclasses MUST override this."""
        raise NotImplementedError("Subclasses must implement forward()")
    
    def train(self, mode: bool = True):
        self.training = mode
        return self
    
    def eval(self):
        return self.train(False)


# Subclass: A simple linear layer
class Linear(Module):
    """Linear layer: y = x @ W + b"""
    
    def __init__(self, in_features: int, out_features: int):
        super().__init__()  # Call parent's __init__
        self.weight = np.random.randn(in_features, out_features) * 0.01
        self.bias = np.zeros(out_features)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        return x @ self.weight + self.bias  # Broadcasting!


# Subclass: A two-layer network
class TwoLayerNet(Module):
    """Network that composes multiple layers."""
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        super().__init__()
        self.fc1 = Linear(input_size, hidden_size)
        self.fc2 = Linear(hidden_size, output_size)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        x = self.fc1(x)           # Note: uses __call__, not .forward()
        x = np.maximum(0, x)      # ReLU activation
        x = self.fc2(x)
        return x


# Usage - this is exactly how you'll use PyTorch!
model = TwoLayerNet(784, 128, 10)
x = np.random.randn(32, 784)
output = model(x)  # Calls __call__ -> forward
print(f"Input: {x.shape} -> Output: {output.shape}")
print(f"Training mode: {model.training}")
model.eval()
print(f"After eval(): {model.training}")

## 6.4 Decorators

In [None]:
class Model:
    def __init__(self, name: str):
        self._name = name
        self._is_training = True
    
    @property
    def name(self) -> str:
        """Property decorator - access like an attribute."""
        return self._name
    
    @property
    def is_training(self) -> bool:
        return self._is_training
    
    @is_training.setter
    def is_training(self, value: bool):
        """Setter for property."""
        self._is_training = value
        print(f"Training mode: {value}")
    
    @staticmethod
    def count_parameters(weights: dict) -> int:
        """Static method - doesn't need self."""
        return sum(w.size for w in weights.values())
    
    @classmethod
    def from_config(cls, config: dict):
        """Class method - alternative constructor."""
        return cls(name=config.get("name", "unnamed"))

# Usage
m = Model("MyModel")
print(f"Name: {m.name}")  # Property access
m.is_training = False     # Property setter

m2 = Model.from_config({"name": "ConfigModel"})  # Classmethod
print(f"From config: {m2.name}")

---

# Part 7: sklearn & Linear Regression

Before neural networks, understand classical ML. Linear regression is the foundation.

---

## 7.1 Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(42)
X = np.random.randn(200, 1) * 2
y = 3 * X.squeeze() + 2 + np.random.randn(200) * 0.8

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Visualise
plt.figure(figsize=(8, 5))
plt.scatter(X_train, y_train, alpha=0.5, label='Train')
plt.scatter(X_test, y_test, alpha=0.5, label='Test')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Train/Test Split')
plt.legend()
plt.show()

<details>
<summary><b>Q: Why do we split data into train and test sets?</b></summary>

**A:** To evaluate **generalisation** - how well the model performs on unseen data.

- **Training set**: Used to fit model parameters
- **Test set**: Held out completely, used only for final evaluation

If we evaluated on training data, we'd overestimate performance because the model has "memorised" those examples. This is called **overfitting**.

**Common splits:** 80/20 or 70/30 for train/test.
</details>

## 7.2 Linear Regression with sklearn

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Learned parameters
print(f"Learned: y = {model.coef_[0]:.3f}x + {model.intercept_:.3f}")
print(f"True:    y = 3.000x + 2.000")

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate
print(f"\nTrain MSE: {mean_squared_error(y_train, y_pred_train):.4f}")
print(f"Test MSE:  {mean_squared_error(y_test, y_pred_test):.4f}")
print(f"Test R²:   {r2_score(y_test, y_pred_test):.4f}")

# Plot fit
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, alpha=0.5, label='Test data')
X_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
plt.plot(X_line, model.predict(X_line), colour='red', linewidth=2, label='Fitted line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()

## 7.3 Linear Regression from Scratch

Implementing gradient descent - the same algorithm that trains neural networks!

### Mathematical Derivation

Before implementing, let's derive the gradient descent update rules.

<details>
<summary><b>Deep Dive: Deriving MSE Gradients</b></summary>

For linear regression with model $\hat{y} = wx + b$, the Mean Squared Error loss is:

$$L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i - b)^2$$

**Gradient with respect to $w$:**

Using the chain rule:
$$\frac{\partial L}{\partial w} = \frac{1}{n}\sum_{i=1}^{n} 2(y_i - wx_i - b) \cdot (-x_i) = -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - \hat{y}_i)$$

**Gradient with respect to $b$:**
$$\frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} 2(y_i - wx_i - b) \cdot (-1) = -\frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)$$

**Update rule:** Move in the *opposite* direction of the gradient (downhill):
$$w_{\text{new}} = w_{\text{old}} - \alpha \frac{\partial L}{\partial w}$$
$$b_{\text{new}} = b_{\text{old}} - \alpha \frac{\partial L}{\partial b}$$

where $\alpha$ is the learning rate.
</details>

In [None]:
def linear_regression_gd(X, y, lr=0.1, epochs=100):
    """
    Train linear regression using gradient descent.
    
    Args:
        X: Features, shape (n_samples, 1)
        y: Targets, shape (n_samples,)
        lr: Learning rate
        epochs: Number of iterations
    
    Returns:
        w, b: Learned parameters
        history: Loss at each epoch
    """
    # Initialise parameters
    w = 0.0
    b = 0.0
    n = len(y)
    history = []
    
    X_flat = X.squeeze()  # Shape: (n_samples,)
    
    for epoch in range(epochs):
        # Forward pass: predictions
        y_pred = w * X_flat + b
        
        # Compute loss (MSE)
        loss = np.mean((y - y_pred) ** 2)
        history.append(loss)
        
        # Compute gradients (partial derivatives of MSE)
        # d(MSE)/dw = -2/n * sum(X * (y - y_pred))
        # d(MSE)/db = -2/n * sum(y - y_pred)
        dw = (-2/n) * np.sum(X_flat * (y - y_pred))
        db = (-2/n) * np.sum(y - y_pred)
        
        # Update parameters (gradient descent step)
        w = w - lr * dw
        b = b - lr * db
        
        if epoch % 20 == 0:
            print(f"Epoch {epoch:3d}: Loss = {loss:.4f}, w = {w:.3f}, b = {b:.3f}")
    
    return w, b, history

# Train from scratch
w, b, history = linear_regression_gd(X_train, y_train, lr=0.1, epochs=100)

print(f"\nFinal: y = {w:.3f}x + {b:.3f}")
print(f"sklearn: y = {model.coef_[0]:.3f}x + {model.intercept_:.3f}")

In [None]:
# Visualise gradient descent
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss curve
axes[0].plot(history)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Gradient Descent Convergence')
axes[0].grid(True, alpha=0.3)

# Compare fits
axes[1].scatter(X_test, y_test, alpha=0.5, label='Test data')
X_line = np.linspace(X.min(), X.max(), 100)
axes[1].plot(X_line, w * X_line + b, colour='red', linewidth=2, label=f'GD: y={w:.2f}x+{b:.2f}')
axes[1].plot(X_line, model.coef_[0] * X_line + model.intercept_, 
             colour='green', linewidth=2, linestyle='--', label='sklearn')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].set_title('Comparison: Gradient Descent vs sklearn')
axes[1].legend()

plt.tight_layout()
plt.show()

<details>
<summary><b>Q: What happens if learning rate is too high or too low?</b></summary>

**A:**
- **Too high**: Loss oscillates or diverges (explodes to infinity). The steps are too big and overshoot the minimum.

- **Too low**: Converges very slowly. May get stuck or take forever to train.

**Try it**: Change `lr=0.1` to `lr=0.01` (slow) or `lr=1.0` (unstable) and observe.

**Rule of thumb**: Start with lr=0.01 or 0.001 for neural networks. Use learning rate schedulers for better results.
</details>

## 7.4 Polynomial Features & Overfitting

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Generate nonlinear data
np.random.seed(42)
X_poly = np.random.uniform(-3, 3, 50).reshape(-1, 1)
y_poly = 0.5 * X_poly.squeeze()**2 - X_poly.squeeze() + 2 + np.random.randn(50) * 0.5

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, y_poly, test_size=0.3, random_state=42)

# Fit models of different complexity
degrees = [1, 3, 15]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

X_plot = np.linspace(-3.5, 3.5, 100).reshape(-1, 1)

for ax, degree in zip(axes, degrees):
    # Create polynomial pipeline
    model = make_pipeline(
        PolynomialFeatures(degree),
        LinearRegression()
    )
    model.fit(X_train_p, y_train_p)
    
    # Evaluate
    train_mse = mean_squared_error(y_train_p, model.predict(X_train_p))
    test_mse = mean_squared_error(y_test_p, model.predict(X_test_p))
    
    # Plot
    ax.scatter(X_train_p, y_train_p, alpha=0.6, label='Train')
    ax.scatter(X_test_p, y_test_p, alpha=0.6, label='Test')
    ax.plot(X_plot, model.predict(X_plot), colour='red', linewidth=2)
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
    ax.legend()
    ax.set_ylim(-5, 15)

plt.tight_layout()
plt.show()

print("Observe: Degree 15 has LOW train error but HIGH test error = OVERFITTING")

<details>
<summary><b>Q: How do you detect overfitting?</b></summary>

**A:** Compare train vs test performance:

| Scenario | Train Error | Test Error | Diagnosis |
|----------|------------|-----------|-----------|
| Low | Low | Good fit |
| Low | High | **Overfitting** |
| High | High | Underfitting |

**Solutions for overfitting:**
1. More training data
2. Simpler model (fewer parameters)
3. Regularization (L1/L2)
4. Early stopping
5. Dropout (for neural networks)
</details>

## 7.5 Regularization Preview

In [None]:
from sklearn.linear_model import Ridge, Lasso

# Compare regularized models on degree-15 polynomial
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

models = [
    ('No Regularization', make_pipeline(PolynomialFeatures(15), LinearRegression())),
    ('Ridge (L2)', make_pipeline(PolynomialFeatures(15), Ridge(alpha=1.0))),
    ('Lasso (L1)', make_pipeline(PolynomialFeatures(15), Lasso(alpha=0.1))),
]

for ax, (name, model) in zip(axes, models):
    model.fit(X_train_p, y_train_p)
    
    train_mse = mean_squared_error(y_train_p, model.predict(X_train_p))
    test_mse = mean_squared_error(y_test_p, model.predict(X_test_p))
    
    ax.scatter(X_train_p, y_train_p, alpha=0.6, label='Train')
    ax.scatter(X_test_p, y_test_p, alpha=0.6, label='Test')
    ax.plot(X_plot, model.predict(X_plot), colour='red', linewidth=2)
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.set_title(f'{name}\nTrain: {train_mse:.2f}, Test: {test_mse:.2f}')
    ax.legend()
    ax.set_ylim(-5, 15)

plt.tight_layout()
plt.show()

print("Ridge/Lasso add penalty terms to prevent overfitting:")
print("  Ridge: penalizes large weights (L2 norm)")
print("  Lasso: promotes sparsity (L1 norm)")

---

# Part 8: Practical Patterns

---

## 8.1 Generators & Iterators

Generators are crucial for memory-efficient data loading.

In [None]:
# Generator function - uses yield
def count_up_to(n: int):
    """Generate numbers from 0 to n-1."""
    i = 0
    while i < n:
        yield i  # Pauses here, returns value
        i += 1

# Usage
for num in count_up_to(5):
    print(num, end=" ")
print()

# Generator expression (like list comprehension but lazy)
squares_gen = (x**2 for x in range(1000000))  # No memory allocated yet!
print(f"Generator: {squares_gen}")
print(f"First 5: {[next(squares_gen) for _ in range(5)]}")

In [None]:
# Why generators matter for DL: memory efficiency
import sys

# List stores all values in memory
big_list = [i**2 for i in range(1000000)]
print(f"List size: {sys.getsizeof(big_list) / 1e6:.1f} MB")

# Generator computes on-demand
def big_gen():
    for i in range(1000000):
        yield i**2

gen = big_gen()
print(f"Generator size: {sys.getsizeof(gen)} bytes")

# DataLoader-style batching
def batch_generator(data: list, batch_size: int):
    """Yield batches from data."""
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

data = list(range(100))
for batch in batch_generator(data, batch_size=32):
    print(f"Batch: {batch[:3]}... (size {len(batch)})")

<details>
<summary><b>Q: When should you use a generator vs a list?</b></summary>

**A:**
- **Generator**: When data is large, you only need one pass, or values are computed on-demand
- **List**: When you need random access, multiple passes, or the data is small

**DataLoaders use generators** because:
1. Training data is often huge (can't fit in RAM)
2. You only need one batch at a time
3. Data can be augmented on-the-fly
</details>

## 8.2 File I/O with Pathlib

In [None]:
from pathlib import Path

# Create paths (cross-platform!)
data_dir = Path("data")
model_path = data_dir / "models" / "best.pt"

print(f"Path: {model_path}")
print(f"Parent: {model_path.parent}")
print(f"Name: {model_path.name}")
print(f"Stem: {model_path.stem}")
print(f"Suffix: {model_path.suffix}")

# Check existence
print(f"\nExists: {model_path.exists()}")
print(f"Is file: {model_path.is_file()}")

# Find files
current = Path(".")
print(f"\nPython files in current dir: {list(current.glob('*.py'))[:3]}")
print(f"All .ipynb (recursive): {list(current.glob('**/*.ipynb'))[:3]}")

## 8.3 Debugging Strategies

In [None]:
# 1. Print debugging with f-strings
def debug_forward(x, W):
    print(f"DEBUG: x.shape={x.shape}, W.shape={W.shape}")
    result = x @ W
    print(f"DEBUG: result.shape={result.shape}")
    return result

# 2. Assertions - catch bugs early
def normalize(x: np.ndarray) -> np.ndarray:
    assert x.ndim == 2, f"Expected 2D array, got {x.ndim}D"
    assert x.shape[0] > 0, "Empty array"
    return (x - x.mean(axis=0)) / (x.std(axis=0) + 1e-8)

# 3. Shape annotations in comments
def attention(Q, K, V):
    # Q: (batch, heads, seq_len, d_k)
    # K: (batch, heads, seq_len, d_k)
    # V: (batch, heads, seq_len, d_v)
    
    scores = Q @ K.transpose(-2, -1)  # (batch, heads, seq_len, seq_len)
    weights = scores  # Simplified - normally softmax
    output = weights @ V  # (batch, heads, seq_len, d_v)
    return output

# Test
x = np.random.randn(32, 784)
W = np.random.randn(784, 128)
y = debug_forward(x, W)
z = normalize(x)
print(f"\nNormalized shape: {z.shape}")

---

# Part 9: Summary & Capstone

---

## Key Takeaways

### Python Foundations
- Use **type hints** for self-documenting code
- Use **comprehensions** for building collections
- Use **context managers** (`with`) for resource management

### NumPy
- **Vectorize** operations - avoid Python loops
- **Broadcasting** aligns shapes from the right
- **axis=0** collapses rows, **axis=1** collapses columns

### Data Handling
- **Pandas** for loading and preprocessing tabular data
- Always check for **missing values** and handle appropriately
- **Scale features** before training ML models

### Machine Learning Basics
- Always **split** data into train/test sets
- **MSE** and **R²** for regression evaluation
- Watch for **overfitting**: low train error, high test error
- **Regularization** (Ridge/Lasso) prevents overfitting

### OOP for DL
- **`__call__`** makes objects callable (used by `nn.Module`)
- **Inheritance** is fundamental to PyTorch model building

### Practical Patterns
- **Generators** for memory-efficient iteration (DataLoaders!)
- **Pathlib** for cross-platform file paths

## Self-Assessment Checklist

Before proceeding to Lab 2, you should be able to:

- [ ] Write a function with type hints and a Google-style docstring
- [ ] Predict the output shape of broadcasting `(3,1) + (4,)`
- [ ] Load a CSV file with pandas and handle missing values
- [ ] Split data into train/test sets using sklearn
- [ ] Train a linear regression model and compute MSE
- [ ] Explain why high train accuracy + low test accuracy = overfitting
- [ ] Explain why `__call__` is used in PyTorch modules
- [ ] Write a generator function with `yield`

## Capstone Exercise: End-to-End ML Pipeline

Build a complete pipeline from data loading to model evaluation:

In [None]:
# Capstone Exercise: Complete the pipeline

class MLPipeline:
    """
    End-to-end ML pipeline.
    
    TODO: Implement the following methods:
    1. load_data: Load wine dataset from sklearn
    2. preprocess: Handle missing values, scale features
    3. split: Train/test split
    4. train: Fit a Ridge regression model
    5. evaluate: Return MSE and R² on test set
    """
    
    def __init__(self):
        self.scaler = None
        self.model = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
    
    def load_data(self):
        # TODO: Load wine dataset, use first feature as target for regression
        pass
    
    def preprocess(self, X):
        # TODO: Scale features using StandardScaler
        # Remember: fit on train, transform on both train and test
        pass
    
    def split(self, X, y, test_size=0.2):
        # TODO: Split into train/test
        pass
    
    def train(self):
        # TODO: Fit Ridge regression
        pass
    
    def evaluate(self):
        # TODO: Return dict with 'mse' and 'r2' on test set
        pass
    
    def run(self):
        """Run the full pipeline."""
        X, y = self.load_data()
        self.split(X, y)
        self.X_train = self.preprocess(self.X_train)
        self.X_test = self.preprocess(self.X_test)
        self.train()
        return self.evaluate()

# Test your implementation:
# pipeline = MLPipeline()
# results = pipeline.run()
# print(f"Test MSE: {results['mse']:.4f}")
# print(f"Test R²: {results['r2']:.4f}")

<details>
<summary><b>Solution</b></summary>

```python
class MLPipeline:
    def __init__(self):
        self.scaler = StandardScaler()
        self.model = Ridge(alpha=1.0)
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self._fitted = False
    
    def load_data(self):
        wine = load_wine()
        X = wine.data[:, 1:]  # Features (all but first)
        y = wine.data[:, 0]   # Target (first column: alcohol)
        return X, y
    
    def preprocess(self, X):
        if not self._fitted:
            self._fitted = True
            return self.scaler.fit_transform(X)
        return self.scaler.transform(X)
    
    def split(self, X, y, test_size=0.2):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=test_size, random_state=42
        )
    
    def train(self):
        self.model.fit(self.X_train, self.y_train)
    
    def evaluate(self):
        y_pred = self.model.predict(self.X_test)
        return {
            'mse': mean_squared_error(self.y_test, y_pred),
            'r2': r2_score(self.y_test, y_pred)
        }
```
</details>

---

# Part 10: Software Engineering Essentials

Professional data science requires more than just modeling skills. This section covers essential tools and practices.

> **Prerequisite Course**: For data structures and algorithms foundations, see the [DSA Lab Course](https://github.com/henrycgbaker/data-structures-algorithms-lab-2025-TEACHING).

---

## 10.1 Python Package Management

### Virtual Environments

Always use virtual environments to isolate project dependencies:

```bash
# Using venv (built-in)
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate   # Windows

# Using conda
conda create -n myproject python=3.10
conda activate myproject
```

### requirements.txt vs pyproject.toml

**requirements.txt** (traditional):
```
numpy>=1.20.0
pandas>=1.3.0
torch>=2.0.0
```

**pyproject.toml** (modern, recommended):
```toml
[project]
name = "my-dl-project"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "numpy>=1.20.0",
    "pandas>=1.3.0",
    "torch>=2.0.0",
]

[project.optional-dependencies]
dev = ["pytest", "ruff", "mypy"]
```

### Poetry (Recommended for Projects)

[Poetry](https://python-poetry.org/) provides dependency management and packaging:

```bash
# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create new project
poetry new my-project
cd my-project

# Add dependencies
poetry add numpy pandas torch
poetry add --group dev pytest ruff

# Install all dependencies
poetry install

# Run commands in virtual environment
poetry run python train.py
poetry run pytest

# Export to requirements.txt (for deployment)
poetry export -f requirements.txt --output requirements.txt
```

**Why Poetry?**
- Lock file ensures reproducible builds
- Separates dev and production dependencies
- Handles version conflicts automatically
- Modern pyproject.toml format

<details>
<summary><b>Q: When should you use pip vs conda vs poetry?</b></summary>

**A:**
| Tool | Best For |
|------|----------|
| **pip** | Simple scripts, quick prototypes |
| **conda** | Scientific computing, GPU libraries, cross-language deps |
| **poetry** | Production projects, packages you'll distribute |

**Rule of thumb**: 
- Colab/quick experiments → pip
- Complex ML environments → conda
- Serious projects → poetry
</details>

## 10.2 Code Quality & Pre-commit Hooks

In [None]:
# Pre-commit hooks run checks automatically before each commit
# Install: pip install pre-commit

# Example .pre-commit-config.yaml:
pre_commit_config = """
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff          # Linting
        args: [--fix]
      - id: ruff-format   # Formatting
  
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
"""

print("Save this as .pre-commit-config.yaml in your repo root")
print("Then run: pre-commit install")
print("Now every 'git commit' will auto-format and lint your code!")

### Ruff: Modern Python Linter

[Ruff](https://github.com/astral-sh/ruff) is extremely fast and replaces multiple tools:

```bash
# Install
pip install ruff

# Lint (check for issues)
ruff check .

# Fix auto-fixable issues
ruff check --fix .

# Format (like black)
ruff format .
```

**pyproject.toml configuration:**
```toml
[tool.ruff]
line-length = 100
target-version = "py310"

[tool.ruff.lint]
select = ["E", "F", "I", "UP"]  # Error, pyflakes, isort, pyupgrade
ignore = ["E501"]  # Line too long (handled by formatter)
```

## 10.3 GitHub Actions (CI/CD)

Automatically run tests and checks on every push:

```yaml
# .github/workflows/ci.yml
name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest ruff
      
      - name: Lint with ruff
        run: ruff check .
      
      - name: Run tests
        run: pytest tests/
```

This ensures:
- Every PR is automatically tested
- Code style is enforced
- Broken code can't be merged

<details>
<summary><b>Q: Why copy requirements.txt before copying code in Dockerfile?</b></summary>

**A:** Docker caches each layer. If you copy code first, ANY code change invalidates the cache for `pip install`. By copying requirements.txt first:
- Requirements layer is cached if dependencies are unchanged
- Code changes only rebuild the final copy layer

This can save minutes on each build when dependencies are stable.
</details>

<details>
<summary><b>Q: What's the difference between `poetry install` and `pip install -r requirements.txt`?</b></summary>

**A:**
- **`poetry install`**: Uses lock file (`poetry.lock`) for exact versions. Creates isolated virtual environment. Handles dependency resolution.

- **`pip install -r`**: Uses version ranges from requirements.txt. May get different versions on different machines. No built-in environment management.

Poetry is more reproducible; pip is simpler for quick setups.
</details>

## 10.4 Docker Basics

Docker containers ensure your code runs the same everywhere.

> **Full Docker Course**: See [DS Hub Docker Guide](https://github.com/hertie-data-science-lab/ds01-hub/tree/main)

### Essential Dockerfile for ML

```dockerfile
# Use official Python image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Copy requirements first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy code
COPY . .

# Run training script
CMD ["python", "train.py"]
```

### Common Commands

```bash
# Build image
docker build -t my-ml-project .

# Run container
docker run my-ml-project

# Run with GPU (NVIDIA)
docker run --gpus all my-ml-project

# Interactive shell
docker run -it my-ml-project /bin/bash

# Mount local directory
docker run -v $(pwd)/data:/app/data my-ml-project
```

### Docker Compose for Multi-Container Apps

```yaml
# docker-compose.yml
version: '3.8'
services:
  training:
    build: .
    volumes:
      - ./data:/app/data
      - ./models:/app/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  tensorboard:
    image: tensorflow/tensorflow
    ports:
      - "6006:6006"
    volumes:
      - ./logs:/logs
    command: tensorboard --logdir=/logs --host=0.0.0.0
```

```bash
# Start all services
docker-compose up

# Run in background
docker-compose up -d

# Stop all
docker-compose down
```

<details>
<summary><b>Q: When should you use Docker for ML projects?</b></summary>

**A:** Use Docker when:
- **Reproducibility matters**: Ensure exact same environment
- **Deployment**: Serving models in production
- **Collaboration**: Share exact environments with team
- **GPU clusters**: Many HPC systems require containers

**Skip Docker when:**
- Quick experiments in Colab
- Simple scripts with few dependencies
- Learning/prototyping phase

**Rule**: Start without Docker, add it when you need reproducibility or deployment.
</details>

## 10.5 Project Structure

Recommended structure for ML projects:

```
my-ml-project/
├── .github/
│   └── workflows/
│       └── ci.yml          # GitHub Actions
├── data/
│   ├── raw/                # Original data (gitignored)
│   └── processed/          # Cleaned data
├── models/                 # Saved model checkpoints
├── notebooks/              # Jupyter notebooks for exploration
├── src/
│   ├── __init__.py
│   ├── data.py            # Data loading/preprocessing
│   ├── model.py           # Model architecture
│   ├── train.py           # Training loop
│   └── evaluate.py        # Evaluation metrics
├── tests/
│   └── test_model.py      # Unit tests
├── .gitignore
├── .pre-commit-config.yaml
├── pyproject.toml         # Dependencies & config
├── README.md
└── Dockerfile
```

**Key principles:**
- Separate code (src/) from experiments (notebooks/)
- Never commit raw data or model weights to git
- Use pyproject.toml for all configuration
- Write tests for critical functions

## 10.6 Further Resources

### Courses & Links
- [DSA Lab Course](https://github.com/henrycgbaker/data-structures-algorithms-lab-2025-TEACHING) - Data structures & algorithms
- [DS Hub Docker Guide](https://github.com/hertie-data-science-lab/ds01-hub/tree/main) - Docker for data science
- [Poetry Documentation](https://python-poetry.org/docs/)
- [GitHub Actions Guide](https://docs.github.com/en/actions)
- [Ruff Documentation](https://docs.astral.sh/ruff/)

### Books
- *The Good Research Code Handbook* - Patrick Mineault
- *Software Engineering for Data Scientists* - Andrew Trevett

## References

### Python & NumPy
1. [Python Type Hints Cheat Sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html)
2. [NumPy Broadcasting Rules](https://numpy.org/doc/stable/user/basics.broadcasting.html)
3. [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)

### Data Science & ML
4. [Pandas Documentation](https://pandas.pydata.org/docs/)
5. [sklearn User Guide](https://scikit-learn.org/stable/user_guide.html)
6. [PyTorch nn.Module Source](https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/module.py)

### Software Engineering
7. [DSA Lab Course](https://github.com/henrycgbaker/data-structures-algorithms-lab-2025-TEACHING) - Prerequisites
8. [DS Hub Docker Guide](https://github.com/hertie-data-science-lab/ds01-hub/tree/main)
9. [Poetry Documentation](https://python-poetry.org/docs/)
10. [Ruff Documentation](https://docs.astral.sh/ruff/)
11. [GitHub Actions Guide](https://docs.github.com/en/actions)

---

**Next:** Lab 2 - Introduction to Feedforward Neural Networks