# Chapter 1: Data Structures & Complexity Refresher

## Learning Objectives
By the end of this chapter, you will be able to:
- Analyze time and space complexity of algorithms
- Understand memory layout and cache effects in ML operations
- Compare vectorized operations vs loops for performance
- Implement memory-efficient algorithms for large-scale ML
- Profile and optimize memory usage in PyTorch/TensorFlow

## Prerequisites
- Basic understanding of Python and NumPy
- Familiarity with PyTorch tensors
- Knowledge of basic data structures (arrays, lists, dictionaries)


## 1.1 Complexity Analysis Refresher

### Big O Notation
Big O notation describes the worst-case time or space complexity of an algorithm as the input size grows.

**Common Complexities:**
- O(1) - Constant time (array access, hash table lookup)
- O(log n) - Logarithmic (binary search, balanced tree operations)
- O(n) - Linear (single pass through array)
- O(n log n) - Linearithmic (efficient sorting algorithms)
- O(n²) - Quadratic (nested loops)
- O(2ⁿ) - Exponential (recursive Fibonacci)

### Space Complexity
Space complexity measures the amount of memory an algorithm uses relative to input size:
- **Auxiliary space**: Extra space used by the algorithm
- **Total space**: Input space + auxiliary space

### Amortized Analysis
Some operations may occasionally be expensive but are cheap on average:
- Dynamic array resizing: O(1) amortized insertion
- Hash table with chaining: O(1) amortized lookup


## 1.2 Memory Layout and Cache Effects

### Memory Hierarchy
Modern computers have multiple levels of memory with different speeds and sizes:

1. **CPU Registers** (fastest, smallest)
2. **L1 Cache** (~32KB, ~1 cycle)
3. **L2 Cache** (~256KB, ~10 cycles)
4. **L3 Cache** (~8MB, ~40 cycles)
5. **RAM** (~16GB, ~200 cycles)
6. **Storage** (SSD/HDD, ~100,000+ cycles)

### Cache Locality
**Spatial Locality**: Accessing nearby memory locations
**Temporal Locality**: Accessing the same memory location repeatedly

### Row-Major vs Column-Major Ordering
```python
# Row-major (C-style): elements in same row are adjacent
arr[i][j] and arr[i][j+1] are adjacent in memory

# Column-major (Fortran-style): elements in same column are adjacent
arr[i][j] and arr[i+1][j] are adjacent in memory
```

### Memory Access Patterns in ML
- **Sequential access** is much faster than random access
- **Vectorized operations** leverage SIMD instructions
- **Memory bandwidth** often limits performance more than compute


## 1.3 Vectorization vs Loops

### Why Vectorization is Faster
1. **SIMD Instructions**: Single Instruction, Multiple Data
2. **Reduced Python overhead**: Less interpreter calls
3. **Better cache utilization**: Sequential memory access
4. **Optimized C implementations**: NumPy/PyTorch use optimized BLAS

### Example: Element-wise Operations
```python
import numpy as np
import time

# Large array
arr = np.random.randn(1000000)

# Loop version (slow)
def loop_sum(arr):
    total = 0
    for x in arr:
        total += x
    return total

# Vectorized version (fast)
def vectorized_sum(arr):
    return np.sum(arr)

# Vectorized is ~100x faster for large arrays
```

### When to Use Each Approach
- **Use loops for**: Complex logic, small arrays, irregular patterns
- **Use vectorization for**: Mathematical operations, large arrays, regular patterns
- **Hybrid approach**: Vectorize outer loops, keep inner logic readable


## 1.4 Memory-Efficient Algorithms

### In-Place Operations
Modify data structures without creating copies:
```python
# In-place: O(1) extra space
def reverse_inplace(arr):
    left, right = 0, len(arr) - 1
    while left < right:
        arr[left], arr[right] = arr[right], arr[left]
        left += 1
        right -= 1

# Not in-place: O(n) extra space
def reverse_copy(arr):
    return arr[::-1]
```

### Memory-Efficient Data Structures
- **Sparse matrices**: Store only non-zero elements
- **Compressed representations**: Use bit packing, run-length encoding
- **Lazy evaluation**: Compute values on-demand
- **Streaming algorithms**: Process data in chunks

### Memory Profiling Tools
```python
import tracemalloc
import psutil
import torch

# PyTorch memory tracking
torch.cuda.memory_allocated()  # Current GPU memory
torch.cuda.max_memory_allocated()  # Peak GPU memory

# Python memory profiling
tracemalloc.start()
# ... your code ...
current, peak = tracemalloc.get_traced_memory()
```


## 1.5 PyTorch/Tensor Operations

### Tensor Memory Layout
PyTorch tensors are stored in row-major (C-contiguous) order by default:
```python
import torch

x = torch.randn(3, 4)
print(x.is_contiguous())  # True
print(x.stride())  # (4, 1) - elements in same row are adjacent
```

### Memory-Efficient Tensor Operations
```python
# Avoid creating intermediate tensors
# Bad: Creates temporary tensors
result = (x + y).sum().sqrt()

# Good: Use in-place operations when possible
result = x.add_(y).sum().sqrt_()

# Or use functional operations
result = torch.sqrt(torch.sum(torch.add(x, y)))
```

### Broadcasting and Memory
Broadcasting can create large intermediate tensors:
```python
# This creates a (1000, 1000) tensor
a = torch.randn(1000, 1)
b = torch.randn(1, 1000)
c = a + b  # Broadcasting creates large intermediate tensor
```


---

## Practice Questions

Now let's apply these concepts with 5 progressive exercises. Each question builds on the previous concepts and increases in difficulty.


### Question 1: Memory-Efficient Prefix Sum (Easy)

**Problem**: Implement a memory-efficient prefix sum algorithm that computes the cumulative sum of an array in-place.

**Requirements**:
- Modify the input array in-place (O(1) extra space)
- Time complexity: O(n)
- Handle edge cases (empty array, single element)

**Example**:
```python
arr = [1, 2, 3, 4, 5]
prefix_sum_inplace(arr)
print(arr)  # [1, 3, 6, 10, 15]
```

**Starter Code**:


In [None]:
def prefix_sum_inplace(arr):
    """
    Compute prefix sum in-place.
    
    Args:
        arr: List of numbers to modify in-place
        
    Returns:
        None (modifies input array)
    """
    # TODO: Implement in-place prefix sum
    pass

# Comprehensive Test Suite
def test_prefix_sum():
    """Comprehensive tests for prefix sum implementation."""
    print("Running comprehensive prefix sum tests...")
    
    # Test case 1: Normal array
    arr1 = [1, 2, 3, 4, 5]
    original1 = arr1.copy()
    prefix_sum_inplace(arr1)
    expected1 = [1, 3, 6, 10, 15]
    assert arr1 == expected1, f"Test 1 failed: Expected {expected1}, got {arr1}"
    print("✓ Test 1: Normal array passed")
    
    # Test case 2: Single element
    arr2 = [42]
    original2 = arr2.copy()
    prefix_sum_inplace(arr2)
    expected2 = [42]
    assert arr2 == expected2, f"Test 2 failed: Expected {expected2}, got {arr2}"
    print("✓ Test 2: Single element passed")
    
    # Test case 3: Empty array
    arr3 = []
    original3 = arr3.copy()
    prefix_sum_inplace(arr3)
    expected3 = []
    assert arr3 == expected3, f"Test 3 failed: Expected {expected3}, got {arr3}"
    print("✓ Test 3: Empty array passed")
    
    # Test case 4: Negative numbers
    arr4 = [-1, 2, -3, 4]
    original4 = arr4.copy()
    prefix_sum_inplace(arr4)
    expected4 = [-1, 1, -2, 2]
    assert arr4 == expected4, f"Test 4 failed: Expected {expected4}, got {arr4}"
    print("✓ Test 4: Negative numbers passed")
    
    # Test case 5: All zeros
    arr5 = [0, 0, 0, 0]
    original5 = arr5.copy()
    prefix_sum_inplace(arr5)
    expected5 = [0, 0, 0, 0]
    assert arr5 == expected5, f"Test 5 failed: Expected {expected5}, got {arr5}"
    print("✓ Test 5: All zeros passed")
    
    # Test case 6: Large array
    arr6 = list(range(1, 1001))  # [1, 2, 3, ..., 1000]
    original6 = arr6.copy()
    prefix_sum_inplace(arr6)
    # Verify first few and last few elements
    assert arr6[0] == 1, f"Test 6 failed: First element should be 1, got {arr6[0]}"
    assert arr6[1] == 3, f"Test 6 failed: Second element should be 3, got {arr6[1]}"
    assert arr6[2] == 6, f"Test 6 failed: Third element should be 6, got {arr6[2]}"
    assert arr6[-1] == sum(range(1, 1001)), f"Test 6 failed: Last element incorrect"
    print("✓ Test 6: Large array passed")
    
    # Test case 7: Mixed positive and negative
    arr7 = [5, -3, 2, -1, 4]
    original7 = arr7.copy()
    prefix_sum_inplace(arr7)
    expected7 = [5, 2, 4, 3, 7]
    assert arr7 == expected7, f"Test 7 failed: Expected {expected7}, got {arr7}"
    print("✓ Test 7: Mixed positive/negative passed")
    
    # Test case 8: Single negative number
    arr8 = [-10]
    original8 = arr8.copy()
    prefix_sum_inplace(arr8)
    expected8 = [-10]
    assert arr8 == expected8, f"Test 8 failed: Expected {expected8}, got {arr8}"
    print("✓ Test 8: Single negative passed")
    
    # Test case 9: Verify in-place modification
    test_arr = [1, 2, 3]
    original_ref = id(test_arr)
    prefix_sum_inplace(test_arr)
    assert id(test_arr) == original_ref, "Test 9 failed: Array reference changed (not in-place)"
    print("✓ Test 9: In-place modification verified")
    
    # Test case 10: Edge case with very small numbers
    arr10 = [0.1, 0.2, 0.3]
    original10 = arr10.copy()
    prefix_sum_inplace(arr10)
    expected10 = [0.1, 0.3, 0.6]
    assert all(abs(a - b) < 1e-10 for a, b in zip(arr10, expected10)), f"Test 10 failed: Expected {expected10}, got {arr10}"
    print("✓ Test 10: Small numbers passed")
    
    print("\n🎉 All 10 prefix sum tests passed!")

def test_prefix_sum_performance():
    """Performance test for prefix sum."""
    import time
    
    print("\nRunning performance test...")
    
    # Test with large array
    sizes = [1000, 10000, 100000]
    for size in sizes:
        arr = list(range(1, size + 1))
        
        start_time = time.time()
        prefix_sum_inplace(arr)
        end_time = time.time()
        
        execution_time = end_time - start_time
        print(f"Size {size:,}: {execution_time:.6f} seconds")
        
        # Verify correctness
        expected_sum = sum(range(1, size + 1))
        assert arr[-1] == expected_sum, f"Performance test failed for size {size}"
    
    print("✓ Performance test passed!")

if __name__ == "__main__":
    test_prefix_sum()
    test_prefix_sum_performance()


### Question 2: Tensor Operation Benchmarking (Easy-Medium)

**Problem**: Compare the performance of Python loops vs NumPy vectorized operations for element-wise multiplication on large tensors.

**Requirements**:
- Implement both loop-based and vectorized versions
- Measure execution time for different array sizes
- Analyze memory usage patterns
- Create a performance comparison plot

**Example**:
```python
# Compare these approaches:
# 1. Python loop: for i in range(n): result[i] = a[i] * b[i]
# 2. NumPy vectorized: result = a * b
```

**Starter Code**:


In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import tracemalloc

def elementwise_multiply_loop(a, b):
    """
    Element-wise multiplication using Python loops.
    
    Args:
        a, b: 1D numpy arrays of same length
        
    Returns:
        numpy array: element-wise product
    """
    # TODO: Implement using Python loops
    pass

def elementwise_multiply_vectorized(a, b):
    """
    Element-wise multiplication using NumPy vectorization.
    
    Args:
        a, b: 1D numpy arrays of same length
        
    Returns:
        numpy array: element-wise product
    """
    # TODO: Implement using NumPy vectorization
    pass

def benchmark_operations(sizes=[1000, 10000, 100000, 1000000]):
    """
    Benchmark both approaches across different array sizes.
    
    Args:
        sizes: List of array sizes to test
        
    Returns:
        dict: Results with timing and memory usage
    """
    # TODO: Implement benchmarking
    pass

def plot_performance(results):
    """
    Create performance comparison plots.
    
    Args:
        results: Dictionary with benchmark results
    """
    # TODO: Create plots showing:
    # 1. Execution time vs array size
    # 2. Memory usage vs array size
    # 3. Speedup ratio
    pass

# Test the implementations
if __name__ == "__main__":
    # Quick test
    a = np.random.randn(1000)
    b = np.random.randn(1000)
    
    # Test both implementations
    result_loop = elementwise_multiply_loop(a, b)
    result_vec = elementwise_multiply_vectorized(a, b)
    
    # Verify they produce the same result
    assert np.allclose(result_loop, result_vec), "Results don't match!"
    print("Basic test passed!")
    
    # Run full benchmark
    results = benchmark_operations()
    plot_performance(results)


### Question 3: Cache-Aware Matrix Operations (Medium)

**Problem**: Implement matrix multiplication with cache-aware optimization. Compare different memory access patterns and their impact on performance.

**Requirements**:
- Implement standard matrix multiplication
- Implement cache-optimized version (tiling/blocking)
- Measure performance difference
- Analyze cache miss rates

**Background**: Matrix multiplication C = A × B where A is m×k, B is k×n, C is m×n.
The standard algorithm has poor cache locality when matrices don't fit in cache.

**Starter Code**:


In [None]:
import numpy as np
import time
from numba import jit

def matrix_multiply_naive(A, B):
    """
    Standard matrix multiplication with poor cache locality.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        
    Returns:
        numpy array of shape (m, n)
    """
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    
    # TODO: Implement standard matrix multiplication
    pass

def matrix_multiply_tiled(A, B, tile_size=64):
    """
    Cache-optimized matrix multiplication using tiling.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        tile_size: Size of the tile for blocking
        
    Returns:
        numpy array of shape (m, n)
    """
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    
    # TODO: Implement tiled matrix multiplication
    pass

@jit(nopython=True)
def matrix_multiply_numba(A, B):
    """
    Numba-optimized matrix multiplication for comparison.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        
    Returns:
        numpy array of shape (m, n)
    """
    # TODO: Implement with Numba JIT compilation
    pass

def benchmark_matrix_multiply(sizes=[64, 128, 256, 512, 1024]):
    """
    Benchmark different matrix multiplication approaches.
    
    Args:
        sizes: List of matrix sizes to test (square matrices)
        
    Returns:
        dict: Performance results
    """
    # TODO: Implement benchmarking
    pass

def verify_correctness():
    """Verify all implementations produce the same result."""
    # TODO: Test with small matrices to ensure correctness
    pass

if __name__ == "__main__":
    # Verify correctness first
    verify_correctness()
    print("Correctness tests passed!")
    
    # Run benchmarks
    results = benchmark_matrix_multiply()
    
    # Print results
    for size in results['sizes']:
        print(f"Size {size}x{size}:")
        print(f"  Naive: {results['naive_times'][size]:.4f}s")
        print(f"  Tiled: {results['tiled_times'][size]:.4f}s")
        print(f"  Numba: {results['numba_times'][size]:.4f}s")
        print()


### Question 4: Custom Sparse Tensor Operations (Medium-Hard)

**Problem**: Implement a memory-efficient sparse tensor class with basic operations like addition, multiplication, and reshaping.

**Requirements**:
- Use COO (Coordinate) format for sparse representation
- Implement tensor addition and element-wise multiplication
- Support reshaping operations
- Memory usage should be O(nnz) where nnz is number of non-zeros
- Compare with dense tensor operations

**Background**: Sparse tensors store only non-zero values, making them memory-efficient for data with many zeros.

**Starter Code**:


In [None]:
import numpy as np
from typing import List, Tuple, Union

class SparseTensor:
    """
    Memory-efficient sparse tensor using COO (Coordinate) format.
    
    Stores only non-zero values with their coordinates.
    """
    
    def __init__(self, shape: Tuple[int, ...], values: np.ndarray = None, 
                 indices: np.ndarray = None):
        """
        Initialize sparse tensor.
        
        Args:
            shape: Tensor dimensions
            values: Non-zero values (1D array)
            indices: Coordinates of non-zero values (2D array, shape: [nnz, ndim])
        """
        self.shape = shape
        self.ndim = len(shape)
        
        if values is None:
            self.values = np.array([], dtype=np.float32)
            self.indices = np.array([], dtype=np.int32).reshape(0, self.ndim)
        else:
            self.values = values.astype(np.float32)
            self.indices = indices.astype(np.int32)
            self._validate()
    
    def _validate(self):
        """Validate tensor data."""
        # TODO: Add validation checks
        pass
    
    def to_dense(self) -> np.ndarray:
        """Convert to dense tensor."""
        # TODO: Implement conversion to dense
        pass
    
    @classmethod
    def from_dense(cls, dense_tensor: np.ndarray) -> 'SparseTensor':
        """Create sparse tensor from dense tensor."""
        # TODO: Implement conversion from dense
        pass
    
    def add(self, other: 'SparseTensor') -> 'SparseTensor':
        """Add two sparse tensors."""
        # TODO: Implement sparse tensor addition
        pass
    
    def multiply_elementwise(self, other: 'SparseTensor') -> 'SparseTensor':
        """Element-wise multiplication of two sparse tensors."""
        # TODO: Implement element-wise multiplication
        pass
    
    def reshape(self, new_shape: Tuple[int, ...]) -> 'SparseTensor':
        """Reshape the tensor."""
        # TODO: Implement reshaping
        pass
    
    def memory_usage(self) -> int:
        """Calculate memory usage in bytes."""
        # TODO: Calculate actual memory usage
        pass
    
    def sparsity(self) -> float:
        """Calculate sparsity ratio (fraction of zeros)."""
        # TODO: Calculate sparsity
        pass

def create_sparse_tensor(shape: Tuple[int, ...], sparsity: float = 0.9) -> SparseTensor:
    """
    Create a random sparse tensor with given sparsity.
    
    Args:
        shape: Tensor dimensions
        sparsity: Fraction of elements that should be zero
        
    Returns:
        SparseTensor: Random sparse tensor
    """
    # TODO: Implement random sparse tensor creation
    pass

def benchmark_sparse_vs_dense(shape: Tuple[int, ...], sparsity: float = 0.9):
    """
    Benchmark sparse vs dense operations.
    
    Args:
        shape: Tensor dimensions
        sparsity: Sparsity level
        
    Returns:
        dict: Performance comparison
    """
    # TODO: Implement benchmarking
    pass

# Test cases
def test_sparse_tensor():
    """Test sparse tensor functionality."""
    # TODO: Add comprehensive tests
    pass

if __name__ == "__main__":
    test_sparse_tensor()
    print("All tests passed!")
    
    # Benchmark example
    results = benchmark_sparse_vs_dense((100, 100), sparsity=0.95)
    print(f"Sparse memory: {results['sparse_memory']} bytes")
    print(f"Dense memory: {results['dense_memory']} bytes")
    print(f"Memory savings: {results['memory_savings']:.1%}")


### Question 5: Memory Profiling and Optimization (Hard)

**Problem**: Implement a comprehensive memory profiler for PyTorch operations and optimize a memory-intensive neural network forward pass.

**Requirements**:
- Create a memory profiler that tracks GPU/CPU memory usage
- Profile a multi-layer neural network forward pass
- Identify memory bottlenecks and optimize them
- Implement gradient checkpointing to reduce memory usage
- Compare memory usage before and after optimization

**Background**: Large neural networks can consume massive amounts of memory. Understanding and optimizing memory usage is crucial for training large models.

**Starter Code**:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import psutil
import gc
from contextlib import contextmanager
from typing import Dict, List, Optional

class MemoryProfiler:
    """
    Memory profiler for PyTorch operations.
    """
    
    def __init__(self, device: str = 'cuda' if torch.cuda.is_available() else 'cpu'):
        self.device = device
        self.memory_log = []
        self.peak_memory = 0
        
    def get_memory_usage(self) -> Dict[str, float]:
        """Get current memory usage in MB."""
        # TODO: Implement memory usage tracking
        pass
    
    def log_memory(self, operation_name: str):
        """Log memory usage for an operation."""
        # TODO: Implement memory logging
        pass
    
    def reset(self):
        """Reset profiler state."""
        # TODO: Reset profiler
        pass
    
    def get_peak_memory(self) -> float:
        """Get peak memory usage in MB."""
        # TODO: Return peak memory
        pass

class LargeNeuralNetwork(nn.Module):
    """
    Memory-intensive neural network for profiling.
    """
    
    def __init__(self, input_size: int = 1000, hidden_sizes: List[int] = [2000, 2000, 2000], 
                 output_size: int = 100, dropout: float = 0.1):
        super().__init__()
        # TODO: Implement large neural network
        pass
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass without optimization."""
        # TODO: Implement forward pass
        pass
    
    def forward_with_checkpointing(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with gradient checkpointing."""
        # TODO: Implement checkpointed forward pass
        pass

def profile_network_memory(model: nn.Module, input_tensor: torch.Tensor, 
                          use_checkpointing: bool = False) -> Dict[str, float]:
    """
    Profile memory usage of network forward pass.
    
    Args:
        model: Neural network model
        input_tensor: Input tensor
        use_checkpointing: Whether to use gradient checkpointing
        
    Returns:
        dict: Memory usage statistics
    """
    # TODO: Implement memory profiling
    pass

def optimize_memory_usage(model: nn.Module, input_tensor: torch.Tensor) -> Dict[str, float]:
    """
    Apply memory optimizations and measure improvement.
    
    Args:
        model: Neural network model
        input_tensor: Input tensor
        
    Returns:
        dict: Optimization results
    """
    # TODO: Implement memory optimizations
    pass

def compare_memory_strategies():
    """Compare different memory optimization strategies."""
    # TODO: Implement comparison
    pass

# Test the implementation
if __name__ == "__main__":
    # Create model and test data
    model = LargeNeuralNetwork()
    input_tensor = torch.randn(32, 1000)  # batch_size=32
    
    print("Profiling memory usage...")
    
    # Profile without optimization
    results_baseline = profile_network_memory(model, input_tensor, use_checkpointing=False)
    print(f"Baseline memory usage: {results_baseline['peak_memory']:.2f} MB")
    
    # Profile with optimization
    results_optimized = profile_network_memory(model, input_tensor, use_checkpointing=True)
    print(f"Optimized memory usage: {results_optimized['peak_memory']:.2f} MB")
    
    # Calculate improvement
    improvement = (results_baseline['peak_memory'] - results_optimized['peak_memory']) / results_baseline['peak_memory']
    print(f"Memory reduction: {improvement:.1%}")


---

## 💡 Hints

<details>
<summary>Click to reveal hint for Question 1: Memory-Efficient Prefix Sum</summary>

**Hint**: For in-place prefix sum, you need to iterate through the array once and update each element to be the sum of all previous elements plus itself. Start from index 1 (since index 0 doesn't need to change) and for each position i, set `arr[i] = arr[i-1] + arr[i]`.

**Key insight**: Each element becomes the sum of all elements from the beginning up to and including itself.
</details>

<details>
<summary>Click to reveal hint for Question 2: Tensor Operation Benchmarking</summary>

**Hint**: For the loop version, use a simple for loop with indexing. For the vectorized version, use NumPy's built-in multiplication operator. Use `time.time()` to measure execution time and `tracemalloc` to track memory usage. Create arrays of different sizes and measure both time and memory for each approach.

**Key insight**: Vectorized operations should be significantly faster due to SIMD instructions and reduced Python overhead.
</details>

<details>
<summary>Click to reveal hint for Question 3: Cache-Aware Matrix Operations</summary>

**Hint**: For tiling, divide the matrices into smaller blocks (tiles) and process them separately. This improves cache locality by keeping frequently accessed data in cache. The tile size should be chosen based on cache size (typically 64x64 or 128x128). For Numba, use the `@jit(nopython=True)` decorator to compile the function.

**Key insight**: Tiling reduces cache misses by ensuring that when you access a tile, all its data fits in cache.
</details>

<details>
<summary>Click to reveal hint for Question 4: Custom Sparse Tensor Operations</summary>

**Hint**: In COO format, store non-zero values and their coordinates separately. For addition, you need to merge the coordinate lists and sum values at the same coordinates. For element-wise multiplication, only multiply values that exist in both tensors. For reshaping, convert 1D coordinates to multi-dimensional coordinates.

**Key insight**: Sparse operations should only work on non-zero elements, making them much more memory-efficient for sparse data.
</details>

<details>
<summary>Click to reveal hint for Question 5: Memory Profiling and Optimization</summary>

**Hint**: Use `torch.cuda.memory_allocated()` and `torch.cuda.max_memory_allocated()` for GPU memory tracking. For gradient checkpointing, use `torch.utils.checkpoint.checkpoint()` to trade compute for memory. Implement the network with multiple large layers and measure memory usage at each step.

**Key insight**: Gradient checkpointing saves memory by recomputing activations during backward pass instead of storing them during forward pass.
</details>
