# Chapter 1: Data Structures & Complexity Refresher

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jmamath/openai-interview-textbook/blob/main/chapter_01_data_structures_complexity.ipynb)

## Learning Objectives
By the end of this chapter, you will be able to:
- Analyze time and space complexity of algorithms
- Understand memory layout and cache effects in ML operations
- Compare vectorized operations vs loops for performance
- Implement memory-efficient algorithms for large-scale ML
- Profile and optimize memory usage in PyTorch/TensorFlow

## Prerequisites
- Basic understanding of Python and NumPy
- Familiarity with PyTorch tensors
- Knowledge of basic data structures (arrays, lists, dictionaries)


## 1.1 Complexity Analysis Refresher

### Big O Notation
Big O notation describes the worst-case time or space complexity of an algorithm as the input size grows.

**Common Complexities:**
- O(1) - Constant time (array access, hash table lookup)
- O(log n) - Logarithmic (binary search, balanced tree operations)
- O(n) - Linear (single pass through array)
- O(n log n) - Linearithmic (efficient sorting algorithms)
- O(n¬≤) - Quadratic (nested loops)
- O(2‚Åø) - Exponential (recursive Fibonacci)

### Space Complexity
Space complexity measures the amount of memory an algorithm uses relative to input size:
- **Auxiliary space**: Extra space used by the algorithm
- **Total space**: Input space + auxiliary space

### Amortized Analysis
Some operations may occasionally be expensive but are cheap on average:
- Dynamic array resizing: O(1) amortized insertion
- Hash table with chaining: O(1) amortized lookup


## 1.2 Memory Layout and Cache Effects

### Memory Hierarchy
Modern computers have multiple levels of memory with different speeds and sizes:

1. **CPU Registers** (fastest, smallest)
2. **L1 Cache** (~32KB, ~1 cycle)
3. **L2 Cache** (~256KB, ~10 cycles)
4. **L3 Cache** (~8MB, ~40 cycles)
5. **RAM** (~16GB, ~200 cycles)
6. **Storage** (SSD/HDD, ~100,000+ cycles)

### Cache Locality
**Spatial Locality**: Accessing nearby memory locations
**Temporal Locality**: Accessing the same memory location repeatedly

### Row-Major vs Column-Major Ordering
```python
# Row-major (C-style): elements in same row are adjacent
arr[i][j] and arr[i][j+1] are adjacent in memory

# Column-major (Fortran-style): elements in same column are adjacent
arr[i][j] and arr[i+1][j] are adjacent in memory
```

### Memory Access Patterns in ML
- **Sequential access** is much faster than random access
- **Vectorized operations** leverage SIMD instructions
- **Memory bandwidth** often limits performance more than compute


## 1.3 Vectorization vs Loops

### Why Vectorization is Faster
1. **SIMD Instructions**: Single Instruction, Multiple Data
2. **Reduced Python overhead**: Less interpreter calls
3. **Better cache utilization**: Sequential memory access
4. **Optimized C implementations**: NumPy/PyTorch use optimized BLAS

### Example: Element-wise Operations
```python
import numpy as np
import time

# Large array
arr = np.random.randn(1000000)

# Loop version (slow)
def loop_sum(arr):
    total = 0
    for x in arr:
        total += x
    return total

# Vectorized version (fast)
def vectorized_sum(arr):
    return np.sum(arr)

# Vectorized is ~100x faster for large arrays
```

### When to Use Each Approach
- **Use loops for**: Complex logic, small arrays, irregular patterns
- **Use vectorization for**: Mathematical operations, large arrays, regular patterns
- **Hybrid approach**: Vectorize outer loops, keep inner logic readable


## 1.4 Memory-Efficient Algorithms

### In-Place Operations
Modify data structures without creating copies:
```python
# In-place: O(1) extra space
def reverse_inplace(arr):
    left, right = 0, len(arr) - 1
    while left < right:
        arr[left], arr[right] = arr[right], arr[left]
        left += 1
        right -= 1

# Not in-place: O(n) extra space
def reverse_copy(arr):
    return arr[::-1]
```

### Memory-Efficient Data Structures
- **Sparse matrices**: Store only non-zero elements
- **Compressed representations**: Use bit packing, run-length encoding
- **Lazy evaluation**: Compute values on-demand
- **Streaming algorithms**: Process data in chunks

### Memory Profiling Tools
```python
import tracemalloc
import psutil
import torch

# PyTorch memory tracking
torch.cuda.memory_allocated()  # Current GPU memory
torch.cuda.max_memory_allocated()  # Peak GPU memory

# Python memory profiling
tracemalloc.start()
# ... your code ...
current, peak = tracemalloc.get_traced_memory()
```


## 1.5 PyTorch/Tensor Operations

### Tensor Memory Layout
PyTorch tensors are stored in row-major (C-contiguous) order by default:
```python
import torch

x = torch.randn(3, 4)
print(x.is_contiguous())  # True
print(x.stride())  # (4, 1) - elements in same row are adjacent
```

### Memory-Efficient Tensor Operations
```python
# Avoid creating intermediate tensors
# Bad: Creates temporary tensors
result = (x + y).sum().sqrt()

# Good: Use in-place operations when possible
result = x.add_(y).sum().sqrt_()

# Or use functional operations
result = torch.sqrt(torch.sum(torch.add(x, y)))
```

### Broadcasting and Memory
Broadcasting can create large intermediate tensors:
```python
# This creates a (1000, 1000) tensor
a = torch.randn(1000, 1)
b = torch.randn(1, 1000)
c = a + b  # Broadcasting creates large intermediate tensor
```


---

## Practice Questions

Now let's apply these concepts with 5 progressive exercises. Each question builds on the previous concepts and increases in difficulty.


### Question 1: Memory-Efficient Prefix Sum (Easy)

**Problem**: Implement a memory-efficient prefix sum algorithm that computes the cumulative sum of an array in-place.

**Requirements**:
- Modify the input array in-place (O(1) extra space)
- Time complexity: O(n)
- Handle edge cases (empty array, single element)

**Example**:
```python
arr = [1, 2, 3, 4, 5]
prefix_sum_inplace(arr)
print(arr)  # [1, 3, 6, 10, 15]
```

**Starter Code**:


In [None]:
def prefix_sum_inplace(arr):
    """
    Compute prefix sum in-place.
    
    Args:
        arr: List of numbers to modify in-place
        
    Returns:
        None (modifies input array)
    """
    # TODO: Implement in-place prefix sum
    pass

# Comprehensive Test Suite
def test_prefix_sum():
    """Comprehensive tests for prefix sum implementation."""
    print("Running comprehensive prefix sum tests...")
    
    # Test case 1: Normal array
    arr1 = [1, 2, 3, 4, 5]
    original1 = arr1.copy()
    prefix_sum_inplace(arr1)
    expected1 = [1, 3, 6, 10, 15]
    assert arr1 == expected1, f"Test 1 failed: Expected {expected1}, got {arr1}"
    print("‚úì Test 1: Normal array passed")
    
    # Test case 2: Single element
    arr2 = [42]
    original2 = arr2.copy()
    prefix_sum_inplace(arr2)
    expected2 = [42]
    assert arr2 == expected2, f"Test 2 failed: Expected {expected2}, got {arr2}"
    print("‚úì Test 2: Single element passed")
    
    # Test case 3: Empty array
    arr3 = []
    original3 = arr3.copy()
    prefix_sum_inplace(arr3)
    expected3 = []
    assert arr3 == expected3, f"Test 3 failed: Expected {expected3}, got {arr3}"
    print("‚úì Test 3: Empty array passed")
    
    # Test case 4: Negative numbers
    arr4 = [-1, 2, -3, 4]
    original4 = arr4.copy()
    prefix_sum_inplace(arr4)
    expected4 = [-1, 1, -2, 2]
    assert arr4 == expected4, f"Test 4 failed: Expected {expected4}, got {arr4}"
    print("‚úì Test 4: Negative numbers passed")
    
    # Test case 5: All zeros
    arr5 = [0, 0, 0, 0]
    original5 = arr5.copy()
    prefix_sum_inplace(arr5)
    expected5 = [0, 0, 0, 0]
    assert arr5 == expected5, f"Test 5 failed: Expected {expected5}, got {arr5}"
    print("‚úì Test 5: All zeros passed")
    
    # Test case 6: Large array
    arr6 = list(range(1, 1001))  # [1, 2, 3, ..., 1000]
    original6 = arr6.copy()
    prefix_sum_inplace(arr6)
    # Verify first few and last few elements
    assert arr6[0] == 1, f"Test 6 failed: First element should be 1, got {arr6[0]}"
    assert arr6[1] == 3, f"Test 6 failed: Second element should be 3, got {arr6[1]}"
    assert arr6[2] == 6, f"Test 6 failed: Third element should be 6, got {arr6[2]}"
    assert arr6[-1] == sum(range(1, 1001)), f"Test 6 failed: Last element incorrect"
    print("‚úì Test 6: Large array passed")
    
    # Test case 7: Mixed positive and negative
    arr7 = [5, -3, 2, -1, 4]
    original7 = arr7.copy()
    prefix_sum_inplace(arr7)
    expected7 = [5, 2, 4, 3, 7]
    assert arr7 == expected7, f"Test 7 failed: Expected {expected7}, got {arr7}"
    print("‚úì Test 7: Mixed positive/negative passed")
    
    # Test case 8: Single negative number
    arr8 = [-10]
    original8 = arr8.copy()
    prefix_sum_inplace(arr8)
    expected8 = [-10]
    assert arr8 == expected8, f"Test 8 failed: Expected {expected8}, got {arr8}"
    print("‚úì Test 8: Single negative passed")
    
    # Test case 9: Verify in-place modification
    test_arr = [1, 2, 3]
    original_ref = id(test_arr)
    prefix_sum_inplace(test_arr)
    assert id(test_arr) == original_ref, "Test 9 failed: Array reference changed (not in-place)"
    print("‚úì Test 9: In-place modification verified")
    
    # Test case 10: Edge case with very small numbers
    arr10 = [0.1, 0.2, 0.3]
    original10 = arr10.copy()
    prefix_sum_inplace(arr10)
    expected10 = [0.1, 0.3, 0.6]
    assert all(abs(a - b) < 1e-10 for a, b in zip(arr10, expected10)), f"Test 10 failed: Expected {expected10}, got {arr10}"
    print("‚úì Test 10: Small numbers passed")
    
    print("\nüéâ All 10 prefix sum tests passed!")

def test_prefix_sum_performance():
    """Performance test for prefix sum."""
    import time
    
    print("\nRunning performance test...")
    
    # Test with large array
    sizes = [1000, 10000, 100000]
    for size in sizes:
        arr = list(range(1, size + 1))
        
        start_time = time.time()
        prefix_sum_inplace(arr)
        end_time = time.time()
        
        execution_time = end_time - start_time
        print(f"Size {size:,}: {execution_time:.6f} seconds")
        
        # Verify correctness
        expected_sum = sum(range(1, size + 1))
        assert arr[-1] == expected_sum, f"Performance test failed for size {size}"
    
    print("‚úì Performance test passed!")

if __name__ == "__main__":
    test_prefix_sum()
    test_prefix_sum_performance()


### Question 2: Tensor Operation Benchmarking (Easy-Medium)

**Problem**: Compare the performance of Python loops vs NumPy vectorized operations for element-wise multiplication on large tensors.

**Requirements**:
- Implement both loop-based and vectorized versions
- Measure execution time for different array sizes
- Analyze memory usage patterns
- Create a performance comparison plot

**Example**:
```python
# Compare these approaches:
# 1. Python loop: for i in range(n): result[i] = a[i] * b[i]
# 2. NumPy vectorized: result = a * b
```

**Starter Code**:


In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import tracemalloc

def elementwise_multiply_loop(a, b):
    """
    Element-wise multiplication using Python loops.
    
    Args:
        a, b: 1D numpy arrays of same length
        
    Returns:
        numpy array: element-wise product
    """
    # TODO: Implement using Python loops
    pass

def elementwise_multiply_vectorized(a, b):
    """
    Element-wise multiplication using NumPy vectorization.
    
    Args:
        a, b: 1D numpy arrays of same length
        
    Returns:
        numpy array: element-wise product
    """
    # TODO: Implement using NumPy vectorization
    pass

def benchmark_operations(sizes=[1000, 10000, 100000, 1000000]):
    """
    Benchmark both approaches across different array sizes.
    
    Args:
        sizes: List of array sizes to test
        
    Returns:
        dict: Results with timing and memory usage
    """
    # TODO: Implement benchmarking
    pass

def plot_performance(results):
    """
    Create performance comparison plots.
    
    Args:
        results: Dictionary with benchmark results
    """
    # TODO: Create plots showing:
    # 1. Execution time vs array size
    # 2. Memory usage vs array size
    # 3. Speedup ratio
    pass

# Comprehensive Test Suite
def test_elementwise_operations():
    """Comprehensive tests for element-wise operations."""
    print("Running comprehensive element-wise operation tests...")
    
    # Test case 1: Basic functionality
    a1 = np.array([1, 2, 3, 4, 5])
    b1 = np.array([2, 3, 4, 5, 6])
    
    result_loop1 = elementwise_multiply_loop(a1, b1)
    result_vec1 = elementwise_multiply_vectorized(a1, b1)
    expected1 = np.array([2, 6, 12, 20, 30])
    
    assert np.array_equal(result_loop1, expected1), f"Loop test 1 failed: Expected {expected1}, got {result_loop1}"
    assert np.array_equal(result_vec1, expected1), f"Vectorized test 1 failed: Expected {expected1}, got {result_vec1}"
    assert np.array_equal(result_loop1, result_vec1), "Loop and vectorized results don't match!"
    print("‚úì Test 1: Basic functionality passed")
    
    # Test case 2: Negative numbers
    a2 = np.array([-1, -2, -3])
    b2 = np.array([2, -2, 3])
    
    result_loop2 = elementwise_multiply_loop(a2, b2)
    result_vec2 = elementwise_multiply_vectorized(a2, b2)
    expected2 = np.array([-2, 4, -9])
    
    assert np.array_equal(result_loop2, expected2), f"Loop test 2 failed: Expected {expected2}, got {result_loop2}"
    assert np.array_equal(result_vec2, expected2), f"Vectorized test 2 failed: Expected {expected2}, got {result_vec2}"
    print("‚úì Test 2: Negative numbers passed")
    
    # Test case 3: Single element
    a3 = np.array([42])
    b3 = np.array([2])
    
    result_loop3 = elementwise_multiply_loop(a3, b3)
    result_vec3 = elementwise_multiply_vectorized(a3, b3)
    expected3 = np.array([84])
    
    assert np.array_equal(result_loop3, expected3), f"Loop test 3 failed: Expected {expected3}, got {result_loop3}"
    assert np.array_equal(result_vec3, expected3), f"Vectorized test 3 failed: Expected {expected3}, got {result_vec3}"
    print("‚úì Test 3: Single element passed")
    
    # Test case 4: Empty arrays
    a4 = np.array([])
    b4 = np.array([])
    
    result_loop4 = elementwise_multiply_loop(a4, b4)
    result_vec4 = elementwise_multiply_vectorized(a4, b4)
    expected4 = np.array([])
    
    assert np.array_equal(result_loop4, expected4), f"Loop test 4 failed: Expected {expected4}, got {result_loop4}"
    assert np.array_equal(result_vec4, expected4), f"Vectorized test 4 failed: Expected {expected4}, got {result_vec4}"
    print("‚úì Test 4: Empty arrays passed")
    
    # Test case 5: Floating point precision
    a5 = np.array([0.1, 0.2, 0.3])
    b5 = np.array([2.0, 3.0, 4.0])
    
    result_loop5 = elementwise_multiply_loop(a5, b5)
    result_vec5 = elementwise_multiply_vectorized(a5, b5)
    expected5 = np.array([0.2, 0.6, 1.2])
    
    assert np.allclose(result_loop5, expected5, rtol=1e-10), f"Loop test 5 failed: Expected {expected5}, got {result_loop5}"
    assert np.allclose(result_vec5, expected5, rtol=1e-10), f"Vectorized test 5 failed: Expected {expected5}, got {result_vec5}"
    print("‚úì Test 5: Floating point precision passed")
    
    # Test case 6: Large arrays
    size = 10000
    a6 = np.random.randn(size)
    b6 = np.random.randn(size)
    
    result_loop6 = elementwise_multiply_loop(a6, b6)
    result_vec6 = elementwise_multiply_vectorized(a6, b6)
    
    assert np.allclose(result_loop6, result_vec6, rtol=1e-10), "Large array test failed: Results don't match!"
    print("‚úì Test 6: Large arrays passed")
    
    # Test case 7: Different shapes (should raise error)
    try:
        a7 = np.array([1, 2, 3])
        b7 = np.array([1, 2])  # Different length
        elementwise_multiply_loop(a7, b7)
        assert False, "Should have raised an error for different shapes"
    except (ValueError, IndexError):
        print("‚úì Test 7: Shape validation passed")
    
    print("\nüéâ All 7 element-wise operation tests passed!")

def test_benchmarking():
    """Test the benchmarking functionality."""
    print("\nRunning benchmarking tests...")
    
    # Test with small sizes first
    results = benchmark_operations(sizes=[100, 1000])
    
    # Verify results structure
    assert 'sizes' in results, "Results should contain 'sizes'"
    assert 'loop_times' in results, "Results should contain 'loop_times'"
    assert 'vectorized_times' in results, "Results should contain 'vectorized_times'"
    assert 'memory_loop' in results, "Results should contain 'memory_loop'"
    assert 'memory_vectorized' in results, "Results should contain 'memory_vectorized'"
    
    # Verify vectorized is faster
    for size in results['sizes']:
        loop_time = results['loop_times'][size]
        vec_time = results['vectorized_times'][size]
        assert vec_time < loop_time, f"Vectorized should be faster for size {size}"
    
    print("‚úì Benchmarking tests passed!")

def test_performance_analysis():
    """Test performance analysis and plotting."""
    print("\nRunning performance analysis tests...")
    
    # Test with small dataset
    results = benchmark_operations(sizes=[100, 500, 1000])
    
    # Test plotting (should not crash)
    try:
        plot_performance(results)
        print("‚úì Plotting functionality works")
    except Exception as e:
        print(f"‚ö†Ô∏è  Plotting test failed: {e}")
    
    # Verify performance trends
    sizes = results['sizes']
    loop_times = [results['loop_times'][s] for s in sizes]
    vec_times = [results['vectorized_times'][s] for s in sizes]
    
    # Vectorized should be consistently faster
    for i in range(len(sizes)):
        assert vec_times[i] < loop_times[i], f"Vectorized not faster for size {sizes[i]}"
    
    print("‚úì Performance analysis tests passed!")

# Test the implementations
if __name__ == "__main__":
    test_elementwise_operations()
    test_benchmarking()
    test_performance_analysis()
    
    print("\n" + "="*50)
    print("Running full benchmark with visualization...")
    results = benchmark_operations()
    plot_performance(results)


### Question 3: Cache-Aware Matrix Operations (Medium)

**Problem**: Implement matrix multiplication with cache-aware optimization. Compare different memory access patterns and their impact on performance.

**Requirements**:
- Implement standard matrix multiplication
- Implement cache-optimized version (tiling/blocking)
- Measure performance difference
- Analyze cache miss rates

**Background**: Matrix multiplication C = A √ó B where A is m√ók, B is k√ón, C is m√ón.
The standard algorithm has poor cache locality when matrices don't fit in cache.

**Starter Code**:


In [None]:
import numpy as np
import time
from numba import jit

def matrix_multiply_naive(A, B):
    """
    Standard matrix multiplication with poor cache locality.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        
    Returns:
        numpy array of shape (m, n)
    """
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    
    # TODO: Implement standard matrix multiplication
    pass

def matrix_multiply_tiled(A, B, tile_size=64):
    """
    Cache-optimized matrix multiplication using tiling.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        tile_size: Size of the tile for blocking
        
    Returns:
        numpy array of shape (m, n)
    """
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    
    # TODO: Implement tiled matrix multiplication
    pass

@jit(nopython=True)
def matrix_multiply_numba(A, B):
    """
    Numba-optimized matrix multiplication for comparison.
    
    Args:
        A: numpy array of shape (m, k)
        B: numpy array of shape (k, n)
        
    Returns:
        numpy array of shape (m, n)
    """
    # TODO: Implement with Numba JIT compilation
    pass

def benchmark_matrix_multiply(sizes=[64, 128, 256, 512, 1024]):
    """
    Benchmark different matrix multiplication approaches.
    
    Args:
        sizes: List of matrix sizes to test (square matrices)
        
    Returns:
        dict: Performance results
    """
    # TODO: Implement benchmarking
    pass

def verify_correctness():
    """Verify all implementations produce the same result."""
    # TODO: Test with small matrices to ensure correctness
    pass

# Comprehensive Test Suite
def test_matrix_multiply_correctness():
    """Comprehensive correctness tests for matrix multiplication."""
    print("Running comprehensive matrix multiplication correctness tests...")
    
    # Test case 1: Small matrices
    A1 = np.array([[1, 2], [3, 4]], dtype=np.float32)
    B1 = np.array([[5, 6], [7, 8]], dtype=np.float32)
    expected1 = np.array([[19, 22], [43, 50]], dtype=np.float32)
    
    result_naive1 = matrix_multiply_naive(A1, B1)
    result_tiled1 = matrix_multiply_tiled(A1, B1)
    result_numba1 = matrix_multiply_numba(A1, B1)
    
    assert np.allclose(result_naive1, expected1, rtol=1e-6), f"Naive test 1 failed: Expected {expected1}, got {result_naive1}"
    assert np.allclose(result_tiled1, expected1, rtol=1e-6), f"Tiled test 1 failed: Expected {expected1}, got {result_tiled1}"
    assert np.allclose(result_numba1, expected1, rtol=1e-6), f"Numba test 1 failed: Expected {expected1}, got {result_numba1}"
    print("‚úì Test 1: Small matrices passed")
    
    # Test case 2: Identity matrix
    I = np.eye(3, dtype=np.float32)
    A2 = np.random.randn(3, 3).astype(np.float32)
    
    result_naive2 = matrix_multiply_naive(A2, I)
    result_tiled2 = matrix_multiply_tiled(A2, I)
    result_numba2 = matrix_multiply_numba(A2, I)
    
    assert np.allclose(result_naive2, A2, rtol=1e-6), "Naive identity test failed"
    assert np.allclose(result_tiled2, A2, rtol=1e-6), "Tiled identity test failed"
    assert np.allclose(result_numba2, A2, rtol=1e-6), "Numba identity test failed"
    print("‚úì Test 2: Identity matrix passed")
    
    # Test case 3: Zero matrix
    A3 = np.zeros((2, 3), dtype=np.float32)
    B3 = np.random.randn(3, 4).astype(np.float32)
    expected3 = np.zeros((2, 4), dtype=np.float32)
    
    result_naive3 = matrix_multiply_naive(A3, B3)
    result_tiled3 = matrix_multiply_tiled(A3, B3)
    result_numba3 = matrix_multiply_numba(A3, B3)
    
    assert np.allclose(result_naive3, expected3, rtol=1e-6), "Naive zero test failed"
    assert np.allclose(result_tiled3, expected3, rtol=1e-6), "Tiled zero test failed"
    assert np.allclose(result_numba3, expected3, rtol=1e-6), "Numba zero test failed"
    print("‚úì Test 3: Zero matrix passed")
    
    # Test case 4: Non-square matrices
    A4 = np.random.randn(2, 3).astype(np.float32)
    B4 = np.random.randn(3, 4).astype(np.float32)
    
    result_naive4 = matrix_multiply_naive(A4, B4)
    result_tiled4 = matrix_multiply_tiled(A4, B4)
    result_numba4 = matrix_multiply_numba(A4, B4)
    
    # All should produce same result
    assert np.allclose(result_naive4, result_tiled4, rtol=1e-6), "Non-square naive vs tiled failed"
    assert np.allclose(result_naive4, result_numba4, rtol=1e-6), "Non-square naive vs numba failed"
    assert result_naive4.shape == (2, 4), f"Wrong output shape: {result_naive4.shape}"
    print("‚úì Test 4: Non-square matrices passed")
    
    # Test case 5: Single element
    A5 = np.array([[5]], dtype=np.float32)
    B5 = np.array([[3]], dtype=np.float32)
    expected5 = np.array([[15]], dtype=np.float32)
    
    result_naive5 = matrix_multiply_naive(A5, B5)
    result_tiled5 = matrix_multiply_tiled(A5, B5)
    result_numba5 = matrix_multiply_numba(A5, B5)
    
    assert np.allclose(result_naive5, expected5, rtol=1e-6), "Naive single element test failed"
    assert np.allclose(result_tiled5, expected5, rtol=1e-6), "Tiled single element test failed"
    assert np.allclose(result_numba5, expected5, rtol=1e-6), "Numba single element test failed"
    print("‚úì Test 5: Single element passed")
    
    # Test case 6: Dimension mismatch (should raise error)
    try:
        A6 = np.random.randn(2, 3)
        B6 = np.random.randn(4, 5)  # Wrong inner dimension
        matrix_multiply_naive(A6, B6)
        assert False, "Should have raised an error for dimension mismatch"
    except AssertionError:
        print("‚úì Test 6: Dimension validation passed")
    
    print("\nüéâ All 6 matrix multiplication correctness tests passed!")

def test_matrix_multiply_performance():
    """Test matrix multiplication performance."""
    print("\nRunning matrix multiplication performance tests...")
    
    # Test with different sizes
    sizes = [16, 32, 64, 128]
    results = benchmark_matrix_multiply(sizes)
    
    # Verify results structure
    assert 'sizes' in results, "Results should contain 'sizes'"
    assert 'naive_times' in results, "Results should contain 'naive_times'"
    assert 'tiled_times' in results, "Results should contain 'tiled_times'"
    assert 'numba_times' in results, "Results should contain 'numba_times'"
    
    # Verify all methods work
    for size in sizes:
        assert size in results['naive_times'], f"Missing naive time for size {size}"
        assert size in results['tiled_times'], f"Missing tiled time for size {size}"
        assert size in results['numba_times'], f"Missing numba time for size {size}"
        
        # Times should be positive
        assert results['naive_times'][size] > 0, f"Invalid naive time for size {size}"
        assert results['tiled_times'][size] > 0, f"Invalid tiled time for size {size}"
        assert results['numba_times'][size] > 0, f"Invalid numba time for size {size}"
    
    print("‚úì Performance tests passed!")

def test_tiling_effectiveness():
    """Test that tiling is effective for larger matrices."""
    print("\nRunning tiling effectiveness tests...")
    
    # Test with larger matrices where tiling should help
    large_sizes = [256, 512]
    results = benchmark_matrix_multiply(large_sizes)
    
    for size in large_sizes:
        naive_time = results['naive_times'][size]
        tiled_time = results['tiled_times'][size]
        
        # Tiled should be faster or at least not significantly slower
        if tiled_time < naive_time:
            speedup = naive_time / tiled_time
            print(f"‚úì Size {size}: Tiled is {speedup:.2f}x faster")
        else:
            print(f"‚ö†Ô∏è  Size {size}: Tiled is slower (may need tuning)")
    
    print("‚úì Tiling effectiveness tests completed!")

if __name__ == "__main__":
    # Run all tests
    test_matrix_multiply_correctness()
    test_matrix_multiply_performance()
    test_tiling_effectiveness()
    
    print("\n" + "="*50)
    print("Running full benchmark with visualization...")
    results = benchmark_matrix_multiply()
    
    # Print results
    for size in results['sizes']:
        print(f"Size {size}x{size}:")
        print(f"  Naive: {results['naive_times'][size]:.4f}s")
        print(f"  Tiled: {results['tiled_times'][size]:.4f}s")
        print(f"  Numba: {results['numba_times'][size]:.4f}s")
        print()


### Question 4: Custom Sparse Tensor Operations (Medium-Hard)

**Problem**: Implement a memory-efficient sparse tensor class with basic operations like addition, multiplication, and reshaping.

**Requirements**:
- Use COO (Coordinate) format for sparse representation
- Implement tensor addition and element-wise multiplication
- Support reshaping operations
- Memory usage should be O(nnz) where nnz is number of non-zeros
- Compare with dense tensor operations

**Background**: Sparse tensors store only non-zero values, making them memory-efficient for data with many zeros.

**Starter Code**:


In [None]:
import numpy as np
from typing import List, Tuple, Union

class SparseTensor:
    """
    Memory-efficient sparse tensor using COO (Coordinate) format.
    
    Stores only non-zero values with their coordinates.
    """
    
    def __init__(self, shape: Tuple[int, ...], values: np.ndarray = None, 
                 indices: np.ndarray = None):
        """
        Initialize sparse tensor.
        
        Args:
            shape: Tensor dimensions
            values: Non-zero values (1D array)
            indices: Coordinates of non-zero values (2D array, shape: [nnz, ndim])
        """
        self.shape = shape
        self.ndim = len(shape)
        
        if values is None:
            self.values = np.array([], dtype=np.float32)
            self.indices = np.array([], dtype=np.int32).reshape(0, self.ndim)
        else:
            self.values = values.astype(np.float32)
            self.indices = indices.astype(np.int32)
            self._validate()
    
    def _validate(self):
        """Validate tensor data."""
        # TODO: Add validation checks
        pass
    
    def to_dense(self) -> np.ndarray:
        """Convert to dense tensor."""
        # TODO: Implement conversion to dense
        pass
    
    @classmethod
    def from_dense(cls, dense_tensor: np.ndarray) -> 'SparseTensor':
        """Create sparse tensor from dense tensor."""
        # TODO: Implement conversion from dense
        pass
    
    def add(self, other: 'SparseTensor') -> 'SparseTensor':
        """Add two sparse tensors."""
        # TODO: Implement sparse tensor addition
        pass
    
    def multiply_elementwise(self, other: 'SparseTensor') -> 'SparseTensor':
        """Element-wise multiplication of two sparse tensors."""
        # TODO: Implement element-wise multiplication
        pass
    
    def reshape(self, new_shape: Tuple[int, ...]) -> 'SparseTensor':
        """Reshape the tensor."""
        # TODO: Implement reshaping
        pass
    
    def memory_usage(self) -> int:
        """Calculate memory usage in bytes."""
        # TODO: Calculate actual memory usage
        pass
    
    def sparsity(self) -> float:
        """Calculate sparsity ratio (fraction of zeros)."""
        # TODO: Calculate sparsity
        pass

def create_sparse_tensor(shape: Tuple[int, ...], sparsity: float = 0.9) -> SparseTensor:
    """
    Create a random sparse tensor with given sparsity.
    
    Args:
        shape: Tensor dimensions
        sparsity: Fraction of elements that should be zero
        
    Returns:
        SparseTensor: Random sparse tensor
    """
    # TODO: Implement random sparse tensor creation
    pass

def benchmark_sparse_vs_dense(shape: Tuple[int, ...], sparsity: float = 0.9):
    """
    Benchmark sparse vs dense operations.
    
    Args:
        shape: Tensor dimensions
        sparsity: Sparsity level
        
    Returns:
        dict: Performance comparison
    """
    # TODO: Implement benchmarking
    pass

# Comprehensive Test Suite
def test_sparse_tensor_basic():
    """Test basic sparse tensor functionality."""
    print("Running basic sparse tensor tests...")
    
    # Test case 1: Empty tensor
    empty_tensor = SparseTensor((0, 0))
    assert empty_tensor.shape == (0, 0)
    assert empty_tensor.ndim == 2
    assert empty_tensor.sparsity() == 1.0  # All zeros
    print("‚úì Test 1: Empty tensor passed")
    
    # Test case 2: Single element tensor
    values = np.array([5.0])
    indices = np.array([[0, 0]])
    single_tensor = SparseTensor((1, 1), values, indices)
    
    assert single_tensor.shape == (1, 1)
    assert single_tensor.sparsity() == 0.0  # No zeros
    dense = single_tensor.to_dense()
    assert dense.shape == (1, 1)
    assert dense[0, 0] == 5.0
    print("‚úì Test 2: Single element tensor passed")
    
    # Test case 3: Small 2D tensor
    values = np.array([1.0, 2.0, 3.0])
    indices = np.array([[0, 0], [1, 1], [2, 2]])
    diag_tensor = SparseTensor((3, 3), values, indices)
    
    assert diag_tensor.shape == (3, 3)
    assert diag_tensor.sparsity() == 2/3  # 6 zeros out of 9 elements
    dense = diag_tensor.to_dense()
    expected = np.array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
    assert np.array_equal(dense, expected)
    print("‚úì Test 3: Small 2D tensor passed")
    
    print("üéâ Basic sparse tensor tests passed!")

def test_sparse_tensor_operations():
    """Test sparse tensor operations."""
    print("\nRunning sparse tensor operations tests...")
    
    # Test case 1: Addition
    values1 = np.array([1.0, 2.0])
    indices1 = np.array([[0, 0], [1, 1]])
    tensor1 = SparseTensor((2, 2), values1, indices1)
    
    values2 = np.array([3.0, 4.0])
    indices2 = np.array([[0, 0], [1, 1]])
    tensor2 = SparseTensor((2, 2), values2, indices2)
    
    result = tensor1.add(tensor2)
    expected_dense = np.array([[4, 0], [0, 6]])
    assert np.array_equal(result.to_dense(), expected_dense)
    print("‚úì Test 1: Addition passed")
    
    # Test case 2: Element-wise multiplication
    result_mult = tensor1.multiply_elementwise(tensor2)
    expected_mult = np.array([[3, 0], [0, 8]])
    assert np.array_equal(result_mult.to_dense(), expected_mult)
    print("‚úì Test 2: Element-wise multiplication passed")
    
    # Test case 3: Reshaping
    values3 = np.array([1.0, 2.0, 3.0, 4.0])
    indices3 = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    tensor3 = SparseTensor((2, 2), values3, indices3)
    
    reshaped = tensor3.reshape((4, 1))
    assert reshaped.shape == (4, 1)
    # Verify values are preserved
    assert len(reshaped.values) == 4
    print("‚úì Test 3: Reshaping passed")
    
    print("üéâ Sparse tensor operations tests passed!")

def test_sparse_tensor_from_dense():
    """Test conversion from dense to sparse."""
    print("\nRunning dense to sparse conversion tests...")
    
    # Test case 1: Dense matrix with zeros
    dense1 = np.array([[1, 0, 3], [0, 5, 0], [7, 0, 9]])
    sparse1 = SparseTensor.from_dense(dense1)
    
    assert sparse1.shape == (3, 3)
    assert sparse1.sparsity() == 4/9  # 4 zeros out of 9
    assert np.array_equal(sparse1.to_dense(), dense1)
    print("‚úì Test 1: Dense to sparse conversion passed")
    
    # Test case 2: All zeros
    dense2 = np.zeros((2, 3))
    sparse2 = SparseTensor.from_dense(dense2)
    
    assert sparse2.sparsity() == 1.0  # All zeros
    assert len(sparse2.values) == 0  # No non-zero values
    print("‚úì Test 2: All zeros matrix passed")
    
    # Test case 3: No zeros
    dense3 = np.array([[1, 2], [3, 4]])
    sparse3 = SparseTensor.from_dense(dense3)
    
    assert sparse3.sparsity() == 0.0  # No zeros
    assert len(sparse3.values) == 4  # All values are non-zero
    print("‚úì Test 3: No zeros matrix passed")
    
    print("üéâ Dense to sparse conversion tests passed!")

def test_sparse_tensor_memory():
    """Test memory usage calculations."""
    print("\nRunning memory usage tests...")
    
    # Test case 1: Memory usage calculation
    values = np.array([1.0, 2.0, 3.0], dtype=np.float32)
    indices = np.array([[0, 0], [1, 1], [2, 2]], dtype=np.int32)
    tensor = SparseTensor((3, 3), values, indices)
    
    memory = tensor.memory_usage()
    assert memory > 0, "Memory usage should be positive"
    
    # Should be much less than dense equivalent
    dense_memory = 3 * 3 * 4  # 3x3 float32 = 36 bytes
    assert memory < dense_memory, "Sparse should use less memory than dense"
    print("‚úì Test 1: Memory usage calculation passed")
    
    # Test case 2: Memory vs sparsity relationship
    # More sparse should use less memory
    sparse_tensor = create_sparse_tensor((100, 100), sparsity=0.9)
    dense_tensor = create_sparse_tensor((100, 100), sparsity=0.0)
    
    sparse_memory = sparse_tensor.memory_usage()
    dense_memory = dense_tensor.memory_usage()
    
    assert sparse_memory < dense_memory, "Sparser tensor should use less memory"
    print("‚úì Test 2: Memory vs sparsity relationship passed")
    
    print("üéâ Memory usage tests passed!")

def test_sparse_tensor_edge_cases():
    """Test edge cases and error handling."""
    print("\nRunning edge cases tests...")
    
    # Test case 1: Invalid shape
    try:
        SparseTensor((0, 0), np.array([1.0]), np.array([[0, 0]]))
        assert False, "Should raise error for invalid shape"
    except (ValueError, AssertionError):
        print("‚úì Test 1: Invalid shape validation passed")
    
    # Test case 2: Mismatched values and indices
    try:
        values = np.array([1.0, 2.0])
        indices = np.array([[0, 0]])  # Only one index for two values
        SparseTensor((2, 2), values, indices)
        assert False, "Should raise error for mismatched dimensions"
    except (ValueError, AssertionError):
        print("‚úì Test 2: Mismatched dimensions validation passed")
    
    # Test case 3: Invalid indices
    try:
        values = np.array([1.0])
        indices = np.array([[5, 5]])  # Index out of bounds for (2, 2) tensor
        SparseTensor((2, 2), values, indices)
        assert False, "Should raise error for out-of-bounds indices"
    except (ValueError, AssertionError):
        print("‚úì Test 3: Out-of-bounds indices validation passed")
    
    print("üéâ Edge cases tests passed!")

def test_benchmarking():
    """Test benchmarking functionality."""
    print("\nRunning benchmarking tests...")
    
    # Test with small tensors
    results = benchmark_sparse_vs_dense((10, 10), sparsity=0.5)
    
    # Verify results structure
    assert 'sparse_memory' in results
    assert 'dense_memory' in results
    assert 'memory_savings' in results
    
    # Verify memory savings calculation
    expected_savings = (results['dense_memory'] - results['sparse_memory']) / results['dense_memory']
    assert abs(results['memory_savings'] - expected_savings) < 1e-6
    
    print("‚úì Benchmarking tests passed!")

if __name__ == "__main__":
    # Run all tests
    test_sparse_tensor_basic()
    test_sparse_tensor_operations()
    test_sparse_tensor_from_dense()
    test_sparse_tensor_memory()
    test_sparse_tensor_edge_cases()
    test_benchmarking()
    
    print("\n" + "="*50)
    print("Running benchmark example...")
    
    # Benchmark example
    results = benchmark_sparse_vs_dense((100, 100), sparsity=0.95)
    print(f"Sparse memory: {results['sparse_memory']} bytes")
    print(f"Dense memory: {results['dense_memory']} bytes")
    print(f"Memory savings: {results['memory_savings']:.1%}")
    
    print("\nüéâ All sparse tensor tests completed successfully!")


### Question 5: Memory Profiling and Optimization (Hard)

**Problem**: Implement a comprehensive memory profiler for PyTorch operations and optimize a memory-intensive neural network forward pass.

**Requirements**:
- Create a memory profiler that tracks GPU/CPU memory usage
- Profile a multi-layer neural network forward pass
- Identify memory bottlenecks and optimize them
- Implement gradient checkpointing to reduce memory usage
- Compare memory usage before and after optimization

**Background**: Large neural networks can consume massive amounts of memory. Understanding and optimizing memory usage is crucial for training large models.

**Starter Code**:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import psutil
import gc
from contextlib import contextmanager
from typing import Dict, List, Optional

class MemoryProfiler:
    """
    Memory profiler for PyTorch operations.
    """
    
    def __init__(self, device: str = 'cuda' if torch.cuda.is_available() else 'cpu'):
        self.device = device
        self.memory_log = []
        self.peak_memory = 0
        
    def get_memory_usage(self) -> Dict[str, float]:
        """Get current memory usage in MB."""
        # TODO: Implement memory usage tracking
        pass
    
    def log_memory(self, operation_name: str):
        """Log memory usage for an operation."""
        # TODO: Implement memory logging
        pass
    
    def reset(self):
        """Reset profiler state."""
        # TODO: Reset profiler
        pass
    
    def get_peak_memory(self) -> float:
        """Get peak memory usage in MB."""
        # TODO: Return peak memory
        pass

class LargeNeuralNetwork(nn.Module):
    """
    Memory-intensive neural network for profiling.
    """
    
    def __init__(self, input_size: int = 1000, hidden_sizes: List[int] = [2000, 2000, 2000], 
                 output_size: int = 100, dropout: float = 0.1):
        super().__init__()
        # TODO: Implement large neural network
        pass
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass without optimization."""
        # TODO: Implement forward pass
        pass
    
    def forward_with_checkpointing(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with gradient checkpointing."""
        # TODO: Implement checkpointed forward pass
        pass

def profile_network_memory(model: nn.Module, input_tensor: torch.Tensor, 
                          use_checkpointing: bool = False) -> Dict[str, float]:
    """
    Profile memory usage of network forward pass.
    
    Args:
        model: Neural network model
        input_tensor: Input tensor
        use_checkpointing: Whether to use gradient checkpointing
        
    Returns:
        dict: Memory usage statistics
    """
    # TODO: Implement memory profiling
    pass

def optimize_memory_usage(model: nn.Module, input_tensor: torch.Tensor) -> Dict[str, float]:
    """
    Apply memory optimizations and measure improvement.
    
    Args:
        model: Neural network model
        input_tensor: Input tensor
        
    Returns:
        dict: Optimization results
    """
    # TODO: Implement memory optimizations
    pass

def compare_memory_strategies():
    """Compare different memory optimization strategies."""
    # TODO: Implement comparison
    pass

# Comprehensive Test Suite
def test_memory_profiler():
    """Test memory profiler functionality."""
    print("Running memory profiler tests...")
    
    # Test case 1: Basic profiler initialization
    profiler = MemoryProfiler()
    assert profiler.device in ['cpu', 'cuda'], f"Invalid device: {profiler.device}"
    assert len(profiler.memory_log) == 0, "Memory log should be empty initially"
    print("‚úì Test 1: Profiler initialization passed")
    
    # Test case 2: Memory usage tracking
    memory_usage = profiler.get_memory_usage()
    assert isinstance(memory_usage, dict), "Memory usage should return a dictionary"
    assert 'cpu_memory' in memory_usage, "Should track CPU memory"
    if profiler.device == 'cuda':
        assert 'gpu_memory' in memory_usage, "Should track GPU memory"
    print("‚úì Test 2: Memory usage tracking passed")
    
    # Test case 3: Memory logging
    profiler.log_memory("test_operation")
    assert len(profiler.memory_log) == 1, "Should log one operation"
    assert profiler.memory_log[0]['operation'] == "test_operation"
    print("‚úì Test 3: Memory logging passed")
    
    # Test case 4: Profiler reset
    profiler.reset()
    assert len(profiler.memory_log) == 0, "Memory log should be empty after reset"
    assert profiler.peak_memory == 0, "Peak memory should be reset"
    print("‚úì Test 4: Profiler reset passed")
    
    print("üéâ Memory profiler tests passed!")

def test_neural_network():
    """Test neural network functionality."""
    print("\nRunning neural network tests...")
    
    # Test case 1: Network initialization
    model = LargeNeuralNetwork(input_size=100, hidden_sizes=[50, 50], output_size=10)
    assert model is not None, "Model should be created"
    print("‚úì Test 1: Network initialization passed")
    
    # Test case 2: Forward pass
    input_tensor = torch.randn(4, 100)  # batch_size=4
    output = model.forward(input_tensor)
    assert output.shape == (4, 10), f"Wrong output shape: {output.shape}"
    print("‚úì Test 2: Forward pass passed")
    
    # Test case 3: Checkpointed forward pass
    output_checkpointed = model.forward_with_checkpointing(input_tensor)
    assert output_checkpointed.shape == (4, 10), f"Wrong checkpointed output shape: {output_checkpointed.shape}"
    print("‚úì Test 3: Checkpointed forward pass passed")
    
    # Test case 4: Gradient computation
    loss = output.sum()
    loss.backward()
    
    # Check that gradients are computed
    for param in model.parameters():
        assert param.grad is not None, "Gradients should be computed"
    print("‚úì Test 4: Gradient computation passed")
    
    print("üéâ Neural network tests passed!")

def test_memory_profiling():
    """Test memory profiling functionality."""
    print("\nRunning memory profiling tests...")
    
    # Test case 1: Basic profiling
    model = LargeNeuralNetwork(input_size=50, hidden_sizes=[20, 20], output_size=5)
    input_tensor = torch.randn(2, 50)  # Small batch
    
    results = profile_network_memory(model, input_tensor, use_checkpointing=False)
    
    assert 'peak_memory' in results, "Results should contain peak_memory"
    assert 'memory_timeline' in results, "Results should contain memory_timeline"
    assert results['peak_memory'] > 0, "Peak memory should be positive"
    print("‚úì Test 1: Basic profiling passed")
    
    # Test case 2: Checkpointed profiling
    results_checkpointed = profile_network_memory(model, input_tensor, use_checkpointing=True)
    
    assert 'peak_memory' in results_checkpointed, "Checkpointed results should contain peak_memory"
    assert results_checkpointed['peak_memory'] > 0, "Checkpointed peak memory should be positive"
    print("‚úì Test 2: Checkpointed profiling passed")
    
    # Test case 3: Memory optimization
    optimization_results = optimize_memory_usage(model, input_tensor)
    
    assert 'baseline_memory' in optimization_results, "Should contain baseline memory"
    assert 'optimized_memory' in optimization_results, "Should contain optimized memory"
    assert 'improvement' in optimization_results, "Should contain improvement percentage"
    print("‚úì Test 3: Memory optimization passed")
    
    print("üéâ Memory profiling tests passed!")

def test_memory_optimization():
    """Test memory optimization strategies."""
    print("\nRunning memory optimization tests...")
    
    # Test case 1: Memory comparison
    model = LargeNeuralNetwork(input_size=100, hidden_sizes=[50, 50], output_size=10)
    input_tensor = torch.randn(8, 100)
    
    # Profile both approaches
    baseline_results = profile_network_memory(model, input_tensor, use_checkpointing=False)
    checkpointed_results = profile_network_memory(model, input_tensor, use_checkpointing=True)
    
    # Checkpointed should use less memory (or at least not more)
    assert checkpointed_results['peak_memory'] <= baseline_results['peak_memory'] * 1.1, \
        "Checkpointed should not use significantly more memory"
    print("‚úì Test 1: Memory comparison passed")
    
    # Test case 2: Performance vs memory tradeoff
    import time
    
    # Time both approaches
    start_time = time.time()
    _ = model.forward(input_tensor)
    baseline_time = time.time() - start_time
    
    start_time = time.time()
    _ = model.forward_with_checkpointing(input_tensor)
    checkpointed_time = time.time() - start_time
    
    # Checkpointed might be slower due to recomputation
    print(f"Baseline time: {baseline_time:.4f}s, Checkpointed time: {checkpointed_time:.4f}s")
    print("‚úì Test 2: Performance vs memory tradeoff analyzed")
    
    print("üéâ Memory optimization tests passed!")

def test_edge_cases():
    """Test edge cases and error handling."""
    print("\nRunning edge cases tests...")
    
    # Test case 1: Very small batch size
    model = LargeNeuralNetwork(input_size=10, hidden_sizes=[5], output_size=2)
    input_tensor = torch.randn(1, 10)  # Single sample
    
    results = profile_network_memory(model, input_tensor)
    assert results['peak_memory'] > 0, "Should handle single sample"
    print("‚úì Test 1: Single sample handling passed")
    
    # Test case 2: Large batch size
    input_tensor_large = torch.randn(100, 10)
    results_large = profile_network_memory(model, input_tensor_large)
    assert results_large['peak_memory'] > results['peak_memory'], "Larger batch should use more memory"
    print("‚úì Test 2: Large batch handling passed")
    
    # Test case 3: Different input sizes
    model_different = LargeNeuralNetwork(input_size=20, hidden_sizes=[10], output_size=3)
    input_tensor_different = torch.randn(4, 20)
    
    results_different = profile_network_memory(model_different, input_tensor_different)
    assert results_different['peak_memory'] > 0, "Should handle different input sizes"
    print("‚úì Test 3: Different input sizes passed")
    
    print("üéâ Edge cases tests passed!")

def test_benchmarking():
    """Test benchmarking and comparison functionality."""
    print("\nRunning benchmarking tests...")
    
    # Test case 1: Strategy comparison
    try:
        comparison_results = compare_memory_strategies()
        assert isinstance(comparison_results, dict), "Should return comparison results"
        print("‚úì Test 1: Strategy comparison passed")
    except Exception as e:
        print(f"‚ö†Ô∏è  Strategy comparison test failed: {e}")
    
    # Test case 2: Performance metrics
    model = LargeNeuralNetwork(input_size=50, hidden_sizes=[25], output_size=5)
    input_tensor = torch.randn(4, 50)
    
    results = profile_network_memory(model, input_tensor)
    
    # Verify all expected metrics are present
    expected_metrics = ['peak_memory', 'memory_timeline', 'operations']
    for metric in expected_metrics:
        assert metric in results, f"Missing metric: {metric}"
    
    print("‚úì Test 2: Performance metrics passed")
    
    print("üéâ Benchmarking tests passed!")

# Test the implementation
if __name__ == "__main__":
    # Run all tests
    test_memory_profiler()
    test_neural_network()
    test_memory_profiling()
    test_memory_optimization()
    test_edge_cases()
    test_benchmarking()
    
    print("\n" + "="*50)
    print("Running full memory profiling example...")
    
    # Create model and test data
    model = LargeNeuralNetwork()
    input_tensor = torch.randn(32, 1000)  # batch_size=32
    
    print("Profiling memory usage...")
    
    # Profile without optimization
    results_baseline = profile_network_memory(model, input_tensor, use_checkpointing=False)
    print(f"Baseline memory usage: {results_baseline['peak_memory']:.2f} MB")
    
    # Profile with optimization
    results_optimized = profile_network_memory(model, input_tensor, use_checkpointing=True)
    print(f"Optimized memory usage: {results_optimized['peak_memory']:.2f} MB")
    
    # Calculate improvement
    improvement = (results_baseline['peak_memory'] - results_optimized['peak_memory']) / results_baseline['peak_memory']
    print(f"Memory reduction: {improvement:.1%}")
    
    print("\nüéâ All memory profiling tests completed successfully!")


---

## üí° Hints

<details>
<summary>Click to reveal hint for Question 1: Memory-Efficient Prefix Sum</summary>

**Hint**: For in-place prefix sum, you need to iterate through the array once and update each element to be the sum of all previous elements plus itself. Start from index 1 (since index 0 doesn't need to change) and for each position i, set `arr[i] = arr[i-1] + arr[i]`.

**Key insight**: Each element becomes the sum of all elements from the beginning up to and including itself.
</details>

<details>
<summary>Click to reveal hint for Question 2: Tensor Operation Benchmarking</summary>

**Hint**: For the loop version, use a simple for loop with indexing. For the vectorized version, use NumPy's built-in multiplication operator. Use `time.time()` to measure execution time and `tracemalloc` to track memory usage. Create arrays of different sizes and measure both time and memory for each approach.

**Key insight**: Vectorized operations should be significantly faster due to SIMD instructions and reduced Python overhead.
</details>

<details>
<summary>Click to reveal hint for Question 3: Cache-Aware Matrix Operations</summary>

**Hint**: For tiling, divide the matrices into smaller blocks (tiles) and process them separately. This improves cache locality by keeping frequently accessed data in cache. The tile size should be chosen based on cache size (typically 64x64 or 128x128). For Numba, use the `@jit(nopython=True)` decorator to compile the function.

**Key insight**: Tiling reduces cache misses by ensuring that when you access a tile, all its data fits in cache.
</details>

<details>
<summary>Click to reveal hint for Question 4: Custom Sparse Tensor Operations</summary>

**Hint**: In COO format, store non-zero values and their coordinates separately. For addition, you need to merge the coordinate lists and sum values at the same coordinates. For element-wise multiplication, only multiply values that exist in both tensors. For reshaping, convert 1D coordinates to multi-dimensional coordinates.

**Key insight**: Sparse operations should only work on non-zero elements, making them much more memory-efficient for sparse data.
</details>

<details>
<summary>Click to reveal hint for Question 5: Memory Profiling and Optimization</summary>

**Hint**: Use `torch.cuda.memory_allocated()` and `torch.cuda.max_memory_allocated()` for GPU memory tracking. For gradient checkpointing, use `torch.utils.checkpoint.checkpoint()` to trade compute for memory. Implement the network with multiple large layers and measure memory usage at each step.

**Key insight**: Gradient checkpointing saves memory by recomputing activations during backward pass instead of storing them during forward pass.
</details>
