# Chapter 1: Data Structures & Complexity for ML Engineers

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jmamath/interview_prep/blob/main/chapter_01_data_structures_complexity.ipynb)

## Introduction

As a machine learning engineer at Google DeepMind, I’ve seen firsthand how the performance of a model can make or break a project. A model that takes days to train or milliseconds too long to generate a prediction can be the difference between a groundbreaking innovation and a failed experiment. And often, the key to unlocking that performance lies not in the high-level model architecture, but in the low-level details of how we handle data.

This chapter is about those details. We’re going to take a step back from the glamour of neural network design and get our hands dirty with the fundamentals of data structures and algorithmic complexity. Why? Because in the world of large-scale machine learning, a seemingly small choice—like how you iterate through a tensor or whether you update an array in-place—can have a massive impact on performance and memory usage.

By the end of this chapter, you’ll have a deeper appreciation for the importance of data structures and complexity analysis in machine learning, and you’ll be equipped with the practical skills to write more efficient and scalable ML code. Let's get started!

## Learning Objectives
- Analyze algorithm complexity (Big O time and space)
- Understand memory layout, cache effects, and vectorization
- Implement memory-efficient algorithms
- Profile and optimize memory usage

---

## Problem 1: Memory-Efficient Prefix Sum (Easy)

### Contextual Introduction
In many ML applications, especially in sequence modeling and data analysis, we need to compute running totals or cumulative sums. For example, when implementing custom attention mechanisms or calculating running statistics in a data stream, prefix sums are a fundamental operation. A naive implementation might create a new array to store the cumulative sums, but this can be memory-intensive for large sequences. By performing the operation in-place, we can save memory and improve performance, which is crucial when working with large datasets and models.

### Key Concepts
- **In-place operations**: Modifying the input data directly without creating a copy.
- **Space Complexity**: The amount of memory an algorithm needs. An in-place algorithm has O(1) auxiliary space complexity.
- **Time Complexity**: The amount of time an algorithm takes to run. A single pass through an array is O(n).

### Problem Statement
Implement a memory-efficient prefix sum algorithm that computes the cumulative sum of an array in-place.

**Requirements**:
- Modify the input array in-place (O(1) extra space).
- Time complexity should be O(n).
- Handle edge cases like an empty array or a single-element array.

**Example**:
```python
arr = [1, 2, 3, 4, 5]
prefix_sum_inplace(arr)
print(arr)  # Expected output: [1, 3, 6, 10, 15]
```

In [None]:
def prefix_sum_inplace(arr):
    """
    Computes the prefix sum of an array in-place.
    
    Args:
        arr: A list of integers
        
    Returns:
        None (modifies arr in-place)
    """
    # TODO: Implement the prefix sum algorithm
    # Start from index 1 and modify each element by adding the previous element
    # Remember: You need to preserve the loop structure below
    pass

# Test your implementation
def test_prefix_sum_inplace():
    """Tests the in-place prefix sum implementation."""
    # Test case 1: Basic functionality
    arr1 = [1, 2, 3, 4, 5]
    prefix_sum_inplace(arr1)
    assert arr1 == [1, 3, 6, 10, 15], f"Test 1 Failed: {arr1}"

    # Test case 2: Empty array
    arr2 = []
    prefix_sum_inplace(arr2)
    assert arr2 == [], f"Test 2 Failed: {arr2}"

    # Test case 3: Single element
    arr3 = [10]
    prefix_sum_inplace(arr3)
    assert arr3 == [10], f"Test 3 Failed: {arr3}"

    # Test case 4: Array with negative numbers
    arr4 = [1, -2, 3, -4, 5]
    prefix_sum_inplace(arr4)
    assert arr4 == [1, -1, 2, -2, 3], f"Test 4 Failed: {arr4}"

    print("✓ All prefix sum tests passed!")

# Run the tests
test_prefix_sum_inplace()

<details>
<summary>Click to reveal hint for Problem 1</summary>

**Hint**: Iterate through the array starting from the second element. For each element at index `i`, update it by adding the value of the element at index `i-1`. This way, each element becomes the sum of itself and all previous elements.

</details>

---

## Problem 2: Tensor Operation Benchmarking (Easy-Medium)

### Contextual Introduction
In deep learning, we work with large multi-dimensional arrays called tensors. Operations on these tensors, like element-wise multiplication, are the building blocks of neural networks. A naive way to implement these operations is to use Python loops, but this is incredibly slow. High-performance ML libraries like NumPy and PyTorch use vectorized operations, which are implemented in C or CUDA and can perform operations on entire arrays at once. Understanding the performance difference between loops and vectorization is fundamental for writing efficient ML code.

### Key Concepts
- **Vectorization**: Performing operations on entire arrays at once, rather than element by element.
- **SIMD (Single Instruction, Multiple Data)**: A hardware feature that allows a single instruction to be applied to multiple data points simultaneously. Vectorized operations leverage SIMD.
- **Benchmarking**: Measuring the performance of code, typically in terms of execution time and memory usage.

### Problem Statement
Compare the performance of Python loops vs. NumPy vectorized operations for element-wise multiplication on large tensors. You will implement both versions, measure their execution time and memory usage for different array sizes, and create a plot to visualize the performance difference.

**Requirements**:
- Implement both loop-based and vectorized versions of element-wise multiplication.
- Measure and plot execution time and memory usage for different array sizes.
- Analyze the speedup of the vectorized version over the loop version.

### Example: Loop vs. Vectorized Operations
```python
import numpy as np

# Here's the difference between loop and vectorized approaches:

# LOOP VERSION (slow)
a = np.array([1, 2, 3, 4])
b = np.array([2, 2, 2, 2])
result_loop = np.zeros_like(a)
for i in range(len(a)):
    result_loop[i] = a[i] * b[i]  # One operation at a time
# Result: [2, 4, 6, 8]

# VECTORIZED VERSION (fast)
result_vectorized = a * b  # All operations at once
# Result: [2, 4, 6, 8]

# Same result, but vectorized is 100-1000x faster!
```

### Your Exercise
Implement and benchmark both approaches:

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt

def elementwise_multiply_loop(a, b):
    """
    Element-wise multiplication using Python loops.
    
    TODO: Implement this function
    - Create a result array of zeros with the same shape as 'a'
    - Iterate through indices and multiply element-by-element
    - Return the result
    """
    pass

def elementwise_multiply_vectorized(a, b):
    """
    Element-wise multiplication using NumPy vectorization.
    
    TODO: Implement this function
    - Simply return a * b (let NumPy handle the vectorization)
    """
    pass

def benchmark_operations(sizes=[10**i for i in range(1, 6)]):
    """
    Benchmark loop vs. vectorized operations.
    
    TODO: Implement this function
    - For each size, create random arrays a and b
    - Time elementwise_multiply_loop(a, b)
    - Time elementwise_multiply_vectorized(a, b)
    - Store the times in loop_times and vectorized_times lists
    - Plot the results with matplotlib (logarithmic scales recommended)
    """
    loop_times = []
    vectorized_times = []
    
    for size in sizes:
        # TODO: Create random arrays
        # TODO: Time loop version
        # TODO: Time vectorized version
        pass
    
    # TODO: Create a plot comparing loop_times and vectorized_times
    # Use log scale for both axes
    pass

# Test your implementation
try:
    benchmark_operations()
    print("✓ Benchmarking complete!")
except Exception as e:
    print(f"Error: {e}")
    print("Make sure both functions and the benchmark are implemented.")

<details>
<summary>Click to reveal hint for Problem 2</summary>

**Hint**: Use the `time` module to measure the execution time of each function. For the loop version, iterate from 0 to the length of the array. For the vectorized version, simply use the `*` operator on the two NumPy arrays. Use `matplotlib` to plot the results with a logarithmic scale for both axes to better visualize the performance difference across a wide range of array sizes.

</details>

---

## Problem 3: Cache-Aware Matrix Operations (Medium)

### Contextual Introduction
Matrix multiplication is at the heart of deep learning, forming the basis of fully connected layers and convolutions. When matrices are large, the way we access their elements in memory can have a huge impact on performance. Modern CPUs have a memory hierarchy (caches) where accessing data that is already in a cache is much faster than fetching it from main memory. A naive matrix multiplication algorithm can have poor cache locality, leading to many cache misses and slow performance. By using a cache-aware technique like tiling (or blocking), we can significantly speed up the operation.

### Key Concepts
- **Cache Locality**: The principle that if a memory location is accessed, it's likely that nearby memory locations will be accessed soon (spatial locality) and the same location will be accessed again soon (temporal locality).
- **Tiling/Blocking**: A technique to improve cache locality by breaking down a large matrix multiplication into smaller matrix multiplications on sub-matrices (tiles) that can fit into the cache.
- **Numba**: A just-in-time (JIT) compiler for Python that can translate a subset of Python and NumPy code into fast machine code, often achieving C-like speeds.

### Problem Statement
Implement matrix multiplication using a standard naive approach and a cache-optimized tiled approach. Compare their performance against each other and against a Numba-optimized version.

**Requirements**:
- Implement a standard (i, j, k) matrix multiplication.
- Implement a tiled version of matrix multiplication.
- Use Numba's `@jit` decorator to create a JIT-compiled version.
- Benchmark the three versions for different matrix sizes.

### Example: Naive vs. Tiled Matrix Multiplication

```python
import numpy as np

# Small example to understand the concept
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# NAIVE APPROACH: Three nested loops (i, j, k)
# Access pattern: A[i,:] and B[:,j] in each iteration
# This causes many cache misses for large matrices
def matrix_mult_naive_small(A, B):
    m, k = A.shape
    k2, n = B.shape
    C = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                C[i, j] += A[i, l] * B[l, j]
    return C

result_naive = matrix_mult_naive_small(A, B)
# Result: [[19, 22], [43, 50]]

# TILED APPROACH: Break into smaller blocks
# Process 2x2 tiles that fit in cache
# Same result, but better cache utilization for larger matrices
def matrix_mult_tiled_small(A, B, tile_size=1):
    m, k = A.shape
    k2, n = B.shape
    C = np.zeros((m, n))
    
    # Process in tile_size x tile_size blocks
    for i in range(0, m, tile_size):
        for j in range(0, n, tile_size):
            for l in range(0, k, tile_size):
                # Process the tile
                for i2 in range(i, min(i + tile_size, m)):
                    for j2 in range(j, min(j + tile_size, n)):
                        for l2 in range(l, min(l + tile_size, k)):
                            C[i2, j2] += A[i2, l2] * B[l2, j2]
    return C

result_tiled = matrix_mult_tiled_small(A, B, tile_size=2)
# Result: [[19, 22], [43, 50]]

print(f"Naive result matches tiled: {np.allclose(result_naive, result_tiled)}")
# Output: True

# For large matrices (512x512+), the tiled version can be 2-5x faster!
```

### Your Exercise
Implement all three approaches and benchmark them:

In [None]:
import numpy as np
import time

def matrix_multiply_naive(A, B):
    """
    Standard matrix multiplication using nested loops.
    
    TODO: Implement this function
    - Check that inner dimensions match
    - Create a result matrix of zeros (m x n where A is m x k, B is k x n)
    - Use three nested loops (i, j, l) to compute the multiplication
    - Return the result
    """
    pass

def matrix_multiply_tiled(A, B, tile_size=64):
    """
    Cache-optimized matrix multiplication using tiling.
    
    TODO: Implement this function (optional, more advanced)
    - Break matrices into tile_size x tile_size blocks
    - Compute matrix multiplication on blocks
    - This improves cache locality
    """
    pass

def benchmark_matrix_multiply(sizes=[64, 128, 256, 512]):
    """
    Benchmark different matrix multiplication approaches.
    
    TODO: Implement this function
    - For each size in sizes:
      - Create random square matrices A, B of size x size
      - Time matrix_multiply_naive(A, B)
      - Store the timing results
    - Print or plot the results
    """
    results = {}
    for size in sizes:
        # TODO: Implement benchmarking
        pass
    
    return results

# Test your implementation
try:
    result = benchmark_matrix_multiply(sizes=[64, 128])
    print("✓ Matrix multiplication benchmarking complete!")
    print(f"Results: {result}")
except Exception as e:
    print(f"Error: {e}")
    print("Make sure the functions are implemented.")


<details>
<summary>Click to reveal hint for Problem 3</summary>

**Hint**: For the tiled version, you will need three nested loops to iterate over the tiles, and then three more nested loops to perform the multiplication within each tile. The key is that the inner loops will be operating on small sub-matrices that fit in the cache. For the Numba version, simply apply the `@jit(nopython=True)` decorator to your naive implementation.

</details>

---

## Problem 4: Custom Sparse Tensor Operations (Medium-Hard)

### Contextual Introduction
In many ML applications, such as natural language processing (e.g., word embeddings) and recommendation systems, we deal with very high-dimensional but sparse data. For example, a user-item interaction matrix in a recommendation system might have millions of rows and columns, but each user has only interacted with a few items. Storing this as a dense matrix would be prohibitively expensive in terms of memory. Sparse tensors are data structures that only store the non-zero elements, making them much more memory-efficient.

### Key Concepts
- **Sparsity**: The fraction of zero elements in a tensor.
- **COO (Coordinate) format**: A way to represent a sparse tensor by storing a list of (row, column, value) tuples.
- **Memory Efficiency**: Sparse tensors have a memory usage of O(nnz), where nnz is the number of non-zero elements, as opposed to O(N*M) for a dense matrix.

### Problem Statement
Implement a memory-efficient sparse tensor class using the COO format. The class should support conversion to and from a dense NumPy array, as well as a `memory_usage` method to demonstrate its efficiency.

### Example: Dense vs Sparse Representation

```python
import numpy as np

# Imagine a user-item interaction matrix (e.g., Netflix recommendations)
# 1 = user watched movie, 0 = user hasn't watched

# DENSE representation (wasteful!)
dense_matrix = np.array([
    [1, 0, 0, 0, 2, 0, 0, 0],  # User 1: watched item 0 (score 1), item 4 (score 2)
    [0, 0, 0, 3, 0, 0, 0, 0],  # User 2: watched item 3 (score 3)
    [0, 0, 0, 0, 0, 0, 0, 0],  # User 3: watched nothing
    [0, 4, 0, 0, 0, 0, 5, 0],  # User 4: watched items 1, 6 (scores 4, 5)
])
# This takes 4 * 8 = 32 elements, but only 5 are non-zero!
# Memory waste for 10,000 users × 100,000 items: 1 billion elements!

# SPARSE representation (efficient!)
# Store only (row, column, value) tuples for non-zero elements
sparse_data = {
    'indices': [(0, 0, 1), (0, 4, 2), (1, 3, 3), (3, 1, 4), (3, 6, 5)],
    'shape': (4, 8)
}
# Same information, only 5 tuples stored!
# Memory scales with number of non-zeros, not total elements

# Your task: Create a SparseTensor class that stores this efficiently
# and can convert back to dense when needed

# Example usage (what your code should do):
# sparse = SparseTensor((4, 8))
# sparse.from_dense(dense_matrix)  # Store efficiently
# 
# dense_recovered = sparse.to_dense()  # Convert back
# assert np.allclose(dense_recovered, dense_matrix)  # Verify correctness
# 
# print(sparse.sparsity())  # Output: 0.84375 (87.5% zeros)
# print(sparse.memory_usage())  # Show memory savings
```

### Your Exercise
Implement the SparseTensor class:

**Requirements**:
- Use COO format (indices and values) to store non-zero elements.
- Implement `to_dense` and `from_dense` methods.
- Implement a `memory_usage` method that returns the memory used by the sparse representation.
- Compare the memory usage of the sparse tensor with its dense equivalent.

In [None]:
import numpy as np

class SparseTensor:
    """
    A simple sparse tensor representation using a dictionary.
    """
    
    def __init__(self, shape, data_dict=None):
        """
        Initialize sparse tensor.
        
        TODO: Initialize shape, indices, and values
        - Store shape as a tuple
        - Initialize empty storage for non-zero elements
        """
        pass
    
    def to_dense(self):
        """
        Convert sparse tensor to dense array.
        
        TODO: Implement this function
        - Create a dense numpy array of zeros with self.shape
        - Fill in the non-zero values at their correct positions
        - Return the dense array
        """
        pass
    
    def from_dense(self, dense_array, threshold=1e-10):
        """
        Convert dense array to sparse tensor format.
        
        TODO: Implement this function
        - Find all non-zero elements (use threshold for floating point)
        - Store their indices and values
        - Return self for chaining
        """
        pass

def test_sparse_tensor():
    """Test sparse tensor implementation."""
    # Create a simple sparse tensor
    shape = (5, 5)
    sparse = SparseTensor(shape)
    
    # Create a dense array with some zeros
    dense_array = np.array([
        [1, 0, 0, 2, 0],
        [0, 3, 0, 0, 0],
        [0, 0, 4, 0, 5],
        [0, 0, 0, 0, 0],
        [6, 0, 0, 0, 7]
    ], dtype=float)
    
    # TODO: Convert dense to sparse and back
    # Convert to sparse, then to dense, and verify they match
    
    print("✓ Sparse tensor tests passed!")

# Run tests
try:
    test_sparse_tensor()
except Exception as e:
    print(f"Error: {e}")


<details>
<summary>Click to reveal hint for Problem 4</summary>

**Hint**: For `from_dense`, use `np.argwhere` to find the indices of non-zero elements. For `to_dense`, create a zero-filled NumPy array of the correct shape and then fill in the non-zero values using the stored indices and values. The memory usage is the sum of the bytes used by the indices array and the values array, which you can get from the `.nbytes` attribute of a NumPy array.

</details>

---

## Problem 5: Memory Profiling and Optimization (Hard)

### Contextual Introduction
Training large neural networks, like the ones used for language modeling or image generation, is often limited by GPU memory. During the forward pass, the activations (outputs of each layer) are stored in memory to be used for gradient calculations in the backward pass. For very deep or wide networks, these activations can consume a huge amount of memory. Gradient checkpointing is a technique that trades compute for memory by not storing the activations for some layers and instead recomputing them during the backward pass. This can allow you to train much larger models than would otherwise fit in memory.

### Key Concepts
- **Memory Profiling**: The process of measuring and analyzing the memory usage of a program.
- **Gradient Checkpointing**: A technique to reduce the memory footprint of a neural network by recomputing activations during the backward pass instead of storing them.
- **Activation**: The output of a layer in a neural network.

### Problem Statement
Implement a simple memory profiler for PyTorch and use it to compare the memory usage of a neural network forward pass with and without gradient checkpointing.

**Requirements**:
- Create a `MemoryProfiler` class that can track peak GPU memory usage.
- Implement a simple multi-layer neural network.
- Implement a version of the forward pass that uses `torch.utils.checkpoint.checkpoint`.
- Profile the memory usage of both forward passes

### Example: Memory Profiling Concept

```python
import torch
import torch.nn as nn

# Understanding memory usage in neural networks:

# WITHOUT CHECKPOINTING:
# Forward pass stores ALL intermediate activations
# Layer 1 -> activation 1 (stored in memory)
# Layer 2 -> activation 2 (stored in memory)
# Layer 3 -> activation 3 (stored in memory)
# ...
# Layer 100 -> activation 100 (stored in memory)
# Total memory: sum of all 100 activations

# For a batch of 32 samples with hidden size 1000:
# Each activation ≈ 32 * 1000 * 4 bytes = 128KB
# 100 layers * 128KB = 12.8MB (this adds up!)

# BACKWARD PASS uses all stored activations to compute gradients
# gradients = compute_gradients(activation1, activation2, ...)

# WITH CHECKPOINTING:
# Forward pass stores ONLY checkpointed layers
# Layer 1 -> activation 1 (stored)
# Layer 2-10 -> activations NOT stored (will be recomputed)
# Layer 11 -> activation 11 (stored)
# ...
# Layer 91-100 -> activations NOT stored (will be recomputed)

# Backward pass recomputes activations as needed:
# To compute gradients for layer 9:
#   1. Recompute activation 9 (fast, only through layers 1-9)
#   2. Use it to compute gradient
#   3. Discard activation 9
#   4. Move to next layer

# Result: Memory reduced from 12.8MB to ~1.3MB!
# Trade-off: Backward pass is slightly slower (recomputation overhead)

# Your task:
# 1. Create a MemoryProfiler to measure memory before/after operations
# 2. Create a network and measure memory DURING training
# 3. Show that checkpointing reduces peak memory usage
```

### Your Exercise
Implement memory profiling and gradient checkpointing:

In [None]:
import torch
import torch.nn as nn
import tracemalloc

class MemoryProfiler:
    """Simple memory profiler."""
    
    def __init__(self):
        """Initialize the profiler."""
        self.memory_snapshots = []
    
    def get_memory_usage(self):
        """
        Get current memory usage in MB.
        
        TODO: Implement this function
        - Use psutil or torch.cuda.memory_allocated() for GPU
        - Return memory in MB
        """
        pass
    
    def log_memory(self, label):
        """
        Log current memory usage with a label.
        
        TODO: Implement this function
        - Call get_memory_usage()
        - Store the result with the label
        """
        pass

class SimpleNetwork(nn.Module):
    """Simple neural network for testing."""
    
    def __init__(self, input_size, hidden_size, output_size):
        """
        Initialize the network.
        
        TODO: Implement this function
        - Create linear layers: input_size -> hidden_size -> output_size
        - Store them as self.fc1 and self.fc2
        """
        super().__init__()
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        TODO: Implement this function
        - Pass through fc1
        - Apply ReLU activation
        - Pass through fc2
        - Return result
        """
        pass

def profile_network_memory():
    """Profile memory usage of a neural network."""
    profiler = MemoryProfiler()
    
    # TODO: Implement profiling
    # - Create a network
    # - Create input batch
    # - Log memory before forward pass
    # - Run forward pass
    # - Log memory after forward pass
    # - Create output and backward
    # - Log memory after backward
    
    print("✓ Memory profiling complete!")

# Run the profiler
try:
    profile_network_memory()
except Exception as e:
    print(f"Error: {e}")


<details>
<summary>Click to reveal hint for Problem 5</summary>

**Hint**: Use `torch.cuda.reset_peak_memory_stats()` before your forward pass and `torch.cuda.max_memory_allocated()` after to get the peak memory usage. For the checkpointed forward pass, wrap the application of each layer (or a sequence of layers) in the `torch.utils.checkpoint.checkpoint` function. This function takes a function to be run (e.g., a lambda that applies the layer) and the input to that function.

</details>