# Chapter 1: Data Structures & Complexity for ML Engineers

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jmamath/interview_prep/blob/main/chapter_01_data_structures_complexity.ipynb)

## Introduction

As a machine learning engineer at Google DeepMind, I’ve seen firsthand how the performance of a model can make or break a project. A model that takes days to train or milliseconds too long to generate a prediction can be the difference between a groundbreaking innovation and a failed experiment. And often, the key to unlocking that performance lies not in the high-level model architecture, but in the low-level details of how we handle data.

This chapter is about those details. We’re going to take a step back from the glamour of neural network design and get our hands dirty with the fundamentals of data structures and algorithmic complexity. Why? Because in the world of large-scale machine learning, a seemingly small choice—like how you iterate through a tensor or whether you update an array in-place—can have a massive impact on performance and memory usage.

By the end of this chapter, you’ll have a deeper appreciation for the importance of data structures and complexity analysis in machine learning, and you’ll be equipped with the practical skills to write more efficient and scalable ML code. Let's get started!

## Learning Objectives
- Analyze algorithm complexity (Big O time and space)
- Understand memory layout, cache effects, and vectorization
- Implement memory-efficient algorithms
- Profile and optimize memory usage

---

## Problem 1: Memory-Efficient Prefix Sum (Easy)

### Contextual Introduction
In many ML applications, especially in sequence modeling and data analysis, we need to compute running totals or cumulative sums. For example, when implementing custom attention mechanisms or calculating running statistics in a data stream, prefix sums are a fundamental operation. A naive implementation might create a new array to store the cumulative sums, but this can be memory-intensive for large sequences. By performing the operation in-place, we can save memory and improve performance, which is crucial when working with large datasets and models.

### Key Concepts
- **In-place operations**: Modifying the input data directly without creating a copy.
- **Space Complexity**: The amount of memory an algorithm needs. An in-place algorithm has O(1) auxiliary space complexity.
- **Time Complexity**: The amount of time an algorithm takes to run. A single pass through an array is O(n).

### Problem Statement
Implement a memory-efficient prefix sum algorithm that computes the cumulative sum of an array in-place.

**Requirements**:
- Modify the input array in-place (O(1) extra space).
- Time complexity should be O(n).
- Handle edge cases like an empty array or a single-element array.

**Example**:
```python
arr = [1, 2, 3, 4, 5]
prefix_sum_inplace(arr)
print(arr)  # Expected output: [1, 3, 6, 10, 15]
```

In [None]:
def prefix_sum_inplace(arr):
    """Computes the prefix sum of an array in-place."""
    for i in range(1, len(arr)):
        arr[i] += arr[i-1]

def test_prefix_sum_inplace():
    """Tests the in-place prefix sum implementation."""
    # Test case 1: Basic functionality
    arr1 = [1, 2, 3, 4, 5]
    prefix_sum_inplace(arr1)
    assert arr1 == [1, 3, 6, 10, 15], f"Test 1 Failed: {arr1}"

    # Test case 2: Empty array
    arr2 = []
    prefix_sum_inplace(arr2)
    assert arr2 == [], f"Test 2 Failed: {arr2}"

    # Test case 3: Single element
    arr3 = [10]
    prefix_sum_inplace(arr3)
    assert arr3 == [10], f"Test 3 Failed: {arr3}"

    # Test case 4: Array with negative numbers
    arr4 = [1, -2, 3, -4, 5]
    prefix_sum_inplace(arr4)
    assert arr4 == [1, -1, 2, -2, 3], f"Test 4 Failed: {arr4}"

    print("🎉 All prefix sum tests passed!")

test_prefix_sum_inplace()

<details>
<summary>Click to reveal hint for Problem 1</summary>

**Hint**: Iterate through the array starting from the second element. For each element at index `i`, update it by adding the value of the element at index `i-1`. This way, each element becomes the sum of itself and all previous elements.

</details>

---

## Problem 2: Tensor Operation Benchmarking (Easy-Medium)

### Contextual Introduction
In deep learning, we work with large multi-dimensional arrays called tensors. Operations on these tensors, like element-wise multiplication, are the building blocks of neural networks. A naive way to implement these operations is to use Python loops, but this is incredibly slow. High-performance ML libraries like NumPy and PyTorch use vectorized operations, which are implemented in C or CUDA and can perform operations on entire arrays at once. Understanding the performance difference between loops and vectorization is fundamental for writing efficient ML code.

### Key Concepts
- **Vectorization**: Performing operations on entire arrays at once, rather than element by element.
- **SIMD (Single Instruction, Multiple Data)**: A hardware feature that allows a single instruction to be applied to multiple data points simultaneously. Vectorized operations leverage SIMD.
- **Benchmarking**: Measuring the performance of code, typically in terms of execution time and memory usage.

### Problem Statement
Compare the performance of Python loops vs. NumPy vectorized operations for element-wise multiplication on large tensors. You will implement both versions, measure their execution time and memory usage for different array sizes, and create a plot to visualize the performance difference.

**Requirements**:
- Implement both loop-based and vectorized versions of element-wise multiplication.
- Measure and plot execution time and memory usage for different array sizes.
- Analyze the speedup of the vectorized version over the loop version.

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt

def elementwise_multiply_loop(a, b):
    result = np.zeros_like(a)
    for i in range(len(a)):
        result[i] = a[i] * b[i]
    return result

def elementwise_multiply_vectorized(a, b):
    return a * b

def benchmark_operations():
    sizes = [10**i for i in range(1, 7)]
    loop_times = []
    vectorized_times = []

    for size in sizes:
        a = np.random.rand(size)
        b = np.random.rand(size)

        start_time = time.time()
        elementwise_multiply_loop(a, b)
        loop_times.append(time.time() - start_time)

        start_time = time.time()
        elementwise_multiply_vectorized(a, b)
        vectorized_times.append(time.time() - start_time)

    # Plotting the results
    plt.figure(figsize=(10, 6))
    plt.plot(sizes, loop_times, 'o-', label='Loop')
    plt.plot(sizes, vectorized_times, 'o-', label='Vectorized')
    plt.title('Loop vs. Vectorized Performance')
    plt.xlabel('Array Size')
    plt.ylabel('Execution Time (seconds)')
    plt.xscale('log')
    plt.yscale('log')
    plt.legend()
    plt.grid(True)
    plt.show()

benchmark_operations()

<details>
<summary>Click to reveal hint for Problem 2</summary>

**Hint**: Use the `time` module to measure the execution time of each function. For the loop version, iterate from 0 to the length of the array. For the vectorized version, simply use the `*` operator on the two NumPy arrays. Use `matplotlib` to plot the results with a logarithmic scale for both axes to better visualize the performance difference across a wide range of array sizes.

</details>

---

## Problem 3: Cache-Aware Matrix Operations (Medium)

### Contextual Introduction
Matrix multiplication is at the heart of deep learning, forming the basis of fully connected layers and convolutions. When matrices are large, the way we access their elements in memory can have a huge impact on performance. Modern CPUs have a memory hierarchy (caches) where accessing data that is already in a cache is much faster than fetching it from main memory. A naive matrix multiplication algorithm can have poor cache locality, leading to many cache misses and slow performance. By using a cache-aware technique like tiling (or blocking), we can significantly speed up the operation.

### Key Concepts
- **Cache Locality**: The principle that if a memory location is accessed, it's likely that nearby memory locations will be accessed soon (spatial locality) and the same location will be accessed again soon (temporal locality).
- **Tiling/Blocking**: A technique to improve cache locality by breaking down a large matrix multiplication into smaller matrix multiplications on sub-matrices (tiles) that can fit into the cache.
- **Numba**: A just-in-time (JIT) compiler for Python that can translate a subset of Python and NumPy code into fast machine code, often achieving C-like speeds.

### Problem Statement
Implement matrix multiplication using a standard naive approach and a cache-optimized tiled approach. Compare their performance against each other and against a Numba-optimized version.

**Requirements**:
- Implement a standard (i, j, k) matrix multiplication.
- Implement a tiled version of matrix multiplication.
- Use Numba's `@jit` decorator to create a JIT-compiled version.
- Benchmark the three versions for different matrix sizes.

In [None]:
import numpy as np
import time
from numba import jit

def matrix_multiply_naive(A, B):
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    C = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                C[i, j] += A[i, l] * B[l, j]
    return C

def matrix_multiply_tiled(A, B, tile_size=16):
    m, k = A.shape
    k2, n = B.shape
    assert k == k2, "Inner dimensions must match"
    C = np.zeros((m, n))
    for i0 in range(0, m, tile_size):
        for j0 in range(0, n, tile_size):
            for l0 in range(0, k, tile_size):
                for i in range(i0, min(i0 + tile_size, m)):
                    for j in range(j0, min(j0 + tile_size, n)):
                        for l in range(l0, min(l0 + tile_size, k)):
                            C[i, j] += A[i, l] * B[l, j]
    return C

@jit(nopython=True)
def matrix_multiply_numba(A, B):
    m, k = A.shape
    k2, n = B.shape
    C = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                C[i, j] += A[i, l] * B[l, j]
    return C

def benchmark_matrix_multiplication():
    sizes = [64, 128, 256, 512]
    results = {s: {} for s in sizes}

    for size in sizes:
        A = np.random.rand(size, size)
        B = np.random.rand(size, size)

        start_time = time.time()
        matrix_multiply_naive(A, B)
        results[size]['naive'] = time.time() - start_time

        start_time = time.time()
        matrix_multiply_tiled(A, B)
        results[size]['tiled'] = time.time() - start_time

        # Warm up Numba
        matrix_multiply_numba(A, B)
        start_time = time.time()
        matrix_multiply_numba(A, B)
        results[size]['numba'] = time.time() - start_time

    # Print results
    for size, timings in results.items():
        print(f"Matrix size: {size}x{size}")
        print(f"  Naive: {timings['naive']:.4f}s")
        print(f"  Tiled: {timings['tiled']:.4f}s")
        print(f"  Numba: {timings['numba']:.4f}s")

benchmark_matrix_multiplication()

<details>
<summary>Click to reveal hint for Problem 3</summary>

**Hint**: For the tiled version, you will need three nested loops to iterate over the tiles, and then three more nested loops to perform the multiplication within each tile. The key is that the inner loops will be operating on small sub-matrices that fit in the cache. For the Numba version, simply apply the `@jit(nopython=True)` decorator to your naive implementation.

</details>

---

## Problem 4: Custom Sparse Tensor Operations (Medium-Hard)

### Contextual Introduction
In many ML applications, such as natural language processing (e.g., word embeddings) and recommendation systems, we deal with very high-dimensional but sparse data. For example, a user-item interaction matrix in a recommendation system might have millions of rows and columns, but each user has only interacted with a few items. Storing this as a dense matrix would be prohibitively expensive in terms of memory. Sparse tensors are data structures that only store the non-zero elements, making them much more memory-efficient.

### Key Concepts
- **Sparsity**: The fraction of zero elements in a tensor.
- **COO (Coordinate) format**: A way to represent a sparse tensor by storing a list of (row, column, value) tuples.
- **Memory Efficiency**: Sparse tensors have a memory usage of O(nnz), where nnz is the number of non-zero elements, as opposed to O(N*M) for a dense matrix.

### Problem Statement
Implement a memory-efficient sparse tensor class using the COO format. The class should support conversion to and from a dense NumPy array, as well as a `memory_usage` method to demonstrate its efficiency.

**Requirements**:
- Use COO format (indices and values) to store non-zero elements.
- Implement `to_dense` and `from_dense` methods.
- Implement a `memory_usage` method that returns the memory used by the sparse representation.
- Compare the memory usage of the sparse tensor with its dense equivalent.

In [None]:
import numpy as np

class SparseTensor:
    def __init__(self, shape, indices, values):
        self.shape = shape
        self.indices = indices
        self.values = values

    @classmethod
    def from_dense(cls, dense_tensor):
        indices = np.argwhere(dense_tensor != 0)
        values = dense_tensor[indices[:, 0], indices[:, 1]]
        return cls(dense_tensor.shape, indices, values)

    def to_dense(self):
        dense_tensor = np.zeros(self.shape)
        for i, v in zip(self.indices, self.values):
            dense_tensor[tuple(i)] = v
        return dense_tensor

    def memory_usage(self):
        return self.indices.nbytes + self.values.nbytes

def test_sparse_tensor():
    # Create a large, sparse dense tensor
    dense_tensor = np.zeros((1000, 1000))
    dense_tensor[10, 20] = 5
    dense_tensor[100, 200] = 10

    # Convert to sparse tensor
    sparse_tensor = SparseTensor.from_dense(dense_tensor)

    # Check memory usage
    dense_memory = dense_tensor.nbytes
    sparse_memory = sparse_tensor.memory_usage()

    print(f"Dense tensor memory: {dense_memory / 1024**2:.4f} MB")
    print(f"Sparse tensor memory: {sparse_memory / 1024**2:.4f} MB")
    print(f"Memory savings: {1 - sparse_memory / dense_memory:.2%}")

    # Verify correctness
    reconstructed_dense = sparse_tensor.to_dense()
    assert np.allclose(dense_tensor, reconstructed_dense)
    print("\n🎉 Sparse tensor implementation is correct and memory-efficient!")

test_sparse_tensor()

<details>
<summary>Click to reveal hint for Problem 4</summary>

**Hint**: For `from_dense`, use `np.argwhere` to find the indices of non-zero elements. For `to_dense`, create a zero-filled NumPy array of the correct shape and then fill in the non-zero values using the stored indices and values. The memory usage is the sum of the bytes used by the indices array and the values array, which you can get from the `.nbytes` attribute of a NumPy array.

</details>

---

## Problem 5: Memory Profiling and Optimization (Hard)

### Contextual Introduction
Training large neural networks, like the ones used for language modeling or image generation, is often limited by GPU memory. During the forward pass, the activations (outputs of each layer) are stored in memory to be used for gradient calculations in the backward pass. For very deep or wide networks, these activations can consume a huge amount of memory. Gradient checkpointing is a technique that trades compute for memory by not storing the activations for some layers and instead recomputing them during the backward pass. This can allow you to train much larger models than would otherwise fit in memory.

### Key Concepts
- **Memory Profiling**: The process of measuring and analyzing the memory usage of a program.
- **Gradient Checkpointing**: A technique to reduce the memory footprint of a neural network by recomputing activations during the backward pass instead of storing them.
- **Activation**: The output of a layer in a neural network.

### Problem Statement
Implement a simple memory profiler for PyTorch and use it to compare the memory usage of a neural network forward pass with and without gradient checkpointing.

**Requirements**:
- Create a `MemoryProfiler` class that can track peak GPU memory usage.
- Implement a simple multi-layer neural network.
- Implement a version of the forward pass that uses `torch.utils.checkpoint.checkpoint`.
- Profile the memory usage of both forward passes and compare them.

In [None]:
import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint

class MemoryProfiler:
    def __init__(self, model, device):
        self.model = model
        self.device = device

    def profile(self, forward_pass_func, *args):
        torch.cuda.reset_peak_memory_stats(self.device)
        torch.cuda.synchronize()
        _ = forward_pass_func(*args)
        torch.cuda.synchronize()
        peak_memory = torch.cuda.max_memory_allocated(self.device) / 1024**2
        return peak_memory

class LargeNeuralNetwork(nn.Module):
    def __init__(self, input_size=1024, hidden_size=4096, num_layers=8):
        super().__init__()
        self.layers = nn.ModuleList()
        self.layers.append(nn.Linear(input_size, hidden_size))
        for _ in range(num_layers - 1):
            self.layers.append(nn.Linear(hidden_size, hidden_size))

    def forward(self, x):
        for layer in self.layers:
            x = torch.relu(layer(x))
        return x

    def forward_with_checkpointing(self, x):
        for layer in self.layers:
            x = checkpoint(lambda y: torch.relu(layer(y)), x)
        return x

def test_memory_profiling():
    if not torch.cuda.is_available():
        print("CUDA not available, skipping memory profiling test.")
        return

    device = torch.device("cuda")
    model = LargeNeuralNetwork().to(device)
    profiler = MemoryProfiler(model, device)

    # Profile standard forward pass
    input_tensor = torch.randn(16, 1024, device=device)
    peak_mem_standard = profiler.profile(model.forward, input_tensor)
    print(f"Standard forward pass peak memory: {peak_mem_standard:.2f} MB")

    # Profile checkpointed forward pass
    peak_mem_checkpointed = profiler.profile(model.forward_with_checkpointing, input_tensor)
    print(f"Checkpointed forward pass peak memory: {peak_mem_checkpointed:.2f} MB")

    # Compare results
    print(f"Memory savings with checkpointing: {1 - peak_mem_checkpointed / peak_mem_standard:.2%}")
    assert peak_mem_checkpointed < peak_mem_standard
    print("\n🎉 Gradient checkpointing successfully reduced memory usage!")

test_memory_profiling()

<details>
<summary>Click to reveal hint for Problem 5</summary>

**Hint**: Use `torch.cuda.reset_peak_memory_stats()` before your forward pass and `torch.cuda.max_memory_allocated()` after to get the peak memory usage. For the checkpointed forward pass, wrap the application of each layer (or a sequence of layers) in the `torch.utils.checkpoint.checkpoint` function. This function takes a function to be run (e.g., a lambda that applies the layer) and the input to that function.

</details>