# Distributed and Accelerated Computing

What happens when we run some Python code? Firstly, the code is compiled and turned into language the machine can understand, the code will then loaded within a Python virtual machine on the machine, finally the CPU will then execute the code step by step. When we execute code we generally want it to run as fast as possible, however we will always be bounded or limited in some aspect. Profiling our code is essential to understanding where this limitation is occuring, be it compute, memory, or i/o. When we understand what is limiting our code we can start to leverage strategies to reduce the bottleneck.

In this workshop will explore how to analyze our code to understand our bottlenecks and discussing tolling and approaches to address them.

## Bottlenecks
Bottlenecks will always exist in our code otherwise it would run instantly.

### Compute
When executing code on the CPU the lines of code are executed sequentially, sometimes our operations simply take too long. In this scenario we may try to exploit some parallizable aspect of the computation to distrbute the computation across multiple compute units, either CPUs or GPUs.

We would refer to this as distributing the computation across multiple CPU cores or even multiple machines with multiple CPUs. We would refer to vectorizing the computation on the GPU or some other application specific integratic circuit (ASIC) as accelerated computing.

Below is an example of a job that might benefit from parallelization as each computation is independent of the other computations.

In [1]:
import time

# define some computation which takes a long time
def computation(x):
    time.sleep(1.0)
    return x ** 2

# defin the inputs and outputs
inputs = [1, 2, 3]
outputs = []

# perform the computation one by one
for val in inputs:
    output = computation(val)
    outputs.append(output)

In [2]:
# distribute the computation ....

Below is an example of a job that might benefit from vectorization, rather than iterating over each value we can structure the computation so that we perform all of them at the same time. This is what GPUs in particular are very good at.

In [11]:
import numpy as np

sz = 100
A = np.random.randn(sz,sz)
B = np.random.randn(sz,sz)


def matrix_multiply(A, B):
    # Create the result matrix with zeros
    output = np.zeros((A.shape[0], B.shape[1]))
    
    # Perform matrix multiplication
    for i in range(A.shape[0]):
        for j in range(B.shape[1]):
            output[i][j] = sum(A[i][k] * B[k][j] for k in range(A.shape[1]))
    
    return output

# perform the computation
output = matrix_multiply(A, B)

In [4]:
# perform vectorized matrix multiplication
output = np.matmul(A, B)

### Memory
Another bottleneck we may encounter is memory where the data or intermediaries of a computation is simply too large to store in memory. In this scenario we may try to break the computation up into more manageable chunks to process individually.

We would refer to this as chunking and often we might decide to distribute the computation being performed on each of these chunks.

In [5]:
from pathlib import Path
Path("..").joinpath("data").mkdir(exist_ok=True, parents=True)

Consider you're trying to read in a massive CSV file containing billions of transactions and you want to compute the net value of the transactions. 

In [6]:
import random
import pandas as pd
N = 1000000
lines = {
    "user": [random.randint(0,10) for _ in range(N)],
    "transaction_value": [100 * random.random() for _ in range(N)]
}
pd.DataFrame(lines).to_csv("../data/huge_file.csv")

In [7]:
import pandas as pd

# say this doesnt fit into memory
df = pd.read_csv("../data/huge_file.csv") # e.g. 100TiB of memory?!


df["transaction_value"].sum()

np.float64(50026984.48608181)

Loading this entire file into memory at once would use up more memory than your machine has available. If we can't load the data does that mean we can't do the computation at all?

In [8]:
chunk_sum = 0
rows_per_chunk = 1000

# process the file chunk by chunk
for chunk in pd.read_csv("../data/huge_file.csv", chunksize=rows_per_chunk):
    # process each chunk independently : QUICK!
    chunk_sum += chunk["transaction_value"].sum() 
    
chunk_sum

np.float64(50026984.4860818)

### I/O
When reading/writing to the filesystem we can also be limited by the I/O speed. In this scenario we may try to parralelize the reading/writing to saturate the I/O bandwidth.

In [9]:
# do we need to process the files one at a time in they're independent?
for file in ["../data/huge_file.csv"] * 10:
    df = pd.read_csv(file)

In [10]:
# distribute the computation ....

### Compute -> Parallelize, Vectorize
### Memory -> Chunk
### IO -> Parallelize