# HW3: Part 1 - JIT and Threading

v1.0 (2021 Spring): Aled Cuda, Aditya Sengupta

**Time Budget: 3 Hours**

## [Just-In-Time Compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation):

In order to transform Python code into machine instructions we use the library [Numba](https://numba.readthedocs.io/en/stable/user/index.html). The inner workings of Numba are devilishly complex, but broadly speaking, here's what happens when you call a function that has been marked for jit with Numba:

1. When the Python interpereter tries to execute a function that you marked for JITing with a Numba decorator, it calls out to Numba.
2. Numba checks its database to see if it has already compiled the given function. If it does, it verifies input types match those it was compiled with and calls that function. If not, it moves onto the following step.
3. Numba will check the types of all the inputs, and use that information to figure out the types of every variable in the program as well as the return type. If it's successful at inferring the types, it proceeds onto the next step.
4. Knowing the type of everything, Numba calls out to its compilation backend ([LLVM](https://llvm.org/docs/index.html)) which turns the Python code into machine code.

As you can probably tell there are a number of steps in this chain that can fail. Chief among them is step 3. This is why JITing python is so difficult. Python is *SERIOUSLY* dynamically typed, which means it plays things fast and loose with the type of objects. One minute a function can be adding two integers, and the next it can be concatenating two strings, on the same line of code, using the same variables. Run the example below to see how this works:

In [None]:
def dynamic_add(a, b):
    return a + b

stringa = "Hi"
stringb = " Jeff"

a = 10
b = 20

print("dynamic_add(stringa, stringb):", dynamic_add(stringa, stringb))
print("type(dynamic_add(stringa, stringb)):", type(dynamic_add(stringa, stringb)))

print("dynamic_add(a, b):", dynamic_add(a, b))
print("type(dynamic_add(a, b)):", type(dynamic_add(a, b)))

As you can see, the same function operates on two completely different types of data. If you were to write this in assembly, you'd have to write two entirely different functions. It's almost definitely impossible to figure out statically (that is, without running the program) what type each variable will take on. By waiting until runtime, Numba mostly solves this problem, but it can't always figure out the types, so in a sense it only really supports a subset of Python.

To JIT a function, you can use the `njit` decorator from Numba to tag it for compilation, like below:

In [None]:
from numba import njit

@njit
def numba_add(a, b):
    return a + b

%time numba_add(10.5, 19.5)
%time numba_add(3.141, 3.141)
%time numba_add(9.1, 8.1)

### Question 1:

a. Run the cell above, explain why the first call to numba_add a couple thousand times longer than the second two

**YOUR ANSWER**

b. Run the cell below, it speed up above, so why did it slow down again before speeding back up?

**YOUR ANSWER**

In [None]:
%time numba_add("Hi from ", "Numba")
%time numba_add("Hi from ", "Numba but faster this time")

### Question 2:

Modify the function below, which computes a sum of the squares of each element in an array, to run jited. Note the speedup you achieve:

My speedup = xxx%

(Don't worry this isn't a trick question, but don't use any numpy functions)

In [None]:
import numpy as np

def sum_squares(l):
    running_sum = 0
    for i in range(len(l)):
        running_sum += l[i]**2
    return running_sum

# Trigger jit
sum_squares(np.array([0.0, 1.0]))
testvec = np.linspace(0, 100, 100000)

%timeit sum_squares(testvec)

Another way to parallelize Python code is via the Python `multiprocessing` library. Let's try that!

In [None]:
from multiprocessing import Pool

data = np.arange(25)

def helper(x):
    """
    Double and print a single element
    """
    print(x*2)
    return x*2

def multi_doublenprint(a):
    """
    Take an array as input, and use the map method from multiprocessings
    Pool class to print out and return the double of each element
    """
    p = Pool(16)
    
    # Use the Pool.map method to call our helper on each element of the array
    return p.map(helper, a)

multi_doublenprint(data)

Huh???? In order to break the global interpereter lock, the multiprocessing library spawns multiple separate Python programs. It splits up the list we're operating on and sends them out to the different processes. It then combines the results from each and builds a new list out of them! This unfortunately causes the print output to get all mangled together. This is a good example of the non-deterministic nature of multithreading. Run this cell a couple of times, and see how the output changes.

The other way to write decent parallel code is to explicitly parallelize a loop with Numba. This has the advantage of being both more efficient and more compact than the multiprocessing way:

In [None]:
from numba import prange

# We need to use parallel=True, otherwise prange just acts like range
@njit(parallel = True)
def explicit_doublenprint(a):
    # Create an array to hold our result
    ret = np.empty(a.size, dtype = np.int64)
    # Use the prange function from numba to dispatch different iterations of our for loop on different threads
    for i in prange(0, len(a)):
        print(2*a[i])
        ret[i] = 2*a[i]
    return ret

explicit_doublenprint(data)

You'll notice that the print output is quite a bit less garbled. This is because the way Numba spawns threads is much lighter weight and quite a bit saner. This is because for the most part, it can just ignore the global interpreter lock and run multiple concurrent threads in the same Python process.

The last, easiest, and generally fastest way to run Python code in parallel with Numba is to use the vectorize decorator and to set the target to parallel. For some reason, this also requires you to explicitly specify the input types, as you'll see below.

In [None]:
from numba import vectorize
# We ask numba to generate parallel code that takes an int64 as an
# argument (the part in parenthesis) and returns an int64, int64 is
# the default numpy integer datatype, and float64 is the default
# floating point datatype for numpy
@vectorize('int64(int64)', target='parallel')
def vectored_doublenprint(x):
    print(x * 2)
    return x * 2

vectored_doublenprint(data)

As you can see this operates in essentially the same way as the prange example, but the expression is much clearer, and as we probably saw in lecture, Numba gets a little bit smarter with the vectorization and this way is typicaly a little faster.

For the rest of this part of the homework, we'll explore ways of speeding up calculations of the Fibonacci sequence, defined in plain Python as follows:

In [1]:
def fibonacci(n):
    lastnum = 1
    lastlastnum = 0
    if n == 0:
        return 0
    if n == 1:
        return 1
    i = 2
    while i < n:
        i += 1
        lastlastnum, lastnum = lastnum, lastnum + lastlastnum
    return lastnum + lastlastnum

### Question 3:

Use `multiprocessing.Pool` to compute `fibonacci(k)` for each element `k` in an array, where `fibonacci(k)` is the kth element of the Fibonacci sequence as given by the function above. Feel free to use the given Fibonacci code in your test:

In [2]:
def fibonacci_pool(arr):
    """
    Fill in this function to use the multiprocessing library to compute the fibonacci
    sequence for every element in the array, and return the result as an array
    """
    pass

assert np.all(np.array([fibonacci(i) for i in range(100)]) == fibonacci_pool(np.arange(100)))

NameError: name 'np' is not defined

### Question 4:

Use Numba's `prange` object and the `@njit(parallel=True)` decorator to explicitly parallelize this function.

If you decide to call out to the fibonacci function, make a copy of it under a different name in this cell and decorate it with @njit

In [None]:
# You will need to decorate this funcition for jiting and enable parallelism
def fibonacci_prange(arr):
    """
    Fill in this function with an explicit loop over the array using numba's prange function
    """
    pass

assert np.all(np.array([fibonacci(i) for i in range(100)]) == fibonacci_prange(np.arange(100)))

### Question 5:

Use Numba's `@vectorize` decorator to implicitly parallelize the Fibonacci program. You will probably want the type declaration `'int64(int64)` and the compilation target `target='parallel'`.

In [None]:
# You will need to decorate this function with the vectorize decorator, and target='parallel'
def fibonacci_vectorize(x):
    """
    Fill in this function to compute the xth element of the fibonacci sequence and
    let numba handle the parallelism
    """
    pass

assert np.all(np.array([fibonacci(i) for i in range(100)]) == fibonacci_vectorize(np.arange(100)))

### Question 6:

Run the cell below, remark on the performance differences between each of the implementations, do any of these perform better/worse than you were expecting?

**YOUR ANSWER**

In [None]:
testvec = np.random.randint(0, 300, size=100000)

print("Naive python implementation:")
%timeit _ = np.array([fibonacci(i) for i in testvec])
print("Python multiprocessing pool implementation:")
%timeit _ = fibonacci_pool(testvec)
print("Numba explicitly parallel implementation:")
%timeit _ = fibonacci_prange(testvec)
print("Numba vectorized implementation:")
%timeit _ = fibonacci_vectorize(testvec)