# Introduction

Writing code in python is easy: because it is dynamically typed, we don’t have to worry to much about declaring variable types (e.g. integers vs. floating point numbers). Also, it is interpreted, rather than compiled. Taken together, this means that we can avoid a lot of the boiler-plate that makes compiled, statically typed languages hard to read. However, this incurs a major drawback: performance for some operations can be quite slow.

Whenever possible, the numpy array representation is helpful in saving time. But not all operations can be vectorized. What do you do when you need to speed up your code, but can’t rely on vectorization?

Here, we’ll mention two general approaches to speeding up code:

1. The first is the main topic of this tutorial, parallelization. One of the reasons that Python can be slow is that it does one thing at a time. Technically, this is because to make sure that it doesn't corrupt data and/or memory, and so that basic operations run reasonably fast, Python runs on a single thread at any given time. This is implemented in the so-called Global Interpreter Lock, or GIL. Here, we'll explore one (of many!) approach to parallelization, using a software library called Dask. 

1. The other approach -- which we will not explore in this tutorial -- is to compile your Python code into something faster, and then call the compiled code from with Python. There are two ways to do that: 
    - Sometimes, your only choice in speeding up code is to write extension code in C, but this is very cumbersome, and requires writing many lines of additional code above and beyond your core algorithms, just to communicate between the Python and C computation layers. [Cython](https://cython.org/) is a technology that allows us to easily bridge between python, and the underlying C representations. The main purpose of the library is to take code that is written in python, and, provided some additional amount of (mostly type) information, compile it to C, compile the C code, and bundle the C objects into python extensions that can then be imported directly into python.

    - [Numba](https://numba.pydata.org/) and [Jax](https://jax.readthedocs.io/en/latest/index.html) also compile your code to machine code, but they both take a distinctly different approach. Instead of translating your Python code to C, and then compiling that down to machine code, they compile the code “just in time”, at the time that the code is called.


But before we dive into parallelization, we need to find a way to know whether what we are doing is even helping.

## Profiling

To know whether what you are doing is helping, it is crucial to measure how well you are doing before and after some change. Profiling is a way to know how well a particular piece of code works.

### The IPython `timeit` magic

In the Jupyter Python notebook, you can use a ‘magic’ function to time either a single statement, or multiple statements. For example, the following shows us how one operation scales with the size of the data. In this case, the %timeit magic only times the operation on that line.

In [None]:
import numpy as np

for shape in [10e3, 10e4, 10e5]:
    X = np.random.rand(int(shape))
    %timeit np.dot(X, X.T)

In contrast, if you use `%%timeit`, the magic would apply to the entire cell.

For example, in the following cell, we might calculate the pair-wise distance between the entries in a random matrix of 100 by 100, and store them:

In [None]:
%%timeit 
X = np.random.rand(100, 100)
D = np.empty((100, 100))

M = X.shape[0]
N = X.shape[1]
for i in range(M):
    for j in range(M):
        d = 0.0
        for k in range(N):
            tmp = X[i, k] - X[j, k]
            d += tmp * tmp
        D[i, j] = np.sqrt(d)

## Line profiling

Knowing that some set of procedures takes time is good, but to improve things, we often need to drill down deeper, and figure out which exact lines within a function are the ones that take up most of the time.

That’s where a line-profiler comes in handy. We activate the Jupyter extension

In [None]:
%load_ext line_profiler

One you've done that, you'll need to define a function around the code
that you are interested in profiling:

In [None]:
def distance():
    X = np.random.rand(100, 100)
    D = np.empty((100, 100))

    M = X.shape[0]
    N = X.shape[1]
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)

Sometimes the function you want to profile is not the same as the one you
would call to profile it, so the syntax of the line-profiler extension
is:

    %lprun -f function_to_be_profiled function_to_be_called()
    
In this case, they are the same, so we run the following:

In [None]:
%lprun -f distance distance()

In this output, the 'Hits' column is important, because it tells us that some lines of code are heavily used. And the '% Time' column is also very important, because it tells us where we should focus our attention first, in making this go faster.

With that in our toolbox, let's start accelerating some code!