This notebok benchmarks different simple stencils (pointwise, 1D, etc.) using pure python code (lists), numpy arrays keeping `for` loops, and numpy parallel code.

In [None]:
import math
import numpy as np

In [None]:
from common import initialize_field, plot_field, save_result

In [None]:
NX = 128
NY = 128
NZ = 80
N_ITER = 50

# Pointwise stencils

We consider two pointwise stencil models: a simple copy, and applying the `sin` function.

$$
a(i) = b(i)
$$

$$
a(i) = \sin(b(i))
$$

### Nested lists and `for` loops

In [None]:
def list_pointwise(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
in_field = initialize_field(NX, NY, NZ, mode="random")
# We initialized out_field as zeros to verify later the copy was done correctly
out_field = np.zeros_like(in_field).tolist()
in_field = in_field.tolist()
list_pointwise(in_field, out_field)
plot_field(out_field)

Before running the first benchmark, let's have a look at how Python lists are stored in memory.

Python lists are quite big objects. Under the hood, the CPython implementation of the list object is a vector of pointers, so each element of a list is by itself a proper Python object and a pointer to it is what is stored in the list object. Essentially, after removing some comments, the implementation that can be found in [listobject.h](https://github.com/python/cpython/blob/main/Include/cpython/listobject.h) is the following:

```C
typedef struct {
    PyObject_VAR_HEAD
    PyObject **ob_item;
    Py_ssize_t allocated;
} PyListObject;
```

So our 3D fields are in fact a list containing 80 lists (dimension Z), each of these lists containing 128 lists (dimension Y), and finally each of these lists containing 128 floats. Python lists can be enlarge, reduced, and new elements can be inserted or deleted at arbitrary positions. All these make these objects extremely cache unfriendly. When accessing the field element (x,y,z) with `field[z][y][x]` a bunch of pointers have to be followed, and therefore the chances that subsequent values remain in the cache are minimal.

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise(in_field, out_field)

In [None]:
result = _
save_result(result, "list_pointwise")

Even though Python lists are not ideal to maximize cache hits, it is still expected to see a difference depending on the order of the `for` loops. Let's verify this trying all the possible permutations: ZYX (done above), XYZ, XZY, ZXY, YXZ, YZX.

In [None]:
def list_pointwise_XYZ(in_field, out_field):
    for n in range(N_ITER):
        for i in range(NX):
            for j in range(NY):
                for k in range(NZ):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

def list_pointwise_XZY(in_field, out_field):
    for n in range(N_ITER):
        for i in range(NX):
            for k in range(NZ):
                for j in range(NY):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

def list_pointwise_ZXY(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for i in range(NX):
                for j in range(NY):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

def list_pointwise_YXZ(in_field, out_field, plot=False):
    for n in range(N_ITER):
        for j in range(NY):
            for i in range(NX):
                for k in range(NZ):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field


def list_pointwise_YZX(in_field, out_field):
    for n in range(N_ITER):
        for j in range(NY):
            for k in range(NZ):
                for i in range(NX):
                    out_field[k][j][i] = in_field[k][j][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise_XYZ(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise_XZY(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise_ZXY(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise_YXZ(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_pointwise_YZX(in_field, out_field)

Clearly, by looking at these times, with nested lists we should iterate in the same order as we have nested the lists, i.e., the first level should be looped the slowest, and the innermost level should be looped the fastest. The explanation for this is that to access the element `field[k][j][i+1]` the information (pointers) required to access the innermost list, i.e., `field[k][j]`, is already in the cache, while to get `field[i+1][j][k]` we have to follow a whole new path of pointers, which most likely are not in the cache if the list is large enough.

Furthermore, we would expect that if the fields are initialized with the same order of the nested lists as the loop order, we should obtain similar performance. Let's check this for the cases XYZ and YZX.

In [None]:
def list_pointwise_XYZ_alt(in_field, out_field):
    for n in range(N_ITER):
        for i in range(NX):
            for j in range(NY):
                for k in range(NZ):
                    out_field[i][j][k] = in_field[i][j][k]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

def list_pointwise_YZX_alt(in_field, out_field):
    for n in range(N_ITER):
        for j in range(NY):
            for k in range(NZ):
                for i in range(NX):
                    out_field[j][k][i] = in_field[j][k][i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, dim_order="XYZ", mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist();
list_pointwise_XYZ_alt(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, dim_order="YZX", mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist();
list_pointwise_YZX_alt(in_field, out_field)

Similarly, we could replicate the worst results by using the opposite list structure to the loop order. Let's try this with the two best results from above: ZYX and and YZX.

In [None]:
def list_pointwise_ZYX_alt(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    out_field[i][j][k] = in_field[i][j][k]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

def list_pointwise_YZX_alt(in_field, out_field):
    for n in range(N_ITER):
        for j in range(NY):
            for k in range(NZ):
                for i in range(NX):
                    out_field[i][k][j] = in_field[i][k][j]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, dim_order="XYZ", mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist();
list_pointwise_ZYX_alt(in_field, out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, dim_order="XZY", mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist();
list_pointwise_YZX_alt(in_field, out_field)

All these results confirm our previous explanation.

Let's now check the other simple pointwise stencil computation. In this case, instead of just copying all the values, we will compute the sine of each gridpoint.

In [None]:
def list_sin_pointwise(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    # numpy.sin() is ~7 times slower than math.sin() when applied to single values
                    # Check Appendix for more details about this.
                    out_field[k][j][i] = math.sin(in_field[k][j][i])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_sin_pointwise(in_field, out_field)

In [None]:
result = _
save_result(result, "list_sin_pointwise")

### NumPy arrays and `for` loops

The following code will show the worst we can do when trying to work with NumPy arrays. Surprisingly, as we will see in [the Numba notebook](./2_numba.ipynb), this way of working with NumPy arrays is neccesary when working with certain accelerators.

In [None]:
def array_pointwise(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
in_field = initialize_field(NX, NY, NZ, mode="random")
out_field = np.zeros_like(in_field)
out_field = array_pointwise(in_field)
plot_field(out_field)

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise(in_field)

In [None]:
result = _
save_result(result, "array_pointwise")

However, it is still interesting to try all the possible permutations of the `for` loops. Hopefully, this will give us some hints to how NumPy arrays are stored in memory.

In [None]:
def array_pointwise_ZYX(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

def array_pointwise_XYZ(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for i in range(NX):
            for j in range(NY):
                for k in range(NZ):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

def array_pointwise_XZY(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for i in range(NX):
            for k in range(NZ):
                for j in range(NY):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field
            
def array_pointwise_ZXY(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for k in range(NZ):
            for i in range(NX):
                for j in range(NY):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field
            
def array_pointwise_YXZ(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for j in range(NY):
            for i in range(NX):
                for k in range(NZ):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

def array_pointwise_YZX(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for j in range(NY):
            for k in range(NZ):
                for i in range(NX):
                    out_field[k, j, i] = in_field[k, j, i]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_ZYX(in_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_XYZ(in_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_XZY(in_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_ZXY(in_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_YXZ(in_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_pointwise_YZX(in_field)

We observe some differences between some of the loop permutations. However, the important result is that they are all much slower than when working with lists. A possible explanation for this is that NumPy arrays are, by design, not cache frienly when accessed element by element. To actually understand this result we need to understand how NumPy arrays are stored in memory.

NumPy arrays are objects with two very different parts: a data buffer (a contiguous and fixed block of memory containing fixed-sized data items) and the metadata about the data buffer. The data buffer is very close to what C/Fortran developers would call an array. The metadata contains all the required information so that NumPy can correctly interpret the data buffer (e.g. strides, dim order, byte order, dtype, ...). NumPy deals with changes in arrays very efficiently avoiding unncessary copies. For example, reshaping an array only do changes to the metadata block but leaves the data buffer untouched. This is great but is something to have in mind when trying to optimize NumPy code or trying to explain some surprising results.

For example, knowing how arrays are stored in memory we can now explain why previous tests were so terribly slow. The main reason is that every time we call `out_field[k, j, i] = in_field[k, j, i]` the metadata of both `in_field` and `out_field` arrays have to be read just to figure out how to access the desired element fields. Then, the relevant part of the data buffer is copied to cache. However, when we try to copy the next point, the metadata has to be loaded again to cache, loosing the data buffer, and so on. This code does not make good use of our CPU cache.

Now let's run the benchmark for the `sin()` stencil and save the results.

In [None]:
def array_sin_pointwise(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX):
                    out_field[k, j, i] = math.sin(in_field[k, j, i])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = array_sin_pointwise(in_field)

In [None]:
result = _
save_result(result, "array_sin_pointwise")

### NumPy arrays with vectorized code

Now we test the performance of the same two pointwise stencils using the vectorized NumPy that does not use `for` loops to iterate over the three spatial dimensions.

This is how NumPy should be used. Leaving aside the overhead of Python `for` loops, this code is cache friendly because the metadata of both arrays is read only once, and then only the data buffers are read to make the actual copy. Because of this, large chuncks of the arrays are copied by reading directly from the cache.

In [None]:
def numpy_pointwise(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        out_field[:, :, :] = in_field[:, :, :]
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit -o in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = numpy_pointwise(in_field)

In [None]:
result = _
save_result(result, "numpy_pointwise")

In [None]:
def numpy_sin_pointwise(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        out_field = np.sin(in_field)
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit -o in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = numpy_sin_pointwise(in_field)

In [None]:
result = _
save_result(result, "numpy_sin_pointwise")

The conclusion of these first tests is that Python loops are **terribly slow**. Python lists are not cache friendly but they are surprisingly fast if one takes into account how flexible they are. NumPy arrays can make things worse if they are used incorrectly, e.g., by iterating over the arrays, because they are even less cache friendly than lists. However, when used with vectorized code, performance can be increased a lot. The simplest pointwise stencil (copy) using NumPy vectorized code is **~150x faster** than pure Python code with lists, and the pointwise stencil applying `sin()` is **~7x faster**.

# 1D stencils

1D stencils updating with values from the same row or same colum only and with periodic boundary conditions.

$$
a(i,j) = \frac{1}{2} \Big[b(i+1,j) - b(i,j)\Big]
$$

$$
a(i,j) = \frac{1}{2} \Big[b(i,j+1) - b(i,j)\Big]
$$

The factor 1/2 is to avoid getting huge numbers when `N_ITER` is large with the simple initialization patterns we defined before.

The 1D stencils are the perfect opportunity to validate our explanation about why Python lists are slow. We should be able to observe and measure the effects of writing more or less cache friendly code by changing the order of the `for` loops and the order of how the lists are nested.

###  Same column: $a(i,j) = \frac{1}{2} \Big[b(i+1,j) - b(i,j)\Big]$

This stencil updates each gridpoint with the current value and the value in the next column. For this reason, the optimal code will be the one that iterates the lists in the ZYX order, with the nested lists also stored in that order.

But let's start with our default list order, i.e. ZYX, and the ZXY order for the `for` loops.

In [None]:
def list_1D_same_col_ZXY(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for i in range(NX - 1):
                for j in range(NY):
                    out_field[k][j][i] = 0.5 * (in_field[k][j][i+1] - in_field[k][j][i])
                    # Periodic boundary condition
                    out_field[k][j][NX-1] = 0.5 * (in_field[k][j][0] - in_field[k][j][NX-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
in_field = initialize_field(NX, NY, NZ, mode="vertical-bars")
out_field = np.zeros_like(in_field).tolist()
in_field = in_field.tolist()
list_1D_same_col_ZXY(in_field, out_field)
plot_field(out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_1D_same_col_ZXY(in_field, out_field)

In [None]:
def list_1D_same_col_ZYX(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX - 1):
                    out_field[k][j][i] = 0.5 * (in_field[k][j][i+1] - in_field[k][j][i])
                out_field[k][j][NX-1] = 0.5 * (in_field[k][j][0] - in_field[k][j][NX-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, mode="vertical-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_1D_same_col_ZYX(in_field, out_field)

In [None]:
result = _
save_result(result, "list_1D_same_col_ZYX")

By looking at the identation of the code, we can also realize that by choosing the optimal way to iterate and nest the lists, we can save a few operations that implement the periodic boundary conditions. This, together with the higher cache hit ratio, explains the better performance of this scenario. We can quickly check the impact of the extra operations by identing the single line so that it belongs to the innermost loop.

In [None]:
def list_1D_same_col_ZYX_alt(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for j in range(NY):
                for i in range(NX - 1):
                    out_field[k][j][i] = 0.5 * (in_field[k][j][i+1] - in_field[k][j][i])
                    out_field[k][j][NX-1] = 0.5 * (in_field[k][j][0] - in_field[k][j][NX-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_1D_same_col_ZYX_alt(in_field, out_field)

Finally, let's verify our hypothesis that the order of the two outer loops should not afffect performance as long as the nested lists follow the same order.

In [None]:
def list_1D_same_col_YZX(in_field, out_field):
    for n in range(N_ITER):
        for j in range(NY):
            for k in range(NZ):
                for i in range(NX - 1):
                    out_field[k][j][i] = 0.5 * (in_field[k][j][i+1] - in_field[k][j][i])
                out_field[k][j][NX-1] = 0.5 * (in_field[k][j][0] - in_field[k][j][NX-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist(); 
list_1D_same_col_YZX(in_field, out_field)

Next, we will not benchmark NumPy arrays with `for` loops because we already know from the pointwise stencils that this will be much worse than lists. We have already explained why. So we will jump directly to explore the 1D stencil using vectorized NumPy code.

In [None]:
def numpy_1D_same_col(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        out_field[:, :, :-1] = 0.5 * (in_field[:, :, 1:] - in_field[:, :, :-1])
        # Periodic boundary condition
        out_field[:, :, -1] = 0.5 * (in_field[:, :, 0] - in_field[:, :, -1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

With vectorized code we cannot test different loop orders. The iteration over arrays is done internally by NumPy. However, we can explore the effects of storing the data in the data buffer differently. A first test we can do is to initialize the arrays using the Fortran-style rather than the C-style (default), and check whether this affects performance or not.

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", array_order="F")
out_field = numpy_1D_same_col(in_field)

By default, the data buffers of NumPy arrays are stored row-wise (C), but as shown above this can be changed using the NumPy kwarg `order` (`array_order` simply passes this to `order` in our function as one can check in [`common.py`](./common.py)). From NumPy documentation:

> Data in new ndarrays is in the row-major (C) order, unless otherwise specified, but, for example, basic array slicing often produces views in a different scheme.
> (...)
> Several algorithms in NumPy work on arbitrarily strided arrays. However, some algorithms require single-segment arrays. When an irregularly strided array is passed in to such algorithms, a copy is automatically made.

The main takeaway should be that when using vectorized NumPy code, some situations may lead to worse performance than what one would expect. This may be due to extra copies being done by NumPy internally so that the data buffers of the arrays are stored in a certain way.

We will test the performance of the 12 possible different storage configurations of the data buffers. That is, initializing the fields with the dimensions in all the possible permutations, and for each one the data buffer either in C-style or Fortran-style.

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="ZYX", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="ZYX", array_order="F")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="XYZ", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="XYZ", array_order="F")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="XZY", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="XZY", array_order="F")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="ZXY", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="ZXY", array_order="F")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="YXZ", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="YXZ", array_order="F")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="YZX", array_order="C")
out_field = numpy_1D_same_col(in_field)

In [None]:
%%timeit -n10 -r7 in_field = initialize_field(NX, NY, NZ, mode="vertical-bars", dim_order="YZX", array_order="F")
out_field = numpy_1D_same_col(in_field)

To explain these results we need to understand how arbitrary N-dimensional arrays are stored in a one-dimensional segment of computer memory. In particular, let's suppose that `array` is a N-dimensional array with dimensions $d=(d_0,d_1,\dots,d_{N-1})$. The offset of element $(n_0,n_1,\dots,n_{N-1})$, i.e., the number of bytes from the beginning of the data buffer to the element, is given by the formula

$$
n_\text{offset} = \sum_{k=0}^{N-1} s_k n_k\,,
$$

where $s_k$ are the so-called strides of the array, i.e., the number of bytes required to advance one position in the $k$ dimension. These strides are precisely the numbers that change when we store the data in C-style or Fortran-style. In particular, we have that

$$
s_k^F = \text{itemsize}\times\prod_{j=0}^{k-1}d_j\,,\quad s_k^{C} = \text{itemsize}\times\prod_{j=k+1}^{N-1}d_j
$$

The main takeaway from this is that NumPy arrays in C-style have contiguous elements from the first dimension closer to each other, and in Fortran-style is the opposite, i.e., elements from the last dimension are closer to each other.

For example, for our 3D fields with $d=(\text{NZ},\text{NY},\text{NX})$, we get that for Fortran-style arrays $s_Z=8$, $s_Y=8\times\text{NZ}=640$ and $s_X=8\times\text{NZ}\times\text{NY}=81920$, and for C-style arrays $s_Z=8$, $s_Y=8\times\text{NY}=1024$ and $s_X=8\times\text{NZ}\times\text{NY}=131072$. Let's verify this.

In [None]:
field_C = initialize_field(NX, NY, NZ, dim_order="ZYX", array_order="C")
field_F = initialize_field(NX, NY, NZ, dim_order="ZYX", array_order="F")
print(f"C-style: {field_C.strides}")
print(f"Fortran-style: {field_F.strides}")

Based on this we can explain the differences observed in previous benchmarks: `numpy_1D_same_col()` is the slower when `dim_order="ZYX"` and `array_order="C"` because $s_X\approx $ kB, while is the faster for `dim_order="ZYX"` and `array_order="F"` because $s_X=0$, maximizing the cache hit ratio. The remaining results that are in between correspond to the cases $s_X=640$ B or $s_X\approx 1$ kB.

## Same row: $a(i,j) = \frac{1}{2} \Big[b(i,j+1) - b(i,j)\Big]$

Now in this other stencil, we want the innermost loop to iterate over the rows. This means that the optimal loop permutations are ZXY and XZY.

In [None]:
def list_1D_same_row_ZXY(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for i in range(NX):
                for j in range(NY-1):
                    out_field[k][j][i] = 0.5 * (in_field[k][j+1][i] - in_field[k][j][i])
                out_field[k][NY-1][i] = 0.5 * (in_field[k][0][i] - in_field[k][NY-1][i])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars")
out_field = np.zeros_like(in_field).tolist()
in_field = in_field.tolist()
list_1D_same_row_ZXY(in_field, out_field)
plot_field(out_field)

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist()
list_1D_same_row_ZXY(in_field, out_field)

This result is quite good because the order of the loops is the right one (and we already benefit form less calls to the boundary contidion update), but we can still improve it by nesting the lists in the same way as the loops.

In [None]:
def list_1D_same_row_ZXY_alt(in_field, out_field):
    for n in range(N_ITER):
        for k in range(NZ):
            for i in range(NX):
                for j in range(NY-1):
                    out_field[k][i][j] = 0.5 * (in_field[k][i][j+1] - in_field[k][i][j])
                out_field[k][i][NY-1] = 0.5 * (in_field[k][i][0] - in_field[k][i][NY-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 -o in_field = initialize_field(NX, NY, NZ, dim_order="ZXY", mode="horizontal-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist()
list_1D_same_row_ZXY_alt(in_field, out_field)

In [None]:
result = _
save_result(result, "list_1D_same_row_ZXY")

Let's check that we get similar time with the other optimal setting.

In [None]:
def list_1D_same_row_XZY(in_field, out_field):
    for n in range(N_ITER):
        for i in range(NX):
            for k in range(NZ):
                for j in range(NY-1):
                    out_field[i][k][j] = 0.5 * (in_field[i][k][j+1] - in_field[i][k][j])
                out_field[i][k][NY-1] = 0.5 * (in_field[i][k][0] - in_field[i][k][NY-1])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field

In [None]:
%%timeit -n1 -r3 in_field = initialize_field(NX, NY, NZ, dim_order="XZY", mode="horizontal-bars"); out_field = np.zeros_like(in_field).tolist(); in_field = in_field.tolist()
list_1D_same_row_XZY(in_field, out_field)

Finally, let's check the vectorized NumPy code. We are now in a position to try to guess what is the fastest and slowest scenarios. We don't need to try the 12 of them like we did before. In particular, for the fastest implementation, we want to minimize the stride of the second dimension, i.e., make $s_1\equiv s_Y=0$. This happens, for example, for `dim_order="YXZ"` and `array_order="C"` or `dim_order="ZXY"` and `array_order="F"`.

For this reason, with the default `dim_order="ZYX"` is impossible to get the best performance observed in the same col case, no matter if we use C-style arrays or Fortran-style arrays. We will not observe the worst case either.

In [None]:
def numpy_1D_same_row(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        out_field[:, :-1, :] = 0.5 * (in_field[:, 1:, :] - in_field[:, :-1, :])
        # Periodic boundary condition
        out_field[:, -1, :] = 0.5 * (in_field[:, 0, :] - in_field[:, -1, :])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars", array_order="C")
out_field = numpy_1D_same_row(in_field)

In [None]:
%%timeit in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars", array_order="F")
out_field = numpy_1D_same_row(in_field)

To conclude this section of 1D stencils, let's try to replicate the best and worst results we obtain in the same col case. To make it more fun, and perhaps observe some weird result due to the internals of NumPy, instead of reusing the `numpy_1D_same_row()` function with modified `dim_order` arrays, let's write a new function where the shifted index is the first, instead of the last one.

In [None]:
def numpy_1D_same_row_alt(in_field):
    out_field = np.empty_like(in_field)
    for n in range(N_ITER):
        out_field[:-1, :, :] = 0.5 * (in_field[1:, :, :] - in_field[:-1, :, :])
        # Periodic boundary condition
        out_field[-1, :, :] = 0.5 * (in_field[0, :, :] - in_field[-1, :, :])
        if n < N_ITER - 1:
            in_field, out_field = out_field, in_field
    return out_field

In [None]:
%%timeit in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars", dim_order="YXZ", array_order="C")
out_field = numpy_1D_same_row_alt(in_field)

In [None]:
%%timeit in_field = initialize_field(NX, NY, NZ, mode="horizontal-bars", dim_order="YXZ", array_order="F")
out_field = numpy_1D_same_row_alt(in_field)

In [None]:
plot_field(out_field)

## 2D stencil

The last stencil we will check is the 4th-order non-monotonic diffusion we used during the course, which is defined in terms of the Laplace operator as

$$
a(i, j) = \Delta(\Delta(b(i, j)))
$$

More details about the discretization and the exact stencil can be read in [the project report](./report.pdf).

Since it is already clear from the experiments with pointwise and 1D stencils, that python `for` loops are very slow. Here we will only run the numpy implementation from `stencil2d.py` given in the course. This will set the base time to improve using the different high-level programming techniques available in Python (which will be tested in different notebooks).

In [None]:
from stencil2d import apply_diffusion as numpy_2D

In [None]:
in_field = initialize_field(NX, NY, NZ, mode="square")
out_field = np.empty_like(in_field)
plot_field(in_field)

In [None]:
numpy_2D(in_field, out_field, num_halo=2, num_iter=N_ITER)

In [None]:
plot_field(out_field)

In [None]:
%%timeit -o in_field = initialize_field(NX, NY, NZ, mode="square"); out_field = np.empty_like(in_field)
numpy_2D(in_field, out_field, num_halo=2, num_iter=N_ITER)

In [None]:
result = _
save_result(result, "numpy_2D")

# Conclusions

- NumPy arrays are much faster than Python lists
- The explanation is that (vectorized) NumPy code is much more cache friendly than Python code with nested `for` loops
- **TODO: Verify this carefully.** Unfortunately, NumPy does not take full advantage of multithreading
- NumPy is not very efficient about memory usage (there are some exceptions such as matrix-matrix multiplication)

---

# Appendices

## np.sin() vs math.sin()

There is no doubt that `np.sin()` is faster than `math.sin()` when applied to large NumPy arrays. But this function has an overhead cost when applied to single elements. Here we show that `math.sin()` is more efficient than `np.sin()` when applied to 4 or less elements. However, the overhead of a Python loop is even worse than the overhead of `np.sin()`.

In [None]:
rng = np.random.default_rng()
x = rng.random(100)

`np.sin()` is ~7.7 times slower than `math.sin()` when applied to a single value.

In [None]:
%timeit math.sin(x[0])

In [None]:
%timeit np.sin(x[0])

However, we loose part of this performance the moment we use a single for loop.

In [None]:
%timeit for i in range(1): math.sin(x[i])

In [None]:
%timeit np.sin(x[:1])

In any case, even without for loops, `math.sin()` becomes slower than `np.sin()` the moment we need to compute more than four values.

In [None]:
%timeit math.sin(x[0]), math.sin(x[1]), math.sin(x[2]), math.sin(x[3]), math.sin(x[4]), math.sin(x[5])

In [None]:
%timeit np.sin(x[:6])

This is because `np.sin()` computes these values in parallel, while the `math.sin()` computes them sequentially.

## Different dtypes affect performance

NumPy supports defines different data type (dtype) objects which describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. By default, NumPy will choose a proper dtype depending on the input passing to the constructor. However, we are free to change this default behaviour by specifying manually the dtype. As expected, and quickly demonstrated here using the `numpy_sin_pointwise()` function, this choice can also impact the performance of our stencil model.

In [None]:
dtypes_ = [np.float16, np.float32, np.float64, np.float128]

In [None]:
%%timeit in_field, out_field = initialize_fields(NX, NY, NZ, mode="square", dtype=dtypes_[0])
numpy_sin_pointwise(in_field, out_field)

In [None]:
%%timeit in_field, out_field = initialize_fields(NX, NY, NZ, mode="square", dtype=dtypes_[1])
numpy_sin_pointwise(in_field, out_field)

In [None]:
%%timeit in_field, out_field = initialize_fields(NX, NY, NZ, mode="square", dtype=dtypes_[2])
numpy_sin_pointwise(in_field, out_field)

In [None]:
%%timeit in_field, out_field = initialize_fields(NX, NY, NZ, mode="square", dtype=dtypes_[3])
numpy_sin_pointwise(in_field, out_field)

More numerical precision implies moving more data from and to memory.