# GT4Py Hands-on

This notebook will guide you towards a fully (or at least partially) GT4Py-based implementation of the `stencil2d.py` program we saw on Day1. The module `stencil2d-gt4py-v0.py` contains the backbone of the final code. Holes which need to be filled-in with your inputs are marked as `# TODO`. Here we go through the porting process step-by-step. You will have the opportunity to implement all missing parts in isolation and test them stand-alone. Once you complete all the mandatory tasks successfully, you can copy-and-paste the relevant cells of this notebook into `stencil2d-gt4py-v0.py` to have a running GT4Py program. To keep our lives simple, we shall confine our attention to only two CPU backends of GT4Py: `numpy` and `gtx86`.

In [1]:
import gt4py as gt
from gt4py import gtscript
import numpy as np

## Stencil computations

Let's start by implementing the 5-points Laplacian stencil 
```
lap_field[i, j, k] = - 4 * in_field[  i,   j, k] 
                     +     in_field[i-1,   j, k] 
                     +     in_field[i+1,   j, k] 
                     +     in_field[  i, j-1, k] 
                     +     in_field[  i, j+1, k]
```
as a GTScript subroutine.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Fill the GTScript subroutine <tt>laplacian</tt> whose signature is already provided. <br>
</div>

In [2]:
@gtscript.function
def laplacian(in_field):
    lap_field = (
        -4.0 * in_field[0, 0, 0]
        + in_field[-1, 0, 0]
        + in_field[1, 0, 0]
        + in_field[0, -1, 0]
        + in_field[0, 1, 0]
    )
    return lap_field

We now introduce another level of abstraction with respect to `stencil2d.py`. Leveraging the `laplacian` subroutine we implement a stencil which applies the fourth-order diffusion operator

\begin{equation}
    \frac{\partial \phi}{\partial t} = - \alpha_4 \, \Delta_h \, (\Delta_h \phi) \, .
\end{equation}

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>2.</b> Write a GTScript function called <tt>diffusion_defs</tt> which performs the same operations as the following lines in <tt>stencil2d.py</tt>:<br>
<code>laplacian( in_field, tmp_field, num_halo=num_halo, extend=1 )
 laplacian( tmp_field, out_field, num_halo=num_halo, extend=0 )
 out_field[:, num_halo:-num_halo, num_halo:-num_halo] = \
     in_field[:, num_halo:-num_halo, num_halo:-num_halo] \
     - alpha * out_field[:, num_halo:-num_halo, num_halo:-num_halo] </code><br>
The function receives the input field <tt>in_field</tt>, the output field <tt>out_field</tt> and the scalar coefficient <tt>alpha</tt>. Assume grid point values are stored as <tt>float</tt>s. Do import and call the <tt>laplacian</tt> subroutine.<br>
<b>3.</b> Compile the stencil using <tt>gtscript.stencil()</tt>. Recall to pass the <tt>laplacian</tt> subroutine as an external symbol. Make sure that the code compiles fine with both the <tt>"numpy"</tt> and <tt>"gtx86"</tt> backends. Remember to re-compile the stencil every time you modify its definition function!<br>
</div>

In [3]:
def diffusion_defs(
    in_field: gtscript.Field[float], out_field: gtscript.Field[float], *, alpha: float
):
    from __externals__ import laplacian
    from __gtscript__ import PARALLEL, computation, interval
    
    with computation(PARALLEL), interval(...):
        lap1 = laplacian(in_field)
        lap2 = laplacian(lap1)
        out_field = in_field - alpha * lap2

In [4]:
# compile with the numpy backend
diffusion_numpy = gtscript.stencil(
    definition=diffusion_defs, 
    backend="numpy", 
    externals={"laplacian": laplacian}
)

In [5]:
# compile with the gtx86 backend
diffusion_gtx86 = gtscript.stencil(
    definition=diffusion_defs, 
    backend="gtx86", 
    externals={"laplacian": laplacian},
    verbose=False
)

## Updating the boundary region

Due to current limitations in the storage design, if we wish to run our code on GPUs the boundary conditions should be enforced using GT4Py. However the implementation of the halo update using the DSL is rather cumbersome. So here we restrict ourselves to a plain Python version. A GT4Py version will be provided at the end of the course.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>4.</b> The <tt>update_halo()</tt> function receives (i) the GT4Py storage on which periodicity should be imposed and (ii) the width of the halo. Write the body of the function treating the input field as a regular <tt>numpy.ndarray</tt>. Recall that the axes order is <tt>I-J-K</tt>, while in <tt>stencil2d.py</tt> we adopted the Fortran-ish <tt>K-J-I</tt> order. Validate your code using <tt>test_update_halo()</tt>. <br>
</div>

In [2]:
def update_halo(field, num_halo):
    # bottom edge (without corners)
    field[num_halo:-num_halo, :num_halo] = field[num_halo:-num_halo, -2 * num_halo : -num_halo]

    # top edge (without corners)
    field[num_halo:-num_halo, -num_halo:] = field[num_halo:-num_halo, num_halo : 2 * num_halo]

    # left edge (including corners)
    field[:num_halo, :] = field[-2 * num_halo : -num_halo, :]

    # right edge (including corners)
    field[-num_halo:, :] = field[num_halo : 2 * num_halo]

In [3]:
def test_update_halo(f):
    data = np.load("baseline_data/update_halo.npz")
    field = data["in_field"]
    val = data["out_field"]
    num_halo = data["num_halo"]
    
    f(field, num_halo)
    
    if np.allclose(field, val):
        print("Unit test for update_halo(): PASSED!")
    else:
        print("Unit test for update_halo(): FAILED.")

In [4]:
test_update_halo(update_halo)

FileNotFoundError: [Errno 2] No such file or directory: 'baseline_data/update_halo.npz'

## Time integration

The time marching procedure is carried out by the `apply_diffusion` function, whose signature reads:

```
def apply_diffusion(diffusion_stencil, in_field, out_field, alpha, num_halo, num_iter=1):
```

Here `diffusion_stencil` is the stencil object which applies the diffusion operator, `in_field` and `out_field` are the input and output fields, `alpha` is the diffusion coefficient, `num_halo` is the number of halo points, and `num_iter` is the number of iterations. Each iteration consists of three steps:

1. Updating the halo region of the input field `in_field`;
2. Running the `diffusion` stencil on `in_field` and store the results in `out_field`;
3. Updating the halo region of the output field `out_field` if it is the last iteration, otherwise swapping `in_field` and `out_field`.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
    <b>5.</b> Determine the <tt>origin</tt> of the computation domain and its extent <tt>domain</tt> based on <tt>num_halo</tt> and the grid size. Hint: use the <tt>shape</tt> attribute of a GT4Py storage to retrieve its size.<br>
    <b>6.</b> Add the call to <tt>diffusion_stencil</tt>. <br>
</div>

In [11]:
def apply_diffusion(diffusion_stencil, in_field, out_field, alpha, num_halo, num_iter=1):
    # origin and extent of the computational domain
    origin = (num_halo, num_halo, 0)
    domain = (
        in_field.shape[0] - 2 * num_halo,
        in_field.shape[1] - 2 * num_halo,
        in_field.shape[2],
    )

    for n in range(num_iter):
        # halo update
        update_halo(copy_stencil, in_field, num_halo)

        # run the stencil
        diffusion_stencil(
            in_field=in_field,
            out_field=out_field,
            alpha=alpha,
            origin=origin,
            domain=domain,
        )

        if n < num_iter - 1:
            # swap input and output fields
            in_field, out_field = out_field, in_field
        else:
            # halo update
            update_halo(copy_stencil, out_field, num_halo)

## Input and output fields

We are almost done. The last mile concerns the storages which contain the input and output fields. We explained in `03-GT4Py-concepts.ipynb` that stencil objects must be fed with customed arrays created through one of the utilities provided by the module `gt4py.storage`. Data is formatted in memory so to get the maximum performance out of the target architecture. This is does not affect the user interface, which is hardware-agnostic.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
    <b>7.</b> Based on what we said in <tt>03-GT4Py-concepts.ipynb</tt> which is the most appropriate default origin for the storages? Assume <tt>num_halo = 2</tt>.
    <b>8.</b> Convert the NumPy array <tt>in_field_np</tt> into a GT4Py storage called <tt>in_field</tt> using <tt>gt4py.storage.from_array()</tt>. Use the <tt>gtx86</tt> backend.<br>
    <b>9.</b> Allocate an empty storage <tt>out_field</tt> to hold the output field. <tt>out_field</tt> must have the same shape of <tt>in_field</tt>. Pick the same backend as for <tt>in_field</tt>.<br>
    <b>10.</b> Use the just allocated fields to test <tt>diffusion_gtx86</tt> via <tt>test_diffusion()</tt>.
</div>

In [12]:
nx = 128
ny = 128
nz = 64
num_halo = 2

# default origin
default_origin = (num_halo, num_halo, 0)

# numpy array
in_field_np = np.zeros((nx + 2 * num_halo, ny + 2 * num_halo, nz), dtype=float)
in_field_np[
    num_halo + nx // 4 : num_halo + 3 * nx // 4,
    num_halo + ny // 4 : num_halo + 3 * ny // 4,
    nz // 4 : 3 * nz // 4,
] = 1.0

# gt4py storage collecting the input values
in_field = gt.storage.from_array(in_field_np, backend="gtx86", default_origin=default_origin)

# empty gt4py storage which will collect the output values
out_field = gt.storage.empty("gtx86", default_origin, (nx + 2 * num_halo, ny + 2 * num_halo, nz), float)

In [14]:
def test_diffusion(stencil_object, in_field, out_field):
    phi = np.asarray(in_field)
    tmp1 = np.empty_like(phi)
    tmp2 = np.empty_like(phi)
    out = np.empty_like(phi)
    
    tmp1[1:-1, 1:-1] = (
        phi[2:, 1:-1] 
        + phi[:-2, 1:-1] 
        + phi[1:-1, 2:] 
        + phi[1:-1, :-2] 
        - 4. * phi[1:-1, 1:-1]
    )
    tmp2[2:-2, 2:-2] = (
        tmp1[3:-1, 2:-2] 
        + tmp1[1:-3, 2:-2] 
        + tmp1[2:-2, 3:-1] 
        + tmp1[2:-2, 1:-3] 
        - 4. * tmp1[2:-2, 2:-2]
    )
    out[2:-2, 2:-2] = phi[2:-2, 2:-2] - 2.0 * tmp2[2:-2, 2:-2]
    
    stencil_object(in_field=in_field, out_field=out_field, alpha=2.0)
    
    if np.allclose(out_field, out):
        print(f"Unit test for diffusion_{stencil_object.backend}: PASSED!")
    else:
        print(f"Unit test for diffusion_{stencil_object.backend}: FAILED.")

In [15]:
test_diffusion(diffusion_gtx86, in_field, out_field)

Unit test for diffusion_gtx86: PASSED!


## Running

All right! We are now ready to move onto `stencil2d-gt4py-v0.py`.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
    <b>11.</b> Take some time to understand the structure of the code and realize which parts of this notebook can be transfered as they are (or upon little modifications) into the script.<br>
    <b>12.</b> Fill the holes marked with <tt># TODO</tt> by copy-and-paste from this notebook.
</div>

Let's run `stencil2d-gt4py-v0.py` and check that the stencil compiles fine with both backends:

In [17]:
!python stencil2d-gt4py-v0.py --nx=32 --ny=32 --nz=32 --num_iter=1024 --backend=numpy

Elapsed time for work = 1.003267765045166 s


In [16]:
!python stencil2d-gt4py-v0.py --nx=32 --ny=32 --nz=32 --num_iter=1024 --backend=gtx86

Elapsed time for work = 0.6777076721191406 s


<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
    <b>13.</b> From a terminal, execute the <tt>validation.sh</tt> Bash script to validate the numerics of your code. This script takes two command line arguments: the version tag of the programm (here <code>v0</code>) and the desired backend. <br>
    <b>14 (Bonus).</b> Run both <tt>stencil2d.py</tt> and <tt>stencil2d-gt4py-v0.py</tt> with <code>--nx=128 --ny=128 --nz=64 --num_iter=1024</code>. How do performances compare for the different CPU backends of GT4Py? <br>
    <b>15 (Bonus).</b> Increase the grid size to <code>nx x ny x nz = 256 x 128 x 64</code> and then <code>nx x ny x nz = 256 x 256 x 64</code>. Speculate how the speed-up provided by the <tt>gtx86</tt> backend varies with the number of grid points.
</div>

## Further optimizations

Let's try to apply a couple of optimizations to our code. We shall proceed along the lines of what we did in Day1 on `stencil2d.F90`. All the following tasks are optional and involve the `gtx86` backend only.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
    <b>16 (Bonus).</b> Make a copy of <tt>stencil2d-gt4py-v0.py</tt> and name it <tt>stencil2d-gt4py-v1.py</tt>. Inside <tt>diffusion_defs</tt> inline the subroutine <tt>laplacian</tt> by replacing the calls to the function with its body. Use the <tt>validation.sh</tt> script to validate your code. Which is the performance gain with respect to <tt>stencil2d-gt4py-v0.py</tt>? Base your answer on the timings measured at different grid sizes. <br>
    <b>17 (Bonus).</b> Here we go for a more aggressive optimization. Make a copy of <tt>stencil2d-gt4py-v1.py</tt> and name it <tt>stencil2d-gt4py-v2.py</tt>. Fuse all stages (i.e. statements) inside <tt>diffusion_defs</tt> into a single stage, as done in <code>day1/.solutions/stencil2d-inlining_v2.F90</code>. Modify the function signature by the replacing <tt>alpha</tt> with <tt>a1 = - alpha</tt>, <tt>a2 = - 2 * alpha</tt>, <tt>a8 = 8 * alpha</tt> and <tt>a20 = 1 - 20 * alpha</tt>. Adapt the stencil call inside <tt>apply_diffusion</tt> accordingly. Validate your code using <tt>validation.sh</tt>. Can you appreciate any sensible improvement in the performance? In your opinion, is the stencil code still understandable and intuitive as in <tt>stencil2d-gt4py-v0.py</tt>? <br>
</div>