# GT4Py Concepts
*Disclaimer: This notebook contains substantial contributions from the whole GT4Py development team.*

This notebook walks through the major concepts of the [GT4Py](https://github.com/GridTools/gt4py) stencil library. The concepts are exposed with the aid of some illustrative examples which are particularly relevant in weather and climate.

In [1]:
import gt4py as gt
from gt4py import gtscript
import numpy as np

## Defining a stencil computation

Horizontal advection by the mean flow represents a major driving force for atmospheric dynamics. Consider the conserved quantity $\phi = \rho \psi$, where $\rho$ is the air density and $\psi$ a specific quantity. Its transport by the steering wind $(u, \, v)$ is governed by the equation

\begin{equation}
    \frac{\partial \phi}{\partial t} + \frac{\partial \phi u}{\partial x} + \frac{\partial \phi v}{\partial y} = 0 \, .
\end{equation}

(Observe that $\psi \equiv 1$ discloses the continuity equation.) An established way to discretize this equation on a Cartesian grid is by centered spatio-temporal differencing:

\begin{equation}
    \frac{\phi^{n+1}_{i,j} - \phi^{n-1}_{i,j}}{2 \Delta t} + \frac{\phi_{i+1,j}^n \, u_{i+1,j}^n - \phi_{i-1,j}^n \, u_{i-1,j}^n}{2 \Delta x} + \frac{\phi_{i,j+1}^n \, v_{i,j+1}^n - \phi_{i,j-1}^n \, v_{i,j-1}^n}{2 \Delta y} = 0 \, .
\end{equation}

This method is commonly known as the leapfrog scheme. Here $\Delta x$ and $\Delta y$ are the grid spacings in the $x$- and $y$-direction, $\Delta t$ is the time-step and for a generic variable $\xi = \xi(x, \, y, \, t)$ we denote by $\xi_{i,j}^n$ the numerical approximation of $\xi(i \Delta x, \, j \Delta y, \, n \Delta t)$. 

![grid](img/grid.png)

The formula which advances the solution forward in time is found to be

\begin{equation}
    \phi_{i,j}^{n+1} = \phi_{i, j}^{n-1} - \frac{\Delta t}{\Delta x} \left( \phi_{i+1,j}^n \, u_{i+1,j}^n - \phi_{i-1,j}^n \, u_{i-1,j}^n \right) - \frac{\Delta t}{\Delta y} \left( \phi_{i,j+1}^n \, v_{i,j+1}^n - \phi_{i,j-1}^n \, v_{i,j-1}^n \right) \, .
\end{equation}

We recognize the update operator as a stencil computation. The field $\phi^{n+1}$ at $(i, \, j)$ (blue point in the figure below) is computed by accessing $\phi^{n-1}$ at $(i, \, j)$ and $\phi^n$, $u^n$ and $v^n$ at the neignboring points $(i-1, \, j)$, $(i+1, \, j)$, $(i, \, j-1)$ and $(i, \, j+1)$ (red points).

![stencil](img/stencil.png)

GT4Py exposes the domain-specific language (DSL) GTScript to express stencil computations as regular Python functions. 

In [2]:
def leapfrog_defs(
    u: gtscript.Field[float], 
    v: gtscript.Field[float], 
    phi_old: gtscript.Field[float], 
    phi_now: gtscript.Field[float],
    phi_new: gtscript.Field[float],
    *,
    dt: float,
    dx: float,
    dy: float
):
    from __gtscript__ import PARALLEL, computation, interval
    
    with computation(PARALLEL), interval(...):
        phi_new = phi_old[0, 0, 0] - (
            dt / dx * (phi_now[1, 0, 0] * u[1, 0, 0] - phi_now[-1, 0, 0] * u[-1, 0, 0])
            + dt / dy * (phi_now[0, 1, 0] * v[0, 1, 0] - phi_now[0, -1, 0] * v[0, -1, 0])
        )

Let's demistify the definition function block-by-block.

* All input parameters must be annotated. GTScript offers the type descriptor `Field` for data fields. This descriptor is parametric in the data-type. Supported data-types are: `float`, `numpy.float64`. Scalar coefficients must appear as keyword-only parameters.
* The function adopts an object-oriented interface: its signature includes both read-only fields (`u`, `v`, `phi_old`, `phi_now`) and scalar coefficients (`dt`, `dx`, `dy`), and fields to be computed (`phi_new`).
* Any computation must be enclosed in a **computation block**. Computation blocks are defined as one or multiple assignments (or **stages**) wrapped within a `with` statement. The `with` construct is used in combination with two context managers: `computation()` and `interval()`. 
    1. `computation()` specifies the iteration order in the vertical direction. This can be either `PARALLEL`, `FORWARD` or `BACKWARD`. Since here we do not have any data dependency in the vertical, we set `computation(PARALLEL)`. We will see later an example where both forward and backward sweeps are needed.
    2. `interval()` specifies the vertical region of application. Range specification follows as close as possible standard Python range specification. 
        - The starting element is included, while the ending element is not. 
        - Negative numbers represent the distance from the last element.
        - `None` denotes the end of the axis.
        - The three dots `...` represent a syntactic sugar for `(0, None)`. In other words, it signifies that the iteration must span all elements in vertical direction.
* The import statement is optional.
* Neighboring points are accessed through the corresponding **offsets**, i.e. the relative displacements with respect to the current point. Offsets are signed integers. The syntax is `[x_offset, y_offset, z_offset]`. Please have a look at the figure below for a better understanding of how offsets work.
* Note that for loops are abstracted away and computations are defined for a single grid point. Loop bounds will be specified when running the computation.
* No return statement is required. This is fully compliant with the object-oriented interface.

![offsets](img/offsets.png)

## Compiling a stencil

GT4Py can generate high-performance implementations of a stencil starting from its definition function. The GT4Py pipeline relies on the [GridTools (GT) framework](https://github.com/GridTools) to produce native implementations for different platforms. The piece of software in charge of synthetizing optimized code for a specific hardware architecture is called **backend**. Actually GT4Py offers more code-generating backends than GT. These do not trigger any compilation, and thus are suitable for early testing and prototyping purposes.

We use the expression **stencil compilation** to indicate the joint procedure which generates (and possibly compiles) the stencil code on-the-fly, creates Python bindings for it and import these bindings in the current scope. The stencil compilation is accomplished by the function `gtscript.stencil()`:

In [3]:
backend = "gtx86"
leapfrog = gtscript.stencil(definition=leapfrog_defs, backend=backend, verbose=True)

We observe that the backend is specified as a string. Available options are:

* `"debug"`: Pythonic backend which explicitly iterates over all grid points;
* `"numpy"`: Pythonic backend which adopts vectorized syntax;
* `"gtx86"`: GT-based backend devised for a generic CPU;
* `"gtmc"`: GT-based backend devised specifically for many-core CPUs;
* `"gtcuda"`: GT-based backend devised for NVIDIA GPUs.

`gtscript.stencil()` returns a callable object (henceforth referred to as **stencil object**) which exposes a high-level entry-point to the generated code.

It is worth mentioning that the generated code, binaries and bindings are cached for future usages. If you prefer not to rely on this caching mechanism, you should pass the keyword argument `rebuild=True` to `gtscript.stencil()`.

In [4]:
# trigger compilation
%timeit -n 1 -r 1 gtscript.stencil(definition=leapfrog_defs, backend=backend, rebuild=True)
# exploit cache
%timeit -n 1 -r 1 gtscript.stencil(definition=leapfrog_defs, backend=backend)

11 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
9.44 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Storages

GT4Py provides data storages to hold three-dimensional fields which sit on structured grids. The following figure shows how the array elements (green boxes) map to the grid points (grey dots). The pairs between square brackets represent the element indices in a horizontal slice of the storage.

![storage](img/storage.png)

The storages subclass `numpy.ndarray`. (This will change soon, since inheriting from a `numpy.ndarray` is considered not-so-good practice nowadays. Adhering to [NEP 18](https://numpy.org/neps/nep-0018-array-function-protocol.html), storages will be coded as duck-arrays which implement the `__array_function__` protocol). So storages feature the same high-level API of `numpy.ndarray`. Particularly, the user interface hides as much as possible all low-level and backend-specific details, like the memory layout, strides, padding, etc.. All these aspects are handled internally by GT4Py in a trasparent fashion.

The module `gt4py.storage` exposes useful utilities to either allocate a GT4Py storage, or convert a `numpy.ndarray` into a GT4Py storage. When instantiating a storage, care should be taken to the so-called `default_origin`. This represents the element which is aligned in memory. We will see in the next section an empirical way to set this parameter.

In [5]:
# grid size
nx = 128
ny = 128
nz = 64

# storage shape
shape = (nx, ny, nz)

# default origin (trust it for now!)
default_origin = (1, 1, 0)

# allocate an empty storage
phi_new = gt.storage.empty(backend, default_origin, shape, dtype=float)

# allocate a storage filled with zeros
v = gt.storage.zeros(backend, default_origin, shape, dtype=float)

# allocate a storage filled with ones
u = gt.storage.zeros(backend, default_origin, shape, dtype=float)

# create storages out of numpy.ndarrays
phi_old = gt.storage.from_array(np.random.rand(*shape), backend, default_origin)
phi_now = gt.storage.from_array(np.random.rand(*shape), backend, default_origin)

## Running computations

Executing stencil computations is as simple as a function call:

In [6]:
leapfrog(
    u=u,
    v=v,
    phi_old=phi_old,
    phi_now=phi_now,
    phi_new=phi_new,
    dt=1.0,
    dx=1.0,
    dy=1.0,
    origin=(1, 1, 0),
    domain=(nx - 2, ny - 2, nz)
)

The stencil object retains the same signature of its definition function and adds two additional parameters: `origin` and `domain`. The former specifies the first element of the output field `phi_new` for which a new value should be computed. In other terms, it represents the origin of the region of application (or **computation domain**) of the stencil. The extent of the region of application is determined by `domain`. Here a schematic visualization of the two concepts:

![halo](img/halo.png)

The blue area denotes the computation domain, i.e. where values for `phi_new` can be computed and stored. On the other hand, the red boxes form the **boundary region** where values for `phi_new` cannot be calculated, but where the input fields `u`, `v` and `phi_now` are read. It should be remarked that the figure showcases the *largest* possible computation domain. It is possible to restrict the application of the stencil to a subset of largest feasible computation domain, provided that the following conditions are satisfied:

In [7]:
# stencil halo
stencil_extent = (1, 1, 0)

# storage shape
buffer_shape = (nx, ny, nz)

# stencil origin
origin = (1, 1, 0)

# stencil computation domain
domain = (nx - 2, ny - 2, nz)

# requirements
assert all(origin[i] >= stencil_extent[i] for i in range(3))
assert all(origin[i] + domain[i] <= buffer_shape[i] - stencil_extent[i] for i in range(3))

It should also be noted that the binding between the symbols used within the definition function and the storage buffers happens at invocation time. This implies that the stencil object is not bound to any given grid size. Therefore the same stencil computation can be run on different grids and/or computation domains without any re-compilation.

## Sub-routines

GTScript allows the user to call a customed function (sub-routine) inside a computation block. This function can accept both fields and scalar coefficients, performs stencil operations, and eventually returns one or multiple fields. One should intend a sub-routine as a macro which is automatically expanded by the GT4Py pipeline. So there is little performance penalty associated with sub-routines. This is at contrast with regular Python functions, whose invocation may entail significant overheads. To make a function callable from within a stencil, use the `gtscript.function` decorator.

In [14]:
@gtscript.function
def centered_diff_x(dx, u, phi):
    return (phi[1, 0, 0] * u[1, 0, 0] - phi[-1, 0, 0] * u[-1, 0, 0]) / (2.0 * dx)

@gtscript.function
def centered_diff_y(dy, v, phi):
    return (phi[0, 1, 0] * v[0, 1, 0] - phi[0, -1, 0] * v[0, -1, 0]) / (2.0 * dy)

def leapfrog_subroutines_defs(
    u: gtscript.Field[float], 
    v: gtscript.Field[float], 
    phi_old: gtscript.Field[float], 
    phi_now: gtscript.Field[float],
    phi_new: gtscript.Field[float],
    *,
    dt: float,
    dx: float,
    dy: float
):
    from __gtscript__ import PARALLEL, computation, interval
    from __externals__ import centered_diff_x, centered_diff_y
    
    with computation(PARALLEL), interval(...):
        dphi_dx = centered_diff_x(dx, u, phi_now)
        dphi_dy = centered_diff_y(dy, v, phi_now)
        phi_new = phi_old - 2.0 * dt * (dphi_dx + dphi_dy)

(Please note that `field[0, 0, 0]` can be shortened to `field`.) Sub-routines are imported within the definition function as external symbols. The map between the symbols and the actual function objects is set at compilation time through the `externals` dictionary:

In [15]:
leapfrog_subroutines = gtscript.stencil(
    definition=leapfrog_subroutines_defs, 
    backend=backend, 
    externals={"centered_diff_x": centered_diff_x, "centered_diff_y": centered_diff_y}
)

The systematic use of sub-routines may avoid duplicated code and improve readability, without introducing unacceptable overheads. The latter statement can be easily validated on our simple example:

In [16]:
fields = {"u": u, "v": v, "phi_old": phi_old, "phi_now": phi_now, "phi_new": phi_new}
scalars = {"dt": 1.0, "dx": 1.0, "dy": 1.0}
%timeit leapfrog(**fields, **scalars, origin=(1, 1, 0), domain=(nx - 2, ny - 2, nz))
%timeit leapfrog_subroutines(**fields, **scalars, origin=(1, 1, 0), domain=(nx - 2, ny - 2, nz))

1.99 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Run-time conditionals

GT4Py supports all binary operators and ternary operators. The latter are also known as conditional expressions in Python, and can be used to calculate the absolute value of a field:

In [11]:
@gtscript.function
def absolute_value(phi):
    return phi if phi > 0 else -phi

The ternary operator can also be split into an if-statement followed by an else-statement:

In [13]:
@gtscript.function
def absolute_value(phi):
    if phi > 0:
        out = phi
    else:
        out = -phi
    return out

Another notable example where run-time conditionals come in handy is the numerical integration of the hyperbolic equation

\begin{equation}
    \frac{\partial \phi}{\partial t} + \frac{\partial \phi u}{\partial x} + \frac{\partial \phi v}{\partial y} = 0 \, .
\end{equation}

by the upwind scheme:

\begin{equation}
    F_{i,j} =
    \begin{cases}
        & \left( \phi_{i,j} \, u_{i,j} - \phi_{i-1,j} \, u_{i-1,j} \right) / \Delta x \qquad \text{if $u_{i,j} > 0$} \\
        & \left( \phi_{i+1,j} \, u_{i+1,j} - \phi_{i,j} \, u_{i,j} \right) / \Delta x \qquad \text{if $u_{i,j} < 0$}
    \end{cases} \\
    G_{i,j} =
    \begin{cases}
        & \left( \phi_{i,j} \, v_{i,j} - \phi_{i,j-1} \, v_{i,j-1} \right) / \Delta y \qquad \text{if $v_{i,j} > 0$} \\
        & \left( \phi_{i,j+1} \, v_{i,j+1} - \phi_{i,j} \, v_{i,j} \right) / \Delta y \qquad \text{if $v_{i,j} < 0$}
    \end{cases} \\
    \phi_{i,j}^{n+1} = \phi_{i,j}^n - \Delta t \left( F_{i,j}^n + G_{i,j}^n \right) \, .
\end{equation}

In [32]:
@gtscript.function
def upwind_diff_x(dx, u, phi):
    out = (
        (phi[0, 0, 0] * u[0, 0, 0] - phi[-1, 0, 0] * u[-1, 0, 0]) / dx
        if u > 0 else
        (phi[1, 0, 0] * u[1, 0, 0] - phi[0, 0, 0] * u[0, 0, 0]) / dx
    )
    return out

@gtscript.function
def upwind_diff_y(dy, v, phi):
    out = (
        (phi[0, 0, 0] * v[0, 0, 0] - phi[0, -1, 0] * v[0, -1, 0]) / dy
        if v > 0 else
        (phi[0, 1, 0] * v[0, 1, 0] - phi[0, 0, 0] * v[0, 0, 0]) / dy
    )
    return out

def upwind_defs(
    u: gtscript.Field[float], 
    v: gtscript.Field[float], 
    phi_now: gtscript.Field[float],
    phi_new: gtscript.Field[float],
    *,
    dt: float,
    dx: float,
    dy: float
):
    from __gtscript__ import PARALLEL, computation, interval
    from __externals__ import upwind_diff_x, upwind_diff_y
    
    with computation(PARALLEL), interval(...):
        dphi_dx = upwind_diff_x(dx, u, phi_now)
        dphi_dy = upwind_diff_y(dy, v, phi_now)
        phi_new = phi_now - dt * (dphi_dx + dphi_dy)

## Compile-time conditionals

A more sophisticated control flow statement consists of an if-else construct which queries a **scalar** quantity whose value is known at the stencil compile-time. Such scalar quantity is made available inside the stencil definition as an external symbol and read through the `__INLINED()` accessor. This technique allows to fuse the `leapfrog` and `upwind` stencils into a stencil upon the introduction of the `UPWINDING` flag.

In [35]:
def horizontal_advection_defs(
    u: gtscript.Field[float], 
    v: gtscript.Field[float], 
    phi_old: gtscript.Field[float],
    phi_now: gtscript.Field[float],
    phi_new: gtscript.Field[float],
    *,
    dt: float,
    dx: float,
    dy: float
):
    from __gtscript__ import __INLINED, PARALLEL, computation, interval
    from __externals__ import UPWINDING, centered_diff_x, centered_diff_y, upwind_diff_x, upwind_diff_y
    
    with computation(PARALLEL), interval(...):
        if __INLINED(UPWINDING):
            dphi_dx = upwind_diff_x(dx, u, phi_now)
            dphi_dy = upwind_diff_y(dy, v, phi_now)
            phi_new = phi_now - dt * (dphi_dx + dphi_dy)
        else:
            dphi_dx = centered_diff_x(dx, u, phi_now)
            dphi_dy = centered_diff_y(dy, v, phi_now)
            phi_new = phi_old - 2.0 * dt * (dphi_dx + dphi_dy)
            
horizontal_advection = gtscript.stencil(
    definition=horizontal_advection_defs, 
    backend=backend, 
    externals={
        "UPWINDING": True,
        "centered_diff_x": centered_diff_x,
        "centered_diff_y": centered_diff_y,
        "upwind_diff_x": upwind_diff_x,
        "upwind_diff_y": upwind_diff_y
    }
)

## Vertical direction