# Python for High Performance Computing
# Putting it all together
<hr style="border: solid 4px green">
<br>
<center><img src="../../images/arc_logo.png"; style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Overview
<hr style="border: solid 4px green">

### Package and tools
* `NumPy` arrays
* `f2py` and `ctypes`
<br><br>

### They all make Python code run faster.  How much faster?
* we are testing these in a realistic application
* the test is against straight loop-based Python implementation
<br><br>

### Baseline execution for future benchmarking
* `NumPy` implementation -- fastest performance in Python with minimum development
* C / Fortran implementation -- fastest single threaded performance without a constraint on language or development effort

## A finite difference solution to the 2D heat equation
<hr style="border: solid 4px green">

### The physics
Find the time-varying temperature distribution across a plate given an initial distribution and a fixed temperature around the edges.
<br><br>

### The maths
Solve the equation
$$\frac{\partial u}{\partial t}=\frac{\partial^2 u}{\partial x^2}+\frac{\partial^2 u}{\partial y^2}$$
with
* initial condition: $u(x,y,0)=u_0(x,y)$
* boundary condition: $u(x,y,t)=0$ on the boundary
<br><br>

Choose the domain to be the unit square $0\leq x,y\leq 1$ and the initial conditions
$$u_0(x,y)=\sin\pi x\cdot\sin\pi y$$
Then, the analytic solution is
$$u(x,y,t)=\sin\pi x\cdot\sin\pi y\cdot e^{-2\pi^2 t}$$

## A finite difference solution to the 2D heat equation (cont'd)
<hr style="border: solid 4px green">

### The numerics
Usind a **F**inite **D**ifference (FD) solution
* sample the 2D domain at equidistant points at coordinates $x_i=i\Delta x$ and $y_j=j\Delta y$
* sample time at points $t_n=n\Delta t$
* compute the discrete values $u^n_{i,j}$ corresponsing to the $x_i$, $y_j$ and $t_n$
* assuming $\Delta x=\Delta y$, the numerical solution is produced by the *time-marching scheme*

$$u^{n+1}_{i,j}=u^{n}_{i,j}+\nu \left( u^{n}_{i+1,j}+u^{n}_{i-1,j}+u^{n}_{i,j+1}+u^{n}_{i,j-1}\ -\ 4u^{n}_{i,j} \right)$$

where $\nu=\frac{\Delta t}{\Delta x^2}\leq 0.25$ for numerical stability.

## A finite difference solution to the 2D heat equation (cont'd)
<hr style="border: solid 4px green">

### The numerical scheme
A 6-point stencil
* each discrete point $(i, j)$ at time $t_{n+1}$ is updated from
* five values at time $t_n$ (the same point plus its neighbours)
<br><br>

<table border="0">
  <tr>
    <td><center>time step $n$</center></td>
    <td><center>time step $n+1$</center></td>
  </tr>
  <tr>
    <td><img src="./images/untxt.png"; style="float: center; width: 40%"></td>
    <td><img src="./images/unp1txt.png"; style="float: center; width: 40%"></td>
  </tr>
</table>

<br><br>

### The numerical algorithm
* start with initial conditions
* for each time step
  * for each space point, use current solution $u^{n}$ to compute next step $u^{n+1}$
  * apply boundary conditions
  * swap current solution with updated "next-step" solution

## A finite difference solution to the 2D heat equation (cont'd)
<hr style="border: solid 4px green">

### Implementation

| Class | Data | Methods |
| :--- | :--- | :--- | 
| `grid` | coordinates: `x`, `y` | `error` |
|        | solution: `u`         | |
|        | old solution: `uo`    | |
|  `solution`  | `grid` object | `timeStep` |
|              |               | `setStepper` |
|              |               | `numpyStepper` |
|              |               | `fortranStepper` |
|              |               | `cStepper` |
|              |               | *etc.* |

## A finite difference solution to the 2D heat equation (cont'd)
<hr style="border: solid 4px green">

The `solution` class has the method `timeStep`, which provides the numerical solution
```python
def timeStep (self, numIters=0):
    """
    Advances the solution numIters timesteps using stepper set by setStepper()
    """
    # number of grid points in x, y
    nx, ny = self.grid.u.shape
    # solution vectors
    u  = self.grid.u
    uo = self.grid.uo
    # scheme parameter (<=0.25 for stability)
    nu = self.nu
    # time-step numIters times
    for t in range (1, numIters):
        # apply numerical scheme
        self.stepper (nx,ny, u,uo, nu)
        # copy previous step solution into old solution
        u, uo = uo, u
```

## A finite difference solution to the 2D heat equation (cont'd)
<hr style="border: solid 4px green">

This matches the *numerical algorithm* above, where `self.stepper` is an implementation of the update
$$u^{n+1}_{i,j}=u^{n}_{i,j}+\nu \left( u^{n}_{i+1,j}+u^{n}_{i-1,j}+u^{n}_{i,j+1}+u^{n}_{i,j-1}\ -\ 4u^{n}_{i,j} \right)$$
for all $i$ and $j$, excepting the boundaries.

`self.stepper` is initialised on one of the methods `*Stepper` available, each one being a different implementation.

## <span style="font-family: Courier New, Courier, monospace;">pythonStepper</span>: pure Python
<hr style="border: solid 4px green">

```python
for i in range(1, nx-1):
    for j in range(1, ny-1):
        u[i,j] = uo[i,j] + ( nu * ( uo [i-1, j] + uo [i+1, j] +
                                    uo [i, j-1] + uo [i, j+1]
                                    - 4.0 * uo [i,j] ) )
```

## <span style="font-family: Courier New, Courier, monospace;">numpyStepper</span>: <span style="font-family: Courier New, Courier, monospace;">NumPy</span> arrays
<hr style="border: solid 4px green">

Replacing the `for` loops by array slicing.

```python
u[1:-1, 1:-1] = uo[1:-1, 1:-1] + ( nu * ( uo [0:-2, 1:-1] + uo [2:, 1:-1]
                                        + uo [1:-1, 0:-2] + uo [1:-1, 2:]
                                        - 4.0 * uo [1:-1, 1:-1] ) )
```

## <span style="font-family: Courier New, Courier, monospace;">fortranStepper</span>: Fortran
<hr style="border: solid 4px green">

### Python side
```python
import fortran_stepper
fortran_stepper.timestep (nu, uo,u)
```

### Library source code
```fortran
subroutine timestep (nu, uo,u, nx,ny)

  implicit none

  integer, parameter :: dp = selected_real_kind (15, 307)

  integer,                                   intent (in)    :: nx,ny
  real (kind=dp), dimension (0:nx-1,0:ny-1), intent (inout) :: u
  real (kind=dp), dimension (0:nx-1,0:ny-1), intent (in)    :: uo
  real (kind=dp),                            intent (in)    :: nu

  u(1:nx-2, 1:ny-2) = uo(1:nx-2, 1:ny-2)                               &
                    + nu * ( uo (0:nx-3, 1:ny-2) + uo (2:nx-1, 1:ny-2) &
                           + uo (1:nx-2, 0:ny-3) + uo (1:nx-2, 2:ny-1) &
                           - 4.0_dp * uo (1:nx-2, 1:ny-2) )

end subroutine timestep
```

> *Note*: the above makes use of the Fortran array programming, which is very similar to `numpy` arrays

## <span style="font-family: Courier New, Courier, monospace;">fortranStepper</span>: Fortran  (cont'd)
<hr style="border: solid 4px green">

### Important: storage must be cast as column major

```fortran
if (stepper == "fortran"):
  self.grid.uo =  numpy.array (self.grid.uo, order="Fortran")
  self.grid.u  =  numpy.ndarray (shape=self.grid.uo.shape,
                                 dtype=self.grid.uo.dtype,
                                 order="Fortran")
```

## `cStepper`: C
<hr style="border: solid 4px green">

### Python side

```python
import ctypes
from numpy.ctypeslib import ndpointer

lib = ctypes.cdll.LoadLibrary("c_stepper.so")
lib.timestep.restype = None
lib.timestep.argtypes = [ctypes.c_double,
                         ctypes.c_int,
                         ctypes.c_int,
                         ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"),
                         ndpointer(ctypes.c_double, flags="C_CONTIGUOUS")]

lib.timestep (nu, nx,ny, uo,u)
```

### Library source code

```c
void timestep ( const double nu,
                const int nx,const int ny,
                double uo[nx][ny],double u[nx][ny] )
{

  int i,j;

  // finite difference scheme
  for (i=1; i<nx-1; i++) {
    for (j=1; j<ny-1; j++) {
      u[i][j] = uo[i][j]
              + nu * ( uo [i-1][j] + uo [i+1][j]
                     + uo [i][j-1] + uo [i][j+1]
                     - 4.0 * uo [i][j] );
	}
  }
}
```

## Exercise
<hr style="border: solid 4px green">

### Steps
* build the modules from source
* measure performance
* draw conclusions
<br><br>

### Observations
* the C and Fortran source files are found in the `src/` directory
* the modules (shared libraries) go into a directory `lib/python2.7/site-packages`
  * directory created by the build process
  * directory name mimics `distutils` installations

## Exercise: building the modules
<hr style="border: solid 4px green">

### Simple way using methods learnt so far
* using `gcc` and `f2py`
* to make life simple, the building process is managed by the `make` utility
<br><br>

### Pythonic way
* providing a script `setup.py`, which uses `numpy.distutil` to define rules
* the modules are built with the command `python setup.py build_ext --inplace`
* this is what we shall be using in future
<br><br>

### Build
* edit the `makefile` and observe
  * the compiler details (flags, in particular)
  * the rules to build the modules
  * using `make` is just a convenient way to manage the `gcc` and `f2py` commands
* build the modules by running the command `make`
* check that the modules were created (`ls -l lib/python2.7/site-packages`)

In [8]:
# build libraries
! make

/bin/mkdir -p /Users/mihai/Documents/arc/training/scientific-python/arc-sci-py/python-hpc-day-1/lecture06-summary/lib/python2.7/site-packages
gcc -O2 -mavx -fPIC -std=c99 -c src/c_stepper.c
gcc -shared c_stepper.o -o /Users/mihai/Documents/arc/training/scientific-python/arc-sci-py/python-hpc-day-1/lecture06-summary/lib/python2.7/site-packages/c_stepper.so
/bin/rm -f c_stepper.o
f2py    src/fortran_stepper.f90 -h fortran_stepper.pyf
Reading fortran codes...
	Reading file 'src/fortran_stepper.f90' (format:free)
Post-processing...
	Block: timestep
Post-processing (stage 2)...
Saving signatures to file "./fortran_stepper.pyf"
f2py -c src/fortran_stepper.f90 -m fortran_stepper
[39mrunning build[0m
[39mrunning config_cc[0m
[39munifing config_cc, config, build_clib, build_ext, build commands --compiler options[0m
[39mrunning config_fc[0m
[39munifing config_fc, config, build_clib, build_ext, build commands --fcompiler options[0m
[39mrunning build_src[0m
[39mbuild_src[0m
[39mbuil

In [9]:
# check libraries
! ls -l lib/python2.7/site-packages

total 80
-rwxr-xr-x  1 mihai  staff   4192  5 Mar 15:42 [31mc_stepper.so[m[m
-rwxr-xr-x  1 mihai  staff  32436  5 Mar 15:42 [31mfortran_stepper.so[m[m
drwxr-xr-x  3 mihai  staff    102  5 Mar 15:38 [34mfortran_stepper.so.dSYM[m[m


## Exercise: comparing performance
<hr style="border: solid 4px green">

### Edit the file <span style="font-family: Courier New, Courier, monospace;">heat.py</span>
* understand how the class `solution` selects the "stepper" in method `setStepper`
* locate the "stepper" implementations `pythonStepper` and `numpyStepper`
* locate the "stepper" implementations `cStepper` and `fortranStepper`
* locate the C and Fortran sources in the directory `src/`
<br><br>

### Modify the file <span style="font-family: Courier New, Courier, monospace;">heat.py</span>
* `stepperTypeList` must include all options
  * `"python"`
  * `"numpy"`
  * `"fortran"`
  * `"ctypes"`

## Exercise: comparing performance (cont'd)
<hr style="border: solid 4px green">

### Run #1
* do `python heat.py --help` to see how the test script is to be run with command line arguments
* then, run the test script with a grid of 100 points and 500 timesteps

> *Notes*:
> * time measured refers to solution time only
> * error is measured relative to analytic solution

In [11]:
! python heat.py 100 500

 computing 500 iterations on a 100x100 grid
 stepper python, 500 iterations, 9.287722 seconds, 0.104286 rel error
 stepper numpy, 500 iterations, 0.046170 seconds, 0.104286 rel error
 stepper fortran, 500 iterations, 0.008832 seconds, 0.104286 rel error
 stepper ctypes, 500 iterations, 0.064468 seconds, 0.104286 rel error


## Exercise: comparing performance
<hr style="border: solid 4px green">

### Modify the file <span style="font-family: Courier New, Courier, monospace;">heat.py</span>
* `stepperTypeList` must exclude the option `"python"` (too slow)

### Run #2
* run the test script with a grid of 1000 points and 500 timesteps
* compare time spent by the following implementations
  * `"numpy"`
  * `"fortran"`
  * `"ctypes"`

In [12]:
!python heat.py 1000 500

 computing 500 iterations on a 1000x1000 grid
 stepper numpy, 500 iterations, 5.367319 seconds, 0.382189 rel error
 stepper fortran, 500 iterations, 0.515249 seconds, 0.382189 rel error
 stepper ctypes, 500 iterations, 0.838505 seconds, 0.382189 rel error


## Exercise: comparing performance (cont'd)
<hr style="border: solid 4px green">

```
(mihai@malus) lecture06-summary > python heat.py 100 500
 computing 500 iterations on a 100x100 grid
 stepper python, 500 iterations, 9.472842 seconds, 0.104286 rel error
 stepper numpy, 500 iterations, 0.041959 seconds, 0.104286 rel error
 stepper fortran, 500 iterations, 0.007695 seconds, 0.104286 rel error
 stepper ctypes, 500 iterations, 0.067254 seconds, 0.104286 rel error
```
<br><br>

```
(mihai@malus) > python heat.py 1000 500
 computing 500 iterations on a 1000x1000 grid
 stepper numpy, 500 iterations, 5.358381 seconds, 0.382189 rel error
 stepper fortran, 500 iterations, 0.490860 seconds, 0.382189 rel error
 stepper ctypes, 500 iterations, 0.835508 seconds, 0.382189 rel error
```

## A few more easy methods
<hr style="border: solid 4px green">

### Compiled code solution: <span style="font-family: Courier New, Courier, monospace;">scipy.weave</span>
* package provides tools for including C/C++ code within in Python code
* `weave.inline()` executes C code directly within Python
* `weave.blitz()` translates Python NumPy expressions to C++ for fast execution
* *Pros*
  * fast execution from C/C++ code (generated from NumPy expressions or inserted)
* *Cons*
  * needs C/C++ code available
  * there are easier to use alternative

## <span style="font-family: Courier New, Courier, monospace;">blitzStepper</span>: <span style="font-family: Courier New, Courier, monospace;">scipy.weave.blitz</span>
<hr style="border: solid 4px green">

```python
    def blitzStepper (self, nx,ny, u,uo, nu):
        """ time-steps implemented using numpy expression dispatched via blitz"""
        from scipy import weave
        # define expression (same one as for numpyStepper)
        expr = "u[1:-1, 1:-1] = uo[1:-1, 1:-1] + ( nu * ( uo [0:-2, 1:-1] + uo [2:, 1:-1] + " \
               "                                   uo [1:-1, 0:-2] + uo [1:-1, 2:]" \
               "                                   - 4.0 * uo [1:-1, 1:-1] ) )"
        weave.blitz (expr, check_size=0)
```

## <span style="font-family: Courier New, Courier, monospace;">inlineStepper</span>: <span style="font-family: Courier New, Courier, monospace;">scipy.weave.inline</span>
<hr style="border: solid 4px green">

### Note the use of a "linear" array address rather than normal 2D indexing

```python
    def inlineStepper (self, nx,ny, u,uo, nu):
        """ time-steps implemented using C code dispatched via weave"""
        from scipy import weave
        from scipy.weave import converters
        # define code (same one as for C code cStepper
        #  * cannot use u[i][j]
        #  * instead use u[k], with k = i*ny + j
        code = """
               int i,j;
               int k,kn,ks,kw,ke;
               for (i=1; i<nx-1; i++) {
                 for (j=1; j<ny-1; j++) {
                   k    = i*ny + j;
                   kn   = k + nx;
                   ks   = k - nx;
                   ke   = k + 1;
                   kw   = k - 1;
                   u[k] = uo[k]
                        + nu * ( uo[kn] + uo[ks]
                               + uo[ke] + uo[kw]
                               - 4.0 * uo[k]);
                 }
               }
               """
        # compiler keyword only needed on windows with MSVC installed
        err = weave.inline (code,
                            ["nx","ny","u","uo","nu"])
```

## A few more easy methods (cont'd)
<hr style="border: solid 4px green">

### JIT solution: <span style="font-family: Courier New, Courier, monospace;">numba</span>
* a compiler
  * it leverages LLVM
  * parses, compiles to, and optimises assembly code
  * works in a similar manner to compiled languages such as C and Fortran
* is Python
  * underlying powerful libraries are used for performance
  * code the programmer develops is always pure Python
* accelerates Python functions via simple decorators
* *Pros*
  * powerful
  * extremely easy to use
* *Cons*
  * a relative newcomer, hence not mature

## <span style="font-family: Courier New, Courier, monospace;">numbaStepper</span>: <span style="font-family: Courier New, Courier, monospace;">numba</span>
<hr style="border: solid 4px green">

```python
    def numbaStepper (self, nx,ny, u,uo, nu):
        """ time-steps implemented using straight python array indexing dispatched via JIT compiling"""
        # apply numerical scheme (one time-step)
        numbaStepperJIT (nx,ny, u,uo, nu)

    from numba import jit
    # numba / JIT compiler decorator
    @jit
    def numbaStepperJIT (nx,ny, u,uo, nu):
        # same code as the straight python stepper
        for i in range(1, nx-1):
            for j in range(1, ny-1):
                u[i,j] = uo[i,j] + ( nu * ( uo [i-1, j] + uo [i+1, j] +
                                            uo [i, j-1] + uo [i, j+1]
                                            - 4.0 * uo [i,j] ) )
```

## Exercise: comparing performance (cont'd)
<hr style="border: solid 4px green">

### Modify the file <span style="font-family: Courier New, Courier, monospace;">heat.py</span>
* include all options so far except `"python"` (too slow)
```python
    stepperTypeList = [
        "numpy",
        "numba",
        "blitz",
        "inline",
        "fortran",
        "ctypes"
    ]
```

## Exercise: comparing performance (cont'd)
<hr style="border: solid 4px green">

### Preliminary run
* small datasets
* allows `weave` and `numba` generate code and cache it

In [13]:
! python heat.py 100 500

 computing 500 iterations on a 100x100 grid
 stepper numpy, 500 iterations, 0.195461 seconds, 0.104286 rel error
 stepper numba, 500 iterations, 0.716754 seconds, 0.104286 rel error
 stepper blitz, 500 iterations, 0.353546 seconds, 0.104286 rel error
 stepper inline, 500 iterations, 0.053986 seconds, 0.104286 rel error
 stepper fortran, 500 iterations, 0.030147 seconds, 0.104286 rel error
 stepper ctypes, 500 iterations, 0.284029 seconds, 0.104286 rel error


## Exercise: comparing performance (cont'd)
<hr style="border: solid 4px green">

### Run #3
* run the test script with a grid of 1000 points and 500 timesteps
* compare time spent in all implementations so far

In [14]:
! python heat.py 1000 500

 computing 500 iterations on a 1000x1000 grid
 stepper numpy, 500 iterations, 21.538667 seconds, 0.382189 rel error
 stepper numba, 500 iterations, 2.472552 seconds, 0.382189 rel error
 stepper blitz, 500 iterations, 6.626490 seconds, 0.382189 rel error
 stepper inline, 500 iterations, 4.406395 seconds, 0.382189 rel error
 stepper fortran, 500 iterations, 2.089497 seconds, 0.382189 rel error
 stepper ctypes, 500 iterations, 3.665506 seconds, 0.382189 rel error


## Summary
<hr style="border: solid 4px green">

### We have
* seen the methods for making Python fast in action
* simple but realistic application
<br><br>

### Performance
* pure Python is a beginner's mistake
* `NumPy` is the basic language of Python scientific and it leads to reasonable performance
* compiled languages (C and Fortran) can be harnessed for extra performance
* alternative methods (`weave.blitz`, `weave.inline`) exist for C/C++ code
<br><br>

### Overall winner: `numba`
* JIT compiler
* extremely easy to use, performance on a par with C/Fortran
* future developments promise more
* a pity it is serial...
<br><br>

### Baseline perfomance (for benchmarking)
* `NumPy`
* serial C / Fortran modules

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >