# Python for High Performance Computing
# Interfacing to C and Fortran
<hr style="border: solid 4px green">
<br>
<center><img src="../../images/arc_logo.png"; style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Software
<hr style="border: solid 4px green">

We shall need compilers for C and Fortran.

The **G**NU **C**ompiler **C**ollection (GCC) includes the following compiler front-ends
* C -- `gcc`
* C++ -- `g++`
* Fortran -- `gfortran`

### Linux
* `gcc` is part of the operating system installation
* `gfortran` needs to be installed separately, *e.g.* `sudo apt-get install gfortran`

### Mac OS
* `gcc` that comes with the operating system is a an alias for `clang` (front-end for LLVM)
* GCC `gcc` is installed with `xcode-select --install`
* `gfortran` is installed via Homebrew with `brew install gfortran`

### Windows
* a linux feel-like solution: Cygwin, a distribution of GNU and other popular Open Source tools running on Windows
* `gcc` and `gfortran` are included

## Why couple Python with another language?
<hr style="border: solid 4px green">

### Python is *slow* compared with compiled languages
* dynamically typed language
  * variables are Python objects, assigned a type at runtime, determined from context
  * any operation involves access and checks on the Python object attributes
* interpreted language
  * optimisations on the operations are impossible
* memory access can be sub-optimal
  * *e.g.* Python lists are not stored in contiguous memory chunks
<br><br>

### But... Python is *useful*
* easier to use than compiled languages
* flexible and forgiving
* efficient use of development time

## Why couple Python with another language?
<hr style="border: solid 4px green">

### Best of both worlds: combine
* Python flexibility
  * easy data manipulation, inspection and visualisation
  * parse command line arguments
  * handle complex software coordination
* compiled language performance
  * functions that do the computationally intensive parts
  * these functions are presented as Python callable functions
<br><br>

The scope such an optimisation effort is highlighed by profiling:
* typically, 10/20% of the code takes 90/80% of runtime
* the slow parts of the can be re-programmed in C or Fortran
<br><br>

> More info
> * http://docs.scipy.org/doc/numpy-dev/f2py/
> * http://scipy-cookbook.readthedocs.org
> * http://www.f2py.com/home/

## Extension modules
<hr style="border: solid 4px green">

### Basic approach: build *extension modules*
* hand-write C or Fortran functions
* compile source to produce a dynamic library (containing native machine code)
  * `.so` file (shared object) in linux and MacOS
  * `.dll` file in Windows
* add a wrapper around the library to provide a Python interface
<br><br>

### The result
* an extension module, *i.e.* a shared library
* loadable at run time using `import`.
<br><br>

### Requires
* a clear understanding of the number of types of any arguments
* an appropriate compiler (*e.g.* `gcc` and `gfortran`)
<br><br>

> *Warning*: bugs in the C/Fortran programming easily crash the Python interpreter.

## A number of techniques
<hr style="border: solid 4px green">

### Fortran to Python interface generators
* `f2py` -- a tool that is part of `NumPy`
* `f90wrap` -- a newer package that works together with `f2py`
* both are remarkably easy to use but neither is well maintained

## A number of techniques
<hr style="border: solid 4px green">

### A miscellany of methods for C/C++
* Ctypes
  * a foreign function library which provides C compatible data types and allows calling functions from shared libraries
  * *pros*: Python standard library
  * *cons*: functions need to be available from a shared library, poor support for C++
* Cython
  * is both a Python-like *language* for writing C-extensions and a *compiler* for this language
  * the Cython language is a superset of Python, with additional constructs that annotate variables and class attributes with C types (in a sense, Python with types)
  * supports interactive optimization -- start with a pure-Python script and incrementally adds Cython types to optimize targeted code paths
  * *pros*: easy to use (Python-like language for writing C-extensions), incremental optimization, includes a GNU debugger extension, C++ support
  * *cons*: must be compiled
* C native interface
  * the Python-C API is the backbone of the standard CPython interpreter
  * *pros*: no additional libraries, low-level control, usable from C/C++
  * *cons*: needs compilation, substantial effort and maintenance cost, costly computing overheads, compatibility across Python versions
* SWIG (**S**implified **W**rapper and **I**nterface **G**enerator)
  * tool to connect C/C++ code with a variety of high-level programming languages (inc. Python)
  * reads header files and generates libraries Python can load
  * *pros*: given headers, automatically wraps entire libraries, works well with C++
  * *cons*: generates huge files, difficult to debug, steep learning curve
<br><br>

> More information and examples
> * http://www.scipy-lectures.org/advanced/interfacing_with_c/interfacing_with_c.html
> * https://docs.scipy.org/doc/numpy-1.10.0/user/c-info.python-as-glue.html

## Techniques in this lecture
<hr style="border: solid 4px green">

* `f2py`
* Ctypes

## Arrays and memory access
<hr style="border: solid 4px green">

### NumPy arrays
* are contiguous in memory
* easily usable from Fortran and C
<br><br>

### NumPy arrays (2D and higher)
* default storage is in C order
* Fortran storage has to be explicitly asked for
* the internal ordering is hidden by the abstraction layer
<br><br>

### Arrays are stored in memory at contiguous locations
<table border="0">
  <tr>
    <td><center>Math representation</center></td>
    <td><center>C mapping</center></td>
    <td><center>Fortran mapping</center></td>
  </tr>
  <tr>
    <td><img src="./images/array_2x3_math.png"; style="float: center; width: 40%"></td>
    <td><img src="./images/array_2x3_C.png"; style="float: center; width: 40%"></td>
    <td><img src="./images/array_2x3_Fortran.png"; style="float: center; width: 40%"></td>
  </tr>
</table>

## Arrays and memory access (cont'd)
<hr style="border: solid 4px green">

### Does this matter?
* yes, it has an impact on performance
<br><br>

### Why?
* computation is (almost) free, memory access is expensive
* memory access is cached to improve performance
  * cache is fast but size-limited memory
  * sits between CPU and main memory
  * split between (up to) 3 levels
* the caching mechanism assumes *spatial locality*
  * if a particular storage location is referenced, it is likely that nearby memory locations will be referenced during the following instructions
  * therefore, when an array entry is referenced, a whole *cache line* is pulled out of memory

## Arrays and memory access (cont'd)
<hr style="border: solid 4px green">

### A simple example
* compute the Frobenius norm of a *large* 2D array
* traverse the array row first and column first and compare times
* C-storage (default) means row-first is best
  * `u[i, j]` being accessed means `u[i, j+1]`, `u[i, j+2]`, ... are already in cache and accessed quickly
<br><br>

> *Note*:  the *proper* way to compute the Frobenius norm (3 orders of magnitude faster) is `numpy.linalg.norm (u, "fro")`.

In [1]:
import numpy as np

# function that traverses a 2D array C-style (row major)
# and computes the Frobenius norm of the array
def c_fro_norm (u):
    m, n = u.shape
    rms = 0.0
    for i in range(m):
        for j in range(n):
            rms += u[i,j]**2
    return np.sqrt(rms)

# function that traverses a 2D array Fortran-style (column major)
# and computes the Frobenius norm of the array
def f_fro_norm (u):
    m, n = u.shape
    rms = 0.0
    for j in range(n):
        for i in range(m):
            rms += u[i,j]**2
    return np.sqrt(rms)

## Arrays and memory access (cont'd)
<hr style="border: solid 4px green">

Now, create a large array and time both functions.

In [3]:
u = np.random.rand (60, 40000)
%timeit c_fro_norm (u)
%timeit f_fro_norm (u)

1 loop, best of 3: 1.21 s per loop
1 loop, best of 3: 1.22 s per loop


## Python and Fortran via <span style="font-family: Courier New, Courier, monospace;">f2py</span>
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">f2py</span>: **F**ortran **to** **Py**thon interface generator
From Fortran source, `f2py`
* creates a *signature file*, which contain argument attributes (defining the Fortran interface)
* compiles the source (using an external Fortran compiler)
* wraps the compiled source in an extension module importable from Python
<br><br>

### General recipe
* create a signature file
  * `f2py <source_file> -m <extension_module_name> -h <signature_file>.pyf`
  * typically, the signature filename stub is the same as the source filename

* (optional) check the signature file for correctness
  * sequence and types of arguments to be passed from Python to Fortran function and back
  * argument attributes
     * `depend`
     * `check`
     * `intent`
     * `shape`

* produce the final extension module
  * `f2py -c <signature_file>.pyf <source_file>.f90`

* import the module into Python and use the external Fortran function

```python
import extension_module_name
extension_module_name.function (args)
```
<br><br>

> The source filename may not be the same as the function name.

## Fortran example : <span style="font-family: Courier New, Courier, monospace;">f_array_sqrt.f90</span>
<hr style="border: solid 4px green">

`f_array_sqrt()` is an external subroutine to compute the square root of an array of values.

In [None]:
# %load f_array_sqrt.f90
subroutine array_sqrt (n, a_in, a_out)

  implicit none
  integer, intent(in) :: n
  real, dimension(n), intent(in)  :: a_in
  real, dimension(n), intent(out) :: a_out

  integer :: i

  do i = 1, n
     a_out(i) = sqrt(a_in(i))
  end do

  return

end subroutine array_sqrt


## Fortran example: create signature file
<hr style="border: solid 4px green">

`f2py` creates the signature file automatically:
```bash
f2py f_array_sqrt.f90 -h f_array_sqrt.pyf
```

* use the `-h` option to specify signature is output to text file `f_array_sqrt.pyf`
* use `--overwrite-signature` to overwrite signature

In [4]:
# call from notebook to avoid exiting...
!f2py f_array_sqrt.f90 -h f_array_sqrt.pyf

Reading fortran codes...
	Reading file 'f_array_sqrt.f90' (format:free)
Post-processing...
	Block: array_sqrt
Post-processing (stage 2)...
Saving signatures to file "./f_array_sqrt.pyf"


## Fortran example: check signature file (optional)
<hr style="border: solid 4px green">

Attributes (such as `optional`, `intent` and `depend`)
* specify the visibility, purpose and dependencies of the arguments
* are automatically inferred from the Fortran source
* can be manually modified if needed

In [5]:
!cat f_array_sqrt.pyf

!    -*- f90 -*-
! Note: the context of this file is case sensitive.

subroutine array_sqrt(n,a_in,a_out) ! in f_array_sqrt.f90
    integer, optional,intent(in),check(len(a_in)>=n),depend(a_in) :: n=len(a_in)
    real dimension(n),intent(in) :: a_in
    real dimension(n),intent(out),depend(n) :: a_out
end subroutine array_sqrt

! This file was auto-generated with f2py (version:2).
! See http://cens.ioc.ee/projects/f2py2e/


Normally, the automatically generated signature file does not need inspection or modification.

Modifying the signature file is sometimes needed for changing the `intent` attributes
* generating modules from legacy Fortran 77 code
* adding signature-specific options, *e.g.* `intent(in, hide)`

## Fortran example: compile extension module
<hr style="border: solid 4px green">

Once verified that the signature file is correct, `f2py` compiles a module file that can be imported into Python

```bash
f2py -c f_array_sqrt.f90 -m f_array_sqrt
```

* `-m` specifies the name of the output module, in this case a shared library file called `f_array_sqrt.so`
* it uses the default compiler `gfortran` but this can be changed
  * *e.g.* the Intel Fortran compiler can be used by adding `--compiler=intelem --fcompiler=intelem`
  * the available options are printed with `f2py -c --help-fcompiler`

In [7]:
# call from notebook to avoid exiting...
# (to avoid the long output, use "msg = !f2py ...")
msg = !f2py -c f_array_sqrt.f90 -m f_array_sqrt

In [8]:
# check we have the f_array_sqrt.so
!ls *.so

[31mf_array_sqrt.so[m[m


## Fortran example: calling external function from Python
<hr style="border: solid 4px green">

The function provided by the module has the same name as the Fortran routine: `array_sqrt`.

In [9]:
# import the extension module
import numpy as np
from f_array_sqrt import array_sqrt

In [10]:
# view docsting of function (automatically produced)
array_sqrt?

In [11]:
# use the function
a_in = np.array([16.0, 25.0, 36.0, 49.0])
a_out = array_sqrt(a_in)
print a_out

[ 4.  5.  6.  7.]


## Fortran example: final remarks
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">f2py</span> automation
Notice the difference between the Fortran source
```fortran
subroutine array_sqrt (n, a_in, a_out)
  integer, intent(in) :: n
  real, dimension(n), intent(in)  :: a_in
  real, dimension(n), intent(out) :: a_out
```
and the Python usage
```python
a_out = array_sqrt(a_in)
```
Where does `n` go?  Find it in the signature file
```fortran
integer, optional,intent(in),check(len(a_in)>=n),depend(a_in) :: n=len(a_in)
```
<br><br>

### Input and output variables
* Fortran specifies what is input and output as tightly as possible
* Python allocates `a_in`, initialises it and passes it to `array_sqrt ()`
* Python also allocates `a_out` automatically for Fortran to work with

## Python and C via <span style="font-family: Courier New, Courier, monospace;">ctypes</span>
<hr style="border: solid 4px green">

### Less automation than <span style="font-family: Courier New, Courier, monospace;">f2py</span>
* no additional interface file
* no mixed-language intermediate code
<br><br>

### General recipe
* compile C source to a shared library (`.so` extension)
* the library is ready to use from Python
  * import the `ctypes` module
```python
    import ctypes
```

  * load the library explicitly, *e.g.*
```python
    lib = ctypes.cdll.LoadLibrary ("./my_library.so")
```

* specify the prototype for the C function, *e.g.*
```python
     lib.my_c_function.restype = ctypes.c_int
     lib.my_c_function.argtypes = [ctypes.c_double]
```

## C example: <span style="font-family: Courier New, Courier, monospace;">c_array_sqrt.c</span>
<hr style="border: solid 4px green">

Consider the same square root example, this time in C.

In [None]:
# %load c_array_sqrt.c
# include <math.h>

void array_sqrt (int n, double * a_in, double * a_out) {
  int i;
  for (i = 0; i < n; i++) {
    a_out[i] = sqrt(a_in[i]);
  }
}


## C example: create the module
<hr style="border: solid 4px green">

Now, generate the shared library, using the C compiler `gcc` directly.

In [12]:
# call from notebook to avoid exiting...

# first compile
!gcc -c -fPIC c_array_sqrt.c
# then generate library
!gcc -shared -o c_array_sqrt.so c_array_sqrt.o
# check library was generated
!ls *.so

[31mc_array_sqrt.so[m[m [31mf_array_sqrt.so[m[m


The compiler flag `-fPIC` stands for **P**osition **I**ndependent **C**ode
* generated machine code is not dependent on being located at a specific address in order to work
* this allows the dynamic loader (the mechanism whereby a process can load a library at run-time) to relocate libraries to different addresses at load time
* this is essential for shared libraries support

> *Note*: shared libraries are *shared* by different processes in memory.

## C example: calling external function from Python
<hr style="border: solid 4px green">

* the function provided by the module has the same name as the C routine: `array_sqrt`
* there is no wrapper, so the corresponding `ctypes` code must address the two C pointers

In [13]:
import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer

lib = ctypes.cdll.LoadLibrary("./c_array_sqrt.so")
lib.array_sqrt.restype = None
lib.array_sqrt.argtypes = [ctypes.c_int, ndpointer(ctypes.c_double, flags="C_CONTIGUOUS"),
                                         ndpointer(ctypes.c_double, flags="C_CONTIGUOUS")]

a_in  = np.array([16.0, 25.0, 36.0, 49.0])
a_out = np.empty(4, np.double)

lib.array_sqrt(4, a_in, a_out)
print a_out

[ 4.  5.  6.  7.]


## C example: final remarks
<hr style="border: solid 4px green">

### What you program is what you get
Compare the C source
```c
void array_sqrt (int n, double * a_in, double * a_out)
```
with the Python usage
```python
lib.array_sqrt(4, a_in, a_out)
```
<br><br>

### Advantages of less automation
* complete control over generating the shared library
* complete control over variable memory allocation

## Exercise
<hr style="border: solid 4px green">

### Generate a Python extension module for the function `fibonacci()` 
* start with the provided C or Fortran source provided
* the function computes the first `n` Fibonacci numbers (0, 1, 1, 2, 3, 5, 8, 13, ...) and stores the results in the array provided
* test this function in Python

## Summary
<hr style="border: solid 4px green">

### We have looked at
* coupling Python with C and Fortran code
  * allowing code re-use (*e.g.* existing libraries)
* `f2py` is a simple way to call Fortran code from Python
* `ctypes` is the simplest way to interface Python with C functions
<br><br>

### Next: extended example
* put all the above together (and more)
* realistic example
* focus on performance

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >