[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesFergusson/Introduction-to-Research-Computing/blob/master/10_Cython.ipynb)

# Cython

One of the best ways to speed up python code is to convert it into compiled C code. Luckily this is fairly easy, we can actually do it in a jupyter notebook which we can use for testing things (to use it in scripts we will need some extra steps).  First we install cython using `conda install cython` then we need to load the extension to the notebook:

In [None]:
%load_ext Cython

Now let's compare a normal python function and it's cython-ized version.  We will use the example if a function that calculates the Nth Fibonacci number:

In [None]:
def fib1(N):
    a,b = 0,1
    for i in range(N):
        a,b = b,a+b
    return a

%timeit fib1(1000)

In [None]:
%%cython
def fib2(N):
    a,b = 0,1
    for i in range(N):
        a,b = b,a+b
    return a

In [None]:
%timeit fib2(1000)

In [None]:
%%cython
def fib3(int N):
    cdef int i
    cdef int a=0,b=1
    for i in range(N):
        a,b = b,a+b
    return a



In [None]:
%timeit fib3(1000)

So just adding `%%cython` gives us a factor ~2x speedup.  But if we simply add types to our variables with `cdef` this increases to a ~240x speed up!

This is because the function is dominated by the loop which C can do much better.  The `%%cython` magic actually does something tricky in the background. It takes the cell and converts it to C code then compiles it and stores the resulting executable in a temporary location.  We can see the actual C code generated using the annotate option by adding `-a` after the `%%cython`.  This gives us a window to how the code has been converted to C with highlighting to show how much python interaction is left for each line.   If we click the little '+' on the line number it shows you what this line has been converted to in C and the stronger the yellow the more python interaction remains.

We will come back to compilation later but let's look at the difference between writing cython and python code:

1. We don't have to do anything to cython-ise most python code.  We can put almost any python code through the cython compiler and it will work fine and usually run faster.

2. To access performance of C with Cython we usually only have to declare types using `cdef` and sometimes switch the default behaviour of some operations using simple flags.

3. In cython we can now use all C libraries and easily access threaded parallelisim by avoiding the GIL.

So we see that there are very few differences.  Cython is a superset of python so we don't have to change anything if we don't want to.  As cython is effectively an optimisation tool we should profile the code and only cythonise the slowest parts.  This is the main advantage. If you wanted to access the speed of C you would otherwise have to re-write all your code in C where lots of things can be significantly more difficult.  Instead we can use the convenience of python for most of the code and only invoke C in the sections where performance is most important.

## Types
Using cython is it's basic form is pretty easy.  Let's look at the cdef statement a bit more. Here are the following basic cdef types:

In [None]:
%%cython
cdef char i=1           # Oddly an 8 bit integer (-128 to 127) (it's enough to label all charcaters so can be used for strings)
cdef short j=2          # 16 bit integer (-32,768 to 32,767)
cdef int k=3            # 32 bit integer (-2,147,483,648 to 2,147,483,647)
cdef unsigned int l=4   # 32 bit +ve integer (0 to 4,294,967,295), "unsigned" can go infront of all numeric types
cdef long int m=5       # 64 bit integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
cdef float x=0.0        # 32 bit float (6 decimal places, max exponent 38)
cdef double y = 0e0     # 64 bit float (12 decimal places, max exponent 1023)
cdef list list1 = [1,2,3]       # just a normal list (not much performance gain)
cdef dict dict1 = {'a':1,'b':2} # just a normal dict (not much performance gain)

Here once we define a type we have to stick with it unlike python which dynamically changes types to accurately store any number you give it.  This means we are now in danger of overflow errors.  This is when you assign `cdef short j` then write `j = 200**2` and get:

In [None]:
%%cython
cdef short j
j = 200**2
print(j)

This is because 40,000 is larger than 32,767 so we wrap around to the negative part.  Similarly if we try:

In [None]:
%%cython
cdef unsigned int j
j = -1
print(j)

so we have to be a bit careful with our variables to avoid strange results.

Strings are stored completely differently in C so there is no `cdef` just for them.  Instead they are just a array of `char`.  The `char*` means that it is an address to the point in memory where the string begins. Also python and C encode strings differently so you have to `encode` and `decode` for them to be able to talk to each other.  It's best just to keep strings as python variables.

In [None]:
%%cython
def test(input):
    input_byte = input.encode('utf-8')
    cdef char* c_string = input_byte
    cdef bytes py_string_byte = c_string
    output = py_string_byte.decode('utf-8')
    print(output)
test('Hello')

 We can also use any of the standard C math libraries with:

In [None]:
%%cython
from libc.math cimport sin

def sin_c(double x):
    return sin(x)

In [None]:
import math
x = 0.5
%timeit math.sin(x)
%timeit sin_c(x)

which are a bit faster.  To use numpy arrays we have to do:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

cdef cnp.ndarray array

but this won't give us all the speed improvement possible as C doesn't know how to allocate the memory as it doesn't know the shape and datatype of the array.  Instead it is better to do:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

def matrix_dot(cnp.ndarray[cnp.int_t, ndim=2] array1, cnp.ndarray[cnp.int_t, ndim=2] array2):
    cdef cnp.ndarray[cnp.int_t, ndim=2] array3
    array3 = np.dot(array1,array2)
    return array3


In [None]:
import numpy as np
array1 = np.ones((100,100),dtype=np.int)
array2 = np.ones((100,100),dtype=np.int)
%timeit array3 = matrix_dot(array1,array2)
%timeit array4 = np.dot(array1,array2)

Note: we had to put this declaration in a function, this is so cython knows how long the memory needs to be allocated for as it's local to the function. Also you can `cimport numpy as np` I did it to a different name so you could see which is doing what.  Also numpy is already in C so as expected wrapping it in cython doesn't help.

We can also optimise the function call by specifying the return type.  For functions we have three choices: `def`, `cdef` and `cpdef`.  The first says it's callable in python or cython, the second cython only with optimised call, the third is callable in python and cython but optimised in the second case. If you use cdef or cpdef you need to add the type for the return variable like below:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

cpdef cnp.ndarray[cnp.int_t, ndim=2] matrix_dot2(cnp.ndarray[cnp.int_t, ndim=2] array1, cnp.ndarray[cnp.int_t, ndim=2] array2):
    cdef cnp.ndarray[cnp.int_t, ndim=2] array3
    array3 = np.dot(array1,array2)
    return array3

## Cython for scripts

So using the `%%cython` magic is pretty cool but we can't write a code using it.  So how do we use cython in our normal python code?  It's a four step process (two more than normal):

1. Put your cython code in a file with extension `.pyx` like `cython_module.pyx`

In [None]:
%%file cython_module.pyx
"""
Cython code for fibonnaci numbers
"""
cpdef int fibonacci(int N):
    cdef int i
    cdef int a=0,b=1
    for i in range(N):
        a,b = b,a+b
    return a

2. Create a file called setup.py with the following:

In [None]:
%%file setup.py
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("cython_module.pyx")
)

3. Now compile the code on the command line with:

In [None]:
cd Code

In [None]:
%%bash
python3 setup.py build_ext --inplace

4. Use the new cython functions with:

In [None]:
import cython_module as cym
cym.fibonacci(10)

You are now free to use the functions in `cython_modules` in python.

If we look in the directory we see two new files `cython_modules.c` and `cython_modules.so`  The `.c` is the transliteration of our cython code into C and the `.so` file is the compiled version of it.  If you open the `.c` file you will see that it is now about 2600 lines long.  Mostly it's definitions with the actual calculation appearing around line 1070 and lasting about 80 lines.  It is clear from the `.c` code that the code is doing a lot of checks which python does in the background which can slow down operation of the code.  Again we can see how well we are doing by using the annotate option in our `setup.py` file:

In [None]:
%%file setup.py
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("cython_module.pyx", annotate=True)
)

Now we can build it again but we will have to delete the `.so` files to make it run (otherwise it doesn't think anything has changed). This generates a `.html` file which shows us how much of our code has been converted to C.  It should have highlighted two lines the `def fib3()` line and the `return a` line.  This is because we haven't specified what type the function should return.  We can correct this by changing the definition to: `cpdef int fib3()`.  Now when we re-compile the `return` line is white and the `def` line is paler yellow.  This can't be changed as we want the function to be available in python so it must interact with it.

## Extensions
Now we have access to all of the functionality of C and C++.  This is a massive topic and I couldn't begin to address it here.  There are however a couple of options I will flag up for you to think about in future

Here is a link to compiler directives that can be specified in the setup file for all code or using decorators (which we haven't discussed but are just lines above a function beginning with an @) for specific functions:
https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#compiler-directives
Some common decorators are:
- @cython.boundscheck(False)  Remove checks that you are accessing valid array entries
- @cython.wraparound(False)   Remove the ability to use negative indexing in arrays
- @cython.cdivision(False)    Use C's version of division rather than pythons so no more divide by zero errors

- @cython.profile(True)  This is necessary if you wan to profile using cProfile

Turning these off and on can help you access more of the C speed by removing python style checks.  If you turn these off you code will usually just produce nonsense or explode when you do something wrong (like in C!) rather than raise an error (like in python).  These can buy some speed but are only really important if they are blocking a loop from being converted to C (where it would vectorise) or this particular loop contains only this type of calculation but this is quite hard to set up. You probably don't need to worry about them much

The second is that now we can access task parallelism both through the cython `prange` command.

In [None]:
%%cython
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
import cython

@cython.cdivision(True)
@cython.boundscheck(False)
cpdef cnp.ndarray[cnp.int_t, ndim=1] func1(cnp.ndarray[cnp.double_t, ndim=1] Xin):
    cdef int i
    cdef int N = Xin.shape[0]
    cdef cnp.ndarray[cnp.double_t, ndim=1] Xout = np.empty_like(Xin)
    
    for i in range(N):
        Xout[i] = 1e0/Xin[i]
        
    return Xout

@cython.cdivision(True)
@cython.boundscheck(False)
cpdef cnp.ndarray[cnp.int_t, ndim=1] func2(cnp.ndarray[cnp.double_t, ndim=1] Xin):
    cdef int i
    cdef int N = Xin.shape[0]
    cdef cnp.ndarray[cnp.double_t, ndim=1] Xout = np.empty_like(Xin)
    
    for i in prange(N, nogil=True):
        Xout[i] = 1e0/Xin[i]
        
    return Xout

In [None]:
import numpy as np

Xin = np.random.random((10000))+1e0

%timeit func1(Xin)
%timeit func2(Xin)

Here you can easily run into issues if you want to sum numbers as they all threads have to access the same variable. Cython does make sure the answer is right (unlike in C) but the code becomes effectively serial so the code will run slower due to the overheads for creating the threads in the first place. Still, this can be an easy way to paralleise simple loops.  Note that this will not happen if there is any python inside the loop, it has to be all cython.

In [None]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp
from cython.parallel import prange

cdef int i
cdef int n = 30
cdef int sum = 0

# for i in range(n):
for i in prange(n, nogil=True):
    sum += i

print(sum)

## Wrapping C with Cython

One useful advanced topic is learning to wrap pure c code to be used in python.  This is fairly easy but a little fiddly for people new to C.  Here is a simple example to show you how it works.

Suppose we have a function written in in C that we want to use in python.  The C code is as follows:

cexample.c :

In [None]:
#include <omp.h>
#include "cexample.h"

// The top two lines are the C version of loading modules
// This is done via "header" files which list the functions to load.

int fibonacci(int n){
	
	int i,a,b,tmp;
	a=0;
	b=1;
	for (i=0;i<(n-1);i++){
		tmp = a+b;
		a = b;
		b = tmp;
	}
	return b;
}

with header file cexample.h:

In [None]:
#ifndef C_EXAMPLE_H
#define C_EXAMPLE_H

// The bit at the top is to check if the module has already been loaded
// It asks if it is already defined and only loads it if not.

int fibonacci(int n);

#endif

We must compile this code into a library which we will then load in cython.  To do this we run the two commands, the first compiles the `cexamples.c` into an object file.  The second creates the library **ar**chive from the object files.

In [None]:
cd ../Cython

In [None]:
%%bash

cd lib
gcc -c cexample.c
ar rcs libcexample.a cexample.o
cd ..

Now we have created the C library we need to use cython to wrap it for use in python.  All we need to do is create the `.pyx` file for this:

In [None]:
%%file wrapcexamples.pyx
cdef extern from "cexample.h":
    int fibonacci(int n)

def cfib(n):
    cdef int m
    m = fibonacci(n)
    return m

Then create the `setup.py` to create the python module:

In [None]:
%%file steup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

cexample_extension = Extension(
    name="wrapcexample",            
    sources=["wrapcexample.pyx"],
    libraries=["cexample"],
    library_dirs=["lib"],
    include_dirs=["lib"]
)

setup(
    name="wrapcexample",
    ext_modules=cythonize([cexample_extension])
)

Now if we run the build command:

In [None]:
%%bash
python3 setup.py build_ext --inplace

Then we can access the c function in pure python

In [None]:
import wrapcexample as cexmpl

cexmpl.cfib(10)

This is a useful way to access thread level parallelism as we can now use OpenMP (OMP) in native C where it's much easier.  An example is in the directory CythonOMP. The C code is below:

In [None]:
#include "ompexample.h"
#include <omp.h>

int pointlesssum(int n){
	
	int i,sum;
	sum = 0;
	
	#pragma omp parallel for default(none) private(i) shared(n) reduction(+:sum)
	for(i=0;i<n;i++){
		sum += i;
	}
	return sum;
}

Now thread parallelism is accessed via simple `pragmas` avoiding the trouble of having to turn off lots of python flags to get true parallelism like for `prange`.  The above parallelises the `for` loop where `i` is private meaning each thread keeps it's own copy, `n` is public which means that all threads see the same variable and `sum` is a reduction variable meaning that each thread gets a local copy which are merged at the end of the loop by `+`. The `default(none)` means that parameters are neither private or shared by default, you have to specify them. Alternatively you can set the default to `private` then only specify variables that are shared or, more dangerously, set the default to shared then specify which are private. This is a pretty common way to use it.  You can create a parallel region with:|

In [None]:
#pragma omp parallel default(none) private(...) shared(...)
{
	t = omp_get_thread_num();
    
    ...
    
}

Which you can programme like a normal mpi template but now communication is done by shared variables and reduction by critical regions which only one thread can access at a time eg:

In [None]:
#pragma omp parallel default(none) private(t,i,thrsum) shared(totsum)
{
	t = omp_get_thread_num();
	int thrsum = 0;
	int totsum = 0;
	for(i=50*t;i<50*(t+1);i++){
		trdsum += i;
	}
	
	#pragma omp critical
	{
		totsum += thrsum;
	}    
}

The reason they can't all write to a shared variable at the same time is due to race conditions, where one thread reads the variable then preforms the operation but before it can write the variable back another thread reads it.  Here is an example:

In [None]:
a = 2
#pragma omp parallel default(none) private() shareda)
{
	a+=1;   
}

1. thread1: read a=2
2. thread1: add one
3. thread2: read a=2
4. thread1: write a=3
5. thread2: add one
6. thread2: write a=3

so a=3 not 4 as it should.

## Makefiles

Compiling the C code and converting it into python modules used a few commands which can be a pain to remember especially if the C code has multiple files.  Generally when writing C this is automated with `Makefiles`.  Here is a short introduction to them.
|
Makefiles consist of a list of rules, usually for updating files when files they depend on change, but they can be used more generally.  The rules take the form (note we need <tab> rather than 4 spaces) of:

In [None]:
target ... : dependancy ...
	commands
	...
	...

The `target` is usually the file to be updated or rule name to pass to `make`, `dependancy` are the files/rules that the target depends on and `commands` are the bash commands to run in order to update the target.  A very simple example would be:

In [None]:
%%file Makefile

test :
	echo "Hello!"

In [None]:
%%bash
make test

This has no dependencies so when we `make test` it finds the rule for `test` and runs the commands below. `make` always does the first rule if nothing is specified (unless you have set a default) so equivalently we could have typed:

In [None]:
%%bash
make

Now we can add a dependancy so in order to do the first rule we need to do the second first.

In [None]:
%%file Makefile

test1 : test2
	echo "Hello two!"
    
test2 :
	echo "Hello one!"

In [None]:
%%bash
make

both `target` and `dependancy` can be files.  Now make checks to see if any files in the "rule tree" have been updated then runs the commands from there up.  A simple make file for the compiling and wrapping the c functions would be:

In [None]:
%%file Makefile

wrap : lib/libcexample.a
	python setup.py build_ext --inplace
    
lib/libcexample.a : lib/cexample.o
	ar rcs lib/libcexample.a lib/cexample.o
    
lib/cexample.o : lib/cexample.c
	gcc -c lib/cexample.c -o lib/cexample.o

In [None]:
%%bash
make wrap

This is OK but a bit verbose.  Just like when writing python we are better to use variables to make the code simpler to understand.  In make files variables are created with `=` and `:=` signs. The values are accessed by $() construct so if `var1 = 2` the $(var) evaluate to 2. The first assignment `=` is `implicit` which means that it doesn't expand the rhs immediately, the second, `:=` is `explicit` in that it does expand it before assignment.  The difference can be seen in the example:

In [None]:
var1 = $(var2)
var2 = "hello"
echo $(var1)

Here we don't expand `var1` until we get to `echo $(var1)` so it doesn't matter that `var2` isn't defined when we assign `var1`.  With `:=` this would matter as we would try to expand `$(var2)` when creating var1 and it wouldn't exist.  Conversely in this example:

In [None]:
var1 = "hello"
var1 = $(var1)

As the second assignment is implicit this creates an infinite loop that can't be expanded.  Here `:=` would work fine as we would expand it before assignment.


There are also `?=` which assigns the variable only if it has not previously been assigned and `+=` which will add another element to a list, ie:

In [None]:
var1 = one two three
var1 += four

We can also create pattern specific variables using `%`.  This will match any non-empty string and can be used in any string object but only once. With the automatically defined variables `$@`, `$<`, and `$^` which are the filename of the target, the first filename of the dependency and all the dependancies this allows us to set up generic rules for all files of a specific type like object files which are always created from their c files, ie:

In [None]:
%%file Makefile
# compilers, flags and libraries 
CC = gcc
CFLAGS := -g -O3 -xHost 

# librarys
LIBS := 
    
# Directories code objects and librarys     
LIBDIR := lib

# source file(s) without suffix 
CFILES = cexample

#This says don't look for a file called "wrap"
.PHONY : wrap
    
# dependany says wrap depends on all files in LIBDIR
# matching anything in CFILES but with 'lib' on the front and '.a' on the end
wrap : $(LIBDIR)/$(CFILES:%=lib%.a)
	python setup.py build_ext --inplace

# for anything in CFILES but with 'lib' on the front and '.a' on the end
# dependent on anything in CFILES with '.o' on the end
$(LIBDIR)/$(CFILES:%=lib%.a) : $(LIBDIR)/$(CFILES:%=%.o)
	ar rcs $@ $^

# everything that ends in '.o' should be made from the same file with '.c' instead
$(LIBDIR)/%.o : $(LIBDIR)/%.c
	$(CC) -c $< -o $@
    
clean : 
	rm *.c *.so $(LIBDIR)/*.a $(LIBDIR)/*.o

In [None]:
%%bash
make clean
make wrap

While makefiles work fine for smaller projects in compiled languages, they become too tricky to write manually for larger ones and are usually generated by the `CMake` system. It is not straightforward to learn and we will not spend time on it now. Recently, a simpler alternative to `CMake`, called `meson` has been gaining popularity and is worth checking out if you are lazy to write your makefiles even for smaller codes. By default, `meson` will use `ninja` instead of `make`, which is actually faster with compilation times. You don't need to learn about `ninja` files though as they will be generated automatically anyway.

## Fortran

As an aside we note that something similar exists for Fortran code, the f2py package (which can also handle C).  It's only for wrapping code where as Cython allows you to mix it together and in fortran strings and arrays need a little more work to pass (see: https://docs.scipy.org/doc/numpy/f2py/).  F2py can be used on the command line a bit like a compiler, first you create some Fortran code like this: 

In [None]:
MODULE funcs
  
  CONTAINS
    
  SUBROUTINE fibonacci(n,m)
    IMPLICIT none
    INTEGER, INTENT(in) :: n
    INTEGER, INTENT(out) :: m
    INTEGER :: i,a,b,tmp
    a = 0
    b = 1
    DO i=1,n
      tmp = a+b
      a = b
      b = tmp
    END DO
    m = b
  END SUBROUTINE fibonacci
  
END MODULE forfunc

Then to create a loadable module (`.so` file) you would then compile with f2py:

In [None]:
f2py3 -c fibonacci.f90 -m fortran

Now you should be able to use the function after importing the module in python

In [None]:
cd Fortran/

In [None]:
import fortran as fort

fort.funcs.fibonacci(10)

Otherwise you can use it via python by creating signature files.  I'll leave you to explore further with the documentation (https://docs.scipy.org/doc/numpy/f2py/getting-started.html#the-quick-and-smart-way)

**Exercise:**

Try to Cythonise other code we have created from previous solutions, both in cell and as scripts. Good examples would be our code for Recmann sequences and for periodic data. 