# How does Cython speed up Python?

## Reason 1: Interpreted -> Compiled

## Cython version of trivial function

In [22]:
%load_ext Cython

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython


In [24]:
%%cython -n cyfoo

def cyfoo(a, b):
    return a + b

## Profiling

In [25]:
%timeit cyfoo(1, 2)

82.8 ns ± 3.08 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [26]:
import sys
sys.modules['cyfoo']

<module 'cyfoo' (/home/jovyan/.cache/ipython/cython/cyfoo.cpython-36m-x86_64-linux-gnu.so)>

In [27]:
print("Cython integer addition speedup: {:0.1f}%".format((112. - 79.) / 112. * 100))

Cython integer addition speedup: 29.5%


In [28]:
%timeit cyfoo('a', 'b')

119 ns ± 3.74 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [29]:
print("Cython string addition speedup: {:0.1f}%".format((159. - 133.) / 159. * 100))

Cython string addition speedup: 16.4%


### For simple addition, Cython version gives consistent speedup

* With all the caveats for microbenchmarks...

## We see the same `PyNumber_Add()` entry point as for interpreted Python

In [30]:
!cat /home/jovyan/.cache/ipython/cython/cyfoo.c | nl

     1	/* Generated by Cython 0.25.2 */
       
     2	/* BEGIN: Cython Metadata
     3	{
     4	    "distutils": {
     5	        "language": "c"
     6	    },
     7	    "module_name": "cyfoo"
     8	}
     9	END: Cython Metadata */
       
    10	#define PY_SSIZE_T_CLEAN
    11	#include "Python.h"
    12	#ifndef Py_PYTHON_H
    13	    #error Python headers needed to compile C extensions, please install development version of Python.
    14	#elif PY_VERSION_HEX < 0x02060000 || (0x03000000 <= PY_VERSION_HEX && PY_VERSION_HEX < 0x03020000)
    15	    #error Cython requires Python 2.6+ or Python 3.2+.
    16	#else
    17	#define CYTHON_ABI "0_25_2"
    18	#include <stddef.h>
    19	#ifndef offsetof
    20	  #define offsetof(type, member) ( (size_t) & ((type*)0) -> member )
    21	#endif
    22	#if !defined(WIN32) && !defined(MS_WINDOWS)
    23	  #ifndef __stdcall
    24	    #define __stdcall
    25	  #endif
    26	  #ifndef __cdecl
    27	    #define __cdecl



```c
static PyObject 
*__pyx_pf_5cyfoo_cyfoo(CYTHON_UNUSED PyObject *__pyx_self,
                       PyObject *__pyx_v_a,
                       PyObject *__pyx_v_b) {
[...]
  /* "cyfoo.pyx":3
 * 
 * def cyfoo(a, b):
 *     return a + b             # <<<<<<<<<<<<<<
 */
  __pyx_t_1 = PyNumber_Add(__pyx_v_a, __pyx_v_b);
 [...]
}
```

## We conclude: converting from interpreted to compiled code gives some speedup

## Reason 2: Dynamic -> Static Typing

In [31]:
def pyfac(n):
    if n <= 1:
        return 1
    return n * pyfac(n - 1)

In [32]:
%timeit pyfac(20.0)
pyfac(20.0)

3.66 µs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


2.43290200817664e+18

In [33]:
%%cython

def cyfac(n):
    if n <= 1:
        return 1
    return n * cyfac(n - 1)

def cyfac_double(double n):
    if n <= 1:
        return 1.0
    return n * cyfac_double(n - 1)

In [34]:
%timeit cyfac(20.0)
cyfac(20.0)

1.48 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


2.43290200817664e+18

In [35]:
%timeit cyfac_double(20.0)
cyfac_double(20.0)

943 ns ± 9.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


2.43290200817664e+18

## Optimal Cython solution: up to 40x speedup

* Optimal for *this* recursive implementation...

In [36]:
%%cython

cpdef double cyfac_double_fast(double n):
    if n <= 1:
        return 1.0
    return n * cyfac_double_fast(n - 1)

In [37]:
%timeit cyfac_double_fast(20.0)
cyfac_double_fast(20.0)

89.4 ns ± 1.48 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


2.43290200817664e+18

## For the record: what about a loop-based version?

In [38]:
def pyfac_loop(n):
    r = 1.0
    for i in range(1, n+1):
        r *= i
    return r

In [39]:
%timeit pyfac_loop(20)
pyfac_loop(20)

1.02 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


2.43290200817664e+18

In [40]:
%%cython -a

cpdef double cyfac_loop(int n):
    cdef double r = 1.0
    cdef int i
    for i in range(1, n+1):
        r *= <double>i
    return r

In [41]:
%timeit cyfac_loop(20)
cyfac_loop(20)

69 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


2.43290200817664e+18

In [42]:
print("Cython speedup factor--loop-based version: {:0.1f}".format((1.81 / 0.062)))

Cython speedup factor--loop-based version: 29.2


## Excercises / questions

* Why are we using `double` here instead of `long`?
* Why are the `pyfac_loop()` and `cyfac_loop()` versions *better* from a robustness pov?
* Write a trivial no-op function in Python and measure its performance w/ `timeit`.  Now, make a Cython no-op `def` function, and measure *it*.  How do they compare?  Conjecture why.  What does this imply for function call overhead between pure Python and Cython code?

In [43]:
def pynoop(): pass
%timeit pynoop()

78 ns ± 0.942 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [44]:
%%cython -a
def cynoop(): pass

In [45]:
%timeit cynoop()

41.9 ns ± 1.32 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
