****
# High Performance Computation in Python: Cython vs. Multiprocessing
****
## About this notebook: 
Notebook prepared by **Jesus Perez Colino** Version 0.1, First Released: 01/12/2014, Alpha.  

- This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This work is offered for free, with the hope that it will be useful.


- **Summary**: This notebook contains not only a brief introduction to Cython, but also we compare the performance between Cython and Multiprocessing for the simplest possible Monte-Carlo Simulation.


- **Python & packages versions** to reproduce the results of this notebook: 

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import IPython
from scipy.stats import norm
from abc import ABCMeta, abstractmethod
from sys import version 
import multiprocessing
from numpy import ceil, mean
import time
import os
print ' Reproducibility conditions for this notebook '.center(90,'-')
print 'Python version:     ' + version
print 'Numpy version:      ' + np.__version__
print 'IPython version:    ' + IPython.__version__
print 'Multiprocessing:    ' + multiprocessing.__version__
print '-'*90

---------------------- Reproducibility conditions for this notebook ----------------------
Python version:     2.7.10 |Anaconda 2.3.0 (x86_64)| (default, Sep 15 2015, 14:29:08) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Numpy version:      1.9.2
IPython version:    4.0.0
Multiprocessing:    0.70a1
------------------------------------------------------------------------------------------


## Introduction to Cython

Cython is two closely related things:

- Cython is a programming language that blends Python with the static type system of C and C++.
- Cython is also a compiler that translates Cython source-code into C or C++ source-code. This source can then be compiled into a Python extension module or a standalone executable.

Python is high-level, dynamic, simple and flexible programming language. These positives come with a cost, however—because Python is dynamic an interpreted, it can be several orders of magnitude slower than statically typed compiled languages.

C, on the other hand, is one of the oldest statically typed compiled languages in wide‐ spread use, so compilers have had nearly half a century to optimize its performance. C is very low level and very powerful. Unlike Python, it does not have many safeguards in place and can be difficult to use.

In the following Fibonacci example, we can see clearly the differences in time between `fibonacci_python` and `fibonacci_cython`:


In [2]:
def fibonacci_python(n):
    a, b = 0, 1
    while b < n:
        #print b,
        a, b = b, a + b

In [3]:
%timeit fibonacci_python(100000000)

100000 loops, best of 3: 2.26 µs per loop


In [4]:
%load_ext cython

In [7]:
%%cython
def fibonacci_cython(int n ):
    cdef int a=0, b=1
    while b < n:
        #print b,
        a, b = b, a + b

In [8]:
%timeit fibonacci_cython(100000000)

The slowest run took 15.24 times longer than the fastest. This could mean that an intermediate result is being cached 
10000000 loops, best of 3: 78.2 ns per loop


Python, as a dynamically typed language, place no restrictions on a variable’s type: the same variable can start out as an integer and end up as a string, or a list, or an instance of a custom Python object, for example. Dynamically typed languages are typically easier to write because the user does not have to explicitly declare variables’ types, with the tradeoff that type-related errors are caught at runtime.

When running a Python program, the interpreter spends most of its time figuring out what low-level operation to perform, and extracting the data to give to this low-level operation. Given Python’s design and flexibility, the Python interpreter always has to determine the low-level operation in a completely general way, because a variable can have any type at any time. This is known as *dynamic dispatch*, and for many reasons, fully general dynamic dispatch is slow.

The situation for C is very different. Because C is compiled and statically typed, the C compiler can determine at compile time what low-level operations to perform and what low-level data to pass as arguments. At runtime, a compiled C program skips nearly all steps that the Python interpreter must perform, and therefore, a compiled C program spends nearly all its time calling fast C functions and performing fundamental operations. Because of the restrictions a statically typed language places on its variables, a compiler generates faster, more specialized instructions that are tailored to its data.

## An Example of Cython vs. Multiprocessing Simulation

In [9]:
def step():
    return np.sign(np.random.random(1)-.5)

def sim1(n):
    x = np.zeros(n)
    dx = 1./n
    for i in xrange(n-1):
        x[i+1] = x[i] + dx * step()
    return x

n = 10000
%timeit sim1(n)

10 loops, best of 3: 41.1 ms per loop


In [10]:
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.double
ctypedef np.double_t DTYPE_t
from libc.stdlib cimport rand, RAND_MAX
from libc.math cimport round
cdef double step():
    return 2 * round(float(rand()) / RAND_MAX) - 1
def sim2(int n):
    cdef int i
    cdef double dx = 1. / n
    cdef np.ndarray[DTYPE_t, ndim=1] x = np.zeros(n, dtype=DTYPE)
    for i in range(n - 1):
        x[i+1] = x[i] + dx * step()
    return x

In [11]:
%timeit sim2(n)

10000 loops, best of 3: 99 µs per loop


In [14]:
scenarios = {'1': n, 
             '2': n, 
             '3': n,
             '4': n,
             '5': n,
             '6': n}
results = {}
print '-' * 85
for num_processes in scenarios:
    N = scenarios[num_processes]
    chunks = [int(ceil(N / int(num_processes)))] * int(num_processes)
    chunks[-1] = int(chunks[-1] - sum(chunks) + N)
    p = multiprocessing.Pool(int(num_processes))
    print 'Number of processors:', num_processes 
    %timeit p.map(sim1, chunks)
    p.close()
    p.join()
    print '-' * 85


-------------------------------------------------------------------------------------
Number of processors: 1
10 loops, best of 3: 43.2 ms per loop
-------------------------------------------------------------------------------------
Number of processors: 3
100 loops, best of 3: 15 ms per loop
-------------------------------------------------------------------------------------
Number of processors: 2
10 loops, best of 3: 21.8 ms per loop
-------------------------------------------------------------------------------------
Number of processors: 5
100 loops, best of 3: 13.4 ms per loop
-------------------------------------------------------------------------------------
Number of processors: 4
100 loops, best of 3: 11.7 ms per loop
-------------------------------------------------------------------------------------
Number of processors: 6
100 loops, best of 3: 12.2 ms per loop
-------------------------------------------------------------------------------------


## And combining Multiprocessing and Cython... 

In [16]:
scenarios = {'1': n, 
             '2': n, 
             '3': n,
             '4': n,
             '5': n,
             '6': n}
results = {}
print '-' * 85
for num_processes in scenarios:
    N = scenarios[num_processes]
    chunks = [int(ceil(N / int(num_processes)))] * int(num_processes)
    chunks[-1] = int(chunks[-1] - sum(chunks) + N)
    p = multiprocessing.Pool(int(num_processes))
    print 'Number of processors:', num_processes 
    %timeit p.map(sim2, chunks)
    p.close()
    p.join()
    print '-' * 85

-------------------------------------------------------------------------------------
Number of processors: 1
The slowest run took 6.41 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 289 µs per loop
-------------------------------------------------------------------------------------
Number of processors: 3
The slowest run took 7.66 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 231 µs per loop
-------------------------------------------------------------------------------------
Number of processors: 2
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 219 µs per loop
-------------------------------------------------------------------------------------
Number of processors: 5
The slowest run took 7.14 times longer than the fastest. This could mean that an intermed