# The Professional

A `pythonistas` understands the meaning behind the Zen of Python.

At first you try to force your self into good practices.

In the end you can't stand writing bad code anymore.

A `notebooker` wants to share: his/her work, his/her results.

Usually also Python and its Zen.

The great capabilities of Python are in its packages.

Mastering Python is about being easily capable to study and use a new/unknown package.

# Speeding

> `Cython` is both a language (a superset of Python) and a Python library. 

Cython is a Python-like language that:

- Improves Python’s performance – 1000x speedups not uncommon
- Wraps external code: C, C++, Fortran, others

The cython command:

- generates an optimized C or C++ source file from a Cython source file 
- the C/C++ source is then compiled into a Python extension module

Other features:
* built-in support for NumPy,
* integrates with IPython,
* Combine C’s performance with Python’s ease of use.

http://www.cython.org/

In [None]:
def fib(n):
    a,b = 1,1
    for i in range(n):
        a, b = a+b, a
    return a

In [None]:
fib

In [None]:
%load_ext cython

In [None]:
%%cython
def cfib(int n):
    cdef int i, a, b
    a,b = 1,1
    for i in range(n):
        a, b = a+b, a
    return a

In [None]:
cfib

What's the difference?

```python
def cfib(int n):
    cdef int i, a, b
    a,b = 1,1
    for i in range(n):
        a, b = a+b, a
    return a
```

* We added `int` to argument

* We used `cdef`

Test and comparison

In [None]:
fib(10)

In [None]:
cfib(10)

Performance

In [None]:
# Number to compute
test_size = 100000
# Normal python
t1 = %timeit -n1 -r1 -o fib(test_size)
# Cython library
t2 = %timeit -n1 -r1 -o cfib(test_size)

In [None]:
print("%sx speedup" % (t1.best // t2.best))

**!!**

*Note*: for this function Cython reaches the same speed of C implementation

In [None]:
%%cython
make an error

A C compiler is required.

## How to

With Cython, we start from a regular Python program and we add `annotations` about the type of the variables. 

Then, Cython translates that code to C and compiles the result to a Python extension module. 

Finally, we can use this compiled module in any Python program.

While dynamic typing comes with a performance cost in Python, statically-typed variables in Cython generally lead to faster code execution.

Performance gains are most significant in CPU-bound programs, notably in **tight Python loops**. 

By contrast, *I/O-bound programs* are **NOT** expected to benefit much from a Cython implementation.

# With numpy

Generating the Mandelbrot fractal.

In [None]:
import numpy as np

def mandelbrot_python(m, size, iterations):
    for i in range(size):
        for j in range(size):
            c = -2 + 3./size*j + 1j*(1.5-3./size*i) 
            z= 0
            for n in range(iterations):
                if np.abs(z) <= 10:
                    z = z*z + c
                    m[i, j] = n
                else:
                    break

In [None]:
mandelbrot_python

In [None]:
size = 200
iterations = 100

In [None]:
%%timeit -n1 -r1 m = np.zeros((size, size),dtype=np.int32) 
mandelbrot_python(m, size, iterations)

In [None]:
%%cython
import numpy as np
def mandelbrot_cython(int[:,::1] m, int size, int iterations):
    cdef int i, j, n
    cdef complex z, c

    for i in range(size):
        for j in range(size):
            c = -2 + 3./size*j + 1j*(1.5-3./size*i)
            z= 0
            for n in range(iterations):
                if z.real**2 + z.imag**2 <= 100:
                    z = z*z + c
                    m[i, j] = n
                else:
                    break

In [None]:
%%timeit -n1 -r1 m = np.zeros((size, size),dtype=np.int32) 
mandelbrot_cython(m, size, iterations)

## Wait. Does that really work?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

m = np.zeros((size, size),dtype=np.int32) 
mandelbrot_cython(m, size, iterations)
plt.imshow(np.log(m), cmap=plt.cm.hot)
plt.xticks([]); plt.yticks([])

## How it really works

The `cdef` keyword declares a variable as a statically-typed C variable. 

C variables lead to faster code execution because the overhead from Python's dynamic typing is mitigated. 

Function arguments can also be declared as statically-typed C variables.

In general, variables used inside tight loops should be declared with cdef. 

There are two ways of declaring NumPy arrays as C variables with Cython: using array buffers or using typed memory views. 

Memory views do not implement element-wise operations like NumPy. 
Thus, memory views act as convenient data containers within tight for loops. 

For element-wise NumPy-like operations, array buffers should be used instead.

## If you want or need to use Cython inside standard Python

## Step 1

Write a standalone Cython script in a `.pyx` file. 

This should correspond exactly to the entire contents of a %%cython cell magic.


In [None]:
%%writefile fib.pyx
def fcfib(int n):
    cdef int i, a, b
    a,b = 1,1
    for i in range(n):
        a, b = a+b, a
    return a

## Step 2

Create a setup.py file that we will use to compile the Cython module.


In [None]:
%%writefile setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
    
setup(cmdclass = {'build_ext': build_ext},
      ext_modules = [Extension("cython_fibonacci", ["fib.pyx"])])

## Step 3

Execute this setup script with Python:

In [None]:
! python setup.py build_ext --inplace

Two files have been created during the build process: 

1. the C source file 
2. and a compiled Python extension. 

The  le extension is .pyd on Windows (DLL files) and .so on UNIX

In [None]:
! ls *.so

## Step 4

Finally, we can load the compiled module as usual (using from mandelbrot import mandelbrot).

In [None]:
from cython_fibonacci import fcfib

We can now use the optimized function

In [None]:
fcfib(10)

In [None]:
fcfib


With this technique, Cython code can also be integrated within a Python package. 

Here are a few references:

* Distributing Cython modules
       explained at http://docs.cython.org/src/userguide/source_files_and_compilation.html
* Compilation with Cython
       explained at http://docs.cython.org/src/reference/compilation.html

# Interactive plots?

http://bokeh.pydata.org/en/0.10.0/docs/gallery.html

In [None]:
# Preparation
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
from collections import OrderedDict

In [None]:
from bokeh.sampledata.iris import flowers

colormap = {'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}
flowers['color'] = flowers['species'].map(lambda x: colormap[x])

p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Petal Width'

p.circle(flowers["petal_length"], flowers["petal_width"], color=flowers["color"], fill_alpha=0.2, size=10)
output_notebook()
show(p)

In [None]:
from bokeh._legacy_charts import Donut, show, output_file
from bokeh.sampledata.olympics2014 import data

# throw the data into a pandas data frame
df = pd.io.json.json_normalize(data['data'])

# filter by countries with at least one medal and sort
df = df[df['medals.total'] > 8]
df = df.sort("medals.total", ascending=False)

# get the countries and we group the data by medal type
countries = df.abbr.values.tolist()
gold = df['medals.gold'].astype(float).values
silver = df['medals.silver'].astype(float).values
bronze = df['medals.bronze'].astype(float).values

# build a dict containing the grouped data
medals = OrderedDict()
medals['bronze'] = bronze
medals['silver'] = silver
medals['gold'] = gold
medals = pd.DataFrame(medals)

donut = Donut(medals, countries)
output_notebook()
show(donut)

# GPU

GPU programming is a rich and highly technical topic, 

encompassing low-level architectural details of GPUs. 

We present here only the simplest paradigm possible
(the "embarrassingly parallel" problem). 

## PyCuda

Installing and configuring PyCUDA is not straightforward in general.

* First, you need an NVIDIA GPU.
* Then, you need to install the CUDA SDK. 
* Finally, you have to install and configure PyCUDA. 
    (Note that PyCUDA depends on a few external packages, notably `pytools`)
    
Make sure your version of CUDA matches the version used in the PyCUDA package...

In [None]:
import pycuda.driver as cuda
import pycuda.autoinit

In [None]:
import numpy as np

# NumPy array that will contain the fract
size = 200
iterations = 100
col = np.empty((size, size), dtype=np.int32)
# allocate GPU memory for this array
col_gpu = cuda.mem_alloc(col.nbytes)

In [None]:
# We write the CUDA kernel (C code) in a string!

code = """
__global__ void mandelbrot(int size,
    int iterations,
    int *col) {

    // YOUR CODE
    // YOUR CODE
}
"""

Or import a C file in a python string :)

In [None]:
%%capture code
%cat mycuda_ccode.c

In [None]:
# Compile the CUDA program
from pycuda.compiler import SourceModule
prg = SourceModule(code)
mandelbrot = prg.get_function("mandelbrot")

In [None]:
# define the block size and the grid size, 
# specifying how the threads will be parallelized with respect to your data
block_size = 10
block = (block_size, block_size, 1)
grid = (size // block_size, size // block_size, 1)

In [None]:
# Execute!

mandelbrot(np.int32(size), 
           np.int32(iterations), 
           col_gpu, # python space for cuda buffer
           block=block, grid=grid) # parallelization

In [None]:
# copy the contents of the CUDA buffer back to the NumPy array
cuda.memcpy_dtoh(col, col_gpu)

# Parallel

In [None]:
! conda install -y ipyparallel

Now: open a terminal a create a cluster.

```bash
$ ipcluster start -n 4
```

The first step is to import the IPython ipyparallel module and then create a Client instance

In [None]:
import ipyparallel as ipp
rc = ipp.Client()
rc

In [None]:
rc.ids

In [None]:
# Process ids

import os
ar = rc[:].apply_async(os.getpid)
pid_map = ar.get_dict()
pid_map

In [None]:
# We might check the PIDs
! ps xa | grep engine


In [None]:
# A DirectView of all engines
dview = rc[:]

Blocking execution

In [None]:
dview.block = True
dview['a'] = 5
dview['b'] = 10

dview.apply(lambda x: a+b+x, 27)

Magic

In [None]:
%px print('hi')

In [None]:
# OOPS
%px print('hi'

In [None]:
import numpy
%px numpy.random.rand(1)

In [None]:
with rc[:].sync_imports():
    import numpy

In [None]:
%%px 
a = numpy.random.rand(2,2)
numpy.linalg.eigvals(a)

Non-blocking execution: asyncronous

In [None]:
%%px
import time
import random
pause = random.randint(1,5)
time.sleep(pause)
now = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(time.time()))
print("[%s] Completed after %s seconds " % (now,pause) )

In [None]:
%%px --noblock
import time
import random
pause = random.randint(1,5)
time.sleep(pause)
now = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(time.time()))
print("[%s] Completed after %s seconds " % (now,pause) )

In [None]:
%pxresult

In [None]:
%%px --targets 1
print("I am number 1")

In [None]:
%%px --targets ::2
print("I am even")

In [None]:
%%px --targets 1:3
print("In the middle")

* Dependencies
* controller and engine are separated
    - they can run on different hosts
* Integrates with MPI
* PBS mode
* Load balancer
* Scheduler
* Retries

```ipython
# for a visible LAN controller listening on an external port:
rc = Client('tcp://192.168.1.16:10101')
# or to connect with a specific profile you have set up:
rc = Client(profile='mpi')
```

https://ipyparallel.readthedocs.org/en/latest/index.html

# The END

> The end is only the beginning