## Performance and Code Optimization

Occasionally the performance of a direct implementation of a statistical algorithm will not execute quickly
enough be applied to interesting data sets. When this occurs, there are a number of alternatives ranging
from improvements possible using only NumPy and Python to using native code through a Python mod-
ule.
Note that before any code optimization, it is essential that a clean, working implementation is avail-
able. This allows for both measuring performance improvements and to ensure that optimizations have
not introduced any bugs.

Timing code is an important step in measuring performance. IPython contains the magic keywords %timeit and %time which can be used to measure the execution time of a block of code. %time simply runs the code and reports the time needed.

In [16]:
import random
import numpy as np
import pandas as pd

In [17]:
%time
c = 0
for i in range(1000):
    c += 1

CPU times: user 3 µs, sys: 3 µs, total: 6 µs
Wall time: 11 µs


###  Profile Long Running Functions

Profiling provides detailed information about the number of times a line is executed as well as the exe-
cution time spent on each line. The default Python profiling tools are not adequate to address all perfor-
mance measurement issues in NumPy code, and so a third party library known as line_profiler is needed.
line_profiler is not currently available in Anaconda and so it must be installed before use. 

In [18]:
!pip install line_profiler



The simplest method to profile function is to use IPython. This requires a small amount of setup to define
a new magic word,
%lprun
.

In [19]:
import IPython, line_profiler

In [20]:
ip = IPython.get_ipython()

In [21]:
ip.define_magic('lprun', line_profiler.magic_lprun)

In [22]:
glass_data = pd.read_csv('data/dati/glass.data.csv')

In [23]:
def test_1(x,y):
    z = x + y
    u = x*y
    return z*u

In [24]:
%lprun test_1(1,2)

#### Use always xrange to be faster

## Executing Code in Parallel

### map and related functions

map
is a built-in method to apply a function to a generic iterable.  It is used as
map(
function
,
iterable
)
,
and returns a list containing the results of applying
function
to each item of
iterable
. The list returned can
be either a simple list if the function returns a single item, or a list of tuples if the function returns more
than 1 value.

In [25]:
def powers(x):
    return x**2,x**3,x**4

In [26]:
y = [1.0, 2.0, 3.0, 4.0]
map(powers, y)

[(1.0, 1.0, 1.0), (4.0, 8.0, 16.0), (9.0, 27.0, 81.0), (16.0, 64.0, 256.0)]

map
can be used with more than 1 iterable, in which case it iterates using the length of the longest iterable.
If one of the iterable is shorter than the other(s), then it is extended with
None
. It is usually best practice to
ensure that all iterables have the same length before using
map
.

In [27]:
def powers(x,y):
    if x is None or y is None:
        return None
    else:
        return x**2,x*y,y**2

In [28]:
x = [10.0, 20.0, 30.0]
y = [1.0, 2.0, 3.0, 4.0]
map(powers, x, y)

[(100.0, 10.0, 1.0), (400.0, 40.0, 4.0), (900.0, 90.0, 9.0), None]

A
related function is
zip
which
combines two or more lists into a single list of tuples. It is similar to calling
map
except that it will stop at the end of the shortest iterable, rather than extending using
None
.

In [29]:
x = [10.0, 20.0, 30.0]
y = [1.0, 2.0, 3.0, 4.0]
zip(x, y)

[(10.0, 1.0), (20.0, 2.0), (30.0, 3.0)]

### multiprocessing

The
real advantage of
map
over list comprehensions is that it can be combined with the
multiprocessing
module to run code on more than 1 (local) processor.
Note that on Windows, the
multiprocessing
module
does not work correctly in IPython, and so it is necessary to use stand-alone Python programs.
multiprocessing
includes a
map
function which is similar to that in the standard Python distribution except that it exe-
cutes using a
Pool
rather than on a single processor. The performance gains to using a
Pool
may be large,
and should be close to the number of pool processes if code execution is completely independent (which
should be less than or equal to the number of physical processors on a system).
This example uses
multiprocessing
to compute critical values for a non-standard distribution and
is illustrative of a Monte Carlo-like setup.   The program has the standard set of imports including the
multiprocessing
module.

In [30]:
import multiprocessing as mp

U
sing
multiprocessing
requires a
__name__==
’main’
block in the function.   The main block does three
things:
1.  Compute the setup for the simulation.  This is done so that the the variables can be passed to the
function executed in parallel.
305
2.  I
nitialize the pool using
mp.Pool(processes=2)
(nor
mally should use the same number as the num-
ber of physical cores on the system)
3.  Call
map
from the
multiprocessing
module>

In [31]:
def powers(x):
    return x**2,x**3,x**4

In [32]:
x = [10.0, 20.0, 30.0]
po = mp.Pool(processes=4)
res = po.map(powers, x)
print(res)
po.close()

[(100.0, 1000.0, 10000.0), (400.0, 8000.0, 160000.0), (900.0, 27000.0, 810000.0)]


### joblib

joblib
is
a Python package that provides a simple interface to the
multiprocessing
module with a bet-
ter syntax, especially for functions taking multiple arguments.  This improved syntax allow for some ar-
guments to vary according to an index but others to stay fixed, which is not simple to handle in
map
(it
requires setting up a tuple with the both the dynamic and fixed parameters).  It is not part of Anaconda,
but can be easily installed using pip from the console:

In [33]:
!pip install joblib



Using
joblib
is a two step process:
1.  Produce a delayed version of the function to be called using
delayed(
func
)
2.  Call
Parallel
with
a simple loop across iterations

Parallel
takes two sets of inputs. The first are options for use by
Parallel
, while the second is the func-
tion and loop statement.  The most important options are
n_jobs
which sets the number of jobs (can be
omitted to use all cores) and
verbose
which takes a non-negative integer and instructs
joblib
to produce
some output about progress and expected completion time.  Reasonable values for
verbose
are typically
between 5 and 10 – using 50 will produce an update on every job completion, while used 0 produces no
output.

In [34]:
from joblib import Parallel, delayed

In [35]:
func = delayed(powers)

In [38]:
res = Parallel(n_jobs=4, verbose=10)(func(s) for s in xrange(1000))

[Parallel(n_jobs=4)]: Batch computation too fast (0.0028s.) Setting batch_size=144.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    0.0s finished


In [40]:
def powers(x,y):
    if x is None or y is None:
        return None
    else:
        return x**2,x*y,y**2

In [41]:
func = delayed(powers)

In [44]:
res = Parallel(n_jobs=4, verbose=10)(func(s, t) for s in xrange(1000) for t in xrange(100))

[Parallel(n_jobs=4)]: Batch computation too fast (0.0035s.) Setting batch_size=114.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Batch computation too fast (0.0101s.) Setting batch_size=4526.
[Parallel(n_jobs=4)]: Done 236 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 5446 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 37128 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done 100000 out of 100000 | elapsed:    2.5s finished
