# BMI565: Bioinformatics Programming & Scripting

#### (C) 2015 Michael Mooney (mooneymi@ohsu.edu)

## Week 11: Benchmarking and Optimizing Python Code

1. Benchmarking / Profiling in Python
2. Optimizing Code with SciPy (Weave)
3. Parallel Processing
    - `Multiprocessing` module
    - `pp` (Parallel Python) module
4. Final Exam Review

####Requirements

- Python 2.7
- `time`, `timeit`, and `profile` modules
- `scipy` and `numpy` modules
- `multiprocessing` module
- Parallel Python, `pp`, module

## Benchmarking / Profiling

There are a number of ways to evaluate the performance of your Python code. Three useful modules are:

- `time`
- `timeit`
- `profile`

In [1]:
## Define a function that determines if a number is prime
def isprime(n):
    """
    Returns the number if it is prime, otherwise returns None.
    """
    assert n > 0, "Number must be greater than 0!"
    if n < 2: return None
    for i in range(2,n):
        if n % i == 0:
            return None
    return n

def get_primes(min, max):
    result = []
    possible_primes = range(min,max+1)
    for n in possible_primes:
        result.append(isprime(n))

    prime_nums = [n for n in result if n is not None]
    return prime_nums

In [2]:
## Binary search function
def bsearch(l, n):
    s = 0
    e = len(l) - 1
    while True:
        if s > e:
            return None
        mid = (s + e)/2
        if l[mid] < n:
            s = mid  + 1
        elif l[mid] > n:
            e = mid  - 1
        else:
            return mid

In [3]:
## Recursive binary search function
def rec_bsearch(l,n,s=0,e=None):
    if e is None: e = len(l) - 1
    if s > e:
        return None
    mid = (s + e)/2
    if n == l[mid]:
        return mid
    elif n < l[mid]:
        return rec_bsearch(l,n,s,mid-1)
    else:
        return rec_bsearch(l,n,mid+1,e)

In [5]:
bsearch([1,2,3,4,5,6,7,8], 9)

### `time` module

In [7]:
import time

def search_time(fun, N, M):
    runtimes = []
    nums = range(M)
    start_time = time.time()
    for i in range(N):
        t0 = time.time()
        cmd = fun + "(nums, 3450)"
        idx = eval(cmd)
        runtimes.append(time.time() - t0)
    
    print "Total runtime: ", time.time() - start_time
    print "Mean runtime: ", sum(runtimes)/len(runtimes)
    return None

In [8]:
print "Binary Search:"
search_time("bsearch", 5000, 1000000)

Binary Search:
Total runtime:  0.138224124908
Mean runtime:  2.69233703613e-05


### `timeit` module

In [9]:
import timeit

## Get the runtime of a Python statement
timeit.timeit("bsearch(nums, 3450)", setup="from __main__ import bsearch; nums = range(1000000)", number=5000)

0.05454111099243164

In [8]:
## Create a timer and run it multiple times
timer = timeit.Timer("bsearch(nums, 3450)", setup="from __main__ import bsearch; nums = range(1000000)")
timer.repeat(3, number=5000)

[0.04827690124511719, 0.0411679744720459, 0.03947019577026367]

### `profile` module

In [11]:
import profile
nums = range(10000000)
profile.run("rec_bsearch(nums, 3450)")

         23 function calls (5 primitive calls) in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     19/1    0.000    0.000    0.000    0.000 <ipython-input-3-3e47e1fd46da>:2(rec_bsearch)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.000    0.000 profile:0(rec_bsearch(nums, 3450))




## Scipy and `weave`

Weave allows you to optimize your code by including C/C++ code within your Python program. The weave.inline() function will run C code and return the results to your Python program. The weave.blitz() function will compile NumPy expressions for faster execution.

In [12]:
## C binary search 
from scipy import weave
import numpy as np

def c_int_bsearch(l, n):
    """
    Binary search written in C, using SciPy weave
    """
    
    ## C code for binary search of integers
    c_code = """int val, mid, s = 0;
        int e = l.length() - 1;
        PyObject *py_val;
        while(1)
        {
            if (s > e)
            {
                return_val =  -1;
                break;
            }
            mid =  (s + e) /2;
            val = py_to_int(PyList_GetItem(l, mid), "val");
            if (val < n)
                s = mid + 1;
            else if (val > n)
                e = mid - 1;
            else
            {
                return_val = mid;
                break;
            }
        }
    """
    
    return weave.inline(c_code, ['l','n'])

In [13]:
c_int_bsearch([1,2,3,4,5,6,7,8], 3)

2

In [14]:
## Use weave.blitz() to compile and run a NumPy expression
a = np.random.random_integers(0,100,(3,50))
print a
print 
np_expr = "a[0,:] = (a[0,:] + a[1,:] + a[2,:])/3"
weave.blitz(np_expr)
print a

[[ 72  69  26  99  28  54  51   6  66  27  79  68  13  72  75  28  16  91
   91  69  92  66  56  76  45  62  19  97  67  35   3  83   8  96  97   2
   91  80  82  18  39  15  71  13  83  89  64  49  57  55]
 [ 47  89  55   2  57  56  66  92  23  62  42  77  71  21  32  27  81  32
   75  53  99  47  47  79  59  12  33  56  25  55   9  25  14  81  73  51
   52  37  88  28  40  14  94  69  61  99  74  28  41  13]
 [ 34  29  80  71  81   7  41  44  46   9  37  11  10  23  93  69  49  79
   16  29  55 100  80  27  36  24  75  42  84  55  40  67  12  99  98  53
    1  23   8  59  29  51  10  32  69  65  62   8  27  41]]

[[ 51  62  53  57  55  39  52  47  45  32  52  52  31  38  66  41  48  67
   60  50  82  71  61  60  46  32  42  65  58  48  17  58  11  92  89  35
   48  46  59  35  36  26  58  38  71  84  66  28  41  36]
 [ 47  89  55   2  57  56  66  92  23  62  42  77  71  21  32  27  81  32
   75  53  99  47  47  79  59  12  33  56  25  55   9  25  14  81  73  51
   52  37  88  28  40 

##Parallel Processing

Parallel processing is a technique for improving the performance of a computational task, based on the idea that large problems can often be split into multiple smaller problems. These smaller problems can then be solved simultaneously (in parallel). Given the constraints of processor design and development, parallel computing (multi-processor machines) is now a common way to improve computational power.

There are numerous Python modules that allow you to take advantage of the computational power of multiple processors (the list below is not complete):

[https://wiki.python.org/moin/ParallelProcessing](https://wiki.python.org/moin/ParallelProcessing)

In [17]:
import multiprocessing as mp
import pp

In [15]:
## Find prime numbers serially
min_prime = 30000
max_prime = 50000

t0 = time.time()
prime_nums = get_primes(min_prime, max_prime)
t1 = time.time()

print "There are %d prime numbers between %d and %d." % (len(prime_nums), min_prime, max_prime)
print "Elapsed time:", t1 - t0

There are 1888 prime numbers between 30000 and 50000.
Elapsed time: 15.1406350136


In [15]:
profile.run("get_primes(min_prime, max_prime)")

         60008 function calls in 15.019 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20001    0.066    0.000    0.066    0.000 :0(append)
    20002    4.873    0.000    4.873    0.000 :0(range)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.156    0.156   15.018   15.018 <ipython-input-1-4d94308a6ebf>:13(get_primes)
    20001    9.922    0.000   14.795    0.001 <ipython-input-1-4d94308a6ebf>:2(isprime)
        1    0.000    0.000   15.018   15.018 <string>:1(<module>)
        1    0.000    0.000   15.019   15.019 profile:0(get_primes(min_prime, max_prime))
        0    0.000             0.000          profile:0(profiler)




###`Multiprocessing` module

`Multiprocessing` is a module in Python's standard library that allows you to spawn multiple Python processes. It is an easy way to take advantage of multiple cores on a single machine.

[https://docs.python.org/2/library/multiprocessing.html](https://docs.python.org/2/library/multiprocessing.html)

In [18]:
## Get number of CPUs
mp.cpu_count()

4

In [19]:
## Find prime numbers using parallel processes
possible_primes = range(min_prime,max_prime+1)

t2 = time.time()
pool = mp.Pool(processes=3)
result2 = pool.map(isprime, possible_primes)
prime_nums2 = [n for n in result2 if n is not None]
t3 = time.time()

## Make sure to close the processes created by Pool
pool.close()

print "There are %d prime numbers between %d and %d." % (len(prime_nums2), min_prime, max_prime) 
print "Elapsed time:", t3 - t2

There are 1888 prime numbers between 30000 and 50000.
Elapsed time: 7.95806097984


###`pp` (Parallel Python) module

The Parallel Python module can be used to parallelize across multiple processors on a single machine, and also across multiple nodes of a computing cluster.

[http://www.parallelpython.com/](http://www.parallelpython.com/)

In [20]:
## Create pp job server
job_server = pp.Server(ncpus=3)
jobs = []

t4 = time.time()
## Submit jobs to pp server
for i in possible_primes:
    jobs.append(job_server.submit(isprime, (i,)))
## Wait for all jobs to finish
job_server.wait()
prime_nums3 = [job() for job in jobs if job() is not None]
t5 = time.time()

## Close the processes created by pp
job_server.destroy()

## Print results
print "There are %d prime numbers between %d and %d." % (len(prime_nums3), min_prime, max_prime) 
print "Elapsed time:", t5 - t4

There are 1888 prime numbers between 30000 and 50000.
Elapsed time: 17.623721838


## Final Exam Review

###Topics Covered on Exam

- Basic Linux commands
    - I/O redirection
- Bash
    - Control structures
- XML
- Algorithm Analysis
    - Time complexity
    - Big O notation
- Error Handling
    - try/except
    - assert statements
    - Check input, use print statements, etc.
- BioPython
    - Sequence objects
    - Alignments (clustalW)
    - BLAST
- Numpy
    - Creating arrays
    - Boolean indexing
    - Element-wise operations

##References

- [https://wiki.python.org/moin/ParallelProcessing](https://wiki.python.org/moin/ParallelProcessing)
- [http://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/weave.html](http://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/weave.html)
- [https://docs.python.org/2/library/time.html](https://docs.python.org/2/library/time.html)
- [https://docs.python.org/2/library/timeit.html](https://docs.python.org/2/library/timeit.html)
- [https://docs.python.org/2/library/profile.html](https://docs.python.org/2/library/profile.html)

#### Last Updated: 1-Dec-2015