# Improving code performance in Python 

 - Homework: reading, think of ?'s for guest speakers, work on project. Sign up for 1-1 meeting on Monday you want.

 - Friday: Guest speakers

 - Monday: project check-in


## 1. Parallel Processing

As with many other languages, Python does have some support for running commands in parallel. This type of programming comes in a few flavors, including multi-threading, GPU-based parallelism, and MPI. GPU computing (passing off calculations to graphics cards) can be useful for a variety of tasks, but does require specialized hardware and so we will not pursue that idea here. MPI (message passing interface) parallelism is also useful in a high performance computing setting where computing clusters are involved, so we will also forego this.

Parallelism is somewhat less efficient in Python due to the way it has been developed. It contains a 'global interpreter', through which most logic will be executed. Threads in python effectively work around this by starting multiple python processes. This type of structure works well for "trivially parallelizable" tasks, but becomes more cumbersome as the complexity of code increases; it isn't too difficult to bump up against the limitations of this approach.

Thread-based parallelism is available through a couple modules. Here we will look at functionality contained within [joblib.Parallel](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).

In [16]:
from joblib import Parallel, delayed
?Parallel

So, we can call `Parallel`, and supply it with a function to iterate over. We can use one of the `%time` or `%timeit` [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html) to get an idea of how fast execution times are.

In [19]:
import numpy as np
from time import sleep

# Sleep for 1 second, 6 times.
print("Sleeping for 6 seconds with one thread...")
%time [sleep(1) for i in range(6)]

# 2 threads sleep for a total 6 seconds, so this should take 3 seconds to run.
print("\n\nSleeping for 6 seconds across 2 threads...")
%time Parallel(n_jobs=2)(delayed(sleep)(1) for i in range(6))

# 2 threads sleep for a total 6 seconds, so this should take 2 seconds to run?
print("\n\nSleeping for 6 seconds across 3 threads...")
%time Parallel(n_jobs=3)(delayed(sleep)(1) for i in range(6)) # n_jobs = threads

Sleeping for 6 seconds with one thread...
CPU times: user 26.9 ms, sys: 3.19 ms, total: 30.1 ms
Wall time: 6.01 s


Sleeping for 6 seconds across 2 threads...
CPU times: user 24.6 ms, sys: 2.05 ms, total: 26.7 ms
Wall time: 3.04 s


Sleeping for 6 seconds across 3 threads...
CPU times: user 22.6 ms, sys: 12.2 ms, total: 34.8 ms
Wall time: 2.22 s


[None, None, None, None, None, None]

How many CPUs do we actually have?

In [18]:
import psutil as ps
print("# local CPUs is:", ps.cpu_count())

# local CPUs is: 2


So multiple threads can run on the same CPU, but not necessarily efficiently.

The `Parallel` call returns output of the command as well,

In [20]:
[np.sqrt(i) for i in range(6)]

[0.0, 1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.23606797749979]

In [21]:
Parallel(n_jobs=2)(delayed(np.sqrt)(i) for i in range(6))

[0.0, 1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.23606797749979]

Below we try a slightly more complicated example.

In [28]:
# As a slightly more complicated example, let's integrate a set of special functions.
from scipy.integrate import quad
from scipy.special import spherical_jn

def bessel_integral(n) :
  return quad(lambda z: spherical_jn(n, z)/z, 0, np.inf, limit=10000)

# This should take a few seconds.
print("Running with one thread...")
%time values = [ bessel_integral(i) for i in range(1, 9) ]

# We might expect this to take half the time if run on 2 cores.
print("\n\nRunning with 2 threads now...")
%time Parallel(n_jobs=2)(delayed(bessel_integral)(i) for i in range(1, 9))

Running with one thread...


  the requested tolerance from being achieved.  The error may be 
  underestimated.
  


CPU times: user 3.25 s, sys: 0 ns, total: 3.25 s
Wall time: 3.27 s


Running with 2 threads now...
CPU times: user 44.5 ms, sys: 1.71 ms, total: 46.2 ms
Wall time: 3.08 s


[(0.7853982627819827, 5.079640384986206e-07),
 (0.3333336924982387, 5.888876016268974e-07),
 (0.19634944168331162, 5.385902947241394e-07),
 (0.13333298120149809, 5.671209072632966e-07),
 (0.09817487142158836, 1.2012872389194351e-06),
 (0.07619081841751907, 5.363645518841054e-07),
 (0.06135839608844671, 2.875120197823322e-07),
 (0.05079332019865245, 5.000602467386539e-07)]

### Quick Exercise

Recall the first midterm where we looked at the Eigenvalues of random, symmetric matrices. Suppose we were unsatisfied with the results from the single matrices we considered there, and instead we wanted to look at the distribution of Eigenvalues of many matrices.

Below, write a function that generates a $1000\times1000$ matrix with random elements on the interval (0, 1), symmetrizes it, and returns the Eigenvalues. Recall we can extract the symmetric part of a matrix M as $(M + M^{\rm T})/2$.

Compute the Eigenvalues of 10 of these matrices. Is the calculation significantly faster when using multiple threads?

In [29]:
import scipy.linalg as la

def rand_eigvals(n) :
  mat = np.random.rand(n, n)
  return np.real(la.eigvals(0.5*(mat+mat.T)))

print("Generating matrices in serial.")
%time evs=[rand_eigvals(1000) for _ in range(10)]

print("\n\nGenerating matrices in parallel.")
evs2 = Parallel(n_jobs=2, verbose=5)(delayed(rand_eigvals)(1000) for _ in range(10))

Generating matrices in serial.
CPU times: user 18.1 s, sys: 8.32 s, total: 26.4 s
Wall time: 13.5 s


Generating matrices in parallel.


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    8.9s remaining:    0.0s
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    8.9s finished


## 2. Cython (Numba)

We have a few options within Python to further improve and optimize code performance. Python is, unfortunately, not one of the most efficient languages. A quick comparison of algorithms in Python to other languages will show this. The below comparison was of algorithms written naively, using 

![benchmarks.svg](https://julialang.org/assets/benchmarks/benchmarks.svg)

This is, to an extent, not an entirely fair comparison: various Python modules can and will pass function calls to highly efficient code written in e.g. c or fortran. Such is the case with many routines within numpy and scipy that we looked at during the semester.

Nevertheless, other languages can take advantage of *compilers* to translate high-level code you may have written to low-level machine code. Modern compilers can heavily optimize code during this process, restructuring it, and even optimizing it so it can run more efficiently on specific hardware.

One extension exists that allows you to, essentially, compile Python code, or to mix Python and c code: Cython. We can load the Cython extension in a Jupyter notebook as follows,


In [30]:
%load_ext Cython

If you would like to do this on your own computer, you can install Cython locally. You can also use Cython outside of a notebook environment, but usage will be more complicated.

Below is a simple function which performs some basic multiplication.

In [31]:
def test(x):
    y = 1
    for i in range(1, x+1):
        y *= i
    return y

In [42]:
%timeit -n 2000 test(20)

2000 loops, best of 5: 1.58 µs per loop


We will need to prefix a Cython cell with the `%%cython` command.
Below we do so, similarly defining a cython function that performs the same task. We can add the `-a` option to see what has been optimized.

In [36]:
%%cython
def test1(x):
    y = 1
    for i in range(1, x+1):
        y *= i
    return y

In [43]:
%timeit -n 2000 test1(20)

2000 loops, best of 5: 947 ns per loop


That doesn't seem to have helped much.
However, within cython, we can also specify variables as types.

In [46]:
%%cython
def test2(int x): # The argument x will be an integer.
    cdef int y = 1 # y will be an integer,
    cdef int i # i will also be an integer.
    for i in range(1, x+1):
        y *= i
    return y

In [44]:
%timeit -n 2000 test2(20)

2000 loops, best of 5: 150 ns per loop


### Quick exercise

The Fibonacci numbers are defined through the recurrence relation,
$$ F_i = F_{i − 1} + F_{i−2}\,,$$
with $F_0 = 0$ and $F_1 = 1$.

Below write a function to compute the $n$th Fibonacci number. You can do this using e.g. a loop or a recursive function. You should use floating point numbers instead of integers for this; the Fibonacci numbers grow very quickly.

Compute the 100th Fibonacci number, and time how fast your code executes. See if you can compile your function with Cython, and improve its speed.

In [47]:
def python_fib(n):
    a = 0.0
    b = 1.0
    for i in range(n):
        tmp = a
        a = a + b
        b = tmp
    return a

%timeit -n 200 python_fib(100)

The slowest run took 4.59 times longer than the fastest. This could mean that an intermediate result is being cached.
200 loops, best of 5: 9.35 µs per loop


In [51]:
%%cython
def cython_fib(int n):
    cdef double a = 0
    cdef double b = 1
    cdef double tmp
    for i in range(n):
        tmp = a
        a = a + b
        b = tmp
    return a

In [52]:
%timeit -n 200 cython_fib(100)

200 loops, best of 5: 261 ns per loop


### Numba

Another option for compiling code is numba. To compile a function, the basic syntax looks something like

```
from numba import jit
compiled_function = jit(uncompiled_function)
```

where `uncompiled_function` is, appropriately, an uncompiled function. See if you can use numba to compile your function; how does its execution speed compare now?

In [53]:
from numba import jit
jfib = jit(python_fib)
%timeit -n 200 jfib(100)

The slowest run took 1834.28 times longer than the fastest. This could mean that an intermediate result is being cached.
200 loops, best of 5: 349 ns per loop


## High-performance I/O

A number of highly efficient file formats exist for storing large amount of data. These include formats such as hdf5, fits, and others; for smaller amounts of data less efficient and structured formats such as csv formats may be useful. Python contains routines to interface with all of these formats and many more.

### MySQL

For dynamic datasets, where you want to access or modify specific parts of a large amount of data, database systems exist. One example of such a system is MySQL. SQL ("standard query language") is a language designed for accessing databases; MySQL is one database system that utilizes this language. Here, we will only look at a few basic commands, but this language is quite powerful, along with extensions (e.g. NoSQL). We will need to install a special package for MySQL.

In [54]:
!pip -q install mysql-connector-python

We can then initialize a connection. Below we connect to `ensembldb.ensembl.org`, a public database that contains a large amount of genomic data. We will need to specify the server, user, and a specific database to connect to. Here we will look at a database with information about troglodytes.

In [55]:
import mysql.connector
mydb = mysql.connector.connect(
  host="ensembldb.ensembl.org",
  user="anonymous",
  password="",
  database="pan_troglodytes_core_100_3"
)
mycursor = mydb.cursor(mydb)

We can view tables of data within this database by running an appropriate command, `SHOW TABLES`. We could similarly see other databases, `SHOW DATABASES`, but we will just look at troglodytes for now.

In [56]:
mycursor.execute("SHOW TABLES")
for x in mycursor:
  print(x)

('alt_allele',)
('alt_allele_attrib',)
('alt_allele_group',)
('analysis',)
('analysis_description',)
('assembly',)
('assembly_exception',)
('associated_group',)
('associated_xref',)
('attrib_type',)
('biotype',)
('coord_system',)
('data_file',)
('density_feature',)
('density_type',)
('dependent_xref',)
('ditag',)
('ditag_feature',)
('dna',)
('dna_align_feature',)
('dna_align_feature_attrib',)
('exon',)
('exon_transcript',)
('external_db',)
('external_synonym',)
('gene',)
('gene_archive',)
('gene_attrib',)
('genome_statistics',)
('identity_xref',)
('interpro',)
('intron_supporting_evidence',)
('karyotype',)
('map',)
('mapping_session',)
('mapping_set',)
('marker',)
('marker_feature',)
('marker_map_location',)
('marker_synonym',)
('meta',)
('meta_coord',)
('misc_attrib',)
('misc_feature',)
('misc_feature_misc_set',)
('misc_set',)
('object_xref',)
('ontology_xref',)
('operon',)
('operon_transcript',)
('operon_transcript_gene',)
('peptide_archive',)
('prediction_exon',)
('prediction_transc

We can obtain more information about a specific table using the `DESCRIBE` command.

In [57]:
mycursor.execute("DESCRIBE gene")
for x in mycursor:
  print(x)

('gene_id', 'int(10) unsigned', 'NO', 'PRI', None, 'auto_increment')
('biotype', 'varchar(40)', 'NO', '', None, '')
('analysis_id', 'smallint(5) unsigned', 'NO', 'MUL', None, '')
('seq_region_id', 'int(10) unsigned', 'NO', 'MUL', None, '')
('seq_region_start', 'int(10) unsigned', 'NO', '', None, '')
('seq_region_end', 'int(10) unsigned', 'NO', '', None, '')
('seq_region_strand', 'tinyint(2)', 'NO', '', None, '')
('display_xref_id', 'int(10) unsigned', 'YES', 'MUL', None, '')
('source', 'varchar(40)', 'NO', '', None, '')
('description', 'text', 'YES', '', None, '')
('is_current', 'tinyint(1)', 'NO', '', '1', '')
('canonical_transcript_id', 'int(10) unsigned', 'NO', 'MUL', None, '')
('stable_id', 'varchar(128)', 'YES', 'MUL', None, '')
('version', 'smallint(5) unsigned', 'YES', '', None, '')
('created_date', 'datetime', 'YES', '', None, '')
('modified_date', 'datetime', 'YES', '', None, '')


In [58]:
mycursor.execute("SELECT * FROM gene LIMIT 10")
for x in mycursor:
  print(x)

(1, 'Mt_tRNA', 1, 120533, 1, 71, 1, None, 'RefSeq', None, 1, 1, 'ENSPTRG00000042638', 1, datetime.datetime(2012, 11, 8, 14, 22, 14), datetime.datetime(2012, 11, 8, 14, 22, 14))
(2, 'Mt_rRNA', 1, 120533, 72, 1020, 1, None, 'RefSeq', None, 1, 2, 'ENSPTRG00000042646', 1, datetime.datetime(2012, 11, 8, 14, 22, 14), datetime.datetime(2012, 11, 8, 14, 22, 14))
(3, 'Mt_tRNA', 1, 120533, 1021, 1089, 1, None, 'RefSeq', None, 1, 3, 'ENSPTRG00000042654', 1, datetime.datetime(2012, 11, 8, 14, 22, 14), datetime.datetime(2012, 11, 8, 14, 22, 14))
(4, 'Mt_rRNA', 1, 120533, 1090, 2647, 1, None, 'RefSeq', None, 1, 4, 'ENSPTRG00000042645', 1, datetime.datetime(2012, 11, 8, 14, 22, 14), datetime.datetime(2012, 11, 8, 14, 22, 14))
(5, 'Mt_tRNA', 1, 120533, 2648, 2722, 1, None, 'RefSeq', None, 1, 5, 'ENSPTRG00000042644', 1, datetime.datetime(2012, 11, 8, 14, 22, 14), datetime.datetime(2012, 11, 8, 14, 22, 14))
(6, 'protein_coding', 1, 120533, 2725, 3681, 1, 10317784, 'RefSeq', 'mitochondrially encoded NADH

### Filesystem I/O

We can also compare performance of reading and writing to a few file formats. Here we compare some core Python functionality, with numpy, with the HDF5 format.

In [62]:
import numpy as np
import h5py

randmat = np.random.rand(1000, 1000)

print("Time taken to save binary data using native python...")
file = open("standard.bin", "wb")
%time file.write(randmat)
file.close()

print("\nTime taken to save ascii data...")
%time np.savetxt("ascii.npz", randmat)

print("\nTime taken to save numpy data...")
%time np.savez("numpy.npz", randmat)

print("\nTime taken to save compressed numpy data...")
%time np.savez_compressed("compressed.npz", randmat)

print("\nTime taken to save HDF5 data...")
f = h5py.File("h5file.h5", "w")
%time dset = f.create_dataset("mydataset", data=randmat)
f.close()

print("\nTime taken to save compressed HDF5 data...")
f = h5py.File("h5compressed.h5", "w")
%time dset = f.create_dataset("mydataset", data=randmat, compression="gzip", compression_opts=9)
f.close()

print("\n")
! ls -lath

Time taken to save binary data using native python...
CPU times: user 383 µs, sys: 5.16 ms, total: 5.54 ms
Wall time: 5.55 ms

Time taken to save ascii data...
CPU times: user 1.03 s, sys: 39.6 ms, total: 1.07 s
Wall time: 1.08 s

Time taken to save numpy data...
CPU times: user 10.5 ms, sys: 9.04 ms, total: 19.6 ms
Wall time: 19.7 ms

Time taken to save compressed numpy data...
CPU times: user 460 ms, sys: 11.9 ms, total: 472 ms
Wall time: 476 ms

Time taken to save HDF5 data...
CPU times: user 880 µs, sys: 6.08 ms, total: 6.96 ms
Wall time: 6.91 ms

Time taken to save compressed HDF5 data...
CPU times: user 247 ms, sys: 9.84 ms, total: 257 ms
Wall time: 256 ms


total 62M
-rw-r--r-- 1 root root 7.3M Apr 26 20:48 h5compressed.h5
-rw-r--r-- 1 root root 7.7M Apr 26 20:48 h5file.h5
-rw-r--r-- 1 root root 7.2M Apr 26 20:48 compressed.npz
-rw-r--r-- 1 root root 7.7M Apr 26 20:48 numpy.npz
-rw-r--r-- 1 root root  24M Apr 26 20:48 ascii.npz
-rw-r--r-- 1 root root 7.7M Apr 26 20:48 standard.b