# Python Optimization

In this section we will discuss python optimization.  The key fact about python optimization is several-fold.
1. Is your code correct?
2. Do you need to optimize?
3. Do you really need to optimize?
4. Optimize is not parallelization -- usually do this last.
5. Optimization involves tradeoffs.  Be careful what you wish for.

There are a few steps to optimization:
1. profile
2. profile again.
3. check the hotspots.
4. payoff in optimization: modify your use case, use better algorithms, use builtin functions, use numba, pre-compiled code



## Profiling

The first and most important aspect of optimization is to figure out what is the slow part.  For this you need to profile your code.  Fortunately python offers some excellent profilers and jupyter has step this part up even further.  For this, we will use the magic function %prun.  

In [2]:
import numpy as np 
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display
N_massive = 20
N_bodies = N_massive
M = np.ones(N_bodies)

def Nbody_derivatives(pos,vel) :
    dpdt = vel
    dvdt = np.zeros(vel.shape)
    for i in range(N_bodies) :
        for j in range(N_bodies) :
            if i == j : 
                continue
            r = np.linalg.norm( pos[j]-pos[i])
            mass = M[j]
            rhat =(pos[j] - pos[i])/r
            dvdt[i] += mass/(r*r)*rhat
        
    return dpdt, dvdt

def initial_conditions() : 
    pos = np.random.random([N_bodies,3])
    vel = np.random.random([N_bodies,3])

    return pos, vel

def run_Nbody_rk2(tend,tframe,dt,derivatives=Nbody_derivatives) :
    p,v = initial_conditions()
    t = 0
    tnext = tframe
    positions = []
    while t<tend :
        while t < tnext :
            delta_t = min(tnext-t,dt)
            dpdt, dvdt = derivatives(p,v) 
            phalf, vhalf = p+dpdt*0.5*delta_t, v+dvdt*0.5*delta_t
            dpdt, dvdt = derivatives(phalf, vhalf)
            p, v = p + dpdt*delta_t, v + dvdt*delta_t
            t += delta_t
        positions.append(p.copy())
        tnext += tframe
    return positions

tframe = 0.01
dt = 0.001
frames = 100
%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt)
from matplotlib.animation import FuncAnimation

def animate(i, positions):
    ax.clear()
    # Get the point from the points list at index i
    pos = positions[i]
    ax.scatter(pos[:,0], pos[:,1], color='green', marker='o')
    # Set the x and y axis to display a fixed range
    ax.set_xlim([-100,100])
    ax.set_ylim([-100,100])
fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=len(positions), interval=50, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


 

         7605207 function calls in 6.994 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    3.529    0.002    6.981    0.003 3022826614.py:9(Nbody_derivatives)
   760000    1.437    0.000    2.675    0.000 linalg.py:2357(norm)
   760000    0.756    0.000    0.756    0.000 {method 'dot' of 'numpy.ndarray' objects}
   760000    0.376    0.000    3.451    0.000 <__array_function__ internals>:177(norm)
   760000    0.313    0.000    2.987    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
   760000    0.145    0.000    0.145    0.000 {method 'ravel' of 'numpy.ndarray' objects}
   760000    0.139    0.000    0.194    0.000 linalg.py:117(isComplexType)
  1520000    0.119    0.000    0.119    0.000 {built-in method builtins.issubclass}
   760000    0.088    0.000    0.088    0.000 linalg.py:2353(_norm_dispatcher)
   760000    0.078    0.000    0.078    0.000 {built-in method numpy.asarray}
 

As you can see the %prun reveals what is slow.  Nearly the entire code is spent in Nbody_derivatives, which computes the accelerations.  So it goes to show what profiling can do for you.  Pretty much nothing else is spent anywhere else. 

## Optimal python

The first thing we want to do is rewrite the code so that it is more correct or more pythonic -- this means that we want to write it so that is more numpy like.  Lets look at the following.  Starting with the above code, I challenge you to write it so that it is significantly cleaner.

In [None]:
def Nbody_derivatives2(pos,vel) :
    dpdt = vel
    dvdt = np.zeros(vel.shape)
    # write the above in optimal python using numpy.  How fast can you make it.
    return dpdt, dvdt

%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt, derivatives=Nbody_derivatives2)


fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=len(positions), interval=50, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


This was a significant speedup -- about a factor of 20.   This is really excellent.

## Hot spot optimization with numba

Recently python compilation has started to become a thing.  One noteworthy example is numba, which is a jit compiler that works well with numpy. Lets try this one, but using the code originally.  

It is extremely easy to use.  You can just use the decorator @jit before the function to optimize.

There are two mode of operation nopython=True or False.  When nopython=True produces much faster code, but there are limitations that forces it to fall back to False mode.

In [None]:
from numba import jit,njit

def Nbody_derivatives3(pos,vel) :
    dpdt = vel
    dvdt = np.zeros(vel.shape)
    for i in range(N_bodies) :
        for j in range(N_bodies) :
            if i == j : 
                continue
            r = np.linalg.norm( pos[j]-pos[i])
            mass = M[j]
            rhat =(pos[j] - pos[i])/r
            dvdt[i] += mass/(r*r)*rhat
        
    return dpdt, dvdt

%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt, derivatives=Nbody_derivatives3)
fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=frames, interval=25, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


This is faster than idiomic python, but it can be marginally so. Still much faster than the original version.  Not bad for a simple @jit

## Hot spot optimization with cython

Cython is a superset of the python language the "converts python to c" and then compiles the code to generate a fast runtime.  
1. This means that any python program is a cython program/code.
2. This also means that you can give cython directives to help do the conversion faster.  

There are a number of directives, but the most important are data directives.  So you can define variables as 
1. cdef int -> int in c
2. cdef float or cdef double -> float or double in c
3. cdef int/float/double [:] or [:,:] as arrays -> int, float, double * or **
This will python numerical data which are objects to fast native data types.  

First we load cython in jupyter

In [None]:
%load_ext cython

In [None]:
Now we compile a cython program.

In [None]:
%%cython -a
import numpy as np
from cpython cimport array
import array
import math 
def cython_Nbody_derivatives(pos,vel,M) :
    cdef int N_bodies = M.size
    dpdt = vel
    dvdt = np.zeros(vel.shape)
    for i in range(N_bodies) :
        for j in range(N_bodies) :
            if i == j : 
                continue
            r = np.linalg.norm( pos[j]-pos[i])
            mass = M[j]
            rhat =(pos[j] - pos[i])/r
            dvdt[i] += mass/(r*r)*rhat
        
    return dpdt, dvdt


In [None]:
def cython_derivs(pos,vel) :
    return cython_Nbody_derivatives(pos,vel,M)
%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt, derivatives=cython_derivs)
fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=frames, interval=25, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


As is, you get no speedup.  But if you judiciously use cdef int, cdef double, and cdef double [:,:], you get huge speedups. 

The speedup that I got was about 100x.  Can you match it?

## Hot spot optimization with f2py

The final example is using fortran to optimize the slowest bits.  Why fortran?  Because fortran 90 plays extremely well with python.

First we generate the .f90 file.

In [None]:
%%file nbody_derivatives.f90
SUBROUTINE derivs(pos,vel,mass,dpdt,dvdt,n)
    implicit none
    integer, intent(IN) :: n
    double precision, intent(IN), dimension(n,3):: pos, vel
    double precision, intent(IN), dimension(n) :: mass
    double precision, intent(OUT), dimension(n,3):: dpdt, dvdt
!f2py intent(in) n
!f2py intent(in) pos, vel, mass
!f2py intent(out) dpdt, dvdt
!f2py depend(n) mass
    integer :: i, j
    double precision, dimension(3) :: rhat,r
    double precision :: r2
    
    dpdt(:,:) = vel(:,:)
    dvdt(:,:) = 0.
    do i = 1,n
        do j = 1,n
            if( i .eq. j) then
                cycle
            endif
            r(:) = pos(j,:) - pos(i,:)
            r2 = sum(r*r)
            rhat = r/sqrt(r2)
            dvdt(i,:) = dvdt(i,:)+ mass(j)/(r2)*rhat(:)
        enddo
    enddo
    
    return
end subroutine derivs

In [None]:
!f2py -m nbody_derivatives -c nbody_derivatives.f90

In [None]:
import nbody_derivatives as nbd
import importlib
importlib.reload(nbd)

def fortran_derivs(pos,vel) :
    return nbd.derivs(pos,vel,M)
%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt, derivatives=fortran_derivs)
#print(positions)
fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=frames, interval=50, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


This is extremely fast like 600x faster than the original code.  It is so fast that python becomes the limiting factor. Compiled code in a highly optimized language is extremely powerful.  

# Running in Parallel

Students usually think the running in parallel is the way you optimize your code.  This is not true.  As you can see above, the speedups are fortran, cython, numba, and optimized python.  These speedups that way to do it.  You should almost always try optimized python and numba first as they are easiest.  But then switch over to cython and fortran.  Fortran is explicitly the fastest, but it is the most work and requires you to learn a new language. 

There are many ways of doing parallelization.  One way we discussed already is using apache spark, which allows parallel processing of large distributed data sets, but this is overkill for most purposes, but extremely valuable in technology companies.  Instead we will focus on a simple way to do this.  This is one node, but with nodes containing upwards of 128 cpu cores, this is generally plenty for most problems. 


In [3]:
import multiprocessing 
count = 4
pool = multiprocessing.Pool(processes=count)

def onebody_derivatives(i, pos, vel) :
    dvdt = np.zeros(3)
    for j in range(N_bodies) :
        if i == j : 
            continue
        r = np.linalg.norm( pos[j]-pos[i])
        mass = M[j]
        rhat =(pos[j] - pos[i])/r
        dvdt += mass/(r*r)*rhat
        
    return dvdt

def parallel_Nbody_derivatives(pos,vel) :
    dpdt = vel
    res = pool.starmap(onebody_derivatives, zip(range(N_bodies), [pos] * N_bodies, [vel]*N_bodies))
    dvdt = np.array(res)
    return dpdt, dvdt

%prun positions = run_Nbody_rk2(frames*tframe, tframe, dt, derivatives=parallel_Nbody_derivatives)
fig, ax = plt.subplots()
ani = FuncAnimation(fig, lambda i : animate(i, positions), frames=frames, interval=50, repeat=False)
video = ani.to_html5_video()
html = display.HTML(video)
display.display(html)
plt.close()


 

         77207 function calls in 5.566 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8000    5.301    0.001    5.301    0.001 {method 'acquire' of '_thread.lock' objects}
     2000    0.160    0.000    0.160    0.000 {built-in method numpy.array}
        1    0.022    0.022    5.566    5.566 3022826614.py:29(run_Nbody_rk2)
     2000    0.014    0.000    0.045    0.000 pool.py:471(_map_async)
     2000    0.009    0.000    0.009    0.000 {method 'put' of '_queue.SimpleQueue' objects}
     2000    0.008    0.000    0.008    0.000 threading.py:236(__init__)
     2000    0.008    0.000    5.543    0.003 2559947476.py:17(parallel_Nbody_derivatives)
     2000    0.008    0.000    5.313    0.003 threading.py:288(wait)
     2000    0.005    0.000    0.016    0.000 pool.py:747(__init__)
     2000    0.005    0.000    5.321    0.003 threading.py:589(wait)
     2000    0.004    0.000    5.376    0.003 pool.py:369(starmap)
     

This is the problem with parallelization.  In this case, we got worse performance.  You can see why, it has to lock down memory to make sure that people don't clobber each other.  This is because the amount of work per thing is too small.  So parallel is good if there is no contention *and* the amount of work is large per core.  This is not always true *cough, gpu*.  But generally for most of the problems you deal with it is the case. It doesn't usually give you the best performance.  Even this one where the operation is relatively parallel, the gains are not so great as being more careful about how you approach things.  