Codes with pure python is slow.

Time for discretization and ODE solving takes

In [1]:
%run solve_ode.py

discretizing...
discretization takes 255.250808 seconds
Solving ODE...
solving ODE takes 39.486365 seconds


# Optimizing breakage and selection functions with Cython

Code for breakage function with lognormal distribution and selection function in `lognormal.py` is shown below. 

```python
def lnpdf(x, m, sg):
    num = np.exp(-(np.log(x) - m)**2 / (2 * sg**2))
    den = x * sg * np.sqrt(2 * np.pi)
    return num / den

def lognorm_b(x, y, m, sg):
    assert sg > 0, "sigma must be larger than 0"
   
    num = lnpdf(x, m, sg)
    den = erfc(-(np.log(y) - m) / (np.sqrt(2) * sg))/2

    # In case 'y' is too small compared to 'mu',
    # 'den' can be numerically zero 
    # if it is smaller than the machine precision epsilon 
    # which is not correct theoretically
    if den == 0:
        den = np.finfo(float).eps
    # convert volume to number
    return (y / x)**3 * num / den

def breakagefunc(x, y, k, *args):
    mu = args[0]
    sigma = args[1]
    res = k[1] * lognorm_b(x, y, mu[0], sigma[0])\
        + k[2] * lognorm_b(x, y, mu[1], sigma[1])\
        + (1 - k[1] - k[2]) * lognorm_b(x, y, mu[2], sigma[2])
    return res

def selectionfunc(y, k, *args):
    return k[0] * y**3
```

Benchmark result is

In [1]:
import benchmark

benchmark.breakage('python')

breakage function takes 26.58 μs.


In [2]:
benchmark.selection('python')

breakage function takes  0.60 μs.


## Simply cythonize without modification

Compling was done with `setup.py`

`$ python setup.py bulid_ext -i`

In [1]:
import benchmark

benchmark.breakage('cython')

breakage function takes 20.80 μs.


In [2]:
benchmark.selection('cython')

breakage function takes  0.37 μs.


## Use C library for math functions

Cythonized code in `lognormal_cy.pyx` is shown below 

In [None]:
import numpy as np
from libc.math cimport exp, log, sqrt, erfc

def lnpdf(x, m, sg):
    num = exp(-(log(x) - m)**2 / (2 * sg**2))
    den = x * sg * sqrt(2 * np.pi)
    return num / den

def lognorm_b(x, y, m, sg):
    assert sg > 0, "sigma must be larger than 0"
   
    num = lnpdf(x, m, sg)
    den = erfc(-(log(y) - m) / (sqrt(2) * sg))/2

    if den == 0:
        den = np.finfo(float).eps

    return (y / x)**3 * num / den

In [1]:
import benchmark

benchmark.breakage('cython')

breakage function takes  5.48 μs.


In [2]:
benchmark.selection('cython')

breakage function takes  0.42 μs.


## Static types

```python
cdef double lnpdf(double x, double m, double sg):
    cdef double pi = 3.141592653589793115997963468544185161590576171875
    cdef double num = exp(-(log(x) - m) ** 2 / (2 * sg**2))
    cdef double den = x * sg * sqrt(2 * pi)
    return num / den

cdef double lognorm_b(double x, double y, double m, double sg):
    assert sg > 0, "sigma must be larger than 0"
    cdef double num = lnpdf(x, m, sg)
    cdef double den = erfc(-(log(y) - m) / (sqrt(2) * sg)) / 2
    if den == 0:
        den = np.finfo(float).eps
    return (y / x)**3 * num / den

cpdef double breakagefunc(double x, double y, double[:] k, args):
    cdef double[:] mu = args[0]
    cdef double[:] sigma = args[1]
    cdef double res = k[1] * lognorm_b(x, y, mu[0], sigma[0])\
                    + k[2] * lognorm_b(x, y, mu[1], sigma[1])\
                    + (1 - k[1] - k[2]) * lognorm_b(x, y, mu[2], sigma[2])
    return res

cpdef double selectionfunc(double y, double[:] k, args):
    return k[0] * y**3
```

In [1]:
import benchmark

benchmark.breakage('cython')

breakage function takes  1.65 μs.


In [2]:
benchmark.selection('cython')

breakage function takes  0.52 μs.


Cythonization makes the breakage function more than 10 times faster but there is no significant improvement for selection function. 

# Optimizing discretization with Cython

Code for discretization in `discretize.py` is shown below

In [None]:
def den_integrand(x, k, *args):
    return x**3 * selectionfunc(x, k, *args)

def num_integrand(x, y, k, *args):
    return x**3 * selectionfunc(y, k, *args) * breakagefunc(x, y, k, *args)

def breakage_discretize(L, n, k, *args):
    L = np.insert(L, 0, 0)
    res = np.zeros((n, n))

    for i in range(n):
        den, err = quad(den_integrand, L[i], L[i+1], args=(k, *args))
        assert den != 0, 'breakage_discretize: division by zero'
        for j in range(i):
            num, err = dblquad(num_integrand, L[i], L[i+1],
                               lambda x: L[j], lambda x: L[j+1],
                               args=(k, *args))
            Li = (L[i]+L[i+1])/2
            Lj = (L[j]+L[j+1])/2
            res[j, i] = (Li / Lj)**3 * num / den
        num, err = dblquad(num_integrand, L[i], L[i+1],
                           lambda x: L[i], lambda x: x,
                           args=(k, *args))
        res[i, i] = num / den
        
    return res 

def particle_number(x, k, *args): 
    res = quad(lambda a: breakagefunc(a, x, k, *args), 0, x)[0]
    return res

def selection_integrand(x, k, *args):
    return (particle_number(x, k, *args) - 1) * selectionfunc(x, k, *args)

def selection_discretize(L, n, k, breakage_mat, *args):
    res = np.empty(n)
    L = np.insert(L, 0, 0)
    
    for i in range(1, n):
        integ = quad(selection_integrand, L[i], L[i+1], args=(k, *args))[0]
        num = integ / (L[i+1] - L[i])
        sum = np.sum(breakage_mat[:i+1, i])
        den = sum - 1
        assert den != 0, 'selection_discretize: division by zero'
        res[i] = num / den
        
    res[0] = 0.0
    return res

Benchmark result is

In [1]:
import benchmark

benchmark.discretize('python')

discretization of breakage takes 24.565 s.
discretization of selection takes 12.370 s.


## Simply inserting cythonized lognormal function

The `cdef` function cannot have starred argument `*args` for variable number of arguments as Python, `*args` should be converted to `args`. With this modification, simply inserting cythonized functions into `discretize.py` results in

In [None]:
def den_integrand(x, k, *args):
    return x**3 * selectionfunc(x, k, args)

def num_integrand(x, y, k, *args):
    return x**3 * selectionfunc(y, k, args) * breakagefunc(x, y, k, args)

def breakage_discretize(L, n, k, *args):
    L = np.insert(L, 0, 0)
    res = np.zeros((n, n))

    for i in range(n):
        den, err = quad(den_integrand, L[i], L[i+1], args=(k, *args))
        assert den != 0, 'breakage_discretize: division by zero'
        for j in range(i):
            num, err = dblquad(num_integrand, L[i], L[i+1],
                               lambda x: L[j], lambda x: L[j+1],
                               args=(k, *args))
            Li = (L[i]+L[i+1])/2
            Lj = (L[j]+L[j+1])/2
            res[j, i] = (Li / Lj)**3 * num / den
        num, err = dblquad(num_integrand, L[i], L[i+1],
                           lambda x: L[i], lambda x: x,
                           args=(k, *args))
        res[i, i] = num / den
        
    return res 

def particle_number(x, k, *args): 
    res = quad(lambda a: breakagefunc(a, x, k, args), 0, x)[0]
    return res

def selection_integrand(x, k, *args):
    return (particle_number(x, k, *args) - 1) * selectionfunc(x, k, args)

def selection_discretize(L, n, k, breakage_mat, *args):
    res = np.empty(n)
    L = np.insert(L, 0, 0)
    
    for i in range(1, n):
        integ = quad(selection_integrand, L[i], L[i+1], args=(k, *args))[0]
        num = integ / (L[i+1] - L[i])
        sum = np.sum(breakage_mat[:i+1, i])
        den = sum - 1
        assert den != 0, 'selection_discretize: division by zero'
        res[i] = num / den
        
    res[0] = 0.0
    return res

In [1]:
import benchmark

benchmark.discretize_cython_check()

No error


In [2]:
benchmark.discretize('serial')

discretization of breakage takes  2.018 s.
discretization of selection takes  0.685 s.


## Add static types

Integrands for `scipy.integrate quad` and `dblquad` are cythonized as shown below.

Closures inside `cpdef` functions is not supported so that a function using `lambda` function cannot be converted to a `cpdef` function.

In [None]:
cpdef double den_integrand(double x, double[:] k, args):
    return x**3 * selectionfunc(x, k, args)

cpdef double num_integrand(double x, double y, double[:] k, args):
    return x**3 * selectionfunc(y, k, args) * breakagefunc(x, y, k, args)

def breakage_discretize(L, n, k, *args):
    L = np.insert(L, 0, 0)
    res = np.zeros((n, n))

    for i in range(n):
        den, err = quad(den_integrand, L[i], L[i+1], args=(k, args))
        assert den != 0, 'breakage_discretize: division by zero'
        for j in range(i):
            num, err = dblquad(num_integrand, L[i], L[i+1],
                               lambda x: L[j], lambda x: L[j+1],
                               args=(k, args))
            Li = (L[i]+L[i+1])/2
            Lj = (L[j]+L[j+1])/2
            res[j, i] = (Li / Lj)**3 * num / den
        num, err = dblquad(num_integrand, L[i], L[i+1],
                           lambda x: L[i], lambda x: x,
                           args=(k, args))
        res[i, i] = num / den
        
    return res 

def particle_number(double x, double[:] k, args): 
    res = quad(lambda a: breakagefunc(a, x, k, args), 0, x)[0]
    return res

cpdef double selection_integrand(double x, double[:] k, args):
    return (particle_number(x, k, args) - 1) * selectionfunc(x, k, args)

def selection_discretize(L, n, k, breakage_mat, *args):
    res = np.empty(n)
    L = np.insert(L, 0, 0)
    
    for i in range(1, n):
        integ = quad(selection_integrand, L[i], L[i+1], args=(k, args))[0]
        num = integ / (L[i+1] - L[i])
        sum = np.sum(breakage_mat[:i+1, i])
        den = sum - 1
        assert den != 0, 'selection_discretize: division by zero'
        res[i] = num / den
        
    res[0] = 0.0
    return res

In [1]:
import benchmark

benchmark.discretize_cython_check()

No error


In [2]:
benchmark.discretize('cython')

discretization of breakage takes  2.358 s.
discretization of selection takes  0.654 s.


There is no performance degradation by defining integrands as `cpdef` functionss. This is probably due to `quad` function takes Python function as argument so there is no benefit defining integrands `cpdef` with static types.

## Static types for loops

In [None]:
def den_integrand(x, k, *args):
    return x**3 * selectionfunc(x, k, args)

def num_integrand(x, y, k, *args):
    return x**3 * selectionfunc(y, k, args) * breakagefunc(x, y, k, args)

def breakage_discretize(L, Py_ssize_t n, k, *args):
    L = np.insert(L, 0, 0)
    res = np.zeros((n, n))
    
    cdef Py_ssize_t i, j

    for i in range(n):
        den, err = quad(den_integrand, L[i], L[i+1], args=(k, *args))
        assert den != 0, 'breakage_discretize: division by zero'
        for j in range(i):
            num, err = dblquad(num_integrand, L[i], L[i+1],
                               lambda x: L[j], lambda x: L[j+1],
                               args=(k, *args))
            Li = (L[i]+L[i+1])/2
            Lj = (L[j]+L[j+1])/2
            res[j, i] = (Li / Lj)**3 * num / den
        num, err = dblquad(num_integrand, L[i], L[i+1],
                           lambda x: L[i], lambda x: x,
                           args=(k, *args))
        res[i, i] = num / den
        
    return res 

def particle_number(x, k, *args): 
    res = quad(lambda a: breakagefunc(a, x, k, args), 0, x)[0]
    return res

def selection_integrand(x, k, *args):
    return (particle_number(x, k, *args) - 1) * selectionfunc(x, k, args)

def selection_discretize(L, Py_ssize_t n, k, breakage_mat, *args):
    res = np.empty(n)
    L = np.insert(L, 0, 0)
    
    cdef Py_ssize_t i
    
    for i in range(1, n):
        integ = quad(selection_integrand, L[i], L[i+1], args=(k, *args))[0]
        num = integ / (L[i+1] - L[i])
        sum = np.sum(breakage_mat[:i+1, i])
        den = sum - 1
        assert den != 0, 'selection_discretize: division by zero'
        res[i] = num / den
        
    res[0] = 0.0
    return res

In [2]:
import benchmark

benchmark.discretize_cython_check()

No error


In [3]:
benchmark.discretize('cython')

discretization of breakage takes  2.054 s.
discretization of selection takes  0.696 s.


# Parallelize for-loop using Joblib

Since the most of time is spent in calling `quad` and `dblquad` functions of `scipy.integrate` library, Cython has limited effect on performance improvement. For further optimizaiton, it needs to parallelize `for-loop` 

In [None]:
from joblib import Parallel, delayed

def breakage_discretize(L, n, k, *args):
    L = np.insert(L, 0, 0)
    
    def in_for_loop(i):
        temp = np.zeros(n)
        den, err = quad(den_integrand, L[i], L[i+1], args=(k, *args))
        assert den != 0, 'breakage_discretize: division by zero'
        for j in range(i):
            num, err = dblquad(num_integrand, L[i], L[i+1],
                               lambda x: L[j], lambda x: L[j+1],
                               args=(k, *args))
            Li = (L[i]+L[i+1])/2
            Lj = (L[j]+L[j+1])/2
            temp[j] = (Li / Lj)**3 * num / den
        num, err = dblquad(num_integrand, L[i], L[i+1],
                           lambda x: L[i], lambda x: x,
                           args=(k, *args))
        temp[i] = num / den
        
        return temp
    
    r = Parallel(n_jobs=-1)(delayed(in_for_loop)(i) for i in range(n))
    
    res = np.stack(r).T 
        
    return res

def selection_discretize(L, n, k, breakage_mat, *args):
    L = np.insert(L, 0, 0)
    
    def in_for_loop(i):
        integ = quad(selection_integrand, L[i], L[i+1], args=(k, *args))[0]
        num = integ / (L[i+1] - L[i])
        sum = np.sum(breakage_mat[:i+1, i])
        den = sum - 1
        assert den != 0, 'selection_discretize: division by zero'
        return num / den
        
    r = Parallel(n_jobs=-1)(delayed(in_for_loop)(i) for i in range(1, n))
    
    res = np.zeros(n)
    res[1:] = r
    return res

In [1]:
import benchmark

benchmark.discretize_parallel_check()

No error


In [2]:
benchmark.discretize('parallel')

discretization of breakage takes  0.577 s.
discretization of selection takes  0.205 s.


Excution time of discretization of breakage function reduced from 23.956 s to 0.605 s and that of selection function reduced from 11.992 s to 0.209.

# Test

In [1]:
%run solve_ode.py

discretizing...
discretization takes  5.857 seconds
Solving ODE...
solving ODE takes 47.361 seconds


Time to evaluate `discretize` function reduced from 283.45 to 5.86 s.

Since the `discretize` will be reused for every call from `ode_solve`, `discretize` should be cached for unnecessary evaluation for the same arguments using `joblib Memory`.

In [None]:
from joblib import Memory

cachedir = './cachedir'
memory = Memory(cachedir, verbose=0)

@memory.cache
def discretize(L, n, p, k, delta, *args):

# Optimizing ODE construction functions

ODE functions for breakage PBM in `ode.py` is shown below,

In [None]:
def breakage(number, brk_mat, slc_vec):
    n = len(number)
    R1 = np.zeros(n)
    
    # Mechanism 1 (i=1~n, j=i~n) !!! with index 1~n
    for i in range(n):
        R1[i] = np.sum(brk_mat[i, i:] * slc_vec[i:] * number[i:])
        
    # Mechanism 2 (i=2~n)
    R2 = slc_vec[1:] * number[1:]
    R2 = np.insert(R2, 0, 0.0)
        
    dNdt = R1 - R2

    return dNdt



def breakage_moment(Y, brk_mat, slc_vec, L):
    n = len(Y) - 4
    number = Y[0:n]

    dNdt = breakage(number, brk_mat, slc_vec)

    m0 = np.sum(dNdt)
    m1 = np.sum(L @ dNdt)
    m2 = np.sum(np.power(L, 2) @ dNdt)
    m3 = np.sum(np.power(L, 3) @ dNdt)
    
    dydt = np.append(dNdt,[m0,m1,m2,m3])
    
    return dydt

In [1]:
import benchmark

benchmark.ode('python')

constructing ode takes 330.41 μs.


## Cythonize

Since the construction of ODE is iterating over arrays, cythonize with static types should results in significant performance improvement.

In [None]:
@boundscheck(False)
@wraparound(False)
def breakage(number, brk_mat, slc_vec):
    cdef Py_ssize_t n = len(number)
    R1 = np.zeros(n)
    R2 = np.zeros(n)
    
    # Memoryview
    cdef double[:] R1_view = R1
    cdef double[:] R2_view = R2
    cdef double[:] n_view = number
    cdef double[:, :] brk_view = brk_mat
    cdef double[:] slc_view = slc_vec
    
    cdef Py_ssize_t i, j
    cdef double sum
    
    # Mechanism 1 (i=1~n, j=i~n) !!! with index 1~n
    for i in range(n):
        sum = 0
        for j in range(i, n):
            sum += brk_view[i, j] * slc_view[j] * n_view[j]
        R1_view[i] = sum
        
    # Mechanism 2 (i=2~n)
    for i in range(1, n):
        R2_view[i] = slc_view[i] * n_view[i]

    return R1 - R2

In [1]:
import benchmark

benchmark.ode_check()

No error


In [2]:
benchmark.ode('cython')

constructing ode takes  6.18 μs.


## Parallelize with prange

Note that `sum += ...` is not working in loop with `prange`. Use `sum = sum + ...` instead.

In [None]:
from cython.parallel cimport parallel

@boundscheck(False)
@wraparound(False)
def breakage(number, brk_mat, slc_vec):
    cdef Py_ssize_t n = len(number)
    R1 = np.zeros(n)
    R2 = np.zeros(n)
    
    # Memoryview
    cdef double[:] R1_view = R1
    cdef double[:] R2_view = R2
    cdef double[:] n_view = number
    cdef double[:, :] brk_view = brk_mat
    cdef double[:] slc_view = slc_vec
    
    cdef Py_ssize_t i, j
    cdef double sum
    
    for i in prange(n, nogil=True):
        sum = 0
        for j in range(i, n):
            sum = sum + brk_view[i, j] * slc_view[j] * n_view[j]
        R1_view[i] = sum
        R2_view[i] = slc_view[i] * n_view[i]
        
    R2_view[0] = 0.0

    return R1 - R2

In [1]:
import benchmark

benchmark.ode_check()

No error


In [2]:
benchmark.ode('parallel')

constructing ode takes 13.66 μs.


Parallelizing is deteriorative for performance. This is due to the overhead in parallelizing is overwhelming than the gain by dividing simple jobs.

# Test

In [4]:
%run solve_ode.py

discretization takes 0.004441 seconds
Solving ODE...
solving ODE takes 2.283300 seconds


Discretization took only 0.004 seconds since it was cached.

Time for solving ODE reduced from 44.1 to 2.28 seconds.

# Phi construction

Code for $\phi$ construction in `phi.py` is shown below. `breakage` from `ode_cy` module will be used.

In [None]:
import numpy as np
from ode_cy import breakage

def phi_breakage(z, dbs, n, p, delta):
    # dbs: discretized breakage and selection functions
    z = z.astype(np.float)
    y = z[0:n]
    J = z[n:].reshape((p, n)).transpose()
    phiz = np.empty(n * (p+1))
    dfdy = np.empty((n, n))
    dfdk = np.empty((p, n))
    
    for i in range(n):
        yr = y.copy()
        
        yl = y.copy()
        yr[i] += delta
        yl[i] -= delta
        dfdy[i] = (breakage(yr, dbs[0], dbs[1]) - \
                   breakage(yl, dbs[0], dbs[1])) / (2 * delta)
    dfdy = dfdy.transpose()
    
    for i in range(p):
        dfdk[i] = (breakage(y, dbs[2][i], dbs[3][i]) - \
                   breakage(y, dbs[4][i], dbs[5][i])) / (2 * delta)
    dfdk = dfdk.transpose()
    
    dJdt = dfdy @ J + dfdk
    phiz[0:n] = breakage(y, dbs[0], dbs[1])
    phiz[n:] = dJdt.transpose().flatten()
    return phiz

In [1]:
import benchmark

benchmark.phi('python')

constructing phi takes  1.20 ms.


## Cythonize

In [None]:
def phi_breakage(z, dbs, Py_ssize_t n, Py_ssize_t p, double delta):
    # dbs: discretized breakage and selection functions
    z = z.astype(np.float)
    y = z[0:n]
    J = z[n:].reshape((p, n)).transpose()
    phiz = np.empty(n * (p + 1))
    dfdy = np.empty((n, n))
    dfdk = np.empty((p, n))
    
    Y = np.tile(y, [n, 1])
    Yr = Y + np.eye(n) * delta
    Yl = Y - np.eye(n) * delta
    
    # Memoryview
    cdef double[:, :] brk_view = dbs[0]
    cdef double[:] slc_view = dbs[1]
    cdef double[:, :, :] brk_view_r = dbs[2]
    cdef double[:, :] slc_view_r = dbs[3]
    cdef double[:, :, :] brk_view_l = dbs[4]
    cdef double[:, :] slc_view_l = dbs[5]
    cdef double[:] y_view = y
    cdef double[:, :] Yr_view = Yr
    cdef double[:, :] Yl_view = Yl
    cdef double[:, :] dfdy_view = dfdy
    cdef double[:, :] dfdk_view = dfdk
    
    cdef double[:] temp1 = np.empty(n)
    cdef double[:] temp2 = np.empty(n)
    
    cdef Py_ssize_t i, j
    
    for i in range(n):
        temp1 = breakage(Yr_view[i], brk_view, slc_view)
        temp2 = breakage(Yl_view[i], brk_view, slc_view)
        for j in range(n):
            dfdy_view[i, j] = (temp1[j] - temp2[j]) / (2 * delta)
            
    dfdy = dfdy.transpose()
    
    for i in range(p):
        temp1 = breakage(y_view, brk_view_r[i], slc_view_r[i])
        temp2 = breakage(y_view, brk_view_l[i], slc_view_l[i])
        for j in range(n):
            dfdk_view[i, j] = (temp1[j] - temp2[j]) / (2 * delta)
            
    dfdk = dfdk.transpose()
    
    dJdt = dfdy @ J + dfdk
    phiz[0:n] = breakage(y_view, brk_view, slc_view)
    phiz[n:] = dJdt.transpose().flatten()
    return phiz

In [1]:
import benchmark

benchmark.phi_check()

No error


In [2]:
benchmark.phi('cython')

constructing phi takes  0.96 ms.


## Defining ODE constructor as cdef function
ODE constructor function `discretize` in `ode_cy_cdef.pyx`

In [None]:
from cython import boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
cdef double[:] breakage(double[:] number, double[:, :] brk_mat, double[:] slc_vec):
    cdef Py_ssize_t n = len(number)
    
    # Memoryview
    cdef double[:] dndt = np.zeros(n)
    
    cdef Py_ssize_t i, j
    cdef double sum
    
    sum = 0
    for j in range(n):
        sum += brk_mat[0, j] * slc_vec[j] * number[j]
    dndt[0] = sum
    
    for i in range(1, n):
        sum = 0
        for j in range(i, n):
            sum += brk_mat[i, j] * slc_vec[j] * number[j]
        dndt[i] = sum - slc_vec[i] * number[i]

    return dndt

To `import` `cdef` function from other cython module, there should be a declaration file (`.pxd`). `ode_cy_cdef.pxd` is shown below. 

In [None]:
cdef double[:] breakage(double[:] number, double[:, :] brk_mat, double[:] slc_vec)

`cdef` function can be imported by `cimport`

In [None]:
from ode_cy_cdef cimport breakage

In [1]:
import benchmark

benchmark.phi_check()

No error


In [2]:
benchmark.phi('cython')

constructing phi takes  0.48 ms.


Cythonize decreased the phi construction time from 1.2 to 0.48 ms.

## Parallelize by Cython with nogil

In [None]:
from cython cimport boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
cdef void breakage(double[:] dndt, double[:] number, double[:, :] brk_mat, double[:] slc_vec) nogil:
    cdef Py_ssize_t n = len(number)
    
    cdef Py_ssize_t i, j
    cdef double sum
    
    # Mechanism 1 (i=1~n, j=i~n) !!! with index 1~n
    for i in range(n):
        sum = 0
        for j in range(i, n):
            sum += brk_mat[i, j] * slc_vec[j] * number[j]
        dndt[i] = sum
        
    # Mechanism 2 (i=2~n)
    for i in range(1, n):
        dndt[i] -= slc_vec[i] * number[i]

        

from cython.parallel import prange 

@boundscheck(False)
@wraparound(False)
def phi_breakage(breakage, z, dbs, Py_ssize_t n, Py_ssize_t p, double delta):
    # dbs: discretized breakage and selection functions
    z = z.astype(np.float)
    y = z[0:n]
    J = z[n:].reshape((p, n)).transpose()
    phiz = np.empty(n * (p + 1))
    dfdy = np.empty((n, n))
    dfdk = np.empty((p, n))
    
    Y = np.tile(y, [n, 1])
    Yr = Y + np.eye(n) * delta
    Yl = Y - np.eye(n) * delta
    
    # Memoryview
    cdef double[:, :] brk_mat = dbs[0]
    cdef double[:] slc_vec = dbs[1]
    cdef double[:, :, :] brk_mat_r = dbs[2]
    cdef double[:, :] slc_vec_r = dbs[3]
    cdef double[:, :, :] brk_mat_l = dbs[4]
    cdef double[:, :] slc_vec_l = dbs[5]
    cdef double[:] yv = y
    cdef double[:, :] Yrv = Yr
    cdef double[:, :] Ylv = Yl
    cdef double[:, :] dfdyv = dfdy
    cdef double[:, :] dfdkv = dfdk
    cdef double[:] dndt = phiz[:n]
    
    cdef double[:] temp1 = np.empty(n)
    cdef double[:] temp2 = np.empty(n)
    
    cdef Py_ssize_t i, j
    
    for i in prange(n, nogil=True):
        breakage(temp1, Yr[i], brk_mat, slc_vec)
        breakage(temp2, Yl[i], brk_mat, slc_vec)
        for j in range(n):
            dfdyv[i, j] = (temp1[j] - temp2[j]) / (2 * delta)
            
    dfdy = dfdy.transpose()
    
    for i in prange(p, nogil=True):
        breakage(temp1, yv, brk_mat_r[i], slc_vec_r[i])
        breakage(temp2, yv, brk_mat_l[i], slc_vec_l[i])
        for j in range(n):
            dfdkv[i, j] = (temp1[j] - temp2[j]) / (2 * delta)
            
    dfdk = dfdk.transpose()
    
    dJdt = dfdy @ J + dfdk
    breakage(dndt, y, dbs[0], dbs[1])
    phiz[n:] = dJdt.transpose().flatten()
    return phiz

In [3]:
import benchmark

benchmark.phi_parallel_check()

No error


In [4]:
benchmark.phi('parallel')

constructing phi takes  0.29 ms.


Parallelize decreased phi construction time from 0.48 to 0.29 ms.

## Test

In [6]:
%run solve_ode.py

discretization takes  0.001 seconds
Solving ODE...
solving ODE takes  0.672 seconds


Time for solving ODE decreased from 2.28 to 0.67 s.

Time for discretization: 225 -> 5.86 seconds.

Time for ODE solving: 39 -> 0.67 seconds.