refactored _log_probability function to avoid gil in multithreading #1008

ghost · 2022-11-28T03:12:04Z

I noticed that when using training HiddenMarkovModel with emissions composed by ExponentialDistribution wrapped in IndependentComponentsDistribution, the training was orders of magnitude slower than when using ExponentialDistribution directly. I was able to replicate the same behaviour also with NormalDistribution and so I narrowed it down to IndependentComponentsDistribution.

The slower speed is noticable only when using multithreading in fit (n_jobs > 1).
I was able to narrow down the cause to the function _log_probability of IndependentComponentsDistribution.
The following version of the function leads to slow behavior:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp
        
        memset(log_probability, 0, n*sizeof(double))
        
        for i in range(n):
	        for j in range(self.d):
                          if self.cython == 1:
	                          (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
	                          with gil:
		                          python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

But the following is fast

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp
        
        memset(log_probability, 0, n*sizeof(double))
        
        for i in range(n):
	        for j in range(self.d):
                          if True or self.cython == 1: # notice the True here
	                          (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
	                          with gil:
		                          python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

This happens even though self.cython is always 1, and so the else block is never executed.
Moreover, this is also slow:

cdef void _log_probability(self, double* X, double* log_probability, int n) nogil:
        cdef int i, j
        cdef double logp
        
        memset(log_probability, 0, n*sizeof(double))
        
        for i in range(n):
	        for j in range(self.d):
                          if  self.cython == 1 or True: # notice the True here with reverse order
	                          (<Model> self.distributions_ptr[j])._log_probability(X+i*self.d+j, &logp, 1)
                          else:
	                          with gil:
		                          python_log_probability(self.distributions[j], X+i*self.d+j, &logp, 1)

                          log_probability[i] += logp * self.weights_ptr[j]

I think that branch in the for loop somehow makes it harder for cython to compile efficient code.
Here a complete reproducible example that elicits the slow behaviour:

from pomegranate.distributions import ExponentialDistribution, IndependentComponentsDistribution
from pomegranate.base import State
from pomegranate.io import SequenceGenerator
from pomegranate.hmm import HiddenMarkovModel
import numpy as np

def get_model(A_d, B_d):
    model = HiddenMarkovModel()
   
    A_s = State(A_d, name="A")
    B_s = State(B_d, name="B")

    model.add_states([A_s, B_s])

    model.add_transition(
        a=model.start,
        b=A_s,
        probability=0.5,
    )
    model.add_transition(
        a=model.start,
        b=B_s,
        probability=0.5,
    )
    model.add_transition(
        a=A_s,
        b=A_s,
        probability=0.99,
    )
    model.add_transition(
        a=B_s,
        b=B_s,
        probability=0.99,
    )
    model.add_transition(
        a=A_s,
        b=B_s,
        probability=0.01,
    )
    model.add_transition(
        a=B_s,
        b=A_s,
        probability=0.01,
    )

    model.bake(verbose=True)
    
    return model

model = get_model(
    IndependentComponentsDistribution(
        [ExponentialDistribution(1), ExponentialDistribution(1), ExponentialDistribution(1)]
    ),
    IndependentComponentsDistribution(
        [ExponentialDistribution(100), ExponentialDistribution(100), ExponentialDistribution(100)]
    )
)
X = np.zeros((1000, 3))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

model = get_model(
    ExponentialDistribution(1),
    ExponentialDistribution(100)
)
X = np.zeros((1000))
X[:500] = 1
X[500:] = 1/100
X_set = [X]*1000
model.fit(X_set, min_iterations=10, max_iterations=10, verbose=True, n_jobs=-1)

The first fit call is slow with the original pomegranate code:

[1] Improvement: 8114.942703745794	Time (s): 6.926
[2] Improvement: 0.0011873976327478886	Time (s): 8.598
[3] Improvement: 1.3969838619232178e-09	Time (s): 8.284
[4] Improvement: -6.426125764846802e-08	Time (s): 8.033
[5] Improvement: 6.472691893577576e-08	Time (s): 8.972
[6] Improvement: -6.472691893577576e-08	Time (s): 8.011
[7] Improvement: 6.472691893577576e-08	Time (s): 7.855
[8] Improvement: -6.472691893577576e-08	Time (s): 8.366
[9] Improvement: 6.472691893577576e-08	Time (s): 8.269
[10] Improvement: -6.472691893577576e-08	Time (s): 8.243
Total Training Improvement: 8114.943891080562
Total Training Time (s): 88.2151

The second fit call without IndependentComponentsDistribution is fast:

[1] Improvement: 8111.520942713367	Time (s): 0.1579
[2] Improvement: 1.6850003311410546	Time (s): 0.1475
[3] Improvement: 0.00018647708930075169	Time (s): 0.1554
[4] Improvement: 1.3271346688270569e-08	Time (s): 0.1509
[5] Improvement: -1.234002411365509e-08	Time (s): 0.1648
[6] Improvement: 2.3283064365386963e-10	Time (s): 0.1637
[7] Improvement: 0.0	Time (s): 0.1708
[8] Improvement: 0.0	Time (s): 0.1509
[9] Improvement: 0.0	Time (s): 0.1576
[10] Improvement: 0.0	Time (s): 0.1464
Total Training Improvement: 8113.206129522761
Total Training Time (s): 1.7545

With the modifications in this commit, the slow fit call becomes fast:

[1] Improvement: 8114.942703745794	Time (s): 0.1773
[2] Improvement: 0.0011873976327478886	Time (s): 0.1736
[3] Improvement: 1.3969838619232178e-09	Time (s): 0.1783
[4] Improvement: -6.426125764846802e-08	Time (s): 0.1667
[5] Improvement: 6.472691893577576e-08	Time (s): 0.1846
[6] Improvement: -6.472691893577576e-08	Time (s): 0.1884
[7] Improvement: 6.472691893577576e-08	Time (s): 0.1826
[8] Improvement: -6.472691893577576e-08	Time (s): 0.1638
[9] Improvement: 6.472691893577576e-08	Time (s): 0.1868
[10] Improvement: -6.472691893577576e-08	Time (s): 0.1695
Total Training Improvement: 8114.943891080562
Total Training Time (s): 1.9820

jmschrei · 2023-04-16T06:02:53Z

Thank you for your contribution. However, pomegranate has recently been rewritten from the ground up to use PyTorch instead of Cython, and so this PR is unfortunately out of date and will be closed.

refactored _log_probability functionto avoid gil in multithreading

e4de6ab

ghost marked this pull request as ready for review November 28, 2022 03:18

ghost changed the title ~~refactored _log_probability functionto avoid gil in multithreading~~ refactored _log_probability function to avoid gil in multithreading Nov 28, 2022

jmschrei closed this Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactored _log_probability function to avoid gil in multithreading #1008

refactored _log_probability function to avoid gil in multithreading #1008

ghost commented Nov 28, 2022 •

edited by ghost

jmschrei commented Apr 16, 2023

refactored _log_probability function to avoid gil in multithreading #1008

refactored _log_probability function to avoid gil in multithreading #1008

Conversation

ghost commented Nov 28, 2022 • edited by ghost

jmschrei commented Apr 16, 2023

ghost commented Nov 28, 2022 •

edited by ghost