Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] predict_proba too slow on a Bayesian Network #811

Closed
jeongsoolee09 opened this issue Sep 8, 2020 · 2 comments
Closed

[QUESTION] predict_proba too slow on a Bayesian Network #811

jeongsoolee09 opened this issue Sep 8, 2020 · 2 comments

Comments

@jeongsoolee09
Copy link

Hello! I have a custom-made Bayesian Network with the following specs:

  • 494 nodes
  • 608 edges
  • max incoming edges per node: 9
  • number of values each random variable can take: 4
  • CPTs are custom-made

I am trying to get the marginal probabilities of every node without any evidence by the following code:

BN_for_inference.predict_proba({}, n_jobs=-1)

However, the code is running very slow: It didn't finish after 30 minutes on a 32-core machine with 128GB of RAM.

Is there any way to speed up my computation, or is it impossible due to the sheer size of my network?

@jeongsoolee09 jeongsoolee09 changed the title predict_proba too slow on a Bayesian Network [QUESTION] predict_proba too slow on a Bayesian Network Sep 8, 2020
@jmschrei
Copy link
Owner

Howdy

Unfortunately, loopy belief propogation is a pretty slow algorithm due, in part, to its iterative nature. It may not be the right inference algorithm for a model that is as large as the one you're defining. Further, currently, using additional jobs does not speed up this process for a single example, but rather parallelizes the processing of examples (e.g. if you had 32 examples, one thread would process each one in parallel).

However, I think that limiting the number of incoming edges per node might significantly speed things up. The size of a CPT is equal to n**m * p where p is the number of values that variable can take, n is the number of values each parent can take (assuming they're all the same, prod(n) more generally when n is a vector of the number of values each parent can take), and m is the number of parents. If each of your variables can take 4 values this means you have at least one CPT with a million values in it. I'd see whether there are any instances where you can reduce this size.

@jeongsoolee09
Copy link
Author

You were definitely right: I have aggressively reduced the maximum fan-in per node down to 6, and the inference time has drastically reduced to 5-7 seconds. Fighting with O(4^m) (where m is the incoming edge per node) complexity is definitely a tough thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants