New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shortcuts for missing data #602
Conversation
This looks good. I'll review it more thoroughly when I get back next week from the conference I'm at. Would it be possible for you to confirm that you're getting the same network before and after your changes by printing out |
I have now verified that the shortcuts result in no changes to model.structure or the output of model.predict_proba. I look forward to these commits being merged when you return from your conference, and after that I'll likely start digging into other areas where the performance could be improved. Thanks again for your support! |
When I run the test script on my machine I don't see the same speedups that you see. This is the result before incorporating your change to
When I incorporate the change:
To me, the differences look like normal run-to-run differences. When I cranked up the percentage of missing values, I actually got a segfault in my code. It took me a while to figure it out, but the issue seems to have to do with the score calculation of a candidate parent set, which is P(D|M) + log(|D|) / 2 * |M|, where |D| is the number of examples in your data set and |M| is the number of parameters in the model. Pomegranate handles missing data when building Bayesian networks by ignoring examples that have missing values when focusing on the subset of features corresponding to a variable and it's candidate parents. This means that |D| can change from one candidate parent set to the next. In cases where the candidate parent set is large and the data full of missing values, you may get situations where there is only one example in the data where no features are missing. In this case, P(D|M) must be 0, but log(|D|) is also 0, meaning that complexity is no longer penalized. The segfault appears to be related to this, where massive candidate sets were being selected for some variables. The fix, derived from the MDL regularization itself, appears to be to discard candidate parent sets which result in only one example, much like we discard candidate parent sets that result in zero examples. Regardless, after incorporating this simple fix, I timed both the original code and the code incorporating this change and removing ~99% of the values. The original code:
The proposed changes:
The difference in speed between the original code and the modifications appears minor, if not just run-to-run variation. Are you sure that these changes are causing a dramatic speedup on your end? To be fair, I didn't incorporate the changes to ConditionalProbabilityTable or JointProbabilityTable because I didn't expect them to make as much of a difference as this function. |
The biggest difference actually comes from adding a shortcut to ConditionalProbabilityTable.__summarize (two underscores). Try the benchmarks again after applying the patch that touches that function. |
I added in all of your changes and get the following timings when using ~99% sparsity:
When I go back to ~10% sparsity I get the following, which is comparable to the timings above:
|
Huh. Are you sure you're running a fair comparison, making sure that Pomegranate is not installed via Pip and compiling and installing Pomegranate the same way before and after applying the changes? If it's a fair comparison, maybe it's a difference between your Python installation, compiler, or CPU and mine. I have Python 3.7.4, GCC 9.1.0, and an Intel i5-5257U. What do you have? |
Or could your fix for data with too many missing values be affecting the performance? Could you put that fix on a branch where I can look at it? |
Each time I would modify the code locally, build it, and then run the timings twice to ensure that there weren't any first-run issues. I'm running Python 3.7.3 GCC 7.3.0 in an Ubuntu vmware virtual machine on an i7-8700K 6 core (12 thread) CPU @ 3.70GHz. My fix involved simply adding if w_sum > 1:
logp -= _log(w_sum) / 2 * m[d+1]
else:
logp = NEGINF to the end of the |
I don't see any improvement when running in a virtual machine either. I think you have to be running on bare metal to see the difference. |
I talked to Sergiusz and he confirmed that the bug you found could be part of our problem. However, it does not seem to be the entire problem because even with the fix, training with missing values is still significantly slower than training without missing values. But regardless of performance, fixing the bug is the right thing to do. |
Hi Jacob, it's been a week since your last comment. Do you need me to get you SSH access to a non-VM server where you can play around and see the performance difference? |
The fastest way for me to provide a fix would be for me to see a data set where this is a problem. It's difficult for me to provide a fix for a bug I can't reproduce. |
You should be able to see the performance boost on the digits dataset as long as you are not running in a virtual machine. The performance characteristics of virtual machines are different from real hardware. |
I guess the question is what your criteria are for accepting performance improvements. Do you only care about performance in virtual machines? If so, do you only care about VMware, or would you accept the patches if I showed that they benefit another platform such as QEMU? |
I ran both the current master and the proposed changes on a machine that only had Linux installed. I couldn't reproduce the gain in speed there either. The only noteworthy things about the machine is that it's part of a shared file system, and appears to be fairly slow compared to both of our machines. I don't know where this source of slowness comes from, because it allegedly has 2.4GHz processors, in comparison to my 3.7GHz ones. However, not only is the learning process slow, but it takes a few seconds to write "Without missing values" to the screen, and then another few seconds to start the progress bar... Current master:
Proposed changes:
Out of curiosity, why is the progress bar out of 6 when you write My understanding of the situation is that it appears there is a bug which, on your end, is causing a major slowdown in structure learning when there are missing values. I agree with you that it is important to track down this bug and fix it. However, I haven't been able to reproduce this bug, and it doesn't look like you are either outside of this one(?) data set that you haven't shared with me. Please correct me if I'm wrong. Instead, you have proposed a few changes that, while well-intentioned, don't appear to make any difference on my end. Even when the improvements you report are taken at face value, they don't appear to solve the larger bug, and don't really appear to me to yield major speed improvements. I would like to see either an issue with reproducible code that demonstrates the bug so that I can begin to debug it on my end, or, even better, a PR with such code and also the fix. I would happily merge such a PR. |
I posted a sample program that has similar problems to our real dataset over a month ago at #589 (comment). I'll respond to your other questions in the morning. |
I don't know why neurtu appears to do 6 iterations with each iteration taking more time than the last. I agree that it is strange. If you prefer to use different benchmarking software, that's totally fine. |
9bd386b
to
7afa277
Compare
I've tried the shortcuts on several more computers now and gotten mixed results. The shortcuts seem more likely to help when the system is under load, but I haven't found a scenario where I can consistently see an improvement. I still think that merging these commits is the right thing to do primarily because they simplify the code, but I can live with it if you are still opposed. |
Added in a different PR. |
The following benchmarking program shows that on sklearn's "digits" dataset, which is their dataset most suited for Bayesian analysis, adding these shortcuts reduces the total execution time by about 10% when there are missing values and does not increase the execution time when there are no missing values.
Interestingly, I could not reproduce the missing values slowdown on this dataset that I see with our own dataset or a randomly generated dataset. Nonetheless, the benchmarks show that there is a significant benefit to adding shortcuts.
Script:
Without NaN shortcuts:
With NaN shortcuts:
Please let me know if there's anything else you need.