Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shortcuts for missing data #602

Closed
wants to merge 6 commits into from

Conversation

alexhenrie
Copy link
Contributor

The following benchmarking program shows that on sklearn's "digits" dataset, which is their dataset most suited for Bayesian analysis, adding these shortcuts reduces the total execution time by about 10% when there are missing values and does not increase the execution time when there are no missing values.

Interestingly, I could not reproduce the missing values slowdown on this dataset that I see with our own dataset or a randomly generated dataset. Nonetheless, the benchmarks show that there is a significant benefit to adding shortcuts.

Script:

import numpy as np
from neurtu import delayed, Benchmark
from pomegranate import BayesianNetwork
from random import random
from sklearn.datasets import load_digits

X = load_digits(return_X_y=True)[0][:,:10]
train = delayed(BayesianNetwork)().from_samples(X)

print('Without missing values:')
print(Benchmark(wall_time=True, cpu_time=True, repeat=3)(train))
print()

for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        if random() <= 0.1:
            X[i][j] = np.nan

print('With missing values:')
print(Benchmark(wall_time=True, cpu_time=True, repeat=3)(train))
print()

Without NaN shortcuts:

Without missing values:
      wall_time  cpu_time
mean  11.357033  7.907993
max   11.491907  7.944401
std    0.117067  0.031715

With missing values:
      wall_time  cpu_time
mean  10.816061  7.586988
max   10.831504  7.742489
std    0.013388  0.148912

With NaN shortcuts:

Without missing values:
      wall_time  cpu_time
mean  11.586937  7.937418
max   12.074349  8.004921
std    0.422112  0.067759

With missing values:
      wall_time  cpu_time
mean   9.708813  6.849663
max    9.726768  6.869913
std    0.020418  0.025905

Please let me know if there's anything else you need.

@jmschrei
Copy link
Owner

This looks good. I'll review it more thoroughly when I get back next week from the conference I'm at. Would it be possible for you to confirm that you're getting the same network before and after your changes by printing out model.structure and by calculating the probability of some data set? It can be the one that learned the network over. Thanks!

@alexhenrie
Copy link
Contributor Author

alexhenrie commented Jul 29, 2019

I have now verified that the shortcuts result in no changes to model.structure or the output of model.predict_proba. I look forward to these commits being merged when you return from your conference, and after that I'll likely start digging into other areas where the performance could be improved. Thanks again for your support!

@jmschrei
Copy link
Owner

jmschrei commented Aug 7, 2019

When I run the test script on my machine I don't see the same speedups that you see.

This is the result before incorporating your change to discrete_score_node (which is generally the most compute intense part of structure learning).

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.683641   8.551788
max   5.730023   8.625430
std   0.068112   0.064195

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.138517   7.745014
max   5.150633   7.810688
std   0.016490   0.057186

When I incorporate the change:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.744121   8.610118
max   5.888380   8.690713
std   0.137677   0.070891

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.517163   8.523160
max   5.574747   8.530496
std   0.066018   0.012541

To me, the differences look like normal run-to-run differences.

When I cranked up the percentage of missing values, I actually got a segfault in my code. It took me a while to figure it out, but the issue seems to have to do with the score calculation of a candidate parent set, which is P(D|M) + log(|D|) / 2 * |M|, where |D| is the number of examples in your data set and |M| is the number of parameters in the model. Pomegranate handles missing data when building Bayesian networks by ignoring examples that have missing values when focusing on the subset of features corresponding to a variable and it's candidate parents. This means that |D| can change from one candidate parent set to the next. In cases where the candidate parent set is large and the data full of missing values, you may get situations where there is only one example in the data where no features are missing. In this case, P(D|M) must be 0, but log(|D|) is also 0, meaning that complexity is no longer penalized. The segfault appears to be related to this, where massive candidate sets were being selected for some variables. The fix, derived from the MDL regularization itself, appears to be to discard candidate parent sets which result in only one example, much like we discard candidate parent sets that result in zero examples.

Regardless, after incorporating this simple fix, I timed both the original code and the code incorporating this change and removing ~99% of the values. The original code:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.665113   8.461362
max   5.768124   8.482762
std   0.103726   0.021463

With missing values:
      cpu_time  wall_time
mean  0.171360   0.170214
max   0.172353   0.171646
std   0.000871   0.001241

The proposed changes:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.484248   8.345044
max   5.521972   8.416701
std   0.032916   0.070389

With missing values:
      cpu_time  wall_time
mean  0.160196   0.160249
max   0.161104   0.162660
std   0.001142   0.002100

The difference in speed between the original code and the modifications appears minor, if not just run-to-run variation. Are you sure that these changes are causing a dramatic speedup on your end? To be fair, I didn't incorporate the changes to ConditionalProbabilityTable or JointProbabilityTable because I didn't expect them to make as much of a difference as this function.

@alexhenrie
Copy link
Contributor Author

The biggest difference actually comes from adding a shortcut to ConditionalProbabilityTable.__summarize (two underscores). Try the benchmarks again after applying the patch that touches that function.

@jmschrei
Copy link
Owner

jmschrei commented Aug 7, 2019

I added in all of your changes and get the following timings when using ~99% sparsity:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.520043   8.370361
max   5.609389   8.449367
std   0.123448   0.125639

With missing values:
      cpu_time  wall_time
mean  0.159223   0.160261
max   0.160032   0.167227
std   0.000870   0.006134

When I go back to ~10% sparsity I get the following, which is comparable to the timings above:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.564584   8.379262
max   5.654466   8.494476
std   0.125364   0.137579

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.649820   8.333919
max   5.746899   8.389962
std   0.109258   0.059623

@alexhenrie
Copy link
Contributor Author

Huh. Are you sure you're running a fair comparison, making sure that Pomegranate is not installed via Pip and compiling and installing Pomegranate the same way before and after applying the changes?

If it's a fair comparison, maybe it's a difference between your Python installation, compiler, or CPU and mine. I have Python 3.7.4, GCC 9.1.0, and an Intel i5-5257U. What do you have?

@alexhenrie
Copy link
Contributor Author

Or could your fix for data with too many missing values be affecting the performance? Could you put that fix on a branch where I can look at it?

@jmschrei
Copy link
Owner

jmschrei commented Aug 7, 2019

Each time I would modify the code locally, build it, and then run the timings twice to ensure that there weren't any first-run issues. I'm running Python 3.7.3 GCC 7.3.0 in an Ubuntu vmware virtual machine on an i7-8700K 6 core (12 thread) CPU @ 3.70GHz.

My fix involved simply adding

	if w_sum > 1:
		logp -= _log(w_sum) / 2 * m[d+1]
	else:
		logp = NEGINF

to the end of the discrete_score_node function, so I doubt that it was significantly impacting the speed of that function. It is possible that, by returning NEGINF rather than 0, that it is building a smaller network when the data has many missing values and so the time taken by ConditionalProbabilityTable is lessened. However, since my script segfaults with the original code, I haven't been able to time that.

@alexhenrie
Copy link
Contributor Author

I don't see any improvement when running in a virtual machine either. I think you have to be running on bare metal to see the difference.

@alexhenrie
Copy link
Contributor Author

alexhenrie commented Aug 8, 2019

I talked to Sergiusz and he confirmed that the bug you found could be part of our problem. However, it does not seem to be the entire problem because even with the fix, training with missing values is still significantly slower than training without missing values. But regardless of performance, fixing the bug is the right thing to do.

@alexhenrie
Copy link
Contributor Author

alexhenrie commented Aug 16, 2019

Hi Jacob, it's been a week since your last comment. Do you need me to get you SSH access to a non-VM server where you can play around and see the performance difference?

@jmschrei
Copy link
Owner

The fastest way for me to provide a fix would be for me to see a data set where this is a problem. It's difficult for me to provide a fix for a bug I can't reproduce.

@alexhenrie
Copy link
Contributor Author

alexhenrie commented Aug 16, 2019

You should be able to see the performance boost on the digits dataset as long as you are not running in a virtual machine. The performance characteristics of virtual machines are different from real hardware.

@alexhenrie
Copy link
Contributor Author

I guess the question is what your criteria are for accepting performance improvements. Do you only care about performance in virtual machines? If so, do you only care about VMware, or would you accept the patches if I showed that they benefit another platform such as QEMU?

@jmschrei
Copy link
Owner

I ran both the current master and the proposed changes on a machine that only had Linux installed. I couldn't reproduce the gain in speed there either. The only noteworthy things about the machine is that it's part of a shared file system, and appears to be fairly slow compared to both of our machines. I don't know where this source of slowness comes from, because it allegedly has 2.4GHz processors, in comparison to my 3.7GHz ones. However, not only is the learning process slow, but it takes a few seconds to write "Without missing values" to the screen, and then another few seconds to start the progress bar...

Current master:

Without missing values:
       cpu_time  wall_time                                                                                           
mean  25.002199  31.330353
max   25.019196  31.929205
std    0.016997   0.518762

With missing values:
       cpu_time  wall_time                                                                                           
mean  25.142178  31.134752
max   25.156176  31.183504
std    0.013526   0.042548

Proposed changes:

Without missing values:
       cpu_time  wall_time                                                                                                                                                 
mean  25.780414  32.184617
max   25.831073  32.499963
std    0.074321   0.287220

With missing values:
       cpu_time  wall_time                                                                                                                                                 
mean  25.802744  32.714040
max   25.805077  32.910791
std    0.003214   0.170431

Out of curiosity, why is the progress bar out of 6 when you write repeat=3 in the command? Also, do you know why each iteration appears to take longer than the last one? They should all be the same command on the same data, right?

My understanding of the situation is that it appears there is a bug which, on your end, is causing a major slowdown in structure learning when there are missing values. I agree with you that it is important to track down this bug and fix it. However, I haven't been able to reproduce this bug, and it doesn't look like you are either outside of this one(?) data set that you haven't shared with me. Please correct me if I'm wrong. Instead, you have proposed a few changes that, while well-intentioned, don't appear to make any difference on my end. Even when the improvements you report are taken at face value, they don't appear to solve the larger bug, and don't really appear to me to yield major speed improvements.

I would like to see either an issue with reproducible code that demonstrates the bug so that I can begin to debug it on my end, or, even better, a PR with such code and also the fix. I would happily merge such a PR.

@alexhenrie
Copy link
Contributor Author

I posted a sample program that has similar problems to our real dataset over a month ago at #589 (comment). I'll respond to your other questions in the morning.

@alexhenrie
Copy link
Contributor Author

I don't know why neurtu appears to do 6 iterations with each iteration taking more time than the last. I agree that it is strange. If you prefer to use different benchmarking software, that's totally fine.

@alexhenrie
Copy link
Contributor Author

I've tried the shortcuts on several more computers now and gotten mixed results. The shortcuts seem more likely to help when the system is under load, but I haven't found a scenario where I can consistently see an improvement. I still think that merging these commits is the right thing to do primarily because they simplify the code, but I can live with it if you are still opposed.

@jmschrei
Copy link
Owner

Added in a different PR.

@jmschrei jmschrei closed this Aug 26, 2019
@alexhenrie alexhenrie deleted the nan_shortcuts branch September 4, 2019 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants