Add shortcuts for missing data #602

alexhenrie · 2019-07-29T21:10:23Z

The following benchmarking program shows that on sklearn's "digits" dataset, which is their dataset most suited for Bayesian analysis, adding these shortcuts reduces the total execution time by about 10% when there are missing values and does not increase the execution time when there are no missing values.

Interestingly, I could not reproduce the missing values slowdown on this dataset that I see with our own dataset or a randomly generated dataset. Nonetheless, the benchmarks show that there is a significant benefit to adding shortcuts.

Script:

import numpy as np
from neurtu import delayed, Benchmark
from pomegranate import BayesianNetwork
from random import random
from sklearn.datasets import load_digits

X = load_digits(return_X_y=True)[0][:,:10]
train = delayed(BayesianNetwork)().from_samples(X)

print('Without missing values:')
print(Benchmark(wall_time=True, cpu_time=True, repeat=3)(train))
print()

for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        if random() <= 0.1:
            X[i][j] = np.nan

print('With missing values:')
print(Benchmark(wall_time=True, cpu_time=True, repeat=3)(train))
print()

Without NaN shortcuts:

Without missing values:
      wall_time  cpu_time
mean  11.357033  7.907993
max   11.491907  7.944401
std    0.117067  0.031715

With missing values:
      wall_time  cpu_time
mean  10.816061  7.586988
max   10.831504  7.742489
std    0.013388  0.148912

With NaN shortcuts:

Without missing values:
      wall_time  cpu_time
mean  11.586937  7.937418
max   12.074349  8.004921
std    0.422112  0.067759

With missing values:
      wall_time  cpu_time
mean   9.708813  6.849663
max    9.726768  6.869913
std    0.020418  0.025905

Please let me know if there's anything else you need.

jmschrei · 2019-07-29T21:31:52Z

This looks good. I'll review it more thoroughly when I get back next week from the conference I'm at. Would it be possible for you to confirm that you're getting the same network before and after your changes by printing out model.structure and by calculating the probability of some data set? It can be the one that learned the network over. Thanks!

alexhenrie · 2019-07-29T21:43:27Z

I have now verified that the shortcuts result in no changes to model.structure or the output of model.predict_proba. I look forward to these commits being merged when you return from your conference, and after that I'll likely start digging into other areas where the performance could be improved. Thanks again for your support!

jmschrei · 2019-08-07T07:37:40Z

When I run the test script on my machine I don't see the same speedups that you see.

This is the result before incorporating your change to discrete_score_node (which is generally the most compute intense part of structure learning).

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.683641   8.551788
max   5.730023   8.625430
std   0.068112   0.064195

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.138517   7.745014
max   5.150633   7.810688
std   0.016490   0.057186

When I incorporate the change:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.744121   8.610118
max   5.888380   8.690713
std   0.137677   0.070891

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.517163   8.523160
max   5.574747   8.530496
std   0.066018   0.012541

To me, the differences look like normal run-to-run differences.

When I cranked up the percentage of missing values, I actually got a segfault in my code. It took me a while to figure it out, but the issue seems to have to do with the score calculation of a candidate parent set, which is P(D|M) + log(|D|) / 2 * |M|, where |D| is the number of examples in your data set and |M| is the number of parameters in the model. Pomegranate handles missing data when building Bayesian networks by ignoring examples that have missing values when focusing on the subset of features corresponding to a variable and it's candidate parents. This means that |D| can change from one candidate parent set to the next. In cases where the candidate parent set is large and the data full of missing values, you may get situations where there is only one example in the data where no features are missing. In this case, P(D|M) must be 0, but log(|D|) is also 0, meaning that complexity is no longer penalized. The segfault appears to be related to this, where massive candidate sets were being selected for some variables. The fix, derived from the MDL regularization itself, appears to be to discard candidate parent sets which result in only one example, much like we discard candidate parent sets that result in zero examples.

Regardless, after incorporating this simple fix, I timed both the original code and the code incorporating this change and removing ~99% of the values. The original code:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.665113   8.461362
max   5.768124   8.482762
std   0.103726   0.021463

With missing values:
      cpu_time  wall_time
mean  0.171360   0.170214
max   0.172353   0.171646
std   0.000871   0.001241

The proposed changes:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.484248   8.345044
max   5.521972   8.416701
std   0.032916   0.070389

With missing values:
      cpu_time  wall_time
mean  0.160196   0.160249
max   0.161104   0.162660
std   0.001142   0.002100

The difference in speed between the original code and the modifications appears minor, if not just run-to-run variation. Are you sure that these changes are causing a dramatic speedup on your end? To be fair, I didn't incorporate the changes to ConditionalProbabilityTable or JointProbabilityTable because I didn't expect them to make as much of a difference as this function.

alexhenrie · 2019-08-07T14:08:52Z

The biggest difference actually comes from adding a shortcut to ConditionalProbabilityTable.__summarize (two underscores). Try the benchmarks again after applying the patch that touches that function.

jmschrei · 2019-08-07T16:05:47Z

I added in all of your changes and get the following timings when using ~99% sparsity:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.520043   8.370361
max   5.609389   8.449367
std   0.123448   0.125639

With missing values:
      cpu_time  wall_time
mean  0.159223   0.160261
max   0.160032   0.167227
std   0.000870   0.006134

When I go back to ~10% sparsity I get the following, which is comparable to the timings above:

Without missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.564584   8.379262
max   5.654466   8.494476
std   0.125364   0.137579

With missing values:
      cpu_time  wall_time                                                                                                                                                                
mean  5.649820   8.333919
max   5.746899   8.389962
std   0.109258   0.059623

alexhenrie · 2019-08-07T17:00:42Z

Huh. Are you sure you're running a fair comparison, making sure that Pomegranate is not installed via Pip and compiling and installing Pomegranate the same way before and after applying the changes?

If it's a fair comparison, maybe it's a difference between your Python installation, compiler, or CPU and mine. I have Python 3.7.4, GCC 9.1.0, and an Intel i5-5257U. What do you have?

alexhenrie · 2019-08-07T18:59:50Z

Or could your fix for data with too many missing values be affecting the performance? Could you put that fix on a branch where I can look at it?

jmschrei · 2019-08-07T19:33:20Z

Each time I would modify the code locally, build it, and then run the timings twice to ensure that there weren't any first-run issues. I'm running Python 3.7.3 GCC 7.3.0 in an Ubuntu vmware virtual machine on an i7-8700K 6 core (12 thread) CPU @ 3.70GHz.

My fix involved simply adding

	if w_sum > 1:
		logp -= _log(w_sum) / 2 * m[d+1]
	else:
		logp = NEGINF

to the end of the discrete_score_node function, so I doubt that it was significantly impacting the speed of that function. It is possible that, by returning NEGINF rather than 0, that it is building a smaller network when the data has many missing values and so the time taken by ConditionalProbabilityTable is lessened. However, since my script segfaults with the original code, I haven't been able to time that.

alexhenrie · 2019-08-07T21:35:56Z

I don't see any improvement when running in a virtual machine either. I think you have to be running on bare metal to see the difference.

alexhenrie · 2019-08-08T16:02:36Z

I talked to Sergiusz and he confirmed that the bug you found could be part of our problem. However, it does not seem to be the entire problem because even with the fix, training with missing values is still significantly slower than training without missing values. But regardless of performance, fixing the bug is the right thing to do.

alexhenrie · 2019-08-16T02:35:18Z

Hi Jacob, it's been a week since your last comment. Do you need me to get you SSH access to a non-VM server where you can play around and see the performance difference?

jmschrei · 2019-08-16T05:41:00Z

The fastest way for me to provide a fix would be for me to see a data set where this is a problem. It's difficult for me to provide a fix for a bug I can't reproduce.

alexhenrie · 2019-08-16T05:54:31Z

You should be able to see the performance boost on the digits dataset as long as you are not running in a virtual machine. The performance characteristics of virtual machines are different from real hardware.

alexhenrie · 2019-08-16T06:22:48Z

I guess the question is what your criteria are for accepting performance improvements. Do you only care about performance in virtual machines? If so, do you only care about VMware, or would you accept the patches if I showed that they benefit another platform such as QEMU?

jmschrei · 2019-08-16T07:13:41Z

I ran both the current master and the proposed changes on a machine that only had Linux installed. I couldn't reproduce the gain in speed there either. The only noteworthy things about the machine is that it's part of a shared file system, and appears to be fairly slow compared to both of our machines. I don't know where this source of slowness comes from, because it allegedly has 2.4GHz processors, in comparison to my 3.7GHz ones. However, not only is the learning process slow, but it takes a few seconds to write "Without missing values" to the screen, and then another few seconds to start the progress bar...

Current master:

Without missing values:
       cpu_time  wall_time                                                                                           
mean  25.002199  31.330353
max   25.019196  31.929205
std    0.016997   0.518762

With missing values:
       cpu_time  wall_time                                                                                           
mean  25.142178  31.134752
max   25.156176  31.183504
std    0.013526   0.042548

Proposed changes:

Without missing values:
       cpu_time  wall_time                                                                                                                                                 
mean  25.780414  32.184617
max   25.831073  32.499963
std    0.074321   0.287220

With missing values:
       cpu_time  wall_time                                                                                                                                                 
mean  25.802744  32.714040
max   25.805077  32.910791
std    0.003214   0.170431

Out of curiosity, why is the progress bar out of 6 when you write repeat=3 in the command? Also, do you know why each iteration appears to take longer than the last one? They should all be the same command on the same data, right?

My understanding of the situation is that it appears there is a bug which, on your end, is causing a major slowdown in structure learning when there are missing values. I agree with you that it is important to track down this bug and fix it. However, I haven't been able to reproduce this bug, and it doesn't look like you are either outside of this one(?) data set that you haven't shared with me. Please correct me if I'm wrong. Instead, you have proposed a few changes that, while well-intentioned, don't appear to make any difference on my end. Even when the improvements you report are taken at face value, they don't appear to solve the larger bug, and don't really appear to me to yield major speed improvements.

I would like to see either an issue with reproducible code that demonstrates the bug so that I can begin to debug it on my end, or, even better, a PR with such code and also the fix. I would happily merge such a PR.

alexhenrie · 2019-08-16T07:22:35Z

I posted a sample program that has similar problems to our real dataset over a month ago at #589 (comment). I'll respond to your other questions in the morning.

alexhenrie · 2019-08-17T03:40:58Z

I don't know why neurtu appears to do 6 iterations with each iteration taking more time than the last. I agree that it is strange. If you prefer to use different benchmarking software, that's totally fine.

…bability

alexhenrie · 2019-08-22T04:56:42Z

I've tried the shortcuts on several more computers now and gotten mixed results. The shortcuts seem more likely to help when the system is under load, but I haven't found a scenario where I can consistently see an improvement. I still think that merging these commits is the right thing to do primarily because they simplify the code, but I can live with it if you are still opposed.

jmschrei · 2019-08-26T07:18:20Z

Added in a different PR.

alexhenrie added 6 commits August 19, 2019 23:35

Add shortcut for missing data to BayesianNetwork.discrete_score_node

7a21a39

Add shortcut for missing data to ConditionalProbabilityTable._log_pro…

f2ccbfc

…bability

Add shortcut for missing data to ConditionalProbabilityTable.__summarize

ee907c5

Add shortcut for missing data to ConditionalProbabilityTable._summarize

fa3e824

Add shortcut for missing data to JointProbabilityTable._log_probability

36081f4

Add shortcut for missing data to JointProbabilityTable._summarize

7afa277

alexhenrie force-pushed the nan_shortcuts branch from 9bd386b to 7afa277 Compare August 20, 2019 05:36

jmschrei closed this Aug 26, 2019

alexhenrie deleted the nan_shortcuts branch September 4, 2019 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shortcuts for missing data #602

Add shortcuts for missing data #602

alexhenrie commented Jul 29, 2019

jmschrei commented Jul 29, 2019

alexhenrie commented Jul 29, 2019 •

edited

jmschrei commented Aug 7, 2019 •

edited

alexhenrie commented Aug 7, 2019

jmschrei commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

jmschrei commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

alexhenrie commented Aug 8, 2019 •

edited

alexhenrie commented Aug 16, 2019 •

edited

jmschrei commented Aug 16, 2019

alexhenrie commented Aug 16, 2019 •

edited

alexhenrie commented Aug 16, 2019

jmschrei commented Aug 16, 2019

alexhenrie commented Aug 16, 2019

alexhenrie commented Aug 17, 2019

alexhenrie commented Aug 22, 2019

jmschrei commented Aug 26, 2019

Add shortcuts for missing data #602

Add shortcuts for missing data #602

Conversation

alexhenrie commented Jul 29, 2019

jmschrei commented Jul 29, 2019

alexhenrie commented Jul 29, 2019 • edited

jmschrei commented Aug 7, 2019 • edited

alexhenrie commented Aug 7, 2019

jmschrei commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

jmschrei commented Aug 7, 2019

alexhenrie commented Aug 7, 2019

alexhenrie commented Aug 8, 2019 • edited

alexhenrie commented Aug 16, 2019 • edited

jmschrei commented Aug 16, 2019

alexhenrie commented Aug 16, 2019 • edited

alexhenrie commented Aug 16, 2019

jmschrei commented Aug 16, 2019

alexhenrie commented Aug 16, 2019

alexhenrie commented Aug 17, 2019

alexhenrie commented Aug 22, 2019

jmschrei commented Aug 26, 2019

alexhenrie commented Jul 29, 2019 •

edited

jmschrei commented Aug 7, 2019 •

edited

alexhenrie commented Aug 8, 2019 •

edited

alexhenrie commented Aug 16, 2019 •

edited

alexhenrie commented Aug 16, 2019 •

edited