Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Implement probabilistic KDE error bounds #1934

Merged
merged 84 commits into from Jul 30, 2019

Conversation

@robertohueso
Copy link
Member

commented Jun 20, 2019

Inspired by this paper and corrections made by @rcurtin, this PR adds support for probabilistic relative error bounds when using Gaussian kernel into the existing KDE codebase.

@rcurtin

This comment has been minimized.

Copy link
Member

commented on src/mlpack/methods/kde/kde_rules_impl.hpp in 94ea85c Jun 4, 2019

Are you sure this is right? I thought it might be better to use boost::math::cdf() here.

This comment has been minimized.

Copy link
Member

replied Jun 4, 2019

Ah, sorry. I was not reading the right documentation. This gives us the inverse CDF, which is what we need. So I think it is right.

@rcurtin

This comment has been minimized.

Copy link
Member

commented on src/mlpack/methods/kde/kde_rules_impl.hpp in 94ea85c Jun 4, 2019

The paper isn't clear on this, but I wonder if sample.size() > mThresh means that Monte Carlo approximation can't provide the desired error bounds, and recursion should be done.

I tried digging in the ancient FASTLIB/MLPACK code (as it was called at that time), but I did not quickly see Dongryeol's implementation of this particular technique. You can look if you like:

https://github.com/mlpack/mlpack/tree/0f23d5f61908b4640ed540e4673485bdf3fd9d3b/src/contrib/dongryel/kde

In the end, that code was never ported to newer mlpack because of its complexity, Dongryeol's focus on defending his thesis (meaning he didn't have time :)), and the fact that it was very specific to kd-trees and did not look easy to port. So even if you can find the right stuff in there, it's not clear how easy it will be to port over. Still, it may be useful to glance it (maybe).

@rcurtin

This comment has been minimized.

I spent a long time with this part of the paper tonight and actually I found a flaw in the bound. Luckily, it ends up giving us an easier-to-implement (but looser) bound. Here's my work---it's probably a good idea to try to reproduce it to make sure I haven't gone wrong somewhere.

Our goal is to ensure that the following condition is satisfied by our choice of m:

z_{β / 2} σ^S / √(m) <= ϵ Φ(q, R) / |R|

And our goal is to find the value of m that will, with high probability, cause this inequality to be satisfied. But we do not have access to the true kernel density estimate Φ(q, R) (although you can compute it in the code, and I did while debugging); instead, we can approximate this one of two ways:

(1) with probability 1 - β, Φ(q, R) <= |R| (µ_S - z_{β / 2} σ^S / √(m))---if the inequality above is satisfied.
(2) with probability 1 - β, Φ(q, R) <= Φ^l(q, R) + (|R| - m) (µ_S - z_{β / 2} σ^S / √(m))---again this holds if the inequality above is satisfied.

Note this sheds some light on what Φ^l(q, R) has to be. In the paper it's not really clear how it's described, other than the writing the currently running lower bound on the sum computed using exact methods. But the derivation above means that Φ^l(q, R) must correspond in the code to just the quantity arma::accu(sample).

So, now that we have a lower bound for Φ(q, R), we can go from the inequality above to Equation (2) in the paper. However Equation (2) uses an incorrect bounding of Φ(q, R): the quantity µ_S - z_{β / 2} σ^S / √(m) is the lower bound of what we expect the kernel contribution of any reference point to be. (And, based on our bounds from earlier, a reference point will have kernel contribution less than that bound with probability (1 - β / 2). Although, that applies only if m is chosen such that that inequality is true.) Thus, if we take Φ^l(q, R) + |R| µ_S - z_{β / 2} σ^S / √(m) where Φ^l(q, R) is just the sum of the true kernel contributions of m points, then we are double-counting m points. So we have to instead take Φ^l(q, R) + (|R| - m) (µ_S - z_{β / 2} σ^S / √(m)) as a bound! This is an error in the paper, and causes the equation that establishes the bound on m to be wrong (unfortunately that equation is not numbered).

So, we could work out that unnumbered equation again simply by starting from the bounded inequality

z_{β / 2} σ^S / √(m) <= ϵ (Φ^l(q, R) + (|R| - m) (µ_S - z_{β / 2} σ^S / √(m))) / |R|

and solving for m. However, in the derivation that is done in the paper, there is another error---the upper bound (µ_S - z_{β / 2} σ^S / √(m)) <= µ_S is used. (If you take a look at the result, you can see how this was applied.) But this is invalid, since we used a lower bound on the right-hand-side Φ(q, R) in the first step. If we now apply an upper bound after a lower bound, the original inequality may no longer hold.

I found it easier to use bound (1) on Φ(q, R) instead of bound (2). Doing that, I rederived the value of m and got the following:

m >= z_{β / 2}^2 (σ^S)^2 (1 + ϵ)^2 / (ϵ µ_S)^2

but you should also rederive it and see if I didn't mess anything up. :) I also tried deriving it using bound (2) but the algebra became really irritating and it's a little bit late here. :) Maybe you can do it better than I did.

This comment has been minimized.

Copy link
Owner Author

replied Jun 13, 2019

Thank you very much for the time you took working on this :)
After reviewing it, I think everything is correct except for the last derivation.

I think instead of

m >= z_{β / 2}^2 (σ^S)^2 (1 + ϵ)^2 / (ϵ µ_S)^2

it should be

m >= z_{β / 2}^2 (σ^S)^2 (1 + 2ϵ)^2 / (2ϵ µ_S)^2

I'm not sure I'm right, because your derivation yields better results (i.e. less estimations out of bounds) than mine but I'm concerned it might be oversizing mThresh.

When I realized about it, I thought about checking it using WolframAlpha (although it's not the best free software solution) and I got this result for derivation using (1).

This comment has been minimized.

Copy link

replied Jun 14, 2019

Shouldn't the input for the derivation (1) be this instead though? https://www.wolframalpha.com/input/?i=solve+z*s%2Fsqrt(m)%3D(e*(P%2B(R+-+m)*(n-(z*s%2Fsqrt(m)))))%2FR+for+m
The key is that P (or Φ^l(q, R), I was just too lazy to type that into Wolfram Alpha) is actually a value we've already computed, not something dependent on the mean and standard deviation we estimated.

I don't follow where the came from though... my derivation looks like this:

// Starting from the original statement.
z_{β / 2} σ^S / √(m) <= ϵ Φ(q, R) / |R|

// Apply bound (2).
z_{β / 2} σ^S / √(m) <= ϵ (|R| (µ_S - z_{β / 2} σ^S / √(m))) / |R|

// Distribute ϵ, drop |R| term.
z_{β / 2} σ^S / √(m) <= ϵ µ_S - ϵ z_{β / 2} σ^S / √(m)

// Move RHS term to LHS.
(1 + ϵ) z_{β / 2} σ^S / √(m) <= ϵ µ_S

// Invert and then move LHS denominator to the right.
√(m) >=  ((1 + ϵ) z_{β / 2} σ^S) / (ϵ µ_S)

// Square both sides.
m >= z_{β / 2}^2 (σ^S)^2 (1 + ϵ)^2 / (ϵ µ_S)^2

But, it's possible I missed something in the math there. Based on the simulation results I saw, I agree, it may be choosing mThresh too large.

@rcurtin

This comment has been minimized.

I found a bug here---when we take mThresh as a size_t, this is equivalent to taking the floor of the floating point result. If you do

size_t mThresh = std::ceil(std::pow(z * stddev * (numerator / denominator), 2.0));

instead, this should fix it. The current code can predict a too-small value of m as a result. :)

This comment has been minimized.

Copy link
Owner Author

replied Jun 12, 2019

Exactly :) Thanks!

@rcurtin

This comment has been minimized.

Are you sure GetKernelBandwidth() should be used here and not 1.0? The z scores should be for a unit-variance Gaussian as far as I know. Correct me if I'm wrong.

This comment has been minimized.

Copy link
Owner Author

replied Jun 12, 2019

I think you're right, I just tried this in case I was doing it wrong :/

Copy link
Member

left a comment

Hey @robertohueso, I went pretty in depth with the implementation and I think it is basically good to go. Some of the comments I left are pretty simple little things, but most of the "big" comments have to do with basically one idea that I'll summarize here---

Right now the code avoids Monte Carlo approximation (or approximation of any sort) when a cover tree's self-child is encountered for the reference node. This is done via the use of the newCalculations variable. However, my suggestions primarily center around changing this, because it is possible to still approximate (Monte Carlo or not) when a cover tree self-child is encountered. Primarily, the change that needs to be made in this case is to avoid double-counting the reference self-child point (i.e. referenceNode.Point(0)). In dual-tree mode it gets a little bit more tricky, because we have to avoid double-counting the reference self-child point but only for the query self-child point.

In any case, I left comments throughout that should be the entirety of the changes that are needed to allow approximation even for the self-child case. But (1) you should double-check to make sure I didn't miss something while thinking about it :) and (2) if you had a specific reason that I overlooked for avoiding approximation in these cases (like maybe it's more complex and I forgot some awful detail), let me know. It seems to me that it should be mostly straightforward to make the change, but if it's not, I don't want to make you spend weeks on it. :) I think it will work as-is, this would just be a minor improvement.

I don't have any other comments to the code, other than any already-open comments that aren't resolved from earlier reviews, and for those earlier comments, it's up to you how (or if) you want to handle them; I don't think any of those are critical, all just suggestions. Once each comment is handled I think it is ready for merge. 👍

Great work on this. I know it took longer than you planned for but that's perfectly okay. :)

static constexpr double mcEntryCoef = 3;

//! Monte Carlo break coefficient.
static constexpr double mcBreakCoef = 0.7;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 18, 2019

Member

Have you tried being more aggressive with this one? I feel like an mcBreakCoef of 0.4 could give good results, although one might also want to have a maximum number of points sampled too, since we almost definitely want to recurse at very high levels, where even mcBreakCoef * referenceNode.NumDescendants() could be extremely large (like at the root node for instance). Up to you how you want to handle it; let me know if I can clarify what I wrote.

This comment has been minimized.

Copy link
@robertohueso

robertohueso Jul 22, 2019

Author Member

I've been doing some testing and this is an example of the KDE computation time (in seconds) of one of the tests:

Execution parameters:
abs_error: 0
algorithm: dual-tree
initial_sample_size: 100
kernel: gaussian
mc_entry_coef: 3
mc_probability: 0.95
monte_carlo: true
rel_error: 0.05
tree: kd-tree
Bandwidth\mcBreakCoef 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.5 6.789707 6.266044 5.977942 6.204131 6.275582 6.391213 6.270569
0.8 5.146516 4.916502 6.405031 9.853159 11.119784 2.484530 13.522522
1.0 5.859346 1.235331 1.224885 1.249122 1.269203 1.254212 1.280066

This is highly dependant on other parameters, but it seems to follow a similar pattern across different datasets, so a good default would be either 0.3 or 0.4 so let's choose 0.4 :)

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 24, 2019

Member

Agreed. 👍 Thanks for doing the testing on this one to figure out what is best.

src/mlpack/methods/kde/kde_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_main.cpp Show resolved Hide resolved
src/mlpack/methods/kde/kde_rules_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_rules_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_rules_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/methods/kde/kde_rules_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/tests/kde_test.cpp Show resolved Hide resolved
Now it also tries to approximate self-children
- Now it also tries to approximate self-children
- Fix for Cover Tree
@robertohueso

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

Thanks for the last review! I think everything is now solved and ready for review :)

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

@mlpack-jenkins test this please

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

Agreed, this all seems ready to me except for the static code analysis error. This is the one about the RectangleTree copy constructor. Did you want me to provide a quick patch for that?

Any of the failing tests on Travis appear to be in ANNLayerTest, so #1953 is trying to solve that but we don't need to consider it here.

@robertohueso

This comment has been minimized.

Copy link
Member Author

commented Jul 24, 2019

If it doesn't take too much time for you, it would be nice. Otherwise I can also try to fix it :)

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

👍 let me see if I can do it in a moment here once I handle a few other emails.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

https://gist.github.com/d8fd23ded2bda55277689265362c224f should handle the RectangleTree fixes. Let me know if you see anything that can be improved. 👍 Otherwise I think that this is ready, let's just make sure there are no CI problems before approval.

rcurtin and others added 3 commits Jul 26, 2019
Signed-off-by: Roberto Hueso Gomez <robertohueso96@gmail.com>
@rcurtin

This comment has been minimized.

Copy link
Member

commented on aaaaa38 Jul 26, 2019

Nice catch! It's possible my focus is a little elsewhere this week... :)

Copy link
Member

left a comment

Awesome work! Now that all the tests pass I think this is ready to merge (with the exception of ANNLayerTest but as we know that's not an issue with this code). 💯

@mlpack-bot
mlpack-bot bot approved these changes Jul 28, 2019
Copy link

left a comment

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin merged commit b882909 into mlpack:master Jul 30, 2019
5 of 6 checks passed
5 of 6 checks passed
continuous-integration/travis-ci/pr The Travis CI build failed
Details
LaTeX Documentation Checks Build finished.
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

Awesome, happy to have this merged. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.