Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[GSoC] Implement probabilistic KDE error bounds #1934
The paper isn't clear on this, but I wonder if
I tried digging in the ancient FASTLIB/MLPACK code (as it was called at that time), but I did not quickly see Dongryeol's implementation of this particular technique. You can look if you like:
In the end, that code was never ported to newer mlpack because of its complexity, Dongryeol's focus on defending his thesis (meaning he didn't have time :)), and the fact that it was very specific to kd-trees and did not look easy to port. So even if you can find the right stuff in there, it's not clear how easy it will be to port over. Still, it may be useful to glance it (maybe).
I spent a long time with this part of the paper tonight and actually I found a flaw in the bound. Luckily, it ends up giving us an easier-to-implement (but looser) bound. Here's my work---it's probably a good idea to try to reproduce it to make sure I haven't gone wrong somewhere.
Our goal is to ensure that the following condition is satisfied by our choice of
And our goal is to find the value of m that will, with high probability, cause this inequality to be satisfied. But we do not have access to the true kernel density estimate
(1) with probability
Note this sheds some light on what
So, now that we have a lower bound for
So, we could work out that unnumbered equation again simply by starting from the bounded inequality
and solving for m. However, in the derivation that is done in the paper, there is another error---the upper bound
I found it easier to use bound (1) on
but you should also rederive it and see if I didn't mess anything up. :) I also tried deriving it using bound (2) but the algebra became really irritating and it's a little bit late here. :) Maybe you can do it better than I did.
Thank you very much for the time you took working on this :)
I think instead of
it should be
I'm not sure I'm right, because your derivation yields better results (i.e. less estimations out of bounds) than mine but I'm concerned it might be oversizing
When I realized about it, I thought about checking it using WolframAlpha (although it's not the best free software solution) and I got this result for derivation using (1).
Shouldn't the input for the derivation (1) be this instead though? https://www.wolframalpha.com/input/?i=solve+z*s%2Fsqrt(m)%3D(e*(P%2B(R+-+m)*(n-(z*s%2Fsqrt(m)))))%2FR+for+m
I don't follow where the
But, it's possible I missed something in the math there. Based on the simulation results I saw, I agree, it may be choosing
I found a bug here---when we take
instead, this should fix it. The current code can predict a too-small value of
rcurtin left a comment
Hey @robertohueso, I went pretty in depth with the implementation and I think it is basically good to go. Some of the comments I left are pretty simple little things, but most of the "big" comments have to do with basically one idea that I'll summarize here---
Right now the code avoids Monte Carlo approximation (or approximation of any sort) when a cover tree's self-child is encountered for the reference node. This is done via the use of the
In any case, I left comments throughout that should be the entirety of the changes that are needed to allow approximation even for the self-child case. But (1) you should double-check to make sure I didn't miss something while thinking about it :) and (2) if you had a specific reason that I overlooked for avoiding approximation in these cases (like maybe it's more complex and I forgot some awful detail), let me know. It seems to me that it should be mostly straightforward to make the change, but if it's not, I don't want to make you spend weeks on it. :) I think it will work as-is, this would just be a minor improvement.
I don't have any other comments to the code, other than any already-open comments that aren't resolved from earlier reviews, and for those earlier comments, it's up to you how (or if) you want to handle them; I don't think any of those are critical, all just suggestions. Once each comment is handled I think it is ready for merge.
Great work on this. I know it took longer than you planned for but that's perfectly okay. :)
Agreed, this all seems ready to me except for the static code analysis error. This is the one about the
Any of the failing tests on Travis appear to be in ANNLayerTest, so #1953 is trying to solve that but we don't need to consider it here.
https://gist.github.com/d8fd23ded2bda55277689265362c224f should handle the