-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds GammaDistribution::Train(observations, probabilities) #834
Conversation
@rcurtin For testing this method, I have decided to generate some n points from a fixed gamma distribution and n points between 0 and 1 from a uniform distribution. Then use the points from the uniform distribution as probabilities and points from the original gamma distribution as observations to get an another gamma distribution. Then finally compare the parameters of the two gamma distributions. |
6458c9d
to
256fe79
Compare
256fe79
to
333c541
Compare
@rcurtin Should I proceed with the tests or do you have any suggestions? I cannot understand why travis is failing but appveyor passes. |
@yashu-seth: thanks for the contribution, sorry it took me a little while to get around to looking at this. Don't worry about the test failure, that is for the Nystroem method, it doesn't have to do with your code. Some of the mlpack tests are probabilistic and although we try to keep the failure probability very low, in some cases the probability isn't low enough. The testing method seems reasonable---I think the gamma distribution that you'll fit with the uniform distribution for weights should have approximately the same parameters as if you trained it without weights. Here are another couple ideas for simple tests:
Hope this helps---let me know if I can clarify anything. I glanced at the implementation, I think it looks right; just needs tests. :) |
I've actually done some work on Gamma EM (a generalization of the Gamma
distribution following the idea of Gaussian EM...) You can obtain a decent
starting value for your Newton-Raphson by noting that the expected value of
the empirical covariance of X,log X is, in fact, beta.
Another alternative is to use the method of moments to obtain initial
estimates.
…On Tue, Dec 20, 2016 at 12:46 PM, Ryan Curtin ***@***.***> wrote:
@yashu-seth <https://github.com/yashu-seth>: thanks for the contribution,
sorry it took me a little while to get around to looking at this. Don't
worry about the test failure, that is for the Nystroem method, it doesn't
have to do with your code. Some of the mlpack tests are probabilistic and
although we try to keep the failure probability very low, in some cases the
probability isn't low enough.
The testing method seems reasonable---I think the gamma distribution that
you'll fit with the uniform distribution for weights should have
approximately the same parameters as if you trained it without weights.
Here are another couple ideas for simple tests:
-
Draw points from two different gamma distributions. Set the
probabilities for the points from the first distribution to something small
(0.01, 0.001, something like that) and the probabilities for the second to
something large (0.99, 0.999, something like that). The gamma distribution
that you recover should have the same parameters as the second gamma
distribution you drew from.
-
Train with probabilities all set to 1, and ensure that this gives the
same result as when you train with no probabilities at all.
Hope this helps---let me know if I can clarify anything. I glanced at the
implementation, I think it looks right; just needs tests. :)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#834 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJ4bFB0IJCO8agd4nvZtJygT-6I6e6Zrks5rKD64gaJpZM4LQMCF>
.
_______________________________________________
mlpack mailing list
***@***.***
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
|
@rcurtin I have added the requisite tests. Please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look good, I think they just need a few minor modifications. Let me know what you think of my comments. It would probably be a good idea to split these into multiple tests, too; if you prefer I can do that after merge.
Also, there are a few places where the code doesn't follow the mlpack style guide... could you fix those please? :) Specifically, for(...)
-> for (...)
, i%2
-> i % 2
(horizontal whitespace), and no variables names with underscores (all_probabilities_1
-> allProbabilities1
, etc.). Thanks!
Lastly, do you want to add your name and email to src/mlpack/core.hpp
and COPYRIGHT.txt
?
BOOST_REQUIRE_CLOSE(gDist2.Beta(0), gDist.Beta(0), 10); | ||
|
||
BOOST_REQUIRE_CLOSE(alphaReal, gDist.Alpha(0), 10); | ||
BOOST_REQUIRE_CLOSE(betaReal, gDist.Beta(0), 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably check these with much closer tolerances. I'd suggest 1e-5 instead of 10. Shouldn't we also check gDist.Alpha(1)
and gDist.Beta(1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin I just checked closer tolerances are not working. A tolerance less than 10 does not pass all the tests.
I forgot to add gDist.Alpha(1)
and gDist.Beta(1)
. I will add the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a larger tolerance between the values of alphaReal
and betaReal
and the estimated values, due to the noisy sampling from the gamma distribution. I still think 10% tolerance is a bit large. But gDist
and gDist2
should have virtually identical values so the tolerance should be very small and 1e-5 should be fine. If that is causing the tests to fail, then I think the implementation is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin I increased the sample size from 500 till 50000 and it achieves a tolerance of 1, but it does not get < 1e-5.
Besides, for the low and high probability two distribution test, it achieves a tolerance of < 2%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, ok, a larger tolerance like 5% (or even 10% if need be) is okay for the difference between alphaReal and gDist.Alpha(0), but the tolerance between gDist and gDist2 should be 1e-5. The gDist/gDist2 tolerance is really the one I'm much more concerned with.
One way you can test a tolerance is by adding math::RandomSeed(std::time(NULL))
(although you are using different RNGs so you'll have to set them accordingly) and then running the test over and over again to see how often it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcurtin I checked the implementation and could not find anything incorrect.
But I think, it is not necessary gDist and gDist2 should always have a difference less than 1e-5 when we are sampling probabilities from a unifrom distribution between 0 and 1. For example, there would be some data points that would get assigned a very low probability and would be treated as if they are not present at all. So when we train gDist and gDist2 they don't get the exact same data to be trained on.
Something in favour of my argument, can be the fact that when I train the distribution with all probabilities equal to 1, I get the exact same distribution when I train it with just the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think that we were getting confused because we were talking about probabilities for different tests. :) I looked through the most recent code, and I think the tolerances are fine now. When I was saying that the tolerance should be small like 1e-5, I was only referencing the test with probabilities all equal to one. Thanks for the updates.
probabilities(i) = low_prob(generator); | ||
else | ||
probabilities(i) = high_prob(generator); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility to simplify the creation here is just to put the first N/2 points in the first part of rdata
from low_prob
and then the second N/2 points in the second part of rdata
from high_prob
, then call arma::shuffle()
to shuffle the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require ensuring shuffling rdata matrix and probabilities vector in the same order. How to use shuffle for this?
@rcurtin I will split these into multiple tests and also fix the style guide violations. |
528b361
to
4bf6e05
Compare
Ah, a comment to whoever replied via email, the method used for training here is this one: |
BOOST_REQUIRE_EQUAL(gDist2.Beta(0), gDist.Beta(0)); | ||
|
||
BOOST_REQUIRE_EQUAL(gDist2.Alpha(1), gDist.Alpha(1)); | ||
BOOST_REQUIRE_EQUAL(gDist2.Beta(1), gDist.Beta(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this BOOST_REQUIRE_CLOSE()
with a tolerance of 1e-5
or so, because floating-point errors in some implementations could provide slightly different results. It looks like your current implementation appears to produce the exact same result, but that may not be true in the future if the code is modified (but still valid), so we should be careful to try and make our test future-proof.
Things look good to me minus the one comment about |
4bf6e05
to
f5dadd8
Compare
@rcurtin It was nice discussing things with you and getting suggestions. I have made the suggested changes. Please have a look. Besides, I was looking into the Perceptron module. I think there can be test cases to check beahviour of the constructors and the classify methods also. Should I create an issue for it and look into it? |
Thanks! I made some style changes in c739839 and ec65745 and changed the random number generators to use the mlpack random number generator object in 1ef2a02 (some of the tests I changed there were not yours, but still this is an improvement because you can now set the random number seed with Then, for each test, I added a line at the top:
This will give output like this:
Using this strategy I found that some of the tolerances needed to be modified to prevent failures, in the test This strategy is important to test with, because each compilation will (effectively) have a different random seed associated with it, so to ensure that tests pass on users' systems, we have to check with many different random seeds. Anyway, about the Perceptron code, definitely, if you would like to add tests, it would be much appreciated! You can create an issue if you like or just open a PR, whatever you prefer is fine. |
fixes #733