adds GammaDistribution::Train(observations, probabilities) #834

yashu-seth · 2016-12-18T17:56:37Z

fixes #733

yashu-seth · 2016-12-18T18:07:32Z

@rcurtin For testing this method, I have decided to generate some n points from a fixed gamma distribution and n points between 0 and 1 from a uniform distribution. Then use the points from the uniform distribution as probabilities and points from the original gamma distribution as observations to get an another gamma distribution. Then finally compare the parameters of the two gamma distributions.
Please correct me if I am missing something.

yashu-seth · 2016-12-19T18:13:07Z

@rcurtin Should I proceed with the tests or do you have any suggestions? I cannot understand why travis is failing but appveyor passes.

rcurtin · 2016-12-20T20:46:45Z

@yashu-seth: thanks for the contribution, sorry it took me a little while to get around to looking at this. Don't worry about the test failure, that is for the Nystroem method, it doesn't have to do with your code. Some of the mlpack tests are probabilistic and although we try to keep the failure probability very low, in some cases the probability isn't low enough.

The testing method seems reasonable---I think the gamma distribution that you'll fit with the uniform distribution for weights should have approximately the same parameters as if you trained it without weights. Here are another couple ideas for simple tests:

Draw points from two different gamma distributions. Set the probabilities for the points from the first distribution to something small (0.01, 0.001, something like that) and the probabilities for the second to something large (0.99, 0.999, something like that). The gamma distribution that you recover should have the same parameters as the second gamma distribution you drew from.
Train with probabilities all set to 1, and ensure that this gives the same result as when you train with no probabilities at all.

Hope this helps---let me know if I can clarify anything. I glanced at the implementation, I think it looks right; just needs tests. :)

mlpack-mailing-lists · 2016-12-21T01:40:29Z

I've actually done some work on Gamma EM (a generalization of the Gamma distribution following the idea of Gaussian EM...) You can obtain a decent starting value for your Newton-Raphson by noting that the expected value of the empirical covariance of X,log X is, in fact, beta. Another alternative is to use the method of moments to obtain initial estimates.

…

On Tue, Dec 20, 2016 at 12:46 PM, Ryan Curtin ***@***.***> wrote: @yashu-seth <https://github.com/yashu-seth>: thanks for the contribution, sorry it took me a little while to get around to looking at this. Don't worry about the test failure, that is for the Nystroem method, it doesn't have to do with your code. Some of the mlpack tests are probabilistic and although we try to keep the failure probability very low, in some cases the probability isn't low enough. The testing method seems reasonable---I think the gamma distribution that you'll fit with the uniform distribution for weights should have approximately the same parameters as if you trained it without weights. Here are another couple ideas for simple tests: - Draw points from two different gamma distributions. Set the probabilities for the points from the first distribution to something small (0.01, 0.001, something like that) and the probabilities for the second to something large (0.99, 0.999, something like that). The gamma distribution that you recover should have the same parameters as the second gamma distribution you drew from. - Train with probabilities all set to 1, and ensure that this gives the same result as when you train with no probabilities at all. Hope this helps---let me know if I can clarify anything. I glanced at the implementation, I think it looks right; just needs tests. :) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ4bFB0IJCO8agd4nvZtJygT-6I6e6Zrks5rKD64gaJpZM4LQMCF> . _______________________________________________ mlpack mailing list ***@***.*** http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

yashu-seth · 2016-12-21T12:44:44Z

@rcurtin I have added the requisite tests. Please have a look.

rcurtin

Tests look good, I think they just need a few minor modifications. Let me know what you think of my comments. It would probably be a good idea to split these into multiple tests, too; if you prefer I can do that after merge.

Also, there are a few places where the code doesn't follow the mlpack style guide... could you fix those please? :) Specifically, for(...) -> for (...), i%2 -> i % 2 (horizontal whitespace), and no variables names with underscores (all_probabilities_1 -> allProbabilities1, etc.). Thanks!

Lastly, do you want to add your name and email to src/mlpack/core.hpp and COPYRIGHT.txt?

rcurtin · 2016-12-21T21:43:10Z

src/mlpack/tests/distribution_test.cpp

+  BOOST_REQUIRE_CLOSE(gDist2.Beta(0), gDist.Beta(0), 10);
+
+  BOOST_REQUIRE_CLOSE(alphaReal, gDist.Alpha(0), 10);
+  BOOST_REQUIRE_CLOSE(betaReal, gDist.Beta(0), 10);


We can probably check these with much closer tolerances. I'd suggest 1e-5 instead of 10. Shouldn't we also check gDist.Alpha(1) and gDist.Beta(1)?

@rcurtin I just checked closer tolerances are not working. A tolerance less than 10 does not pass all the tests.
I forgot to add gDist.Alpha(1) and gDist.Beta(1). I will add the checks.

We can use a larger tolerance between the values of alphaReal and betaReal and the estimated values, due to the noisy sampling from the gamma distribution. I still think 10% tolerance is a bit large. But gDist and gDist2 should have virtually identical values so the tolerance should be very small and 1e-5 should be fine. If that is causing the tests to fail, then I think the implementation is incorrect.

@rcurtin I increased the sample size from 500 till 50000 and it achieves a tolerance of 1, but it does not get < 1e-5.
Besides, for the low and high probability two distribution test, it achieves a tolerance of < 2%.

Hm, ok, a larger tolerance like 5% (or even 10% if need be) is okay for the difference between alphaReal and gDist.Alpha(0), but the tolerance between gDist and gDist2 should be 1e-5. The gDist/gDist2 tolerance is really the one I'm much more concerned with.

One way you can test a tolerance is by adding math::RandomSeed(std::time(NULL)) (although you are using different RNGs so you'll have to set them accordingly) and then running the test over and over again to see how often it fails.

@rcurtin I checked the implementation and could not find anything incorrect.

But I think, it is not necessary gDist and gDist2 should always have a difference less than 1e-5 when we are sampling probabilities from a unifrom distribution between 0 and 1. For example, there would be some data points that would get assigned a very low probability and would be treated as if they are not present at all. So when we train gDist and gDist2 they don't get the exact same data to be trained on.

Something in favour of my argument, can be the fact that when I train the distribution with all probabilities equal to 1, I get the exact same distribution when I train it with just the data.

Ok, I think that we were getting confused because we were talking about probabilities for different tests. :) I looked through the most recent code, and I think the tolerances are fine now. When I was saying that the tolerance should be small like 1e-5, I was only referencing the test with probabilities all equal to one. Thanks for the updates.

rcurtin · 2016-12-21T21:46:00Z

src/mlpack/tests/distribution_test.cpp

+      probabilities(i) = low_prob(generator);
+    else
+      probabilities(i) = high_prob(generator);
+  }


Another possibility to simplify the creation here is just to put the first N/2 points in the first part of rdata from low_prob and then the second N/2 points in the second part of rdata from high_prob, then call arma::shuffle() to shuffle the dataset.

This would require ensuring shuffling rdata matrix and probabilities vector in the same order. How to use shuffle for this?

yashu-seth · 2016-12-22T07:05:03Z

@rcurtin I will split these into multiple tests and also fix the style guide violations.
I would love to get my name added.

rcurtin · 2016-12-23T20:38:53Z

Ah, a comment to whoever replied via email, the method used for training here is this one:
http://131.107.65.14/en-us/um/people/minka/papers/minka-gamma.pdf
It seems to work quite well!

rcurtin · 2016-12-23T20:43:01Z

src/mlpack/tests/distribution_test.cpp

+  BOOST_REQUIRE_EQUAL(gDist2.Beta(0), gDist.Beta(0));
+
+  BOOST_REQUIRE_EQUAL(gDist2.Alpha(1), gDist.Alpha(1));
+  BOOST_REQUIRE_EQUAL(gDist2.Beta(1), gDist.Beta(1));


We should make this BOOST_REQUIRE_CLOSE() with a tolerance of 1e-5 or so, because floating-point errors in some implementations could provide slightly different results. It looks like your current implementation appears to produce the exact same result, but that may not be true in the future if the code is modified (but still valid), so we should be careful to try and make our test future-proof.

rcurtin · 2016-12-23T20:44:29Z

Things look good to me minus the one comment about BOOST_REQUIRE_CLOSE. I'll probably refactor the tests very slightly after merge to try and use mlpack's existing RNGs instead of custom ones. You can add your name to src/mlpack/core.hpp and COPYRIGHT.txt. Let me know when you think this is ready to merge and I'll hit the button. Thanks again for taking the time to sort out the issues here.

yashu-seth · 2016-12-23T21:44:01Z

@rcurtin It was nice discussing things with you and getting suggestions. I have made the suggested changes. Please have a look.

Besides, I was looking into the Perceptron module. I think there can be test cases to check beahviour of the constructors and the classify methods also. Should I create an issue for it and look into it?

rcurtin · 2016-12-27T16:45:35Z

Thanks! I made some style changes in c739839 and ec65745 and changed the random number generators to use the mlpack random number generator object in 1ef2a02 (some of the tests I changed there were not yours, but still this is an improvement because you can now set the random number seed with mlpack::math::RandomSeed()).

Then, for each test, I added a line at the top: math::RandomSeed(std::time(NULL)) and recompiled. I ran a bash script like this:

while(true); do bin/mlpack_test -t GammaDistributionTrainWithProbabilitiesTest; sleep 1; done

This will give output like this:

Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...
/home/ryan/src/mlpack/src/mlpack/tests/distribution_test.cpp(483): fatal error in "GammaDistributionTrainWithProbabilitiesTest": difference{2.03727%} between betaReal{6.7000000000000002} and gDist.Beta(0){6.5662282671947327} exceeds 2%

*** 1 failure detected in test suite "mlpackTest"
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected

Using this strategy I found that some of the tolerances needed to be modified to prevent failures, in the test GammaDistributionTrainWithProbabilitiesTest: the comparison between gDist and gDist2 parameters occasionally failed, so I loosened the tolerance to 1.5% from 1%. The comparisons between alphaReal and betaReal and the trained distributions often failed, so I loosened the tolerance from 1% to 2.5%. I think this is an ok change to make and I think these are reasonable bounds for the expected noise.

This strategy is important to test with, because each compilation will (effectively) have a different random seed associated with it, so to ensure that tests pass on users' systems, we have to check with many different random seeds.

Anyway, about the Perceptron code, definitely, if you would like to add tests, it would be much appreciated! You can create an issue if you like or just open a PR, whatever you prefer is fine.

yashu-seth force-pushed the fixes_issue_#733 branch from 6458c9d to 256fe79 Compare December 18, 2016 18:09

adds GammaDistribution::Train(observations, probabilities)

333c541

yashu-seth force-pushed the fixes_issue_#733 branch from 256fe79 to 333c541 Compare December 18, 2016 18:48

rcurtin reviewed Dec 21, 2016

View reviewed changes

yashu-seth force-pushed the fixes_issue_#733 branch 2 times, most recently from 528b361 to 4bf6e05 Compare December 22, 2016 17:04

rcurtin reviewed Dec 23, 2016

View reviewed changes

adds tests for GammaDistribution::Train(observations, probabilities)

626d94c

yashu-seth force-pushed the fixes_issue_#733 branch from 4bf6e05 to f5dadd8 Compare December 23, 2016 21:04

adds my name to core.hpp and COPYRIGHT.txt

f5dadd8

rcurtin merged commit 60f16d7 into mlpack:master Dec 27, 2016

rcurtin added this to the mlpack 2.2.0 milestone Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds GammaDistribution::Train(observations, probabilities) #834

adds GammaDistribution::Train(observations, probabilities) #834

yashu-seth commented Dec 18, 2016 •

edited

yashu-seth commented Dec 18, 2016 •

edited

yashu-seth commented Dec 19, 2016

rcurtin commented Dec 20, 2016

mlpack-mailing-lists commented Dec 21, 2016 via email

yashu-seth commented Dec 21, 2016

rcurtin left a comment

rcurtin Dec 21, 2016

yashu-seth Dec 22, 2016

rcurtin Dec 22, 2016

yashu-seth Dec 22, 2016 •

edited

rcurtin Dec 23, 2016

yashu-seth Dec 23, 2016

rcurtin Dec 23, 2016

rcurtin Dec 21, 2016

yashu-seth Dec 22, 2016

yashu-seth commented Dec 22, 2016 •

edited

rcurtin commented Dec 23, 2016

rcurtin Dec 23, 2016

rcurtin commented Dec 23, 2016

yashu-seth commented Dec 23, 2016 •

edited

rcurtin commented Dec 27, 2016

adds GammaDistribution::Train(observations, probabilities) #834

adds GammaDistribution::Train(observations, probabilities) #834

Conversation

yashu-seth commented Dec 18, 2016 • edited

yashu-seth commented Dec 18, 2016 • edited

yashu-seth commented Dec 19, 2016

rcurtin commented Dec 20, 2016

mlpack-mailing-lists commented Dec 21, 2016 via email

yashu-seth commented Dec 21, 2016

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yashu-seth Dec 22, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yashu-seth commented Dec 22, 2016 • edited

rcurtin commented Dec 23, 2016

Choose a reason for hiding this comment

rcurtin commented Dec 23, 2016

yashu-seth commented Dec 23, 2016 • edited

rcurtin commented Dec 27, 2016

yashu-seth commented Dec 18, 2016 •

edited

yashu-seth commented Dec 18, 2016 •

edited

yashu-seth Dec 22, 2016 •

edited

yashu-seth commented Dec 22, 2016 •

edited

yashu-seth commented Dec 23, 2016 •

edited