Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds GammaDistribution::Train(observations, probabilities) #834

Merged
merged 3 commits into from
Dec 27, 2016

Conversation

yashu-seth
Copy link
Contributor

@yashu-seth yashu-seth commented Dec 18, 2016

fixes #733

@yashu-seth
Copy link
Contributor Author

yashu-seth commented Dec 18, 2016

@rcurtin For testing this method, I have decided to generate some n points from a fixed gamma distribution and n points between 0 and 1 from a uniform distribution. Then use the points from the uniform distribution as probabilities and points from the original gamma distribution as observations to get an another gamma distribution. Then finally compare the parameters of the two gamma distributions.
Please correct me if I am missing something.

@yashu-seth
Copy link
Contributor Author

@rcurtin Should I proceed with the tests or do you have any suggestions? I cannot understand why travis is failing but appveyor passes.

@rcurtin
Copy link
Member

rcurtin commented Dec 20, 2016

@yashu-seth: thanks for the contribution, sorry it took me a little while to get around to looking at this. Don't worry about the test failure, that is for the Nystroem method, it doesn't have to do with your code. Some of the mlpack tests are probabilistic and although we try to keep the failure probability very low, in some cases the probability isn't low enough.

The testing method seems reasonable---I think the gamma distribution that you'll fit with the uniform distribution for weights should have approximately the same parameters as if you trained it without weights. Here are another couple ideas for simple tests:

  • Draw points from two different gamma distributions. Set the probabilities for the points from the first distribution to something small (0.01, 0.001, something like that) and the probabilities for the second to something large (0.99, 0.999, something like that). The gamma distribution that you recover should have the same parameters as the second gamma distribution you drew from.

  • Train with probabilities all set to 1, and ensure that this gives the same result as when you train with no probabilities at all.

Hope this helps---let me know if I can clarify anything. I glanced at the implementation, I think it looks right; just needs tests. :)

@mlpack-mailing-lists
Copy link

mlpack-mailing-lists commented Dec 21, 2016 via email

@yashu-seth
Copy link
Contributor Author

@rcurtin I have added the requisite tests. Please have a look.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good, I think they just need a few minor modifications. Let me know what you think of my comments. It would probably be a good idea to split these into multiple tests, too; if you prefer I can do that after merge.

Also, there are a few places where the code doesn't follow the mlpack style guide... could you fix those please? :) Specifically, for(...) -> for (...), i%2 -> i % 2 (horizontal whitespace), and no variables names with underscores (all_probabilities_1 -> allProbabilities1, etc.). Thanks!

Lastly, do you want to add your name and email to src/mlpack/core.hpp and COPYRIGHT.txt?

BOOST_REQUIRE_CLOSE(gDist2.Beta(0), gDist.Beta(0), 10);

BOOST_REQUIRE_CLOSE(alphaReal, gDist.Alpha(0), 10);
BOOST_REQUIRE_CLOSE(betaReal, gDist.Beta(0), 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably check these with much closer tolerances. I'd suggest 1e-5 instead of 10. Shouldn't we also check gDist.Alpha(1) and gDist.Beta(1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin I just checked closer tolerances are not working. A tolerance less than 10 does not pass all the tests.
I forgot to add gDist.Alpha(1) and gDist.Beta(1). I will add the checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a larger tolerance between the values of alphaReal and betaReal and the estimated values, due to the noisy sampling from the gamma distribution. I still think 10% tolerance is a bit large. But gDist and gDist2 should have virtually identical values so the tolerance should be very small and 1e-5 should be fine. If that is causing the tests to fail, then I think the implementation is incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin I increased the sample size from 500 till 50000 and it achieves a tolerance of 1, but it does not get < 1e-5.
Besides, for the low and high probability two distribution test, it achieves a tolerance of < 2%.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok, a larger tolerance like 5% (or even 10% if need be) is okay for the difference between alphaReal and gDist.Alpha(0), but the tolerance between gDist and gDist2 should be 1e-5. The gDist/gDist2 tolerance is really the one I'm much more concerned with.

One way you can test a tolerance is by adding math::RandomSeed(std::time(NULL)) (although you are using different RNGs so you'll have to set them accordingly) and then running the test over and over again to see how often it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin I checked the implementation and could not find anything incorrect.

But I think, it is not necessary gDist and gDist2 should always have a difference less than 1e-5 when we are sampling probabilities from a unifrom distribution between 0 and 1. For example, there would be some data points that would get assigned a very low probability and would be treated as if they are not present at all. So when we train gDist and gDist2 they don't get the exact same data to be trained on.

Something in favour of my argument, can be the fact that when I train the distribution with all probabilities equal to 1, I get the exact same distribution when I train it with just the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think that we were getting confused because we were talking about probabilities for different tests. :) I looked through the most recent code, and I think the tolerances are fine now. When I was saying that the tolerance should be small like 1e-5, I was only referencing the test with probabilities all equal to one. Thanks for the updates.

probabilities(i) = low_prob(generator);
else
probabilities(i) = high_prob(generator);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility to simplify the creation here is just to put the first N/2 points in the first part of rdata from low_prob and then the second N/2 points in the second part of rdata from high_prob, then call arma::shuffle() to shuffle the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require ensuring shuffling rdata matrix and probabilities vector in the same order. How to use shuffle for this?

@yashu-seth
Copy link
Contributor Author

yashu-seth commented Dec 22, 2016

@rcurtin I will split these into multiple tests and also fix the style guide violations.
I would love to get my name added.

@yashu-seth yashu-seth force-pushed the fixes_issue_#733 branch 2 times, most recently from 528b361 to 4bf6e05 Compare December 22, 2016 17:04
@rcurtin
Copy link
Member

rcurtin commented Dec 23, 2016

Ah, a comment to whoever replied via email, the method used for training here is this one:
http://131.107.65.14/en-us/um/people/minka/papers/minka-gamma.pdf
It seems to work quite well!

BOOST_REQUIRE_EQUAL(gDist2.Beta(0), gDist.Beta(0));

BOOST_REQUIRE_EQUAL(gDist2.Alpha(1), gDist.Alpha(1));
BOOST_REQUIRE_EQUAL(gDist2.Beta(1), gDist.Beta(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this BOOST_REQUIRE_CLOSE() with a tolerance of 1e-5 or so, because floating-point errors in some implementations could provide slightly different results. It looks like your current implementation appears to produce the exact same result, but that may not be true in the future if the code is modified (but still valid), so we should be careful to try and make our test future-proof.

@rcurtin
Copy link
Member

rcurtin commented Dec 23, 2016

Things look good to me minus the one comment about BOOST_REQUIRE_CLOSE. I'll probably refactor the tests very slightly after merge to try and use mlpack's existing RNGs instead of custom ones. You can add your name to src/mlpack/core.hpp and COPYRIGHT.txt. Let me know when you think this is ready to merge and I'll hit the button. Thanks again for taking the time to sort out the issues here.

@yashu-seth
Copy link
Contributor Author

yashu-seth commented Dec 23, 2016

@rcurtin It was nice discussing things with you and getting suggestions. I have made the suggested changes. Please have a look.

Besides, I was looking into the Perceptron module. I think there can be test cases to check beahviour of the constructors and the classify methods also. Should I create an issue for it and look into it?

@rcurtin rcurtin merged commit 60f16d7 into mlpack:master Dec 27, 2016
@rcurtin
Copy link
Member

rcurtin commented Dec 27, 2016

Thanks! I made some style changes in c739839 and ec65745 and changed the random number generators to use the mlpack random number generator object in 1ef2a02 (some of the tests I changed there were not yours, but still this is an improvement because you can now set the random number seed with mlpack::math::RandomSeed()).

Then, for each test, I added a line at the top: math::RandomSeed(std::time(NULL)) and recompiled. I ran a bash script like this:

while(true); do bin/mlpack_test -t GammaDistributionTrainWithProbabilitiesTest; sleep 1; done

This will give output like this:

Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...
/home/ryan/src/mlpack/src/mlpack/tests/distribution_test.cpp(483): fatal error in "GammaDistributionTrainWithProbabilitiesTest": difference{2.03727%} between betaReal{6.7000000000000002} and gDist.Beta(0){6.5662282671947327} exceeds 2%

*** 1 failure detected in test suite "mlpackTest"
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected
Running 1 test case...

*** No errors detected

Using this strategy I found that some of the tolerances needed to be modified to prevent failures, in the test GammaDistributionTrainWithProbabilitiesTest: the comparison between gDist and gDist2 parameters occasionally failed, so I loosened the tolerance to 1.5% from 1%. The comparisons between alphaReal and betaReal and the trained distributions often failed, so I loosened the tolerance from 1% to 2.5%. I think this is an ok change to make and I think these are reasonable bounds for the expected noise.

This strategy is important to test with, because each compilation will (effectively) have a different random seed associated with it, so to ensure that tests pass on users' systems, we have to check with many different random seeds.

Anyway, about the Perceptron code, definitely, if you would like to add tests, it would be much appreciated! You can create an issue if you like or just open a PR, whatever you prefer is fine.

@rcurtin rcurtin added this to the mlpack 2.2.0 milestone Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GammaDistribution Implementation is Incomplete
3 participants