Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test stability #249

Merged
merged 4 commits into from
Feb 11, 2021
Merged

Fix test stability #249

merged 4 commits into from
Feb 11, 2021

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented Feb 7, 2021

Occasionally, ensmallen tests fail randomly, because often the tests we run are non-deterministic. This can cause a bad user experience (and can confuse reviewers of journal articles for ensmallen too :)) so I spent some time refactoring ensmallen's test suite so that I was able to run it 1000 times with no failures at all.

A number of major changes were done during this process:

  1. I added a GetFinalPoint() method to most of the functions in ensmallen_bits/problems/. I didn't do this for functions with multiple global minima. Maybe we could do that another time. I also added GetFinalObjective().

  2. Some methods in ensmallen_bits/problems/ used different initial points in the tests than GetInitialPoint() returned. So I updated the implementation of GetInitialPoint() to use what was used in the tests.

  3. I added a function called FunctionTest() to test_function_tools.hpp, that expects an input optimizer and the type of a function, then uses GetInitialPoint() and GetFinalPoint() to run the optimization and check convergence. It also supports a number of trials---so for instances where the test often fails, you can set trials to, e.g., 3 and it will run the test up to 3 times.

  4. I made a standalone LogisticRegressionFunctionTest() utility in test_function_tools.hpp, since FunctionTest() doesn't work quite right for our logistic regression tests (as we are checking the accuracy, not the objective). This also supports multiple trials.

  5. Wherever possible, I refactored all of the tests to use FunctionTest() and LogisticRegressionFunctionTest(), and set the number of trials to what I observed was necessary for no failures in 1000 runs.

  6. I removed tests for SGDTestFunction as much as possible. The reason for this is that SGDTestFunction is actually really not a good test function for SGD. In order to optimize it correctly, an optimizer must consider each of the three different objective functions it contains, but, when a batch size of 1 is used, the gradient of each individual function is extremely different! This means that lots of oscillations will be introduced for most SGD-like optimizers, and as a result these tests were very prone to failure. I found the logistic regression test to be more suitable in these instances. However, there are a couple tests that still use SGDTestFunction that seem to work okay.

Hopefully after this we will see very few reports of test failures! :) I suppose, I should think about how to set up an automatic job that will run the tests 1000 times each month or something like that.

@zoq
Copy link
Member

zoq commented Feb 7, 2021

I'll fix the static analysis job.

template<typename MatType = arma::mat>
MatType GetInitialPoint() const { return MatType("-5.0; 5.0"); }
MatType GetInitialPoint() const { return MatType("0.02; 0.02"); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you changed some of the Initial points, to be closer to the final point, is that for performance reasons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah---they did not always seem to converge with the given initial points. So I took the starting points that were actually being used in adam_test.cpp and used those for GetInitialPoint(). I wrote the original values in a comment in case we ever want to change back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but I guess the comment can be confusing for users, as they don't really have a reference where this comes from, we could just browse the commit history to check what the old value was?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point---I removed those comments in 0aeab3c. 👍

@rcurtin
Copy link
Member Author

rcurtin commented Feb 9, 2021

I noticed that the master build test fails because it does not properly run catch tests---it was still configured for Boost tests. I updated that in the Jenkins configuration; hopefully that helps.

@rcurtin
Copy link
Member Author

rcurtin commented Feb 9, 2021

@mlpack-jenkins test this please

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, looks like that this was quite some work, but if the end result is a more-stable test suite it was worth the time 👍

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@conradsnicta
Copy link
Contributor

conradsnicta commented Feb 11, 2021

Since this is in effect a large change, I suggest to bump the version to 2.16 (and adjust the corresponding version number in the JMLR cover letter as well).

The version number doesn't have to be only for indicating changes/expansions to the optimisers.
It can also be used for general marketing, to indicate other useful changes.
(ie. people like shiny new things with bigger version numbers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants