Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing radical test. #1924

Merged
merged 4 commits into from Aug 10, 2019

Conversation

@walragatver
Copy link
Contributor

commented Jun 14, 2019

Hi,
This is a fix for failing radical test during the build.
Following are the observations.

  1. RadicalDiffNoiseStdDevTest, RadicalDiffReplicatesTest, RadicalDiffAnglesTest, RadicalDiffSweepsTest fail occasionally. They failed 21 times in 10000 iterations.
  2. After digging in it is being found The covariance matrix is not correctly remade after decomposing it during whitening of the data matrix.
  3. Earlier the output was exploding the values were coming in the range 1e7 to 1e14. After modification the output is coming in 0 to 1 range.
    I am quite new to ICA. So, let me know if this is wrong.
Copy link
Member

left a comment

@walragatver wow, nice work reproducing this. 12 in 100k is not very many failures, I guess maybe I was just not patient enough. :)

I see that the problem itself stems from the given data matrix being (effectively) singular. So, adding the term like 1e-20 to the denominator like I suggested is likely to help, but we can really solve the issue by changing the tests so that the data matrix is never singular. We could do this by either changing from randu(5, 3) to a fixed, given matrix that we know is not singular, or, we could do something like this...

arma::mat data(3, 5, arma::fill::randu);
// Use SVD to decompose, and then we'll "rebuild" with any small singular values increased.
arma::mat matU, matV;
arma::vec s;
arma::svd(matU, s, matV);
s = arma::min(s, 0.1); // Set all singular values to be at least 0.1.
data = matU * s * matV; // Reassemble original matrix.

Either is fine with me, honestly---the RADICAL main tests are just supposed to test that all the input options to the RADICAL binding work properly anyway.

I hope this is helpful; let me know if there's anything I can clarify in my explanation. :)

@@ -188,6 +188,6 @@ void mlpack::radical::WhitenFeatureMajorMatrix(const mat& matX,
mat matU, matV;
vec s;
arma::svd(matU, s, matV, cov(matX));
matWhitening = matU * diagmat(1 / sqrt(s)) * trans(matV);
matWhitening = matU * diagmat(s) * trans(matV);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jun 16, 2019

Member

So if we multiply by the singular values, we are just reconstructing the original matrix instead of whitening it. If what you are seeing is that the singular values are extremely small, then what we can do is try something like diagmat(1 / (sqrt(s) + 1e-20)) or some other thing like this to prevent division by 0.

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

Hi @rcurtin,
Even after making the above changes the tests are failing at the same rate.
I am still somewhat confused about the whitening of the radical test. I am digging the code and I will let you know about the fix soon.
Also it looks like I added an extra zero while writing description. It's 10K instead of 100K. 😛

@walragatver walragatver force-pushed the walragatver:radical_test branch from b7f3b84 to 1e2e71f Jun 18, 2019
@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2019

Hi @rcurtin,
I have changed the input matrix to a fixed matrix which is non-singular to prevent failure. Let me know your thoughts regarding the same.

@zoq

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

Looks like we should run/check the valgrind output for the test as well.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jun 19, 2019

I can't seem to reproduce the valgrind error; it could be an Armadillo issue or something. You might try instantiating from a string instead, i.e., arma::mat data("1 2 3 4; 5 6 7 8") or something like this.

Also, I'm interested in figuring out exactly why RADICAL is failing---do you happen to have one of the failing data matrices' values that I could try manually?

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2019

Hi, @rcurtin, @zoq ,
I have discovered something weird. When I supplied the failing matrix manually. It is not failing at all. I also tried by increasing the precision of double values to 12 but I got the same result.
Here are some of the failing matrices but yes they pass when supplied manually.
For DiffAnglesTest:
Input:

{
  {0.266563, 0.661939, 0.754365},
  {0.441935, 0.0513205,  0.929337},
  {0.187137, 0.638367, 0.115132},
  {0.214277,  0.955993, 0.882105},
  {0.465047, 0.828365, 0.326111}
}

Output:

 1.31271e+15 6.81579e+14 -1.69787e+15 -9.44321e+14 3.34079e+15 1.31271e+15 6.81579e+14 -1.69787e+15 -9.44321e+14 3.34079e+15 1.31271e+15 6.81579e+14 -1.69787e+15 -9.44321e+14 3.34079e+15 

For DiffSweeps and DiffReplicatesTest (These two fail together on same input and more often than others):

Input:

{
    {0.784161, 0.780592, 0.0311524}
    { 0.0293395, 0.368261, 0.35229 }
    {0.922519, 0.183277, 0.841028 }
    {0.554937, 0.165608, 0.233458}
    {0.809767, 0.157323,  0.868404}
}

Output:

9.2652e+15, 2.04385e+16, 1.28048e+16,
-4.63089e+15, -1.17856e+15 9.2652e+15 2.04385e+16 1.28048e+16 -4.63089e+15 -1.17856e+15 9.2652e+15 2.04385e+16 1.28048e+16 -4.63089e+15 -1.17856e+15

Input (High Precision):

{
{0.024773937712, 0.814637115814, 0.505250521271},
{0.230576237157, 0.166377793940, 0.080895868448},
{0.108842626525, 0.124015884598, 0.982638085458},
{0.849357525961, 0.453377366438, 0.011420917620},
{0.826997388653, 0.397100743844, 0.157533767942}  }

The normal non failing output comes near to 10e7 and failing outputs as above(near 10e15).

In order to reproduce the error here is the small script you can run on random input,

cd /media/walragatver/LENOVO/code/mlpack/build
make mlpack_test

for (( c=1; c<=10000; c++ ))
do  
   bin/mlpack_test -t RadicalMainTest;
done

@zoq

This comment has been minimized.

Copy link
Member

commented Jun 20, 2019

In your tests did you build mlpack with -DDEBUG=ON?

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jun 20, 2019

When I ran the tests, I tried with both -DDEBUG=ON and -DDEBUG=OFF and wasn't able to reproduce the memory issue either way.

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2019

When I ran the tests, I tried with both -DDEBUG=ON and -DDEBUG=OFF and wasn't able to reproduce the memory issue either way.

Yes same here.

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2019

I have found a new error.
error: Sort() detected NAN.
My #1920 PR is failing on this

I have digged somewhat and I have found following things.
When the test fails values calculated for all angles on Line 92 in radical.cpp comes out to be -inf Due to this angle value 0 is selected all the time. which results in same matY.

Also while reading about unsafe_col() in armadillo it was written that it is not alias safe. Could that be the issue for memory leak? Let me know your thoughts regarding same. I tried using cols() instead of unsafe_cols() but the testcases continue to fail.
On Line 90 unsafe_col() is used and on Line 46 an alias is used for the argument which is passed on Line 92.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 6, 2019

Hey @walragatver, sorry for the slow response. For the unsafe_col() bit, what actually happens when unsafe_col() is called is that a new Armadillo vector is created just as an alias to the existing memory. So if some other part of the code modifies the underlying matrix, the alias's values will change too. In general that shouldn't be a problem for what we're doing; in line 90 of radical.cpp, unsafe_col() is just used to get temporary aliases to pass to Vasicek().

The observation that the result of Vasicek() is -Inf is very helpful---this can happen only if two elements in the (sorted) vector v have the exact same value. Vasicek() is a technique that computes the m-spacing estimator of entropy as detailed in the paper: http://www.jmlr.org/papers/volume4/learned-miller03a/learned-miller03a.pdf (Section 2.1).

As far as I believe, if we have perfect representations of the data, where we are adding Gaussian noise to it, then no two points in any dataset will ever be exactly the same---the probability of drawing two identical samples from a Gaussian is zero. But... we don't have perfect representation of the data, we only have 64-bit floating point numbers. Therefore, we can have the (previously impossible) situation where two values are exactly the same. In this case, log(v[i - m] - v[i]) turns to -Inf. So, we can (kind of) compensate for this possibility by just limiting the size of what is passed to the log:

log(max(v[i - m] - v[i], 1e-50)

(I chose 1e-50 somewhat arbitrarily. I think DBL_MIN could work too.)

I think that this change to Vasicek() will actually fix the underlying issue, and then we won't have to change the input matrices. I'm still not sure about the memory error, but do you want to try this idea and see how it works, and if there's a memory issue we can continue to debug it then?

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2019

error: Sort() detected NAN.
My #1920 PR is failing on this

Okay, I have tested your solution. It's working fine. However, the above error continues to persist. /

@walragatver walragatver closed this Jul 7, 2019
@walragatver walragatver reopened this Jul 7, 2019
@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2019

Sorry, accidentally I have closed this.
Okay, After digging somewhat more. I have found that the Sort() detected NAN comes when the whitened matrix at line 192 in radical.cpp comes out to be -nan. So I think root of the problem lies in the whitening of the matrix. Let me know your thoughts regarding the same.
Also if you want me to push the changes you suggested just let me know.

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

Hi, @rcurtin,
I have found that when one of the value in s at line 191 in radical.cpp becomes zero it leads to -nan error because of 1 / sqrt(s).
I tried to change zero value in s vector to a non-zero (like 1e-50) value but then it leads to the same output error. So I think it will be great to keep one single input instead of random input for the test as it will not lead to s becoming zero.
Let me know your thoughts regarding the same.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

@walragatver thanks for continuing to dig into this one. I have to agree, I think using a single input is sufficient for this. If you want, you can go ahead and push the changes to Vasicek() too, they won't hurt anything and they may help prevent a very unlikely failure for a user.

@walragatver walragatver force-pushed the walragatver:radical_test branch from fa85bdd to a40108e Jul 13, 2019
@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Jul 13, 2019

Hi, @rcurtin,
I have made the required changes.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 19, 2019

I'm trying to figure out what's wrong with the memory checks build. I can't seem to reproduce any issues locally.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 19, 2019

@mlpack-jenkins test this please

@walragatver walragatver force-pushed the walragatver:radical_test branch from a40108e to d48c932 Aug 2, 2019
@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

@mlpack-jenkins test this please

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Here's the result of the RADICAL memory checks:

Build timed out (after 180 minutes). Marking the build as aborted.

Since I can't reproduce any issues locally, I think we can ignore the check. @zoq @walragatver any thoughts on that?

@zoq

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Let's increase the timeout for this PR and rerun the test. Will run the test on my system as well.

@zoq

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

@mlpack-jenkins test this please

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Even 280 minutes wasn't enough... let's try 1000...

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

@mlpack-jenkins test this please

@walragatver

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

This is quite weird. Earlier at least it was getting completed.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

I'm watching http://ci.mlpack.org/job/pull-requests%20mlpack%20memory/2381/console now, it seems strange that the test appears to have hung after completion:

...
/home/jenkins/workspace/pull-requests mlpack memory/src/mlpack/tests/main_tests/radical_test.cpp(63): info: check CLI::GetParam<arma::mat>("output_unmixing").n_cols == 5 has passed
/home/jenkins/workspace/pull-requests mlpack memory/src/mlpack/tests/main_tests/radical_test.cpp(49): Leaving test case "RadicalOutputDimensionTest"; testing time: 5648741us
/home/jenkins/workspace/pull-requests mlpack memory/src/mlpack/tests/main_tests/radical_test.cpp(44): Leaving test suite "RadicalMainTest"; testing time: 5656602us
Leaving test module "mlpackTest"; testing time: 5663068us

*** No errors detected
[spinning icon]

I'll keep watching...

@zoq

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

This is quite weird. Earlier at least it was getting completed.

Not sure that is correct in previous runs the test was aborted after 180 minutes.

Anyway, no issue on my system, so if it runs fine on at least two machines, I think this is ready to go.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

Yeah, even after 1000 minutes the test still times out. Since we can't reproduce it, let's just go ahead and merge. Thanks @walragatver; sorry this took so long to get merged. :)

@rcurtin
rcurtin approved these changes Aug 8, 2019
Copy link
Member

left a comment

Looks good.

@rcurtin rcurtin merged commit ab6a6f1 into mlpack:master Aug 10, 2019
5 of 6 checks passed
5 of 6 checks passed
Memory Checks Build finished.
Details
LaTeX Documentation Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 10, 2019

Thanks again. 👍 I updated HISTORY.md in 6d68f7c.

@walragatver walragatver deleted the walragatver:radical_test branch Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.