Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CosineTreeTest/CosineNodeCosineSplit test sometimes fails, only on i386 #358
Reported by rcurtin on 7 Jul 44943747 12:11 UTC
To reproduce it, do this:
Then you can hardcode the failing random seed for the purposes of debugging.
The cosine tree is built by choosing a basis vector (that is, a point from the dataset), and calculating the cosine distance between that basis vector and all other points. Points with large cosines go to the left child, and points with small cosines go to the right child; the current implemented split is (I think) the median cosine -- this results in a tree where the left and right child have the same number of points. This process is repeated iteratively. (The details are in the paper
The tests appear to be failing because in some cases, points in the right child appear to have larger cosine to the basis vector than points in the left child. I've checked for memory leaks and other issues with valgrind that may only show up on i386 and fixed what I found (missing destructor) in r17485, but this did not fix the issue.
So, the bug may be either in the cosine tree code or the test itself; I don't know enough to say. But anyone wishing to solve this bug should spend a little time understanding the basics of the cosine tree, debugging what (or if) there is a problem for cosine trees that fail the test, and at that point maybe the solution will be clear. For the 1.0.11 release I've commented the test out and noted that the bug is present (r17488).
Siddharth, I've CC'ed you just in case my description is incorrect or anything like that. If you have thoughts, feel free, but if not, don't feel obligated. I would call this relatively low priority since i386 is far less important these days. :)
Commented by siddharth.950 on 10 Mar 44943832 22:41 UTC
This also reminds me of a problem I had noticed with mlpack tests. We randomly initialize the data matrices used to test the implementations, but in armadillo each time the 'random' values are the same. This practically makes the test use the same matrices again and again and we are less likely to encounter any flaws in the code/test. I think we should set the seed as you have done in the description and then initialize the matrices.
Commented by rcurtin on 6 May 44946312 07:54 UTC