Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMNN: Bounds for avoiding Impostors recalculation #1466

Merged
merged 9 commits into from Aug 2, 2018

Conversation

Projects
None yet
2 participants
@manish7294
Copy link
Member

manish7294 commented Jul 9, 2018

see #1448

manish7294 added some commits Jul 9, 2018

@rcurtin
Copy link
Member

rcurtin left a comment

Looks good to me. If you want to try the few strategies I described for optimization, it could possibly help out a little bit more.

The numbers you mentioned in the issue are promising, but do you think you could do a little more in-depth dive into the behavior like I did in #1461 to see how many prunes we are getting each iteration, for a couple of different datasets? If we eventually write this up into some kind of paper, we'll want to have some idea of how much each optimization has helped us. It may be useful to add a timer for this specific portion of the code.

(As with the other bounds, norm-centering would help this one somewhat too, depending on the dataset.)

Precalculate(labels);

arma::mat subDataset = dataset.cols(points);
arma::Row<size_t> sublabels = labels.cols(points);

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

I'd try to avoid making this copy here and instead use points or points[i] in subsequent expressions.

@@ -483,9 +535,12 @@ template<typename MetricType>
inline void LMNNFunction<MetricType>::Precalculate()
{
pCij.zeros(dataset.n_rows, dataset.n_rows);
norm.zeros(dataset.n_cols);

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

You could just use set_size() here---zeros() will take an unnecessary pass over the array.

}
}

arma::uvec points = arma::conv_to<arma::uvec>::from(tempPoints);

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

Another potential opportunity for speedup would be to allocate a "large-enough" array here, which could be done in a couple ways (I am not sure which would be better, feel free to try each!):

  1. Allocate points to have length equal to dataset.n_cols, then simply fill it and track how full it is. Then have Impostors() also take a parameter indicating how many elements of points should be used.

  2. Allocate points as above, but with length dataset.n_cols * 0.75 (or some tunable parameter, I am not sure which will be best). If, during the pruning check process, points becomes full, then simply shortcut and run Impostors() on everything.

  3. Take a two-pass approach, where in the first pass you compute whether or not the point is pruned, then allocate the array with the correct size. In the second pass, fill the elements. I'm hesitant to say this will be faster here, but it could be. The conv_to<>::from() operation could take a little time.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 10, 2018

Ah, I see also that this (like others) has a potential problem with the storage of the old transformation matrix for different batches. Let's talk about that specifically in #1461 so we (or more specifically I) don't get confused during the discussion because it is in multiple places. :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 11, 2018

Here are some simulations. https://gist.github.com/manish7294/b1efd856ba3013803be29801bfa4629e
I have used mean-centering here. Have a doubt about the correctness of it though:

for (size_t i = 0; i < data.n_rows; i++)
  {
    data.row(i) =  data.row(i) - (double)arma::mean(data.row(i));
  }

It seems points change too much on covertype.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 12, 2018

I believe that's correct, looking at the implementation. If you really want to quickly check if it is correct, just ensure that the starting kNN accuracy with and without mean-centering is the same (kNN should be invariant to mean-centering). The real check would be if the exact kNN results including distances are the same, so you could do that too.

The results look great. I can see that in the later iterations for vc2 and iris, very few impostors are recalculated. I'm not too surprised about the covertype dataset---I've found that the points have very large norm and so as a result the bounds don't seem to be as effective there. Still the mean centering should help.

When #1461 is done let's return to this and work out the batch size issue, do some more timing simulations to convince ourselves the speedup is good, and then merge it in. 👍

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 13, 2018

Just have to figure out one issue - we are obliged to calculate k + 1 target neighbors and impostors instead of k. Now, let say we have only 6 points in one of the class. Ideally, we should be able to run simulations up to k = 5, but now we are restricted to up to k = 4. That's the reason one of the unit test is failing.
I think If we see this problem from the point of real datasets, we generally don't have this low number of points in any class and shouldn't cause much of a problem. We may just output an error message to the user if this is happening with his/her parameters. What do you think?

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 13, 2018

Right, that won't typically happen in real-world datasets, but if it does you can do this... search for k neighbors, resize the distances matrix to have k + 1 rows, then fill that last row with DBL_MAX. Then the bounds will still work correctly.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 13, 2018

Ya, that would solve the problem, just an extra check like this inside LMNNFunction() will be required

  size_t minCount = arma::min(arma::histc(labels, arma::unique(labels)));
  if (minCount < k + 1)
  {
    // use DBL_MAX
  }
@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 13, 2018

Couldn't you just check the size of each of the training sets right before the Search() call instead of calling histc()?

manish7294 added some commits Jul 19, 2018

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 19, 2018

@rcurtin I think it's now ready for review.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

@rcurtin It looks like when we calculate targetNeighbors for k + 1, then the result of first k rows is a bit different than when we calculate targetNeighbors for k. Ideally, the first k rows of result from k + 1 targetNeighbors calculation should be the same as the result when k targetNeighbors are calculated. Similarly, Impostors faces the same issue.

To cross-check this thing, I ran mlpack_knn twice, one for k = 3 and one for k = 4, and it seems entries of some of the columns gets a kind of shuffling / re-ordering effect on changing k.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 20, 2018

Can you file an issue for the KNN bug so that I can try and reproduce it? If the distances are exactly the same, the results may come back in a slightly different order, but the distance should always be exactly the same.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

Right, I am filing a bug issue.

@rcurtin
Copy link
Member

rcurtin left a comment

The code looks good to me, but the tests are failing so something must be wrong. Do you have any intuition on what the idea is?

Once we get the tests passing, we should run some benchmarks to take a look at what the speedup is, and then we can merge it.


// Store impostors.
outputNeighbors.cols(points.elem(subIndexSame)) = neighbors;
outputDistance.cols(points.elem(subIndexSame)) = distances;

This comment has been minimized.

@rcurtin

rcurtin Jul 20, 2018

Member

This is really a minor comment but it looks like there is an extra space in these two lines.

for (size_t i = 0; i < dataset.n_cols; i++)
{
if (transformationDiff * (2 * norm(i) + norm(impostors(k - 1, i)) +
norm(impostors(k, i))) > distance(k, i) - distance(k - 1, i))

This comment has been minimized.

@rcurtin

rcurtin Jul 20, 2018

Member

In the situation from #1470 where the neighbors indices differ, distance(k, i) - distance(k - 1, i) == 0 so unless all the points are identically 0 then the bounds will always be recalculated, so I think everything is fine at least on this line.

}
}

points.resize(numPoints);

This comment has been minimized.

@rcurtin

rcurtin Jul 20, 2018

Member

resize() will incur a copy, so if you really want to be tricky here, hold points as a member of LMNNFunction (so it only gets allocated once in the constructor), and then don't call resize() but instead pass numPoints to Impostors(), then use subvec() inside of the overload of the Impostors() that you've written for this PR.

This comment has been minimized.

@manish7294

manish7294 Jul 21, 2018

Author Member

Done!

// Re-calculate impostors on transformed dataset.
constraint.Impostors(impostors, distance, transformedDataset, labels);
// Re-calculate impostors on transformed dataset.
constraint.Impostors(impostors, distance, transformedDataset, labels);

This comment has been minimized.

@rcurtin

rcurtin Jul 20, 2018

Member

I think the extra indent wasn't necessary here.

This comment has been minimized.

@manish7294

manish7294 Jul 21, 2018

Author Member

Done!

double transformationDiff = 0;
if (!transformationOld.is_empty())
if (!transformationOld.is_empty() && iteration++ % range == 0)

This comment has been minimized.

@rcurtin

rcurtin Jul 20, 2018

Member

transformationDiff is used every iteration, so shouldn't we avoid adding the && iteration++ % range == 0 here?

This comment has been minimized.

@manish7294

manish7294 Jul 21, 2018

Author Member

Done!

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

One of the Accuracy tests over 6 points simple dataset using amsgrad is failing somehow, and it's weird that I am not getting that on my system. Debbuging this led me #1470 earlier, as there is some deviation in final results because of that. Other than that, I don't think I am still up to anything else.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 20, 2018

You can try changing the test dataset very slightly so that there are no KNN distances that are the same. i.e. add a random perturbation less than 0.01 to each coordinate. If that fixes things, then I'll join you in thinking about how the issue in #1470 affects the results.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

I performed some simulations and I think having a dataset having different distances to all neighbors solves the issue. Even vc2 works perfectly.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

I think the deviation could be expected as we are actually training over a different permutation of points. Here a point may be such that it has a greater impact over some other dimension than the other point, and still having same distance from the query point.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 20, 2018

I see. Give me some time to think about it, and you can think about it also, and let's see if we can figure out a way to handle it. The answer may be requiring that NearestNeighborSearch gives us back deterministic orderings of points, or the answer may be that we specifically look through our kNN results and reorder neighbor indices when distances are the same.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 20, 2018

On a second thought, Though being a bit different, the distance learned will still be correct, as we are just taking a different set of constraints which satisfy the conditions of LMNN.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 21, 2018

Unfortunately I've done a little bit of thinking about the exact-same-distance issue and I've come to two conclusions:

  1. Right now all bounds we are computing are with respect to the k'th impostor. In essence these bounds are saying, "the distance between the point and its k'th impostor next iteration cannot be greater than (our bound quantity)". But remember that this depends on the norm of the impostor. So what if the (k - 1)'th impostor has huge norm but the k'th impostor has very small norm? In this case it is possible that the (k - 1)'th impostor will be further away next iteration than the k'th impostor. That means that, with respect to our bounds, it's possible that our current bounds could say "it's not possible that the (k + 1)'th impostor and the k'th impostor could switch places in the distance order" and be correct---but at the exact same time it could be possible that the (k - 1)'th and (k + 1)'th impostor would switch places. (Or something like this.) At the end of my long explanation here, it's clear that the only solution is that we have to rework the bounds a little, to take the max over all k possible bounds. (Also note that for k == 1, there is no problem, and I think the bounds being incorrect would be very rare in practice.) Let's worry about this issue in another PR or something since I don't believe it is causing any of our issues.

  2. The ordering of the results does matter because it will affect the gradient direction. If the impostor ordering is switching every iteration, this will cause trouble for the gradient (I believe). In any case, the order in which we want our results to be is in order of increasing norm. So, e.g., if the k'th and (k + 1)'th impostor have the same distance to the query point, then we'd prefer to have it in the order that the k'th impostor has smaller norm than the (k + 1)'th impostor. But since this is problem-specific to our LMNN optimizations, it doesn't make sense to me to do this computation in the NeighborSearch code. So I believe the best way to do this will be to update Impostors() so that it runs a loop after calling Search() to ensure that the ordering is correct. I worked out what I think is some code that will do this, but I did not have a chance to fully integrate this into the different Constraints::Impostors() overloads. On the laptop I'm using I don't actually have a way to middle-click, so I can't copy-paste it, but I'll email the function to you separately.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 21, 2018

Right now all bounds we are computing are with respect to the k'th impostor. In essence these bounds are saying, "the distance between the point and its k'th impostor next iteration cannot be greater than (our bound quantity)". But remember that this depends on the norm of the impostor. So what if the (k - 1)'th impostor has huge norm but the k'th impostor has very small norm? In this case it is possible that the (k - 1)'th impostor will be further away next iteration than the k'th impostor. That means that, with respect to our bounds, it's possible that our current bounds could say "it's not possible that the (k + 1)'th impostor and the k'th impostor could switch places in the distance order" and be correct---but at the exact same time it could be possible that the (k - 1)'th and (k + 1)'th impostor would switch places. (Or something like this.) At the end of my long explanation here, it's clear that the only solution is that we have to rework the bounds a little, to take the max over all k possible bounds. (Also note that for k == 1, there is no problem, and I think the bounds being incorrect would be very rare in practice.) Let's worry about this issue in another PR or something since I don't believe it is causing any of our issues.

Agreed, this definitely is something that had been overlooked. It seems doing this will be a bit tricky.

The ordering of the results does matter because it will affect the gradient direction. If the impostor ordering is switching every iteration, this will cause trouble for the gradient (I believe). In any case, the order in which we want our results to be is in order of increasing norm. So, e.g., if the k'th and (k + 1)'th impostor have the same distance to the query point, then we'd prefer to have it in the order that the k'th impostor has smaller norm than the (k + 1)'th impostor. But since this is problem-specific to our LMNN optimizations, it doesn't make sense to me to do this computation in the NeighborSearch code. So I believe the best way to do this will be to update Impostors() so that it runs a loop after calling Search() to ensure that the ordering is correct. I worked out what I think is some code that will do this, but I did not have a chance to fully integrate this into the different Constraints::Impostors() overloads. On the laptop I'm using I don't actually have a way to middle-click, so I can't copy-paste it, but I'll email the function to you separately.

Right, that will be the best option compared to making changes to Neighbor Search. I saw the code, and I think good to me. I think computation cost won't be an issue as compared to other parts of LMNN.

manish7294 added some commits Jul 22, 2018

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 24, 2018

I think everything is working here but the random test failure is due to #1469 not being merged yet. Could you run the same simulations we've run on other PRs to see the runtimes with and without this optimization? (It would be useful to calculate the accuracy too because it is easy to do, just as a sanity check to make sure nothing is wrong. Ideally it should be the same.)

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 24, 2018

It's also not clear to me why the build should take any longer than before---usually the build/install step takes roughly 2850s but in the Travis build here it's more like 3200, so the build runs out of time. I think that too is random and we don't need to worry about it. There's nothing in this PR that would cause an additional 350s of build time.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 24, 2018

I guess I am not in position to make a comment over train runtime as I haven't debugged much here. But intiutively I think this may be due to bbsgd getting stuck, as I have seen it doing so earlier as well, though it doesn't make any sense. Probably #1469 will solve this.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 24, 2018

Sorry, I'm not sure I understand what your comment is referring to. Can you clarify?

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 24, 2018

Ah! Sorry it''s auto-correct again, I meant travis build time.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 24, 2018

Currently I am travelling back to my college, and commenting over the phone. So, I guess a lot of autocorrect mistakes can happen :)

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 24, 2018

No worries, thanks for the clarification. I turn autocorrect off but then I just get lots of spelling errors. :)

I don't think we should worry about the Travis build time---I think we are just unlucky. I'll try the build again a few times tomorrow and see if that helps; otherwise I (or you) could just time how long it takes to build the master branch and also this branch and make sure there is no significant difference.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 25, 2018

I think I may have a reason for Travis timeout.

In BBSGD we update iterate using below lines -

// Update the iterate.
iterate -= stepSize * gradient;

but in case we have the gradient as a zero matrix, iterate will never be updated and same will continue forever. So basically, we will be calling evaluate on the same iterate matrix again & again.

Does this sound reasonable?

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 25, 2018

The timeout does not seem to have to do with the testing phase. The Travis build is split into a build phase---which usually takes ~2800 seconds, and a test phase. But for this PR the Travis build step is taking ~3200-3500 seconds. That time does not include the time to run the tests. Here's an example successful build, where the build step for the first configuration takes only 2820 seconds:

https://travis-ci.org/mlpack/mlpack/builds/407688072

So I don't believe the issue is in the tests. I believe it is some strange Travis artifact and I don't see any reason that that behavior should continue after the merge, so I would be happy to merge it in spite of these strange results.

Next, about BBSGD: you are right that the situation could occur that the gradient matrix is identically zero and in this case no step will be taken. However, remember that BBSGD works on batches of points---so the next batch is likely to have a different gradient that will not be zero, and so a step will be taken. Also, note that BBSGD does have a tolerance so if the gradient becomes zero, it will terminate at the end of the epoch.

So I don't expect that we are seeing any issues as a result of zero gradients in BBSGD. If you've noticed cases where that indeed does happen, then let's discuss it, but I don't think it's happening here.

Anyway, that aside:

Could you run the same simulations we've run on other PRs to see the runtimes with and without this optimization? (It would be useful to calculate the accuracy too because it is easy to do, just as a sanity check to make sure nothing is wrong. Ideally it should be the same.)

Have you had a chance to run these numbers yet? :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 26, 2018

Here are some simulations - https://gist.github.com/manish7294/01c192dee660bb09baf27841c84c9b70
Sorry for the delay.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 26, 2018

Do these results seem meaningful to you?

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 26, 2018

I am not sure, I think you have only run it for one optimizer and it's important to run it for all three because each Evaluate() and EvaluateWithGradient() use these bounds differently. I was thinking that I would simply devote some time and run some experiments myself to try and understand why there is no speedup in the covertype and letters cases.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 28, 2018

Just an update, I have not collated my results entirely yet but I have seen promising results on large covertype datasets and MNIST:

covertype-5k lbfgs: 66 iterations

        total     computing_neighbors     acc before      acc after
old:    80.758s   21.672s                 75.56%          79.98%
new:    75.456s   15.992s                 75.56%          79.98%

covertype-5k amsgrad: 50 passes, batch size 256

        total     computing_neighbors     acc before      acc after
old:    136.093s  16.157s                 75.56%          75.54%
new:    137.484s  16.813s                 75.56%          77.26%


        total     computing_neighbors     building    acc before      acc after
old:    249.279s  128.153s                5.587s      90.11%          91.796%
new:    620.516s  340.776s                ~100s       90.11%          91.862%

The first MNIST run, on the 10k test points (784d) with L-BFGS and 25 maximum iterations, took ~3670s before this optimization and ~3697s afterwards. So I guess that there is not much optimization possible here, but I want to investigate that more. Alternately, if you would like to determine whether or not any impostors are being pruned with the MNIST dataset I'd be interested to see the results.

Overall I'm fairly convinced this is good and we should merge it, but I just want to make sure and see a few more results (mostly with large datasets) first. I will be busy this weekend though, so I won't really have a chance to run anything, so if you're able to get to it before me great, otherwise feel free to wait.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 28, 2018

Note also, I believe the accuracy difference in some cases is due to our slight changes in the gradient calculation.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 28, 2018

Right, the results seem good enough. Sure, I will perform some simulation for pruning like you did earlier. Can I still use slack?

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 29, 2018

I think that should be fine, I am not sure what the long-term status of that system will be, I have not heard back yet. But it is up for now.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 30, 2018

@rcurtin Here are some simulations mainly focusing over full covertype and mnist test(10k) dataset - https://gist.github.com/manish7294/f1f478d06ea91c91a42f059e3fabde42
Sorry, but the actual iteration number is off by one, I added the print statement after the iteration update. We can see the effects of impostors bounds on covertype starting from iteration 40, where as I don't think mnist is going to react anytime sooner as after even 50 iterations transformationDiff is still quite big.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 30, 2018

Looks good. Is mean-centering being performed before these simulations? It could give an additional gain (probably slight). We could address that in a separate PR also.

Overall I think nothing is wrong, and it's nice to see that even on large datasets eventually the impostor bounds start making a huge difference. I think you are right, with MNIST it's likely to not make a difference until way later into the optimization because the norm of the transformation difference is so large.

I know I am asking for a lot, but would it be possible to plot the same information for the amsgrad optimizer? I'm curious, I think that with something like that the "big" steps may happen more quickly.

I don't see any other changes necessary in this PR so I think it is ready to go, I just want to make sure that the optimization behaves how we expect first.

Thanks for the hard work! 👍

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 30, 2018

This time I used mean centering. But it doesn't look like pruning happened at least for first 50 passes. Just to make sure everything is fine, I ran a simulation on iris as well for which pruning started from very early stages. https://gist.github.com/manish7294/baecea1fcf0553f73dfe7c67d1eb6e58

@rcurtin
Copy link
Member

rcurtin left a comment

Great, thanks---I agree with your observation. I think with more passes SGD-type optimizers would start pruning, but at least for covertype and MNIST it seems like it takes a long time to get there.

I'll go ahead and merge this in 3 days to leave time for any other comments. If you want to also work on a PR to do mean centering in lmnn_main.cpp (as a PARAM_FLAG() I guess), I think that would be a nice addition also.

Nice work with this one---it can provide a pretty significant speedup in some cases. 👍

@rcurtin rcurtin merged commit c7ec18f into mlpack:master Aug 2, 2018

4 of 5 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.