Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMNN: Bounds for slack term #1461

Merged
merged 12 commits into from Jul 18, 2018

Conversation

Projects
None yet
3 participants
@manish7294
Copy link
Member

manish7294 commented Jul 4, 2018

See #1449

manish7294 added some commits Jul 4, 2018

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 5, 2018

@rcurtin Implementing this bound has caused a bit increase in the runtime of the program, please have a look at it.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 5, 2018

What have you tried to debug the lack of speedup?

2 * norm(i));
if (eval > -1)
{
// Calculate exact eval.

This comment has been minimized.

@manish7294

manish7294 Jul 5, 2018

Author Member

I think this may be the reason for lack of speedup. Here each time eval > -1 we are forced to do computation two times, one for eval bound and one for exact eval value.

This comment has been minimized.

@rcurtin

rcurtin Jul 5, 2018

Member

Yes, but your simulations in #1449 showed that a large number of the eval calculations could be skipped. Have you tried finding out if we are skipping the eval calculation the expected number of times?

This comment has been minimized.

@manish7294

manish7294 Jul 5, 2018

Author Member

I think I made some serious mistake while doing those simulations, I can't seem to reproduce those result. I even tried with previous 12 days versions of LMNN but all are giving same results, and are totally opposite to the simulations in #1449. According to current simulation, there are very few points on which eval < -1.

This comment has been minimized.

@rcurtin

rcurtin Jul 5, 2018

Member

I'm virtually certain your simulations were correct---so, under that assumption, what's different between what you did there and what's implemented here?

This comment has been minimized.

@manish7294

manish7294 Jul 5, 2018

Author Member

I think I had done the same thing like I have initiated three count variables and after each eval calculation I have set three if else if conditions to increase the count based on eval value and just before return statement I am printing the values.

This comment has been minimized.

@manish7294

manish7294 Jul 5, 2018

Author Member

that's what strange that the result is not the same. I even printed the eval values.

This comment has been minimized.

@rcurtin

rcurtin Jul 5, 2018

Member

I'd suggest starting the debugging from square one. From my side I am quite confident that the original theory and your simulations in #1449 were correct (and this is supported by the observations in the JMLR paper). So start from the beginning, make sure that you can get the original observations you were seeing in #1449 with unmodified code again, and then take a look at your implementation here and make sure that you still can.

Looking at the code my strong belief is that simply counting the number of evals less than -2, less than -1, and greater than -1 in the code you have here should provide the same result. But I believe that the bound implementation itself is wrong. However, that is just my intuition, which is no substitute for digging into the problem and figuring out what is wrong.

This comment has been minimized.

@manish7294

manish7294 Jul 5, 2018

Author Member

I couldn't get those results on iris and vc2 but with letters dataset, it is much better

(eval  > -1) --- (-2 < eval < -1) --- (eval < -2)
[DEBUG] Compiled with debugging symbols.
30240---28381---69168
[DEBUG] L-BFGS iteration 0; objective 124341, gradient norm 141545, 0.
13051---20816---77796
[DEBUG] L-BFGS iteration 1; objective 93106.6, gradient norm 44531.9, 0.2512.
5137---23661---75860
[DEBUG] L-BFGS iteration 2; objective 49849.1, gradient norm 19805.7, 0.464602.
30---22888---77110
[DEBUG] L-BFGS iteration 3; objective 22247.1, gradient norm 17154.7, 0.55371.
0---636---99364
[DEBUG] L-BFGS iteration 4; objective 9315.77, gradient norm 14277.2, 0.58126.
0---8---99992
[DEBUG] L-BFGS iteration 5; objective 5669.07, gradient norm 10134.7, 0.391455.
0---0---100000

I think I probably made some mistake earlier.

This comment has been minimized.

@rcurtin

rcurtin Jul 5, 2018

Member

It may be a little time consuming but I think it's very important to understand what is going on here. So I think it is worthwhile to take the time to dig in further and see for sure if you made a mistake earlier or what. Since I can't see the code you are using, I can't review it myself. And with the code you just ran it looks a little suspicious to me since there are 0 evals greater than -1 by the fifth L-BFGS iteration. That doesn't seem right to me.

During this whole process I would say it's really important to keep rigorous and detailed notes to avoid confusion like this, since if we want to eventually publish on it we need to be 100% sure that what we're writing is correct.

This comment has been minimized.

@manish7294

manish7294 Jul 6, 2018

Author Member

Ideally, it should be like this:

(eval  > -1)----(-2 < eval < -1)----(eval < -2)
1298----862----1590
944----656----2150
667----426----2657
921----647----2182
655----408----2687
898----636----2216
646----391----2713
866----633----2251
640----375----2735

But using these new bounds results are:

(eval  > -1)----(-2 < eval < -1)----(eval < -2)
0----0----3750
0----0----3750
0----0----3750
0----0----3750
0----0----3750
0----3750----0
0----3750----0
0----3750----0
0----3750----0

For cross - checking if I bypass our bounds by setting if (transformationOld.n_elem != 0) to false, then we are getting ideal results.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 6, 2018

@rcurtin I think now the implementation is fully refined. runtime has also improved w.r.t current LMNN implementation.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 6, 2018

The thing I am gonna write will contradict my above comment, Sorry about that.
The runtimes are good and on par with earlier timings but I can't strongly say that they are much better.
Here's the reason why I think it is so. Our bounds algorithm efficiently prune most of the points on which eval < -1, means these point never encounters the bound checking condition eval = evalOld + transformationDiff * (norm(targetNeighbors(j, i)) + norm(impostors(l, i)) + 2 * norm(i)) and all those points that pass through it are majorly those with eval > -1 leading to double calculation (one we are calculating upper bound for eval and then we are forced to calculate exact eval as well). The yesterday's eval simulations which were done without commenting out 'if (eval < -1) { bp = l; break;}' also support this.
In general for every few thousands (eval > -1), we are just encountering few hundred's of (eval < -1), all others are being pruned.

transformedDataset.col(targetNeighbors(j, i))) -
distance(l, i);
eval = evalOld + transformationDiff * (norm(targetNeighbors(j, i)) +
norm(impostors(l, i)) + 2 * norm(i));

This comment has been minimized.

@rcurtin

rcurtin Jul 9, 2018

Member

Right, caching the norms for this bound can make a huge difference for speedup.

This comment has been minimized.

@manish7294

manish7294 Jul 9, 2018

Author Member

Ya, these are cached norms (just have same name as arma::norm) :)

//! Holds the norm of each data point.
arma::vec norm;
//! Hold previous eval value.
double evalOld;

This comment has been minimized.

@rcurtin

rcurtin Jul 9, 2018

Member

This confuses me a little bit---shouldn't we cache one evalOld for each query point / impostor combination? I don't think it should just be one eval for the whole dataset.

This comment has been minimized.

@manish7294

manish7294 Jul 9, 2018

Author Member

Right, It seems like a big mistake. I think it's just a coincidence that result didn't differ much :)

This comment has been minimized.

@rcurtin

rcurtin Jul 9, 2018

Member

No worries, if you want to fix it and push the change we can see if there is a big improvement. 👍

This comment has been minimized.

@manish7294

manish7294 Jul 9, 2018

Author Member

I have made the changes. Probably results are most consistent but sadly, there is no improvement in the runtimes :(

This comment has been minimized.

@rcurtin

rcurtin Jul 9, 2018

Member

Then probably something is not right in the code. I also don't know what "no improvement" means, since you've already pointed out that the runtime is improved (but I don't know how much). I'd suggest that we start debugging here, so some things to check:

  • How tight should the bound be? Remember that the bound term is ||L_{t - 1} - L_t|| * (||x_a|| + ||x_b|| + 2 ||x_i||)---is that typically large or small? Norm-centering the dataset can also help reduce the norms on average.

  • If the step size is sufficiently small, then || L_{t - 1} - L_t || should be so small that basically all evaluations should be bounded. So you should be able to see, as the number of iterations increase (and the steps get smaller), that more and more pruning can happen. If that's not happening, there's likely to be a bug.

  • The results may be different for different datasets. So we should always be careful to run the benchmarks against a handful of datasets instead of just one or two, before saying that there is no improvement.

I think that in this case there is likely to be some bug though. I'll take a look at the code when I have a chance.

This comment has been minimized.

@manish7294

manish7294 Jul 10, 2018

Author Member

Just for the confirmation, have you read #1461 (comment) ?

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

Yes, I read it, but at that time the code was only storing one eval for the entire dataset, which is definitely not right. In fact I have found that the code never actually prunes anything. Let me write up a longer comment shortly.

@rcurtin
Copy link
Member

rcurtin left a comment

I checked out the code and ran some simulations. Please go ahead and reproduce what I'm doing here for yourself so that we can know we're on the same page, and please try and look at the level of detail provided here and reproduce that so that we can be on the same page as we debug these things in the future. :) (I know it takes a long time, but it is quicker in the long term to spend a long time writing and describing exactly what is going on, because as you write the description you may find you want to ask additional questions about what is going on in your writeup, and when you investigate these questions, you may figure out nice ways of speedup. I have done that here actually.)

The first thing that I did was modify the master branch mlpack_lmnn program to add a timer lmnn_function_evaluate_with_gradient, which calls Timer::Start() at the start of EvaluateWithGradient() and Timer::Stop() right before the return statement. I also added a timer lmnn_function_compute_evals, which times the loop for (size_t i = 0; i < dataset.n_cols; i++) in which each eval is computed.

This gives us a baseline to work with. Just to pick only one dataset, I chose covertype-5k, for which I got the following results:

 $ bin/mlpack_lmnn -i ~/datasets/covertype-5k.train.csv -l ~/datasets/covertype-5k.train.labels.csv -v -O lbfgs -o output.csv --print_accuracy | grep -v 'node combinations\|base cases'
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Initial learning point have invalid dimensionality.  Identity matrix will be used as initial learning point for optimization.
[INFO ] Accuracy on initial dataset: 73.94%
[INFO ] Accuracy on transformed dataset: 79.82%
[INFO ] Saving CSV data to 'output.csv'.
[INFO ] 
...
[INFO ] Program timers:
[INFO ]   computing_neighbors: 27.892385s
[INFO ]   line_search: 47.787443s
[INFO ]   lmnn_function_compute_evals: 3.029449s
[INFO ]   lmnn_function_evaluate_with_gradient: 47.942906s
[INFO ]   lmnn_optimization: 47.974576s
[INFO ]   loading_data: 0.034106s
[INFO ]   saving_data: 0.001417s
[INFO ]   total_time: 48.374898s
[INFO ]   tree_building: 2.313428s

So we can see that with this dataset, the computation of each eval is a limited part of the runtime. Then, I made the same timer modifications to the branch in this PR and found that the runtime was slightly longer.

My immediate thought was that no pruning was happening. So in order to diagnose this, I modified the code to print a lot of diagnostic information, including:

  • How many eval computations were actually pruned each call to EvaluateWithGradient().
  • How many eval values were actually greater than -1.
  • How many eval values were less than -1 but greater than -2.
  • How many eval values were less than -2.
  • (It is commented out) Printing all the necessary information for the bound computation for each point, so I can see what was happening.

Here is a gist that includes those changes (plus a fix, so, sorry it is a spoiler :)): https://gist.github.com/rcurtin/c29a7a513ac070aa407a000aa5a7a80b
I'd ask you to check it out and run it.

Without the modification to when evalOld is set, I ran, and received this (partial) output:

Evaluate(): 0 pruned of 5000, transformation diff 0.  Active: 1314, close 20, inactive 3666.
Evaluate(): 0 pruned of 5000, transformation diff 0.611229.  Active: 1394, close 29, inactive 3577.
Evaluate(): 0 pruned of 5000, transformation diff 0.596621.  Active: 3606, close 86, inactive 1308.
Evaluate(): 0 pruned of 5000, transformation diff 0.46502.  Active: 2716, close 86, inactive 2198.
Evaluate(): 0 pruned of 5000, transformation diff 0.23251.  Active: 2763, close 119, inactive 2118.
Evaluate(): 0 pruned of 5000, transformation diff 0.0796543.  Active: 2334, close 116, inactive 2550.
Evaluate(): 0 pruned of 5000, transformation diff 0.197749.  Active: 2860, close 150, inactive 1990.
Evaluate(): 0 pruned of 5000, transformation diff 0.184094.  Active: 2314, close 192, inactive 2494.
Evaluate(): 0 pruned of 5000, transformation diff 0.175174.  Active: 2507, close 226, inactive 2267.
Evaluate(): 0 pruned of 5000, transformation diff 0.172605.  Active: 2540, close 264, inactive 2196.
Evaluate(): 0 pruned of 5000, transformation diff 0.0748272.  Active: 2573, close 346, inactive 2081.
Evaluate(): 0 pruned of 5000, transformation diff 0.0362764.  Active: 2625, close 389, inactive 1986.
Evaluate(): 0 pruned of 5000, transformation diff 0.0883931.  Active: 3161, close 514, inactive 1325.
Evaluate(): 0 pruned of 5000, transformation diff 0.132874.  Active: 2820, close 574, inactive 1606.
Evaluate(): 0 pruned of 5000, transformation diff 0.0685879.  Active: 2711, close 594, inactive 1695.
Evaluate(): 0 pruned of 5000, transformation diff 0.0152386.  Active: 3138, close 686, inactive 1176.
Evaluate(): 0 pruned of 5000, transformation diff 0.0374723.  Active: 3234, close 756, inactive 1010.
Evaluate(): 0 pruned of 5000, transformation diff 0.0398768.  Active: 3300, close 759, inactive 941.
Evaluate(): 0 pruned of 5000, transformation diff 0.0250394.  Active: 3235, close 816, inactive 949.
Evaluate(): 0 pruned of 5000, transformation diff 0.0201565.  Active: 3326, close 781, inactive 893.
Evaluate(): 0 pruned of 5000, transformation diff 0.0150167.  Active: 3241, close 818, inactive 941.
Evaluate(): 0 pruned of 5000, transformation diff 0.0299163.  Active: 3096, close 816, inactive 1088.
Evaluate(): 0 pruned of 5000, transformation diff 0.0330943.  Active: 3082, close 828, inactive 1090.
Evaluate(): 0 pruned of 5000, transformation diff 0.0803452.  Active: 2866, close 817, inactive 1317.
Evaluate(): 0 pruned of 5000, transformation diff 0.0550873.  Active: 2879, close 873, inactive 1248.
Evaluate(): 0 pruned of 5000, transformation diff 0.0875116.  Active: 2969, close 887, inactive 1144.
Evaluate(): 0 pruned of 5000, transformation diff 0.0889569.  Active: 2995, close 930, inactive 1075.
Evaluate(): 0 pruned of 5000, transformation diff 0.148345.  Active: 2861, close 1048, inactive 1091.
Evaluate(): 0 pruned of 5000, transformation diff 0.135667.  Active: 3479, close 1003, inactive 518.
Evaluate(): 0 pruned of 5000, transformation diff 0.0554286.  Active: 3476, close 1072, inactive 452.
Evaluate(): 0 pruned of 5000, transformation diff 0.0107477.  Active: 3634, close 1087, inactive 279.
Evaluate(): 0 pruned of 5000, transformation diff 0.0225253.  Active: 3717, close 1059, inactive 224.
...
Evaluate(): 0 pruned of 5000, transformation diff 2.91092e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 3.202e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 6.72421e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 6.41857e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 3.20926e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 3.5302e-13.  Active: 3913, close 990, inactive 97.
Evaluate(): 0 pruned of 5000, transformation diff 7.41342e-13.  Active: 3913, close 990, inactive 97.

So definitely something is wrong---it's not pruning, even at the end of the optimization when transformationDiff is very small. Then I uncommented the cout call that prints the information for every single calculation of eval using the bounds. I got output like this:

eval bound of point 501 with l 0: 8123.46; tdiff 0.596621 norm target 3523.53, norm impostors 3136.12, norm i 3441.11; evalOld 44.0952
eval bound of point 502 with l 0: 11152.5; tdiff 0.596621 norm target 5043.43, norm impostors 3579.43, norm i 5034.99; evalOld 0
eval bound of point 503 with l 0: 6644.96; tdiff 0.596621 norm target 2758.45, norm impostors 2651.33, norm i 2863.94; evalOld 0
eval bound of point 504 with l 0: 8616.29; tdiff 0.596621 norm target 3780.76, norm impostors 3010.31, norm i 3796.55; evalOld 34.4037
eval bound of point 505 with l 0: 10042.8; tdiff 0.596621 norm target 4313.84, norm impostors 4085.71, norm i 4200.54; evalOld 19.2269
eval bound of point 506 with l 0: 11692.8; tdiff 0.596621 norm target 5127.23, norm impostors 4241.35, norm i 5114.93; evalOld 0
eval bound of point 507 with l 0: 6884.76; tdiff 0.596621 norm target 2812.86, norm impostors 3048.53, norm i 2839.1; evalOld 0
eval bound of point 508 with l 0: 9194.64; tdiff 0.596621 norm target 3788.47, norm impostors 3982.62, norm i 3806.09; evalOld 16.6542
eval bound of point 509 with l 0: 10675.3; tdiff 0.596621 norm target 4504.05, norm impostors 4282.94, norm i 4552.93; evalOld 0
eval bound of point 510 with l 0: 15528.3; tdiff 0.596621 norm target 6074.61, norm impostors 7861.31, norm i 6045.6; evalOld 0
eval bound of point 511 with l 0: 9294.89; tdiff 0.596621 norm target 3953.39, norm impostors 3767.32, norm i 3929.26; evalOld 0
eval bound of point 512 with l 0: 9460.31; tdiff 0.596621 norm target 3929.16, norm impostors 3945.34, norm i 3990.99; evalOld 0
eval bound of point 513 with l 0: 9744.06; tdiff 0.596621 norm target 4107.85, norm impostors 4190.15, norm i 3987.92; evalOld 34.7482
eval bound of point 514 with l 0: 8477.52; tdiff 0.596621 norm target 3508.36, norm impostors 3472.01, norm i 3569.13; evalOld 54.045

That's not on the first iteration, so we can see that the old eval values are never less than -1... that led me to the conclusion that evalOld is not being set right when pruning occurs!

So I commented out the every-calculation output and got something more expected, although the pruning only happens at the end of the optimization:

...
Evaluate(): 0 pruned of 5000, transformation diff 0.000489792.  Active: 3909, close 991, inactive 100.
Evaluate(): 0 pruned of 5000, transformation diff 0.000244896.  Active: 3911, close 990, inactive 99.
Evaluate(): 3 pruned of 5000, transformation diff 0.000122448.  Active: 3911, close 994, inactive 95.
Evaluate(): 58 pruned of 5000, transformation diff 6.12241e-05.  Active: 3911, close 1043, inactive 46.
Evaluate(): 204 pruned of 5000, transformation diff 3.0612e-05.  Active: 3913, close 1050, inactive 37.
Evaluate(): 407 pruned of 5000, transformation diff 1.5306e-05.  Active: 3913, close 1027, inactive 60.
Evaluate(): 600 pruned of 5000, transformation diff 7.65301e-06.  Active: 3913, close 1031, inactive 56.
Evaluate(): 744 pruned of 5000, transformation diff 3.8265e-06.  Active: 3913, close 1028, inactive 59.
Evaluate(): 848 pruned of 5000, transformation diff 1.91325e-06.  Active: 3913, close 1030, inactive 57.
Evaluate(): 948 pruned of 5000, transformation diff 9.56626e-07.  Active: 3913, close 1028, inactive 59.
Evaluate(): 982 pruned of 5000, transformation diff 4.78313e-07.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1034 pruned of 5000, transformation diff 2.39156e-07.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1060 pruned of 5000, transformation diff 1.19578e-07.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1073 pruned of 5000, transformation diff 5.97891e-08.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1073 pruned of 5000, transformation diff 2.98946e-08.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1084 pruned of 5000, transformation diff 1.49473e-08.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1085 pruned of 5000, transformation diff 7.47364e-09.  Active: 3913, close 1027, inactive 60.
Evaluate(): 1083 pruned of 5000, transformation diff 3.73682e-09.  Active: 3913, close 1027, inactive 60.

At least for the covertype dataset, we don't get that many prunes, but it's likely to be different for other datasets. I have not investigated that here, since you already have a benchmarking system set up for that, and also my only goal was to try and find why we were getting no speedup.

Once I removed all the extra code for counting and commenting and left only the fix for evalOld, here are the results that I got:

$ bin/mlpack_lmnn -i ~/datasets/covertype-5k.train.csv -l ~/datasets/covertype-5k.train.labels.csv -v -O lbfgs -o output.csv --print_accuracy | grep -v 'node combinations\|base cases' 2>&1
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Initial learning point have invalid dimensionality.  Identity matrix will be used as initial learning point for optimization.
[INFO ] Accuracy on initial dataset: 73.94%
[INFO ] Accuracy on transformed dataset: 79.82%
[INFO ] Saving CSV data to 'output.csv'.
[INFO ] 
...
[INFO ] Program timers:
[INFO ]   computing_neighbors: 27.788266s
[INFO ]   line_search: 47.258662s
[INFO ]   lmnn_function_compute_evals: 2.923383s
[INFO ]   lmnn_function_evaluate_with_gradient: 47.409206s
[INFO ]   lmnn_optimization: 47.440213s
[INFO ]   loading_data: 0.034090s
[INFO ]   saving_data: 0.001398s
[INFO ]   total_time: 47.811751s
[INFO ]   tree_building: 2.266863s

The speedup for the lmnn_function_compute_evals is marginal, but it is definitely not nothing. It's possible there is a little more improvement to be had here---see my comment on the shape of evalOld. And it's also possible that we will see much more speedup for some other datasets.

A last comment is, if we norm-center the data first, the average norm of points will go down, making this bounding strategy more effective. I'd suggest adding an option to do that to the lmnn_main.cpp program (and maybe to the LMNN class); that doesn't have to be this PR, you could do it in another one if you like.

I hope this is helpful. I've gone to great lengths to describe the details of what I've done in the hopes that it can help you in debugging some of the other optimization issues we are looking at. Of course I am here to help also. :) Anyway, I hope this is helpful! Let me know if I can clarify anything.

//! Holds the norm of each data point.
arma::vec norm;
//! Hold previous eval values for each datapoint.
arma::vec evalOld;

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

By the way, I think that we have to hold one eval value for each datapoint and each of its neighbors.

This comment has been minimized.

@manish7294

manish7294 Jul 10, 2018

Author Member

First of all, Thanks a lot for doing this. That's a whole new level of debugging from my view. It has helped in clearing a lot of things :)

I think I can at least slightly see what you mean here, While doing eval derivation we started taking a particular x_i, x_a, and x_b. So, the new eval bound eval_new <= eval_old + || L_{t + 1} - L_t ||_F^2 * ( ||x_a||^2 + 2 * ||x_i||^2 + ||x_b||^2 ) should be for those particular points only. If it's right, then do you think we need to store eval for each triplet (x_i, x_a, x_b)?
Which might be a bit inconvenient though.

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

Right, each eval is a function of x_i, x_a, and x_b. So we should be holding evals for each of those. It will take some more RAM, but it shouldn't be too much compared to everything else.

All of the debugging I did was with k = 1, so I did not actually see any difference there (as there is only one eval per point in this case).

{
// update bound.
bp = l;
break;
}

// Update cache eval value.
evalOld[i] = eval;

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

Since you are setting the cached value down here, it is never updated in the case that eval is small enough that it can be pruned.

This comment has been minimized.

@manish7294

manish7294 Jul 10, 2018

Author Member

I always had my doubts regarding this, and since I was always neglecting and considering the marginal improvement as system noise. After some more confusing thoughts, I end up putting it here.

So basically this should be done at least before the break statement, right?

This comment has been minimized.

@rcurtin

rcurtin Jul 10, 2018

Member

Right, I put it up above the break and that is when I started seeing prunes. I would encourage you to add the same debugging output I linked to and watch how the change makes a difference. :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 10, 2018

I forgot to mention this, It strike me today while working over bounds for impostors. I think we may not be able to use this for optimizers using batches as it would be a lot inconvenient to keep track of transformationOld for points of each batch. So I think in this code that part is wrong.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 10, 2018

Right, so I think maybe holding a bunch of transformation matrices and then a list for each point of which transformation matrix was last used could be a way to solve this.

Alternately, we could store only one transformation matrix when Shuffle() is called, and then be careful about how we compute evalOld for each point (since it would not be computed with the exact same transformation matrix).

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 11, 2018

I think we may have to store a transformation matrix for each point as apart from shuffle(), we have some optimizers having variable batchSize as well. So that could cause some extra problems. In terms of memory, it will be of the same order as of evalOld.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 11, 2018

Would you like to try implementing it? Some more memory will have to be allocated for each of the stored transformation matrices, and some management will be necessary to check when old ones can be deleted.

}

// Update cache eval value.
evalOld[i](j, l) = eval;

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

I think this shouldn't be here. Suppose we get eval < -1 then we aren't calculating exact eval value and are updating the cache with this approximate value. So, it will definitely cause some deviations from actual result.

So, I think doing this after break statement may be the valid choice or we may have to calculate exact eval everytime(which is I think will be negative optimization).

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

Is it actually a problem if we store an approximate bounded value in evalOld?

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

I think it can be when approximate eval is very close to -1 than actual eval or vice-versa.

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

At the very least we won't be getting exact same results.

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

But does your idea here match the results I was showing in my simulation? Note that I was comparing the output transformation matrices and received the exact same results. To put it more clearly: suppose that we take evalOld[i](j, l) = eval + x for some x greater than 0; now, the bound for that particular evalOld is loose. What will happen the next iteration?

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

The explanation makes sense to me. I agree---increasing x would result in fewer prunes. Would you then conclude that for any positive x, even if we take evalOld[i](j, l) = eval + x, the bound will still be correct? (As in, it will still be a valid bound.)

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

No doubt, It will surely be correct. Just the prunes will decrease.

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

Having loose bounds definitely shouldn't be wrong.

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

Ok, I agree. Now we can get back to your original question about the placement of how evalOld is assigned. Even if we assign evalOld before the prune, the worst case is that we assign a bound instead of an inexact value (so, this is equivalent to saying evalOld[i](j, l) = eval + x where x is the bounding term || L_{t + 1} - L_t || * (|| x_j || + || x_l || + 2 || x_i ||). Since we both agree that it is still a valid bound, just possibly loose, there is no problem with doing this. In fact, in the next iteration, if the exact calculation of eval is not done and we end up pruning based on another bound, our stored evalOld value will be loose---but not invalid. So we will still get some prunes, but it will not be perfect.

The other option, as you pointed out, is to assign evalOld after the pruning. But in this case, this means that we will never set any evalOld value for a lot of pruned points, and then it won't be possible to prune them in the following iteration. You can see this in the simulations I ran yesterday---without setting evalOld before the pruning, we get zero prunes at all throughout the entire algorithm (and no speedup).

So, given these two choices, it's best to go with the approach that will give some prunes despite the bound being looser than it could possibly be.

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

Totally Agree :) Thanks for clearing this out.

eval = metric.Evaluate(transformedDataset.col(i),
transformedDataset.col(targetNeighbors(j, i))) -
distance(l, i);
eval = evalOld[i](j, l) + transformationDiff *

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

This is one of the biggest problems. Each time impostors() is called mostly the triplets {i, j, l} changes and since evalOld contains values for different triples {i, j ,l}, it gets all wrong.
One solution may be to reset evalOld on every impostors() call but then we will lose the significance of this optimization.

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

Please have a look at this one too. It worries me the most.

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

Yes, I wanted to wait to discuss this one until we took care of the other part. I thought about it this morning and because the bound depends on || x_i ||, || x_j ||, and || x_l ||, I believe it is possible that the impostors changing could cause a problem. But let's talk a little bit more about what the problem is. As always please read carefully, it's possible there may be an error in what I've written.

Our whole goal is to bound eval for the next iteration. We've done this by computing the maximum possible eval value for the next iteration (not the minimum). Implicitly our derivation does this by bounding how much closer the impostor x_l could get given || L_{t + 1} - L_t ||. So even if the impostor changes, the bound will not. I haven't been very clear in my description here (but I am about to go to lunch so I want to keep it short), but I'd encourage you to sit down and try to work this out if you like.

That said there is still another problem: each iteration, we compute this bounding term || L_{t + 1} - L_t || * (|| x_j || + || x_l || + 2 || x_i ||). But if x_l changes, then suddenly you are right, it is possible that the bound could be incorrect---because we need to be using x_l with respect to the original impostor x_l that the eval was calculated with, not a new impostor point.

I thought about it for a while, and I think the problem could be solved if you did this: every time we perform a direct computation of eval, store the quantity || x_j || + || x_l || + 2 || x_i ||. Then when we compute the bound, just use that stored quantity. Now, even if the impostor changes, the bound will still be valid. As soon as the bound is no longer able to prune (i.e. once it becomes greater than -1), then we'll perform another direct computation of eval and store the new quantity with the new impostors.

What do you think, does that make sense?

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

Here's what I got -

new doc 2018-07-11_1

I think it will depend on the norm of the point. I may not be absolutely correct here, just an attempt.

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

Right, I think that I agree with the derivation here, so definitely if || x_k ||^2 > || x_{k + 1} ||^2, we can use the previous bound as-is. In the other case, where || x_k ||^2 <= || x_{k + 1} ||^2 (that is, the new impostor x_{k + 1} has larger norm than the old impostor x_k), a valid bound is

d(L_t x_i, L_t x_k) - A || x_{k + 1} ||^2 - B,  // call this (1)

although that bound would actually be somewhat looser than the bound

d(L_t x_i, L_t x_{k + 1}) - A || x_{k + 1} ||^2 - B.  // call this (2)

It's possible to put these together to make a nice bound that uses max(|| x_k_t ||^2, || x_k_{t + 1}} ||^2), where x_k_t (sorry for the ugly notation) is impostor k at iteration t and x_k_{t + 1} is impostor k at iteration t + 1. I think it might be possible to make that tighter by exploiting that impostors can only change when iteration - 1 % range == 0, but I think the result would be really complex. For that max(), note that when impostors don't change, x_k_t == x_k_{t + 1} and the bound is the same as before.

This comment has been minimized.

@manish7294

manish7294 Jul 12, 2018

Author Member

So, our new condition is basically,
eval_new <= eval_old + || L_{t + 1} - L_t ||_F^2 * ( ||x_a||^2 + 2 * ||x_i||^2 + max(||x_b||^2, ||x_b_p||^2))
Here,
x_a : target neighbor
x_b : current impostor
x_b_p : previous impostor
x_i : current data point

As per the implementation technicalities, I think we will have to hold the last maximum impostors norm used and will be solving -
eval_new <= eval_old + || L_{t + 1} - L_t ||_F^2 * ( ||x_a||^2 + 2 * ||x_i||^2 + max(||x_b||^2, previousMaxNorm))

So basically we will have to hold another cube.

This comment has been minimized.

@manish7294

manish7294 Jul 12, 2018

Author Member

Or, maybe just a 2D matrix will do.

//! Holds the norm of each data point.
arma::vec norm;
//! Hold previous eval values for each datapoint.
std::vector<arma::mat> evalOld;

This comment has been minimized.

@rcurtin

rcurtin Jul 11, 2018

Member

I'd suggest arma::cube or something, no need to use a std::vector.

This comment has been minimized.

@manish7294

manish7294 Jul 11, 2018

Author Member

Right, that could help to simplify things.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 12, 2018

Hopefully, this will be the last problem in the course of this PR -
I have no idea why BigBatchSGD after a number of iterations just keeps on running without printing any other iteration output. As if it is just stuck at a point. I have observed this several time but couldn't find an answer.

@rcurtin
Copy link
Member

rcurtin left a comment

Looking good, I think we are getting close to having this optimization about as good as we can get it. I see only a couple things left to do:

  • Check for correctness. I'll do this too, but you can do this empirically by running with and without these changes for both the L-BFGS and SGD optimizers (for SGD make sure that you are doing a linear scan, so there is no randomness), then make sure the learned output matrices are the same.

  • Check optimization of data structures (I left a few comments here, once we get them resolved I think we are good to go).

  • Check performance of the optimization on a couple different datasets. If you want to run the type of simulations I did a few days ago where we capture the overall speedup and also see some of the pruning behavior at different iterations, I think it would be really helpful.

if (!std::isnan(maxImpNorm(l, i)))
maxNorm = std::max(norm(impostors(l, i)), maxImpNorm(l, i));
else
maxNorm = norm(impostors(l, i));

This comment has been minimized.

@rcurtin

rcurtin Jul 12, 2018

Member

We could avoid the if statement here if we initialize maxImpNorm elements to 0. (Correct me if I am wrong---but I think that will work.) You could even just do

maxImpNorm(l, i) = std::max(maxImpNorm(l, i), norm(impostors(l, i)));

in that case.

This comment has been minimized.

@manish7294

manish7294 Jul 12, 2018

Author Member

Definitely, that's a better way of doing this.

// Update cache max impostor norm.
maxImpNorm(l, i) = maxNorm;

eval = evalOld(j, l, i) + transformationDiff *

This comment has been minimized.

@rcurtin

rcurtin Jul 12, 2018

Member

This is a minor point, but since Armadillo is column-major, points in the same column end up being contiguous in memory. Since we're looping over i, then j, then l, the best access pattern for evalOld will be if we are accessing as evalOld(l, j, i) (i.e. l corresponds to the row, j to the column, i to the slice).

transformationOldPoint.set_size(transformation.n_rows,
transformation.n_cols, dataset.n_cols);
transformationOldPoint.fill(arma::datum::nan);
}

This comment has been minimized.

@rcurtin

rcurtin Jul 12, 2018

Member

I think this means your approach is that you will store one transformation matrix per point? That will be very large in memory. I'd suggest instead a std::vector<arma::mat> of old transformation matrices, and then a arma::Row<size_t> to track the index that is used for each point. (I know that earlier I suggested changing from a std::vector<arma::mat> to an arma::cube, but that was for a case where the number of slices in the cube wouldn't change. Here, I think it would be better for memory to only store as many transformation matrices as we need. Also, I guess each call to Shuffle() we can check for unused transformation matrices and either remove them or just call clear().)

This comment has been minimized.

@manish7294

manish7294 Jul 12, 2018

Author Member

I have some doubts regarding this -
This is mainly done for batch support, the reason why I have done this because of the changing batch size, Let''s suppose in next iteration batch size changed, then I am not able to see how will we keep track of the transformationOld now. Please explain a little bit more, maybe I am missing something here.

This comment has been minimized.

@rcurtin

rcurtin Jul 12, 2018

Member

Right, so the idea is this. Hold an arma::Row<size_t> called lastTransformationIndices (or something), and hold a std::vector<arma::mat> called oldTransformationMatrices (or something). Then, when we do a batch:

oldTransformationMatrices.push_back(transformationMatrix);
lastTransformationIndices.subvec(begin, begin + batchSize - 1) = oldTransformationMatrices.size();

Then, when you need to get the transformation matrix for a point i, do oldTransformationMatrices[lastTransformationIndices[i]]. Do you see what I mean? There will be a little bit of extra complexity here but that is the basic idea.

This comment has been minimized.

@manish7294

manish7294 Jul 13, 2018

Author Member

The update process is nasty and will require some extra computation time at each Evaluate() call

  // Update & remove unused cache matrices.
  oldTransformationMatrices.push_back(transformation);
  lastTransformationIndices.cols(begin, begin + batchSize - 1) =
      oldTransformationMatrices.size() - 1;

  size_t maxIndex = lastTransformationIndices.max();
  arma::uvec matricesUsed = arma::unique(lastTransformationIndices);
  for (size_t i = 0, j = 0, l = 0; i <= maxIndex; i++)
  {
    if (i != matricesUsed(j))
    {
      oldTransformationMatrices.erase(i - l);
      l++;
    }
    else
      j++;
  }
  // Normalize lastTransformationIndices.
  Maybe NormalizeLabels() can be used for this.

So, Do you think it is worth the tradeoff for space? And I think we are using the same amount of space for evalOld as well.

This comment has been minimized.

@rcurtin

rcurtin Jul 13, 2018

Member

Right, let's consider using a slightly different strategy, holding the following auxiliary variables:

// It might be useful to come up with shorter names, these are really long...

// This will hold each of the transformation matrices.
std::vector<arma::mat> oldTransformationMatrices;
// This will hold how many points are using each transformation matrix.
std::vector<size_t> oldTransformationCounts;
// This will hold which transformation matrix each point is using.
arma::Row<size_t> lastTransformationIndices;

Then, whenever we call Evaluate(..., begin, batchSize), we can do the following:

// Are there any empty transformation matrices?
size_t index = oldTransformationMatrices.size();
for (size_t i = 0; i < oldTransformationCounts.size(); ++i)
{
  if (oldTransformationCounts[i] == 0)
  {
    index = i; // Reuse this index.
    break;
  }
}

// Did we find an unused matrix?  If not, we have to allocate new space.
if (index == oldTransformationMatrices.size())
{
  oldTransformationMatrices.push_back(transformationMatrix);
  oldTransformationCounts.push_back(0);
}
else
{
  oldTransformationMatrices[index] = transformationMatrix;
}
    
// Update all the transformation indices.
for (size_t i = begin; i < begin + batchSize; ++i)
  --oldTransformationCounts[lastTransformationIndices[i]];
lastTransformationIndices.cols(begin, begin + batchSize - 1) = index;
oldTransformationCounts[index] += batchSize;

During initialization for the whole procedure I'd suggest setting oldTransformationMatrices to have one empty matrix and set all the lastTransformationIndices to 0 and the counts accordingly.

And, when shuffling is done, we just need to modify lastTransformationIndices, which will be very similar to the existing shuffling code.

This is an improvement over the idea above, because we don't need to call arma::unique() every call, which is very expensive. Instead the cost of this approach, at worst, is one transformation matrix copy followed by 2 * batchSize + 1 assignments, which is pretty trivial compared to the rest of the computation.

This comment has been minimized.

@manish7294

manish7294 Jul 14, 2018

Author Member

Thanks! That's a really good algorithm to handle the situation :)

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 12, 2018

I have no idea why BigBatchSGD after a number of iterations just keeps on running without printing any other iteration output. As if it is just stuck at a point. I have observed this several time but couldn't find an answer.

Hmm, let's look into this once we resolve the couple of other comments and testing. If you can tell me how to reproduce it, I can help guide you through the debugging.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 12, 2018

Any simple command like this one except k = 1 will be able to reproduce-
bin/mlpack_lmnn -i iris.csv -l iris_labels.txt -k 5 -O bbsgd -o output.csv -v

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 12, 2018

Ok, and what have you tried to produce output that could help pinpoint what the problem is?
(Feel free to wait to approach this issue until the other issues are fixed, that's fine with me. It's possible that some of the other changes you will make may solve this.)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 13, 2018

It looks like making these new changes solved the problem. As I am not seeing it anymore. The optimization is successfully converging. Maybe something was wrong with maxImpNorm implementation earlier.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 13, 2018

Here is the gist for correctness of distance - https://gist.github.com/manish7294/7f939c3ed80be5af7ea3f2b216304efb
The results are overall same, just a little difference with BigBatchSGD.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 13, 2018

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 13, 2018

Thanks for the simulations. It looks to me like there may be some problem---this pruning should be exact and should not affect the results at all. But I can see for L-BFGS for the iris dataset and for BBSGD, the results are a little different. Are you sure that any randomness is removed from BBSGD?

bool exactEval = true;

// Use eval calualated during Evaluate.
if (!std::isnan(evalOld(l, j, i)))

This comment has been minimized.

@manish7294

manish7294 Jul 13, 2018

Author Member

I think I have at least pinpointed the cause of difference in result -

  1. In this PR we have breaking condition as if (eval <= -1) whereas in the current LMNN we have if (eval < -1). Doing this made solved the iris deviation. So, I think we don't need to worry about this.

  2. There is something going wrong here while using cache evalOld for Gradient() function. If I change if (!std::isnan(evalOld(l, j, i))) to false, BBSGD deviation problem solves as well.

This comment has been minimized.

@rcurtin

rcurtin Jul 13, 2018

Member
  1. In this PR we have breaking condition as if (eval <= -1) whereas in the current LMNN we have if (eval < -1). Doing this made solved the iris deviation. So, I think we don't need to worry about this.

Got it, so you mean that when you run the current PR with if (eval < -1), you get the exact same results with L-BFGS on the iris dataset as the current git master branch?

  1. There is something going wrong here while using cache evalOld for Gradient() function. If I change if (!std::isnan(evalOld(l, j, i))) to false, BBSGD deviation problem solves as well.

Sorry, do you mean you change to if (false) or if (std::isnan(evalOld(l, j, i)) == false) or something else? I don't know if I understood.

This comment has been minimized.

@manish7294

manish7294 Jul 13, 2018

Author Member

Got it, so you mean that when you run the current PR with if (eval < -1), you get the exact same results with L-BFGS on the iris dataset as the current git master branch?

Yes

Sorry, do you mean you change to if (false) or if (std::isnan(evalOld(l, j, i)) == false) or something else? I don't know if I understood.

Ya, I wrote a confusing statement. Sorry, about that. I meant if (false)

This comment has been minimized.

@rcurtin

rcurtin Jul 13, 2018

Member

I think the condition actually needs to be eval < -1 not eval <= -1; correct me if I'm wrong on that. In any case we can call that solved.

The difference that I notice with BigBatchSGD and SGD is that SGD will call EvaluateWithGradient(), but BigBatchSGD calls Evaluate() and Gradient() separately. So my guess is that this has to do with us not handling different previous transformation matrices in the same way we do in EvaluateWithGradient(). Do you want to implement the strategy I proposed for that a little while ago in EvaluateWithGradient(), see if it works for SGD, and if so, then apply it to Evaluate() and Gradient() also?

This comment has been minimized.

@manish7294

manish7294 Jul 13, 2018

Author Member

Here's is my thought about eval <= -1
Our original condition was max(0, 1 + eval), and 0 doesn't contribute anything so skipping it shouldn't be any problem. And at eval = -1, we are basically dealing with 0. Does it sounds right?

Ya, I will try that.

This comment has been minimized.

@rcurtin

rcurtin Jul 13, 2018

Member

You're right, when eval == -1, there is no change to the cost function. But there is a change to the gradient:

         // Caculate gradient due to impostors.
         arma::vec diff = dataset.col(i) - dataset.col(targetNeighbors(j, i));
         cil += diff * arma::trans(diff);
 
         diff = dataset.col(i) - dataset.col(impostors(l, i));
         cil -= diff * arma::trans(diff);

In the git master branch, those terms get added to cil even if eval == -1. When I review the paper it's not fully clear whether we should include those terms in our gradient computation. The paper writes:

"Let us define a set of triples N_t, such that (i, j,l) ∈ N_t if and only if the indices (i, j,l) trigger the hinge loss in the second part of Eq. (18)."

I think we can go either way here, and I like the tighter condition of if (eval <= -1) since that will allow a little bit of extra pruning. So I would say, let's leave it as it is with <=, but just in order to test correctness with BigBatchSGD, let's make sure we do that test with < to match the git master code.

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 14, 2018

Ah! finally! It looks like we are done here :)
Here's updated distance correctness simulations- https://gist.github.com/manish7294/7f939c3ed80be5af7ea3f2b216304efb

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 14, 2018

Looks great! Thank you for the hard work. I'll give this a final review shortly but I am pretty sure any comments will be very minor.

The last thing to check would be the speedup---can you run some timing simulations to checkwhat the speedup is on different datasets?

Thanks again for the hard work here. I'm really happy to see this merged soon. :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 14, 2018

Timings Simulations:
Timings:

               With Change ----- Without Change
Iris -          0.841357 ------- 0.878832
vc2 -           3.255247 ------- 3.341104
balance_scale - 9.823953 ------- 12.755052
ecoli -         5.727747 ------- 5.909457
diabetes -      18.633512 ------ 18.724131
letters -      522.005695 -----  547.120529

Pruning output -
https://gist.github.com/manish7294/cbafa9f0d1ba1370c5ef0160ec4b2430

I have avoided covertype-5k as, we already know that it has very less pruning rate.

@rcurtin
Copy link
Member

rcurtin left a comment

This is looking great. I have a handful of minor comments about how the code could be slightly optimized, but based on the simulation results you showed, I'm convinced that (a) the pruning is correct (b) the pruning provides non-negligible speedup on some datasets. If you can take care of the comments I think this is ready for merge.

As a side note, I was a little surprised that so many of the constraints for many of the datasets were active. In many cases it seemed like only 10% or fewer of the constraints were less than -1 in the last iterations of the optimization. The text in the JMLR paper on page 240 under equation (19) suggests that most of the constraints won't be "active" (i.e. greater than -1), but this seems to not be what we are seeing. I see no problems in our code or anything, I just thought that was an interesting observation to point out.

bool exactEval = true;

// Bounds for eval.
if (transformationOld.n_elem != 0 && !std::isnan(evalOld(l, j, i)))

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

Here's an idea to reduce computation very slightly (and I think it could clarify the code too). Right now you skip the calculation of the bound if evalOld(l, j, i) is NaN. But you could also always skip if evalOld(l, j, i) is greater than -1 (since our bound can only increase that value). So, when you initialize, you could just fill all the evalOld with 0, and then there's no need for checking if it's NaN or not.

if ((iteration - 1 % range == 0) && (iteration - 1 != 0))

// Flag to trigger exact eval calculation.
bool exactEval = true;

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

You could remove this variable entirely and use if (eval > -1) instead of if (exactEval) later.

{
transformationDiff = arma::norm(transformation -
oldTransformationMatrices[lastTransformationIndices(i)]);
}

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

This computation could be redundant for large batches. It might (maybe) be worthwhile to move this computation outside the loop and create a custom loop where we only call arma::norm() on the matrices that we need to. I would say it's up to you if you'd like to do this.

This comment has been minimized.

@manish7294

manish7294 Jul 14, 2018

Author Member

I think it is possible to do this for amsgrad and sgd (change can be made in EvaluatewithGradient()) but will have to keep it like this for bbsgd.

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

Right, you can leave it like this and I think it's fine. What I was thinking was more along the lines of something like this:

std::set<size_t> uniqueIndices;
std::map<size_t, double> transformationDiffs;
for (size_t i = begin; i < begin + batchSize; ++i)
{
  if (transformationDiffs.count(lastTransformationIndices[i]) == 0)
  {
    // If you make the change so that oldTransformationMatrices[0] is empty, this will work.
    if (lastTransformationIndices[i] == 0)
    {
      transformationDiffs[0] = 0.0; // This won't be used anyway...
    }
    else
    {
      transformationDiffs[lastTrasformationIndices[i]] = arma::norm(transformation -
          oldTransformationMatrices[lastTransformationIndices(i)]);
    }
  }
}

I think that will work, but I haven't tried or debugged it. Then, later, instead of using transformationDiff for the bound calculation, you'd use transformationDiff[oldTransformationIndices[i]].


eval = evalOld(l, j, i) + transformationDiff *
(norm(targetNeighbors(j, i)) + maxImpNorm(l, i) +
2 * norm(i));

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

Oops, minor style issue---we should indent wrapped lines twice (so, four spaces instead of two).

evalOld(l, j, i) = arma::datum::nan;
maxImpNorm(l, i) = 0;
lastTransformationIndices(i) = arma::datum::nan;
--oldTransformationCounts[lastTransformationIndices(i)];

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

I don't actually think this is necessary---if eval > -1 (i.e. if no pruning could happen) it's okay to leave evalOld(l, j, i) set to the exact value of eval, and to leave lastTransformationIndices(i) set to the index of this transformation matrix.

Also I think that setting lastTransformationIndices(i) to NaN will cause a problem when we later do --oldTransformationCounts[lastTransformationIndices(i)].

lastTransformationIndices(i) = index;
}

oldTransformationCounts[index] += batchSize;

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

I think this is the same code used at the end of Gradient() and EvaluateWithGradient(). To reduce the maintenance burden of the code, we should try and keep it as short as we can, so could you make this code into a separate method please and then just call it from each of those methods?

This comment has been minimized.

@manish7294

manish7294 Jul 14, 2018

Author Member

Sure, I will make separate method for this update part.

This comment has been minimized.

@manish7294

manish7294 Jul 15, 2018

Author Member

Done!

maxImpNorm.fill(0);

lastTransformationIndices.set_size(dataset.n_cols);
lastTransformationIndices.fill(arma::datum::nan);

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

I made a comment later about the danger of using NaNs here. I'd suggest filling lastTransformationIndices originally with 0, and then setting oldTransformationCounts accordingly (so, one element with dataset.n_cols as the value), and also adding an empty matrix to oldTransformationMatrices.

return cost;
}

template<typename MetricType>
inline void LMNNFunction<MetricType>::Precalculate()
{
pCij.zeros(dataset.n_rows, dataset.n_rows);
norm.zeros(dataset.n_cols);

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

You can just use set_size() here. zeros() will take a second pass to fill all the elements.

This comment has been minimized.

@manish7294

manish7294 Jul 14, 2018

Author Member

Done in #1466

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

Ah, sorry, forgot about that. 👍

for (size_t i = 0; i < ordering.n_elem; i++)
{
evalOld.slice(i) = newEvalOld.slice(ordering(i));
}

This comment has been minimized.

@rcurtin

rcurtin Jul 14, 2018

Member

If you aren't sure if the oldTransformationCounts are correct after each iteration, you could add this set of debugging assertions:

#ifdef DEBUG
  size_t total = 0;
  for (size_t i = 0; i < oldTransformationCounts; ++i)
  {
    std::ostringstream oss;
    oss << "transformation counts for matrix " << i << " invalid (" << oldTransformationCounts[i] << ")!";
    Log::Assert(oldTransformationCounts[i] < dataset.n_cols, oss.str());
    total += oldTransformationCounts[i];
  }

  std::ostringstream oss;
  oss << "total count for transformation matrices invalid (" << total << ", "
      << "should be " << dataset.n_cols << "!";
  Log::Assert(total == dataset.n_cols, oss.str());
#endif

That code (in debugging mode) checks that the count for each cluster is valid, and that the total counts sum to the total number of points. It may be useful to add.

This comment has been minimized.

@manish7294

manish7294 Jul 15, 2018

Author Member

Done!

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 15, 2018

I think all the comments have been handled :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 15, 2018

As a side note, I was a little surprised that so many of the constraints for many of the datasets were active. In many cases it seemed like only 10% or fewer of the constraints were less than -1 in the last iterations of the optimization. The text in the JMLR paper on page 240 under equation (19) suggests that most of the constraints won't be "active" (i.e. greater than -1), but this seems to not be what we are seeing. I see no problems in our code or anything, I just thought that was an interesting observation to point out.

Currently we are not visiting each and every point as if (eval < -1) {bp = l; break;} help us to avoid visiting major portion of the points having eval < -1.
Let's take the case of iris itself. Ideally, for k = 5, we should be having, 150 (dataset.n_cols) * 5 (Number of target neighbors) * 5 (Number of impostors) = 3750 constraints but as apparent from the result we actually visits around 1402 points and out of them 753 usually remains active. So the active to constraints percentage come down to 20.08% for iris, similarly for balance scale - 57.49%, ecoli - 67.55%, vc2 - 76.50%, diabetes - 95.40% and for letters it is 0%.

So, I think the text may be referring to the percentage w.r.t to total number of constraints. But still it seems, this figure is highly dataset dependent, and at least looking at these results, the statement of the authors - most of the constraints won't be "active" doesn't seems quite right.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 15, 2018

Oh, you are right about the number of constraints. I had not considered all the skipped ones. Thank you for the clarification!

@rcurtin
Copy link
Member

rcurtin left a comment

Thank you so much for your hard work with this. I think you've done a great job. I don't see any more comments to handle, so I will go ahead and merge this in 3 days to leave time for other comments. Then I think we can get the rest of the optimizations done and have a truly fast LMNN implementation. :)

@manish7294

This comment has been minimized.

Copy link
Member Author

manish7294 commented Jul 15, 2018

I'm the one who should be thanking you. Thanks for always having my back. I couldn't have done it without you. All of your's innovative ideas were Awesome. :)

if ((iteration - 1 % range == 0) && (iteration - 1 != 0))

// Bounds for eval.
if (transformationOld.n_elem != 0 && evalOld(l, j, i) < -1)

This comment has been minimized.

@zoq

zoq Jul 16, 2018

Member

Perhaps .is_empty() is more descriptive?

This comment has been minimized.

@manish7294

manish7294 Jul 17, 2018

Author Member

Done!


eval = evalOld(l, j, i) + transformationDiff *
(norm(targetNeighbors(j, i)) + maxImpNorm(l, i) +
2 * norm(i));

This comment has been minimized.

@zoq

zoq Jul 16, 2018

Member

Not sure, I get norm(i) here, it's always i if i >= 0. Did I miss something? Perhaps the same applies to targetNeighbors.

This comment has been minimized.

@manish7294

manish7294 Jul 17, 2018

Author Member

It looks like you got norm(i) confused with arma::norm(i)(Though I guess arma::norm expects a vector or matrix to passed not a scalar) :)
Actually it is a vector which stores norm of each data point.
norm(i) = arma::norm(dataset.col(i));

This comment has been minimized.

@zoq

zoq Jul 17, 2018

Member

Ahh, this makes sense, thanks for the clarification.

@rcurtin rcurtin merged commit 3b7bbf0 into mlpack:master Jul 18, 2018

5 checks passed

Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Jul 18, 2018

Thanks again. Very happy to see this come together. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.