Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate various changes to L-BFGS optimizer #370

Closed
stephentu opened this issue Jan 2, 2015 · 7 comments
Closed

evaluate various changes to L-BFGS optimizer #370

stephentu opened this issue Jan 2, 2015 · 7 comments

Comments

@stephentu
Copy link
Contributor

This is a continuation of the thread of discussion in #3.

The following diff stephentu/mlpack@mlpack:10d90d0...stephentu:8af98d5 allows the lovasz theta SDP test case to pass without removing the last edge that was causing the problem.

Essentially, the pattern of optimization looks like:

  • For the smaller values of sigma (e.g. sigma less than ~1000), the solution tends to R=0
  • As sigma is increased, I saw a discrepancy of behaviors. In the L-BFGS-B implementation, I saw it was able to climb out of the R=0 saddle point and converge towards the correct solution. However, in the mlpack L-BFGS implementation, I saw it get stuck in the saddle point no matter how large sigma got.

In my diff above, I added some logic to the mlpack optimizer to help guide it out of the saddle point. I'm not submitting this as a pull request yet because I'm not sure it is the right thing to do-- I'll defer that to Ryan who implemented L-BFGS in the first place.

@rcurtin: To respond to your desire to keep out the external fortran L-BFGS-B dependency. That's fine by me if you feel very strongly about it (which you clearly do 😄). But, I will point out that I was pushing for L-BFGS-B not because I think the specifics of the modified variant of the algorithm (versus what you have implemented) are necessary to get the SDPs to converge, but instead I was just looking for a fast, reliable black box with a self contained implementation which we could easily import. In other words, if you want to invest time in improving an in house optimization routine, I think it's better to go off of the existing implementation.

Anyways, this patch should be enough for you to decide how to proceed with the optimizer.

I'm going to turn my attention to writing a few more SDP test cases, and then building a higher level wrapper for matrix completion problems.

@rcurtin
Copy link
Member

rcurtin commented Jan 14, 2015

This looks good; a relative termination condition is a good improvement, in my opinion. So for the sake of understanding, it seems to me that based on what you've said, L_BFGS gets stuck in some kind of valley with small gradient (but not sufficiently small to terminate), and "walks" down this valley to the R=0 saddle point despite the fact that the objective function improvement at each iteration is very small.

This would imply that either increasing the gradient norm tolerance, or adding a relative objective function improvement termination criterion (wow that's a long set of strung-together nouns), would allow LRSDP to avoid that saddle point.

Just one very minor question before I merge; where does the factr name come from?

Your observations with LRSDP and mine in the past seem to further suggest that it's difficult to make LRSDP converge, and we may not be easily able to provide an LRSDP implementation that always converges.

@stephentu
Copy link
Contributor Author

It's actually a bit subtle, b/c this diff tries to fix two problems. Let me try to explain.

The optimization proceeds as follows for lovasz theta. The augmented lagrangian method starts at small values of sigma (penalty). L-BFGS guides the current iterate towards the dreaded saddle point of X = 0 with these small values (say sigma < 100).

Here's where the current mlpack (pre-merged) implementation differs from the scipy one. In the mlpack one, since X is approximately 0, grad F is also approximately 0 for the starting point of L-BFGS, even as we increase sigma. However, the current implementation was written so that if grad F is approx zero, it won't even run a single iteration of L-BFGS, it just exits immediately. Hence, we now check if itNum > 0, before allowing termination. Why? Because after stepping one iteration, with an increased sigma value, I found that it was able to get out of this saddle point (empirically).

Now, I also added a relative function value termination criterion to address your other concerns that the optimization was indeed slow. Indeed, it turns out as sigma gets larger and larger, it becomes harder for the gradient norm to get really small (since we're dealing with increased numerical instability with really large penalties). As you rightfully observed, the function value itself doesn't change that much. Hence, this allows us to shortcircuit this issue: the optimization criteria is an OR, not an AND, so it only requires one of the conditions to be true.

factr is terminology borrowed from scipy: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_l_bfgs_b.html, which is also terminology borrowed from the fortran L-BFGS-B code.

To address your observation that LRSDP does not converge generically and needs a lot of hand-holding. YES! I mean even if you read the original paper, their experimental section is full of text saying they had to do all these hacks with initialization points and penalty schedules to get the SDPs to converge. In fact, if you email sam burer asking about these heuristics, he will tell you he can't even quite piece them together unless he spends quite a bit of time looking at his code.

I think this is just an unfortunate property of the algorithm, and we don't know how to make it generically robust (if we did, that would be paper worthy itself). Hence, I think the best we can do is to offer a few choices of solvers (hence why I'm working slowly on this interior point solver, which will help for small/medium sized problem instances).

Another thing we might want to think about is this: LRSDP doesn't provide any dual certificates, so we can't say anything about optimality. One cool thing we could do towards this, I think, is actually run (in parallel) a dual ascent solver, so at least you could certify that the duality gap is below a specific tolerance. Then, I could sort of grid search LRSDP from different starting points w/ different penalty parameters, until my duality gap is small enough.

@rcurtin
Copy link
Member

rcurtin commented Jan 15, 2015

Okay; I understand the changes you've made then. Thanks for the explanation. One thing I overlooked, though---you've split out the line search termination conditions in order to give better failure messages; but on those failures, it no longer returns false, it just breaks. Is there a particular reason for this? I think that this could cause L_BFGS to iterate forever with no improvement, although the relative improvement termination condition should catch that. Which then implies that there's no need for LineSearch() to return anything at all or check whether or not it failed. If you agree with that line of reasoning, then after I merge it I'll make LineSearch() void and remove the checks.

In fact, if you email sam burer asking about these heuristics, he will tell you he can't even quite piece them together unless he spends quite a bit of time looking at his code.

Hah, I have one of those emails too! :)

He suggested that he avoided the R=0 problem by using the determinant of R^T R, and linked to a Mathematical Programming paper titled "Local Minima and Convergence in Low-Rank Semidefinite Programming":
http://www.researchgate.net/publication/225703502_Local_Minima_and_Convergence_in_Low-Rank_Semidefinite_Programming/file/d912f508148a86b469.pdf

and also suggested a (slightly) more recent paper:
http://link.springer.com/article/10.1007%2Fs10898-008-9328-4

It has been a very long time since I have done so much as open those papers, though...

I do think a more robust LRSDP would definitely be paperworthy and of high interest. It would also allow mlpack to provide faster implementations of SDP-based algorithms that actually work. (This is better than the old rejected code that was fast but didn't work...) A dual solver would definitely help us know when a restart was necessary, but I don't know if this would cause the entire algorithm to be slower overall than just a regular SDP solver. Either way, that is certainly an interesting avenue to investigate.

@stephentu
Copy link
Contributor Author

One thing I overlooked, though---you've split out the line search termination conditions in order to give better failure messages; but on those failures, it no longer returns false, it just breaks. Is there a particular reason for this? I think that this could cause L_BFGS to iterate forever with no improvement, although the relative improvement termination condition should catch that. Which then implies that there's no need for LineSearch() to return anything at all or check whether or not it failed.

I think I actually needed break instead of return false for it to work. In my experience, it was cond3 which was triggering, that is numIterations >= maxLineSearchTrials. In this case, I found that it was better to take a (non-zero) step that didn't satisfy the armijo conditions than to give up entirely on the optimization. In other words, it helps to get out of these strange valleys so that on subsequent iterations we can make good progress. I also tried to increase maxLineSearchTrials, but then the optimization became quite slow. You do bring up a valid point, which is what happens if the move becomes so small, could it loop infinitely? And that's why we have a relative difference condition, as you pointed out :)

I don't think, however, you should change LineSearch() to return void. There is still one condition that causes us to terminate entirely (the initialSearchDirectionDotGradient > 0.0 check above).

@stephentu
Copy link
Contributor Author

Also thanks for the pointer to the more recent paper on necessary and sufficient conditions. It's in my (ever growing) reading queue now.

@stephentu
Copy link
Contributor Author

Did we ever come to a consensus on this? I can't remember

@rcurtin
Copy link
Member

rcurtin commented Jan 29, 2015

I sat down and I ran some tests, and the L-BFGS implementation is not any slower than before (maybe a little faster? Probably about the same), but the modifications also cause LRSDP to converge on the test cases an order of magnitude faster. I pulled from your lbfgs branch and merged it in, so, I think we can call this one good. 9b55e5e..b257536

@rcurtin rcurtin closed this as completed Jan 29, 2015
@rcurtin rcurtin added this to the mlpack 1.1.0 milestone Jan 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants