Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search direction #136

Closed
wants to merge 9 commits into from
Closed

Search direction #136

wants to merge 9 commits into from

Conversation

vdumoulin
Copy link
Member

@vdumoulin vdumoulin commented Nov 1, 2012

I've completed adding the SearchDirection class to Pylearn2 and I tested it. It seems to work as I intended.

I also added some subclasses (adagrad and momentum, or at least momentum in the way I understand it) to test it a little.

If everything is okay with my pull request, I think there would be some cleanup to do in the SGD class as some of the hacks added over time might be replaced by a SearchDirection subclass.

updates = {}
direction = {}
for param, grad in gradients.iteritems():
sqr_grad_sum = theano.shared(numpy.zeros_like(param.get_value()))
Copy link
Contributor

@goodfeli goodfeli Nov 1, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pylearn2.utils.sharedX so that this will respect theano.config.floatX. numpy.zeros_like will default to float64, so people who expect their shared variables to go on GPU will be disappointed.

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 2, 2012

I've commited some of the changes proposed and commented the others. Thanks for the feedback!

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 2, 2012

I think there should not be an LRDecay class. When I said before that you might want to use LRDecay and Momentum at the same time, I didn't mean to hack LRDecay into the momentum class. I meant having learning rate decay implemented as a SearchDirection is a problem because you either have to define an interface for combining two search directions, or you will need to hack learning rate decay into every SearchDirection that you might want to use with it.

The existing interface makes it pretty easy to use arbitrary learning rate schedules. If we merge this PR then we're stuck using one decay rate formula and if we want to make that more flexible, we need to implement the flexibility in ever SearchDirection class that we want to use with learning rate decay.

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 2, 2012

I've made the changes. Concerning the implementation of the momentum class, have you seen my comment?

@lamblin
Copy link
Member

@lamblin lamblin commented Nov 2, 2012

Being able to chain different SearchDirections would make sense to me: we could simply pass the directions dictionary returned by the first one, and use it as the gradients of the next one. This could easily be implemented in a new SearchDirection class. The trickiest part would be chaining the update rules of the different SearchDirections, but I think calling clone with the previous updates dictionary as the replace kwarg would solve that.

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 2, 2012

Which comment about the Momentum class are you referring to? I don't think I've seen it.

Which clone method are you talking about?

Chaining SearchDirections involves some subtleties and I find it a bit hard to think through just off the top of my head.

Under the original interface we discussed, a SearchDirection can safely assume that the optimization algorithm will move in the direction that it returns, and it can assume that the direction it is passed is the gradient. Chaining SearchDirections together breaks both assumptions. Have you thought through what the consequences of that would be for a few different use cases?

if self.momentum is None:
updates.update( dict(safe_zip(params, [param - learning_rate * \
lr_scalers.get(param, 1.) * grads[param]
lr_scalers.get(param, 1.) * direction[param]
for param in params])))
else:
for param in params:
Copy link
Contributor

@goodfeli goodfeli Nov 2, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks like you didn't modify the case where there is momentum.

@lamblin
Copy link
Member

@lamblin lamblin commented Nov 2, 2012

@goodfeli I was trying to define a simple "interface for combining two search directions". Chaining them is my proposal.
Regarding clone, I was thinking about theano.clone.
I do not think that the assumptions you mention are necessary for most of the SearchDirections to work, although I agree that some chaining would not make any sense (I would not use momentum after conjugate gradient, for instance).
In my opinion, if a particular SearchDirection needs the incoming directions to be actual gradients, then it should be first in the chain. If another needs the returned directions to be used directly in the updates, then it should be the last in the chain. If it makes sense to scale differently the directions for different parameters, but only with one scalar scale for each parameter, then the next module should only do that, but I think that decision belongs to the programmer.
Typically, decaying learning rate would be applied last (but maybe not, for instance, after another instance of itself, or another instance of SearchDirection that would have the same purpose). Momentum may be apply after TONGA, and then we can add decaying LR. In fact, I think second-order methods can be safely applied to the gradients returned by TONGA.

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 5, 2012

@goodfeli

Here's what I wrote about the Momentum subclass:

I must have figured it out wrong. The equations I had were
Theta_t <-- Theta_{t-1} + v_t,
v_t <-- mu_v_{t-1} - epsilon_grad(Theta){t-1}
which needed to be in the form
Theta_t <-- Theta
{t-1} - epsilon_search_direction
to be used by SGD, so I factored -epsilon in the expression of v_t, which gave me
v_t <-- - epsilon(grad(Theta_{t-1}) - mu/epsilon_v_{t-1}) = -epsilon * search_direction,
search_direction = grad(Theta_{t-1}) - mu/epsilon*v_{t-1}

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 5, 2012

(Once I fix the Momentum subclass, would it be better to suppress entirely the momentum option in the SGD class?)

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 7, 2012

I've gotten fairly busy preparing for the AISTATS deadline next week. Do you mind if I want to finish reviewing this until then?

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 8, 2012

No problem!

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 9, 2012

I've re-based my commit and finished my implementation of momentum. I've added lots of commentaries to justify the counter-intuitive way used to implement it, so hopefully everything will be clear for those who use it.

I also removed the old way momentum was implemented.

@@ -551,6 +536,7 @@ def __call__(self, algorithm):
algorithm.learning_rate.set_value(new_lr)


# TODO: decide whether this class is still relevant with SearchDirection
Copy link
Contributor

@goodfeli goodfeli Nov 14, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. This is what allows you to change the momentum over time, which is very important for getting good results. But it would have to be updated to point to the new momentum coefficient.

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 14, 2012

I'm still not convinced momentum should be a SearchDirection, for the following reasons:

  1. The case where learning_rate = 0 is unsupported. learning_rate=0 actually is a useful case to be able to run for debugging. If you notice something bad happening during learning and you aren't sure if it is caused by the learning itself or by one of your callbacks, setting the learning rate to 0 can clear that up. With this implementation you'd need to remember to also disable momentum.
  2. Have you tested it to make sure it is numerically stable? Having to take the difference between the gradient and something divided by epsilon at every step is making me nervous. I'm not very good at analyzing numerical stuff but it seems to me like this could cause problems if the momentum coefficient is close to 1, the gradient points in the same direction for several steps, and the learning rate is small.
  3. Changing the math makes it harder to understand what we're doing. Users need to understand not just momentum but also the Pylearn2 SearchDirection system and why we had to rearrange things to shove momentum into that system.
  4. Since momentum is learning-rate dependent, we're not really computing one search direction anymore. It makes the SearchDirection metaphor inappropriate.
  5. Algorithms other than SGD don't use momentum so there is not much to be gained by factoring momentum out of the TrainingAlgorithm.
  6. You can't implement Nesterov momentum in this framework, because Nesterov momentum requires taking the gradient at a different point than the current model parameters. So the SGD algorithm is going to need a different mechanism for generalizing momentum anyway.

@yoshua
Copy link

@yoshua yoshua commented Nov 14, 2012

Nicolas Boulanger may have something to say about point 6. He has
discovered a reformulation
of Nesterov that makes it of the same EXACT form as regular momentum except
for the way
the linear coefficients are chosen.

-- Yoshua

On Wed, Nov 14, 2012 at 11:29 AM, goodfeli notifications@github.com wrote:

I'm still not convinced momentum should be a SearchDirection, for the
following reasons:

The case where learning_rate = 0 is unsupported. learning_rate=0
actually is a useful case to be able to run for debugging. If you notice
something bad happening during learning and you aren't sure if it is caused
by the learning itself or by one of your callbacks, setting the learning
rate to 0 can clear that up. With this implementation you'd need to
remember to also disable momentum.
2.

Have you tested it to make sure it is numerically stable? Having to
take the difference between the gradient and something divided by epsilon
at every step is making me nervous. I'm not very good at analyzing
numerical stuff but it seems to me like this could cause problems if the
momentum coefficient is close to 1, the gradient points in the same
direction for several steps, and the learning rate is small.
3.

Changing the math makes it harder to understand what we're doing.
Users need to understand not just momentum but also the Pylearn2
SearchDirection system and why we had to rearrange things to shove momentum
into that system.
4.

Since momentum is learning-rate dependent, we're not really computing
one search direction anymore. It makes the SearchDirection metaphor
inappropriate.
5.

Algorithms other than SGD don't use momentum so there is not much to
be gained by factoring momentum out of the TrainingAlgorithm.
6.

You can't implement Nesterov momentum in this framework, because
Nesterov momentum requires taking the gradient at a different point than
the current model parameters. So the SGD algorithm is going to need a
different mechanism for generalizing momentum anyway.


Reply to this email directly or view it on GitHubhttps://github.com/lisa-lab/pylearn/pull/136#issuecomment-10372718.

@dwf
Copy link
Contributor

@dwf dwf commented Nov 14, 2012

#2 definitely could pose a problem. #4 I think is a strong argument against momentum-as-search-direction.

@boulanni
Copy link

@boulanni boulanni commented Nov 14, 2012

Yes, I found that the following change of variable simplifies the implementation of Nesterov momentum. It is in the same form as regular momentum in the sense that both velocity and parameter updates depend only on the gradient at the current value of the parameters.

In short:

regular momentum:
(1) v_t = mu * v_t-1 - lr * gradient_f(params_t)
(2) params_t = params_t-1 + v_t
(3) params_t = params_t-1 + mu * v_t-1 - lr * gradient_f(params_t-1)

Nesterov momentum:
(4) v_t = mu * v_t-1 - lr * gradient_f(params_t-1 + mu * v_t-1)
(5) params_t = params_t-1 + v_t

alternate formulation for Nesterov momentum:
(6) v_t = mu * v_t-1 - lr * gradient_f(params_t-1)
(7) params_t = params_t-1 + mu * v_t - lr * gradient_f(params_t-1)
(8) params_t = params_t-1 + mu**2 * v_t-1 - (1+mu) * lr * gradient_f(params_t-1)

So with Theano you can use (1) then either (2) or (7)/(8) to have both options.

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 14, 2012

Thanks @boulanni . I will add this to the repo as a note for someone to implement.

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 19, 2012

I implemented Nicola's reformulation of Nesterov momentum, hopefully everything is all right!

@@ -54,6 +54,8 @@ class TransformerIterator(object):
def __init__(self, raw_iterator, transformer_dataset):
self.raw_iterator = raw_iterator
self.transformer_dataset = transformer_dataset
self.stochastic = raw_iterator.stochastic
self.uneven = raw_iterator.uneven
Copy link
Contributor

@goodfeli goodfeli Nov 19, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is there a unit test we should add to make sure this does not break again?

@vdumoulin
Copy link
Member Author

@vdumoulin vdumoulin commented Nov 20, 2012

It seems like the last rebase I did messed up BIG TIME: the momentum implementation completely disappeared from the last commit. That would explain the strange disappearance of docstring comments.

Sorry about that, I'll have this fixed soon.

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Nov 20, 2012

ok, @dwf is probably the most knowledgeable person about git in the lab. He can probably help you recover stuff that disappeared. I know how to do a rebase but I don't know how to recover from a rebase gone wrong.

@goodfeli
Copy link
Contributor

@goodfeli goodfeli commented Dec 5, 2012

@vdumoulin is planning to redo this in several small pull requests rather than one big rebase, so I am closing the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants