Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search direction #136

Closed
wants to merge 9 commits into from
Closed

Search direction #136

wants to merge 9 commits into from

Conversation

@vdumoulin
Copy link
Member

vdumoulin commented Nov 1, 2012

I've completed adding the SearchDirection class to Pylearn2 and I tested it. It seems to work as I intended.

I also added some subclasses (adagrad and momentum, or at least momentum in the way I understand it) to test it a little.

If everything is okay with my pull request, I think there would be some cleanup to do in the SGD class as some of the hacks added over time might be replaced by a SearchDirection subclass.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/adagrad.py Outdated
updates = {}
direction = {}
for param, grad in gradients.iteritems():
sqr_grad_sum = theano.shared(numpy.zeros_like(param.get_value()))

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

Use pylearn2.utils.sharedX so that this will respect theano.config.floatX. numpy.zeros_like will default to float64, so people who expect their shared variables to go on GPU will be disappointed.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/adagrad.py Outdated
for param, grad in gradients.iteritems():
sqr_grad_sum = theano.shared(numpy.zeros_like(param.get_value()))
updates[sqr_grad_sum] = sqr_grad_sum + grad**2
direction[param] = theano.tensor.cast(grad / theano.tensor.sqrt(sqr_grad_sum + self.c), 'float32')

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

Don't use a cast. Use:

assert rval.dtype == grad.dtype

and if the assert fails, figure out what changed the dtype and fix it. Using the cast at the end will make stuff bounce onto CPU and then go back to the GPU after the cast.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/lr_decay.py Outdated
import theano
from pylearn2.search_direction.search_direction import SearchDirection


This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

I don't think it makes much sense to implement learning rate decay as a SearchDirection. Usually we will want to use both momentum and learning rate decay, so if both of those are SearchDirections we need to implement some system for composing two SearchDirections together.

This comment has been minimized.

Copy link
@vdumoulin

vdumoulin Nov 2, 2012

Author Member

Would it be a better idea to add learning rate decay as an optional parameter directly into the SearchDirection subclasses in which it fits?

If we're to implement the SearchDirection class it would be better to stick with it, or we might end up hacking things into SGD again, which I think is what we're trying to avoid.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/momentum.py Outdated
for param, grad in gradients.iteritems():
v = theano.shared(numpy.zeros_like(param.get_value()))
updates[v] = grad - self.mass * v
direction[param] = theano.tensor.cast(v, 'float32')

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

same issue with the cast instead of assert here

@dwf
dwf reviewed Nov 1, 2012
View changes
pylearn2/search_direction/search_direction.py Outdated

Parameters
----------
gradients: dictionary mapping parameters (tensor-like theano

This comment has been minimized.

Copy link
@dwf

dwf Nov 1, 2012

Contributor

It'd be easier if these followed the NumPy convention used elsewhere in the library so that when we eventually rig up numpydoc to generate API documentation it will render nicely.

This comment has been minimized.

Copy link
@vdumoulin

vdumoulin Nov 2, 2012

Author Member

I thought I was following it, can you explain which comments break the convention?

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/momentum.py Outdated
direction = {}
for param, grad in gradients.iteritems():
v = theano.shared(numpy.zeros_like(param.get_value()))
updates[v] = grad - self.mass * v

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

This looks like the wrong equation. If your gradient is 0 and your mass is 1, v gets updated to -v? That's like anti-momentum.

This comment has been minimized.

Copy link
@vdumoulin

vdumoulin Nov 2, 2012

Author Member

I must have figured it out wrong. The equations I had were
Theta_t <-- Theta_{t-1} + v_t,
v_t <-- mu_v_{t-1} - epsilon_grad(Theta){t-1}
which needed to be in the form
Theta_t <-- Theta
{t-1} - epsilon_search_direction
to be used by SGD, so I factored -epsilon in the expression of v_t, which gave me
v_t <-- - epsilon(grad(Theta_{t-1}) - mu/epsilon_v_{t-1}) = -epsilon * search_direction,
search_direction = grad(Theta_{t-1}) - mu/epsilon*v_{t-1}

Is it correct?

@dwf
dwf reviewed Nov 1, 2012
View changes
pylearn2/search_direction/lr_decay.py Outdated
updates[time] = time + 1
direction = {}
for param, grad in gradients.iteritems():
direction[param] = theano.tensor.cast(grad * decay, 'float32')

This comment has been minimized.

Copy link
@dwf

dwf Nov 1, 2012

Contributor

Same issue with the cast.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/search_direction.py Outdated
The __init__ method should store parameters used to compute the
transformed gradient (decay rate of annealing and such...).
"""
raise NotImplementedError(str(type(self))+" does not implement "+ \

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

It shouldn't be an error to not have an init method. For example, if we made a ConjugateGradient SearchDirection I don't think it would have any settings that need to be configured in init.

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/search_direction/search_direction.py Outdated
variables) to transformed gradients (tensor-like theano
variables)
updates: dictionary mapping persistent variables used in the
transformation of the gradient to their updated value

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

Add a sentence like "theano functions computing gradients must apply all of these updates in order for successive calls to the function to be correct."

@goodfeli
goodfeli reviewed Nov 1, 2012
View changes
pylearn2/training_algorithms/sgd.py Outdated
# The updates are computed on the search direction, which is the
# gradient if no SearchDirection object was provided at
# initialization. We should consider cleaning up the SGD class
# a bit, since some old options might have become redundant.

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 1, 2012

Contributor

"We should consider cleaning up the SGD class
a bit, since some old options might have become redundant."
->
# TODO: remove SGD.momentum and implement momentum as a SearchDirection

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 2, 2012

I've commited some of the changes proposed and commented the others. Thanks for the feedback!

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 2, 2012

I think there should not be an LRDecay class. When I said before that you might want to use LRDecay and Momentum at the same time, I didn't mean to hack LRDecay into the momentum class. I meant having learning rate decay implemented as a SearchDirection is a problem because you either have to define an interface for combining two search directions, or you will need to hack learning rate decay into every SearchDirection that you might want to use with it.

The existing interface makes it pretty easy to use arbitrary learning rate schedules. If we merge this PR then we're stuck using one decay rate formula and if we want to make that more flexible, we need to implement the flexibility in ever SearchDirection class that we want to use with learning rate decay.

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 2, 2012

I've made the changes. Concerning the implementation of the momentum class, have you seen my comment?

@lamblin

This comment has been minimized.

Copy link
Member

lamblin commented Nov 2, 2012

Being able to chain different SearchDirections would make sense to me: we could simply pass the directions dictionary returned by the first one, and use it as the gradients of the next one. This could easily be implemented in a new SearchDirection class. The trickiest part would be chaining the update rules of the different SearchDirections, but I think calling clone with the previous updates dictionary as the replace kwarg would solve that.

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 2, 2012

Which comment about the Momentum class are you referring to? I don't think I've seen it.

Which clone method are you talking about?

Chaining SearchDirections involves some subtleties and I find it a bit hard to think through just off the top of my head.

Under the original interface we discussed, a SearchDirection can safely assume that the optimization algorithm will move in the direction that it returns, and it can assume that the direction it is passed is the gradient. Chaining SearchDirections together breaks both assumptions. Have you thought through what the consequences of that would be for a few different use cases?

@goodfeli
goodfeli reviewed Nov 2, 2012
View changes
pylearn2/training_algorithms/sgd.py Outdated
if self.momentum is None:
updates.update( dict(safe_zip(params, [param - learning_rate * \
lr_scalers.get(param, 1.) * grads[param]
lr_scalers.get(param, 1.) * direction[param]
for param in params])))
else:
for param in params:

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 2, 2012

Contributor

Also, it looks like you didn't modify the case where there is momentum.

@lamblin

This comment has been minimized.

Copy link
Member

lamblin commented Nov 2, 2012

@goodfeli I was trying to define a simple "interface for combining two search directions". Chaining them is my proposal.
Regarding clone, I was thinking about theano.clone.
I do not think that the assumptions you mention are necessary for most of the SearchDirections to work, although I agree that some chaining would not make any sense (I would not use momentum after conjugate gradient, for instance).
In my opinion, if a particular SearchDirection needs the incoming directions to be actual gradients, then it should be first in the chain. If another needs the returned directions to be used directly in the updates, then it should be the last in the chain. If it makes sense to scale differently the directions for different parameters, but only with one scalar scale for each parameter, then the next module should only do that, but I think that decision belongs to the programmer.
Typically, decaying learning rate would be applied last (but maybe not, for instance, after another instance of itself, or another instance of SearchDirection that would have the same purpose). Momentum may be apply after TONGA, and then we can add decaying LR. In fact, I think second-order methods can be safely applied to the gradients returned by TONGA.

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 5, 2012

@goodfeli

Here's what I wrote about the Momentum subclass:

I must have figured it out wrong. The equations I had were
Theta_t <-- Theta_{t-1} + v_t,
v_t <-- mu_v_{t-1} - epsilon_grad(Theta){t-1}
which needed to be in the form
Theta_t <-- Theta
{t-1} - epsilon_search_direction
to be used by SGD, so I factored -epsilon in the expression of v_t, which gave me
v_t <-- - epsilon(grad(Theta_{t-1}) - mu/epsilon_v_{t-1}) = -epsilon * search_direction,
search_direction = grad(Theta_{t-1}) - mu/epsilon*v_{t-1}

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 5, 2012

(Once I fix the Momentum subclass, would it be better to suppress entirely the momentum option in the SGD class?)

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 7, 2012

I've gotten fairly busy preparing for the AISTATS deadline next week. Do you mind if I want to finish reviewing this until then?

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 8, 2012

No problem!

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 9, 2012

I've re-based my commit and finished my implementation of momentum. I've added lots of commentaries to justify the counter-intuitive way used to implement it, so hopefully everything will be clear for those who use it.

I also removed the old way momentum was implemented.

@@ -551,6 +536,7 @@ def __call__(self, algorithm):
algorithm.learning_rate.set_value(new_lr)


# TODO: decide whether this class is still relevant with SearchDirection

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 14, 2012

Contributor

Yes it is. This is what allows you to change the momentum over time, which is very important for getting good results. But it would have to be updated to point to the new momentum coefficient.

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 14, 2012

I'm still not convinced momentum should be a SearchDirection, for the following reasons:

  1. The case where learning_rate = 0 is unsupported. learning_rate=0 actually is a useful case to be able to run for debugging. If you notice something bad happening during learning and you aren't sure if it is caused by the learning itself or by one of your callbacks, setting the learning rate to 0 can clear that up. With this implementation you'd need to remember to also disable momentum.
  2. Have you tested it to make sure it is numerically stable? Having to take the difference between the gradient and something divided by epsilon at every step is making me nervous. I'm not very good at analyzing numerical stuff but it seems to me like this could cause problems if the momentum coefficient is close to 1, the gradient points in the same direction for several steps, and the learning rate is small.
  3. Changing the math makes it harder to understand what we're doing. Users need to understand not just momentum but also the Pylearn2 SearchDirection system and why we had to rearrange things to shove momentum into that system.
  4. Since momentum is learning-rate dependent, we're not really computing one search direction anymore. It makes the SearchDirection metaphor inappropriate.
  5. Algorithms other than SGD don't use momentum so there is not much to be gained by factoring momentum out of the TrainingAlgorithm.
  6. You can't implement Nesterov momentum in this framework, because Nesterov momentum requires taking the gradient at a different point than the current model parameters. So the SGD algorithm is going to need a different mechanism for generalizing momentum anyway.
@yoshua

This comment has been minimized.

Copy link

yoshua commented Nov 14, 2012

Nicolas Boulanger may have something to say about point 6. He has
discovered a reformulation
of Nesterov that makes it of the same EXACT form as regular momentum except
for the way
the linear coefficients are chosen.

-- Yoshua

On Wed, Nov 14, 2012 at 11:29 AM, goodfeli notifications@github.com wrote:

I'm still not convinced momentum should be a SearchDirection, for the
following reasons:

The case where learning_rate = 0 is unsupported. learning_rate=0
actually is a useful case to be able to run for debugging. If you notice
something bad happening during learning and you aren't sure if it is caused
by the learning itself or by one of your callbacks, setting the learning
rate to 0 can clear that up. With this implementation you'd need to
remember to also disable momentum.
2.

Have you tested it to make sure it is numerically stable? Having to
take the difference between the gradient and something divided by epsilon
at every step is making me nervous. I'm not very good at analyzing
numerical stuff but it seems to me like this could cause problems if the
momentum coefficient is close to 1, the gradient points in the same
direction for several steps, and the learning rate is small.
3.

Changing the math makes it harder to understand what we're doing.
Users need to understand not just momentum but also the Pylearn2
SearchDirection system and why we had to rearrange things to shove momentum
into that system.
4.

Since momentum is learning-rate dependent, we're not really computing
one search direction anymore. It makes the SearchDirection metaphor
inappropriate.
5.

Algorithms other than SGD don't use momentum so there is not much to
be gained by factoring momentum out of the TrainingAlgorithm.
6.

You can't implement Nesterov momentum in this framework, because
Nesterov momentum requires taking the gradient at a different point than
the current model parameters. So the SGD algorithm is going to need a
different mechanism for generalizing momentum anyway.


Reply to this email directly or view it on GitHubhttps://github.com/lisa-lab/pylearn/pull/136#issuecomment-10372718.

@dwf

This comment has been minimized.

Copy link
Contributor

dwf commented Nov 14, 2012

#2 definitely could pose a problem. #4 I think is a strong argument against momentum-as-search-direction.

@boulanni

This comment has been minimized.

Copy link

boulanni commented Nov 14, 2012

Yes, I found that the following change of variable simplifies the implementation of Nesterov momentum. It is in the same form as regular momentum in the sense that both velocity and parameter updates depend only on the gradient at the current value of the parameters.

In short:

regular momentum:
(1) v_t = mu * v_t-1 - lr * gradient_f(params_t)
(2) params_t = params_t-1 + v_t
(3) params_t = params_t-1 + mu * v_t-1 - lr * gradient_f(params_t-1)

Nesterov momentum:
(4) v_t = mu * v_t-1 - lr * gradient_f(params_t-1 + mu * v_t-1)
(5) params_t = params_t-1 + v_t

alternate formulation for Nesterov momentum:
(6) v_t = mu * v_t-1 - lr * gradient_f(params_t-1)
(7) params_t = params_t-1 + mu * v_t - lr * gradient_f(params_t-1)
(8) params_t = params_t-1 + mu**2 * v_t-1 - (1+mu) * lr * gradient_f(params_t-1)

So with Theano you can use (1) then either (2) or (7)/(8) to have both options.

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 14, 2012

Thanks @boulanni . I will add this to the repo as a note for someone to implement.

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 19, 2012

I implemented Nicola's reformulation of Nesterov momentum, hopefully everything is all right!

@@ -54,6 +54,8 @@ class TransformerIterator(object):
def __init__(self, raw_iterator, transformer_dataset):
self.raw_iterator = raw_iterator
self.transformer_dataset = transformer_dataset
self.stochastic = raw_iterator.stochastic
self.uneven = raw_iterator.uneven

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 19, 2012

Contributor

Looks good. Is there a unit test we should add to make sure this does not break again?

@@ -54,19 +56,6 @@ def __init__(self, learning_rate, cost, batch_size=None,
can change it after each minibatch.
cost: a pylearn2.costs.cost.Cost object specifying the objective
function to be minimized.
init_momentum: if None, does not use momentum

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 19, 2012

Contributor

Can you restore the init_momentum docstring?

from pylearn2.search_direction.search_direction import SearchDirection


class Momentum(SearchDirection):

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 19, 2012

Contributor

I thought we were moving Momentum back into the SGD algorithm.

@@ -42,9 +44,9 @@ class SGD(TrainingAlgorithm):
def __init__(self, learning_rate, cost, batch_size=None,
monitoring_batches=None, monitoring_dataset=None,
termination_criterion=None, update_callbacks=None,
init_momentum = None, set_batch_size = False,
init_momentum = None, nesterov_momentum=False, set_batch_size = False,

This comment has been minimized.

Copy link
@goodfeli

goodfeli Nov 19, 2012

Contributor

I don't actually see the nesterov_momentum flag used anywhere in the "Files changed" view.

@vdumoulin

This comment has been minimized.

Copy link
Member Author

vdumoulin commented Nov 20, 2012

It seems like the last rebase I did messed up BIG TIME: the momentum implementation completely disappeared from the last commit. That would explain the strange disappearance of docstring comments.

Sorry about that, I'll have this fixed soon.

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Nov 20, 2012

ok, @dwf is probably the most knowledgeable person about git in the lab. He can probably help you recover stuff that disappeared. I know how to do a rebase but I don't know how to recover from a rebase gone wrong.

@goodfeli

This comment has been minimized.

Copy link
Contributor

goodfeli commented Dec 5, 2012

@vdumoulin is planning to redo this in several small pull requests rather than one big rebase, so I am closing the PR.

@goodfeli goodfeli closed this Dec 5, 2012
@vdumoulin vdumoulin deleted the vdumoulin:search_direction branch Dec 7, 2013
@lucasb-eyer lucasb-eyer mentioned this pull request Feb 26, 2015
5 of 10 tasks complete
@lucasb-eyer lucasb-eyer mentioned this pull request Mar 5, 2015
10 of 10 tasks complete
@ajbrock ajbrock mentioned this pull request Mar 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.