New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search direction #136

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@vdumoulin
Member

vdumoulin commented Nov 1, 2012

I've completed adding the SearchDirection class to Pylearn2 and I tested it. It seems to work as I intended.

I also added some subclasses (adagrad and momentum, or at least momentum in the way I understand it) to test it a little.

If everything is okay with my pull request, I think there would be some cleanup to do in the SGD class as some of the hacks added over time might be replaced by a SearchDirection subclass.

@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/adagrad.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/adagrad.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/lr_decay.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/momentum.py
@dwf

View changes

Show outdated Hide outdated pylearn2/search_direction/search_direction.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/momentum.py
@dwf

View changes

Show outdated Hide outdated pylearn2/search_direction/lr_decay.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/search_direction.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/search_direction/search_direction.py
@goodfeli

View changes

Show outdated Hide outdated pylearn2/training_algorithms/sgd.py
@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 2, 2012

Member

I've commited some of the changes proposed and commented the others. Thanks for the feedback!

Member

vdumoulin commented Nov 2, 2012

I've commited some of the changes proposed and commented the others. Thanks for the feedback!

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 2, 2012

Contributor

I think there should not be an LRDecay class. When I said before that you might want to use LRDecay and Momentum at the same time, I didn't mean to hack LRDecay into the momentum class. I meant having learning rate decay implemented as a SearchDirection is a problem because you either have to define an interface for combining two search directions, or you will need to hack learning rate decay into every SearchDirection that you might want to use with it.

The existing interface makes it pretty easy to use arbitrary learning rate schedules. If we merge this PR then we're stuck using one decay rate formula and if we want to make that more flexible, we need to implement the flexibility in ever SearchDirection class that we want to use with learning rate decay.

Contributor

goodfeli commented Nov 2, 2012

I think there should not be an LRDecay class. When I said before that you might want to use LRDecay and Momentum at the same time, I didn't mean to hack LRDecay into the momentum class. I meant having learning rate decay implemented as a SearchDirection is a problem because you either have to define an interface for combining two search directions, or you will need to hack learning rate decay into every SearchDirection that you might want to use with it.

The existing interface makes it pretty easy to use arbitrary learning rate schedules. If we merge this PR then we're stuck using one decay rate formula and if we want to make that more flexible, we need to implement the flexibility in ever SearchDirection class that we want to use with learning rate decay.

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 2, 2012

Member

I've made the changes. Concerning the implementation of the momentum class, have you seen my comment?

Member

vdumoulin commented Nov 2, 2012

I've made the changes. Concerning the implementation of the momentum class, have you seen my comment?

@lamblin

This comment has been minimized.

Show comment
Hide comment
@lamblin

lamblin Nov 2, 2012

Member

Being able to chain different SearchDirections would make sense to me: we could simply pass the directions dictionary returned by the first one, and use it as the gradients of the next one. This could easily be implemented in a new SearchDirection class. The trickiest part would be chaining the update rules of the different SearchDirections, but I think calling clone with the previous updates dictionary as the replace kwarg would solve that.

Member

lamblin commented Nov 2, 2012

Being able to chain different SearchDirections would make sense to me: we could simply pass the directions dictionary returned by the first one, and use it as the gradients of the next one. This could easily be implemented in a new SearchDirection class. The trickiest part would be chaining the update rules of the different SearchDirections, but I think calling clone with the previous updates dictionary as the replace kwarg would solve that.

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 2, 2012

Contributor

Which comment about the Momentum class are you referring to? I don't think I've seen it.

Which clone method are you talking about?

Chaining SearchDirections involves some subtleties and I find it a bit hard to think through just off the top of my head.

Under the original interface we discussed, a SearchDirection can safely assume that the optimization algorithm will move in the direction that it returns, and it can assume that the direction it is passed is the gradient. Chaining SearchDirections together breaks both assumptions. Have you thought through what the consequences of that would be for a few different use cases?

Contributor

goodfeli commented Nov 2, 2012

Which comment about the Momentum class are you referring to? I don't think I've seen it.

Which clone method are you talking about?

Chaining SearchDirections involves some subtleties and I find it a bit hard to think through just off the top of my head.

Under the original interface we discussed, a SearchDirection can safely assume that the optimization algorithm will move in the direction that it returns, and it can assume that the direction it is passed is the gradient. Chaining SearchDirections together breaks both assumptions. Have you thought through what the consequences of that would be for a few different use cases?

@goodfeli

View changes

Show outdated Hide outdated pylearn2/training_algorithms/sgd.py
@lamblin

This comment has been minimized.

Show comment
Hide comment
@lamblin

lamblin Nov 2, 2012

Member

@goodfeli I was trying to define a simple "interface for combining two search directions". Chaining them is my proposal.
Regarding clone, I was thinking about theano.clone.
I do not think that the assumptions you mention are necessary for most of the SearchDirections to work, although I agree that some chaining would not make any sense (I would not use momentum after conjugate gradient, for instance).
In my opinion, if a particular SearchDirection needs the incoming directions to be actual gradients, then it should be first in the chain. If another needs the returned directions to be used directly in the updates, then it should be the last in the chain. If it makes sense to scale differently the directions for different parameters, but only with one scalar scale for each parameter, then the next module should only do that, but I think that decision belongs to the programmer.
Typically, decaying learning rate would be applied last (but maybe not, for instance, after another instance of itself, or another instance of SearchDirection that would have the same purpose). Momentum may be apply after TONGA, and then we can add decaying LR. In fact, I think second-order methods can be safely applied to the gradients returned by TONGA.

Member

lamblin commented Nov 2, 2012

@goodfeli I was trying to define a simple "interface for combining two search directions". Chaining them is my proposal.
Regarding clone, I was thinking about theano.clone.
I do not think that the assumptions you mention are necessary for most of the SearchDirections to work, although I agree that some chaining would not make any sense (I would not use momentum after conjugate gradient, for instance).
In my opinion, if a particular SearchDirection needs the incoming directions to be actual gradients, then it should be first in the chain. If another needs the returned directions to be used directly in the updates, then it should be the last in the chain. If it makes sense to scale differently the directions for different parameters, but only with one scalar scale for each parameter, then the next module should only do that, but I think that decision belongs to the programmer.
Typically, decaying learning rate would be applied last (but maybe not, for instance, after another instance of itself, or another instance of SearchDirection that would have the same purpose). Momentum may be apply after TONGA, and then we can add decaying LR. In fact, I think second-order methods can be safely applied to the gradients returned by TONGA.

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 5, 2012

Member

@goodfeli

Here's what I wrote about the Momentum subclass:

I must have figured it out wrong. The equations I had were
Theta_t <-- Theta_{t-1} + v_t,
v_t <-- mu_v_{t-1} - epsilon_grad(Theta){t-1}
which needed to be in the form
Theta_t <-- Theta
{t-1} - epsilon_search_direction
to be used by SGD, so I factored -epsilon in the expression of v_t, which gave me
v_t <-- - epsilon(grad(Theta_{t-1}) - mu/epsilon_v_{t-1}) = -epsilon * search_direction,
search_direction = grad(Theta_{t-1}) - mu/epsilon*v_{t-1}

Member

vdumoulin commented Nov 5, 2012

@goodfeli

Here's what I wrote about the Momentum subclass:

I must have figured it out wrong. The equations I had were
Theta_t <-- Theta_{t-1} + v_t,
v_t <-- mu_v_{t-1} - epsilon_grad(Theta){t-1}
which needed to be in the form
Theta_t <-- Theta
{t-1} - epsilon_search_direction
to be used by SGD, so I factored -epsilon in the expression of v_t, which gave me
v_t <-- - epsilon(grad(Theta_{t-1}) - mu/epsilon_v_{t-1}) = -epsilon * search_direction,
search_direction = grad(Theta_{t-1}) - mu/epsilon*v_{t-1}

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 5, 2012

Member

(Once I fix the Momentum subclass, would it be better to suppress entirely the momentum option in the SGD class?)

Member

vdumoulin commented Nov 5, 2012

(Once I fix the Momentum subclass, would it be better to suppress entirely the momentum option in the SGD class?)

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 7, 2012

Contributor

I've gotten fairly busy preparing for the AISTATS deadline next week. Do you mind if I want to finish reviewing this until then?

Contributor

goodfeli commented Nov 7, 2012

I've gotten fairly busy preparing for the AISTATS deadline next week. Do you mind if I want to finish reviewing this until then?

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 8, 2012

Member

No problem!

Member

vdumoulin commented Nov 8, 2012

No problem!

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 9, 2012

Member

I've re-based my commit and finished my implementation of momentum. I've added lots of commentaries to justify the counter-intuitive way used to implement it, so hopefully everything will be clear for those who use it.

I also removed the old way momentum was implemented.

Member

vdumoulin commented Nov 9, 2012

I've re-based my commit and finished my implementation of momentum. I've added lots of commentaries to justify the counter-intuitive way used to implement it, so hopefully everything will be clear for those who use it.

I also removed the old way momentum was implemented.

@@ -551,6 +536,7 @@ def __call__(self, algorithm):
algorithm.learning_rate.set_value(new_lr)
# TODO: decide whether this class is still relevant with SearchDirection

This comment has been minimized.

@goodfeli

goodfeli Nov 14, 2012

Contributor

Yes it is. This is what allows you to change the momentum over time, which is very important for getting good results. But it would have to be updated to point to the new momentum coefficient.

@goodfeli

goodfeli Nov 14, 2012

Contributor

Yes it is. This is what allows you to change the momentum over time, which is very important for getting good results. But it would have to be updated to point to the new momentum coefficient.

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 14, 2012

Contributor

I'm still not convinced momentum should be a SearchDirection, for the following reasons:

  1. The case where learning_rate = 0 is unsupported. learning_rate=0 actually is a useful case to be able to run for debugging. If you notice something bad happening during learning and you aren't sure if it is caused by the learning itself or by one of your callbacks, setting the learning rate to 0 can clear that up. With this implementation you'd need to remember to also disable momentum.
  2. Have you tested it to make sure it is numerically stable? Having to take the difference between the gradient and something divided by epsilon at every step is making me nervous. I'm not very good at analyzing numerical stuff but it seems to me like this could cause problems if the momentum coefficient is close to 1, the gradient points in the same direction for several steps, and the learning rate is small.
  3. Changing the math makes it harder to understand what we're doing. Users need to understand not just momentum but also the Pylearn2 SearchDirection system and why we had to rearrange things to shove momentum into that system.
  4. Since momentum is learning-rate dependent, we're not really computing one search direction anymore. It makes the SearchDirection metaphor inappropriate.
  5. Algorithms other than SGD don't use momentum so there is not much to be gained by factoring momentum out of the TrainingAlgorithm.
  6. You can't implement Nesterov momentum in this framework, because Nesterov momentum requires taking the gradient at a different point than the current model parameters. So the SGD algorithm is going to need a different mechanism for generalizing momentum anyway.
Contributor

goodfeli commented Nov 14, 2012

I'm still not convinced momentum should be a SearchDirection, for the following reasons:

  1. The case where learning_rate = 0 is unsupported. learning_rate=0 actually is a useful case to be able to run for debugging. If you notice something bad happening during learning and you aren't sure if it is caused by the learning itself or by one of your callbacks, setting the learning rate to 0 can clear that up. With this implementation you'd need to remember to also disable momentum.
  2. Have you tested it to make sure it is numerically stable? Having to take the difference between the gradient and something divided by epsilon at every step is making me nervous. I'm not very good at analyzing numerical stuff but it seems to me like this could cause problems if the momentum coefficient is close to 1, the gradient points in the same direction for several steps, and the learning rate is small.
  3. Changing the math makes it harder to understand what we're doing. Users need to understand not just momentum but also the Pylearn2 SearchDirection system and why we had to rearrange things to shove momentum into that system.
  4. Since momentum is learning-rate dependent, we're not really computing one search direction anymore. It makes the SearchDirection metaphor inappropriate.
  5. Algorithms other than SGD don't use momentum so there is not much to be gained by factoring momentum out of the TrainingAlgorithm.
  6. You can't implement Nesterov momentum in this framework, because Nesterov momentum requires taking the gradient at a different point than the current model parameters. So the SGD algorithm is going to need a different mechanism for generalizing momentum anyway.
@yoshua

This comment has been minimized.

Show comment
Hide comment
@yoshua

yoshua Nov 14, 2012

Nicolas Boulanger may have something to say about point 6. He has
discovered a reformulation
of Nesterov that makes it of the same EXACT form as regular momentum except
for the way
the linear coefficients are chosen.

-- Yoshua

On Wed, Nov 14, 2012 at 11:29 AM, goodfeli notifications@github.com wrote:

I'm still not convinced momentum should be a SearchDirection, for the
following reasons:

The case where learning_rate = 0 is unsupported. learning_rate=0
actually is a useful case to be able to run for debugging. If you notice
something bad happening during learning and you aren't sure if it is caused
by the learning itself or by one of your callbacks, setting the learning
rate to 0 can clear that up. With this implementation you'd need to
remember to also disable momentum.
2.

Have you tested it to make sure it is numerically stable? Having to
take the difference between the gradient and something divided by epsilon
at every step is making me nervous. I'm not very good at analyzing
numerical stuff but it seems to me like this could cause problems if the
momentum coefficient is close to 1, the gradient points in the same
direction for several steps, and the learning rate is small.
3.

Changing the math makes it harder to understand what we're doing.
Users need to understand not just momentum but also the Pylearn2
SearchDirection system and why we had to rearrange things to shove momentum
into that system.
4.

Since momentum is learning-rate dependent, we're not really computing
one search direction anymore. It makes the SearchDirection metaphor
inappropriate.
5.

Algorithms other than SGD don't use momentum so there is not much to
be gained by factoring momentum out of the TrainingAlgorithm.
6.

You can't implement Nesterov momentum in this framework, because
Nesterov momentum requires taking the gradient at a different point than
the current model parameters. So the SGD algorithm is going to need a
different mechanism for generalizing momentum anyway.


Reply to this email directly or view it on GitHubhttps://github.com/lisa-lab/pylearn/pull/136#issuecomment-10372718.

yoshua commented Nov 14, 2012

Nicolas Boulanger may have something to say about point 6. He has
discovered a reformulation
of Nesterov that makes it of the same EXACT form as regular momentum except
for the way
the linear coefficients are chosen.

-- Yoshua

On Wed, Nov 14, 2012 at 11:29 AM, goodfeli notifications@github.com wrote:

I'm still not convinced momentum should be a SearchDirection, for the
following reasons:

The case where learning_rate = 0 is unsupported. learning_rate=0
actually is a useful case to be able to run for debugging. If you notice
something bad happening during learning and you aren't sure if it is caused
by the learning itself or by one of your callbacks, setting the learning
rate to 0 can clear that up. With this implementation you'd need to
remember to also disable momentum.
2.

Have you tested it to make sure it is numerically stable? Having to
take the difference between the gradient and something divided by epsilon
at every step is making me nervous. I'm not very good at analyzing
numerical stuff but it seems to me like this could cause problems if the
momentum coefficient is close to 1, the gradient points in the same
direction for several steps, and the learning rate is small.
3.

Changing the math makes it harder to understand what we're doing.
Users need to understand not just momentum but also the Pylearn2
SearchDirection system and why we had to rearrange things to shove momentum
into that system.
4.

Since momentum is learning-rate dependent, we're not really computing
one search direction anymore. It makes the SearchDirection metaphor
inappropriate.
5.

Algorithms other than SGD don't use momentum so there is not much to
be gained by factoring momentum out of the TrainingAlgorithm.
6.

You can't implement Nesterov momentum in this framework, because
Nesterov momentum requires taking the gradient at a different point than
the current model parameters. So the SGD algorithm is going to need a
different mechanism for generalizing momentum anyway.


Reply to this email directly or view it on GitHubhttps://github.com/lisa-lab/pylearn/pull/136#issuecomment-10372718.

@dwf

This comment has been minimized.

Show comment
Hide comment
@dwf

dwf Nov 14, 2012

Contributor

#2 definitely could pose a problem. #4 I think is a strong argument against momentum-as-search-direction.

Contributor

dwf commented Nov 14, 2012

#2 definitely could pose a problem. #4 I think is a strong argument against momentum-as-search-direction.

@boulanni

This comment has been minimized.

Show comment
Hide comment
@boulanni

boulanni Nov 14, 2012

Yes, I found that the following change of variable simplifies the implementation of Nesterov momentum. It is in the same form as regular momentum in the sense that both velocity and parameter updates depend only on the gradient at the current value of the parameters.

In short:

regular momentum:
(1) v_t = mu * v_t-1 - lr * gradient_f(params_t)
(2) params_t = params_t-1 + v_t
(3) params_t = params_t-1 + mu * v_t-1 - lr * gradient_f(params_t-1)

Nesterov momentum:
(4) v_t = mu * v_t-1 - lr * gradient_f(params_t-1 + mu * v_t-1)
(5) params_t = params_t-1 + v_t

alternate formulation for Nesterov momentum:
(6) v_t = mu * v_t-1 - lr * gradient_f(params_t-1)
(7) params_t = params_t-1 + mu * v_t - lr * gradient_f(params_t-1)
(8) params_t = params_t-1 + mu**2 * v_t-1 - (1+mu) * lr * gradient_f(params_t-1)

So with Theano you can use (1) then either (2) or (7)/(8) to have both options.

boulanni commented Nov 14, 2012

Yes, I found that the following change of variable simplifies the implementation of Nesterov momentum. It is in the same form as regular momentum in the sense that both velocity and parameter updates depend only on the gradient at the current value of the parameters.

In short:

regular momentum:
(1) v_t = mu * v_t-1 - lr * gradient_f(params_t)
(2) params_t = params_t-1 + v_t
(3) params_t = params_t-1 + mu * v_t-1 - lr * gradient_f(params_t-1)

Nesterov momentum:
(4) v_t = mu * v_t-1 - lr * gradient_f(params_t-1 + mu * v_t-1)
(5) params_t = params_t-1 + v_t

alternate formulation for Nesterov momentum:
(6) v_t = mu * v_t-1 - lr * gradient_f(params_t-1)
(7) params_t = params_t-1 + mu * v_t - lr * gradient_f(params_t-1)
(8) params_t = params_t-1 + mu**2 * v_t-1 - (1+mu) * lr * gradient_f(params_t-1)

So with Theano you can use (1) then either (2) or (7)/(8) to have both options.

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 14, 2012

Contributor

Thanks @boulanni . I will add this to the repo as a note for someone to implement.

Contributor

goodfeli commented Nov 14, 2012

Thanks @boulanni . I will add this to the repo as a note for someone to implement.

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 19, 2012

Member

I implemented Nicola's reformulation of Nesterov momentum, hopefully everything is all right!

Member

vdumoulin commented Nov 19, 2012

I implemented Nicola's reformulation of Nesterov momentum, hopefully everything is all right!

@@ -54,6 +54,8 @@ class TransformerIterator(object):
def __init__(self, raw_iterator, transformer_dataset):
self.raw_iterator = raw_iterator
self.transformer_dataset = transformer_dataset
self.stochastic = raw_iterator.stochastic
self.uneven = raw_iterator.uneven

This comment has been minimized.

@goodfeli

goodfeli Nov 19, 2012

Contributor

Looks good. Is there a unit test we should add to make sure this does not break again?

@goodfeli

goodfeli Nov 19, 2012

Contributor

Looks good. Is there a unit test we should add to make sure this does not break again?

@@ -54,19 +56,6 @@ def __init__(self, learning_rate, cost, batch_size=None,
can change it after each minibatch.
cost: a pylearn2.costs.cost.Cost object specifying the objective
function to be minimized.
init_momentum: if None, does not use momentum

This comment has been minimized.

@goodfeli

goodfeli Nov 19, 2012

Contributor

Can you restore the init_momentum docstring?

@goodfeli

goodfeli Nov 19, 2012

Contributor

Can you restore the init_momentum docstring?

from pylearn2.search_direction.search_direction import SearchDirection
class Momentum(SearchDirection):

This comment has been minimized.

@goodfeli

goodfeli Nov 19, 2012

Contributor

I thought we were moving Momentum back into the SGD algorithm.

@goodfeli

goodfeli Nov 19, 2012

Contributor

I thought we were moving Momentum back into the SGD algorithm.

@@ -42,9 +44,9 @@ class SGD(TrainingAlgorithm):
def __init__(self, learning_rate, cost, batch_size=None,
monitoring_batches=None, monitoring_dataset=None,
termination_criterion=None, update_callbacks=None,
init_momentum = None, set_batch_size = False,
init_momentum = None, nesterov_momentum=False, set_batch_size = False,

This comment has been minimized.

@goodfeli

goodfeli Nov 19, 2012

Contributor

I don't actually see the nesterov_momentum flag used anywhere in the "Files changed" view.

@goodfeli

goodfeli Nov 19, 2012

Contributor

I don't actually see the nesterov_momentum flag used anywhere in the "Files changed" view.

@vdumoulin

This comment has been minimized.

Show comment
Hide comment
@vdumoulin

vdumoulin Nov 20, 2012

Member

It seems like the last rebase I did messed up BIG TIME: the momentum implementation completely disappeared from the last commit. That would explain the strange disappearance of docstring comments.

Sorry about that, I'll have this fixed soon.

Member

vdumoulin commented Nov 20, 2012

It seems like the last rebase I did messed up BIG TIME: the momentum implementation completely disappeared from the last commit. That would explain the strange disappearance of docstring comments.

Sorry about that, I'll have this fixed soon.

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Nov 20, 2012

Contributor

ok, @dwf is probably the most knowledgeable person about git in the lab. He can probably help you recover stuff that disappeared. I know how to do a rebase but I don't know how to recover from a rebase gone wrong.

Contributor

goodfeli commented Nov 20, 2012

ok, @dwf is probably the most knowledgeable person about git in the lab. He can probably help you recover stuff that disappeared. I know how to do a rebase but I don't know how to recover from a rebase gone wrong.

@goodfeli

This comment has been minimized.

Show comment
Hide comment
@goodfeli

goodfeli Dec 5, 2012

Contributor

@vdumoulin is planning to redo this in several small pull requests rather than one big rebase, so I am closing the PR.

Contributor

goodfeli commented Dec 5, 2012

@vdumoulin is planning to redo this in several small pull requests rather than one big rebase, so I am closing the PR.

@goodfeli goodfeli closed this Dec 5, 2012

@vdumoulin vdumoulin deleted the vdumoulin:search_direction branch Dec 7, 2013

@lucasb-eyer lucasb-eyer referenced this pull request Feb 26, 2015

Closed

Implement alternatives to SGD #3

5 of 10 tasks complete

@lucasb-eyer lucasb-eyer referenced this pull request Mar 5, 2015

Closed

Implement alternatives to SGD #10

10 of 10 tasks complete

@ajbrock ajbrock referenced this pull request Mar 1, 2017

Merged

Add Nesterov Momentum #887

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment