Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Momentum again #7

Closed
ChenglongChen opened this issue Mar 21, 2014 · 2 comments
Closed

Momentum again #7

ChenglongChen opened this issue Mar 21, 2014 · 2 comments

Comments

@ChenglongChen
Copy link
Contributor

I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.

The current update rules implemented are as follow:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - learning_rate * updates[gparam_mom]

According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param + updates[gparam_mom]

Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.

As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)

However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:

updates = OrderedDict()
for gparam_mom, gparam, param in zip(gparams_mom, gparams, params):
    updates[gparam_mom] = mom * gparam_mom - learning_rate * (weight_decay*param + gparam)

Regards,

@mdenil
Copy link
Owner

mdenil commented Mar 23, 2014

You are correct that the learning rule in the code is not exactly the same as what is described in the dropout paper. This also explains why Hinton is able to use such a large learning rate, as you note.

Have you observed better performance by changing the learning rule to match the dropout paper exactly?

@mdenil
Copy link
Owner

mdenil commented Mar 24, 2014

Changed in 6f8c362

@mdenil mdenil closed this as completed Mar 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants