Momentum again #7

ChenglongChen · 2014-03-21T19:49:19Z

I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.

The current update rules implemented are as follow:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - learning_rate * updates[gparam_mom]

According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param + updates[gparam_mom]

Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.

As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)

However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:

updates = OrderedDict()
for gparam_mom, gparam, param in zip(gparams_mom, gparams, params):
    updates[gparam_mom] = mom * gparam_mom - learning_rate * (weight_decay*param + gparam)

Regards,

The text was updated successfully, but these errors were encountered:

mdenil · 2014-03-23T10:48:35Z

You are correct that the learning rule in the code is not exactly the same as what is described in the dropout paper. This also explains why Hinton is able to use such a large learning rate, as you note.

Have you observed better performance by changing the learning rule to match the dropout paper exactly?

mdenil · 2014-03-24T18:24:27Z

Changed in 6f8c362

mdenil closed this as completed Mar 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Momentum again #7

Momentum again #7

ChenglongChen commented Mar 21, 2014

mdenil commented Mar 23, 2014

mdenil commented Mar 24, 2014

Momentum again #7

Momentum again #7

Comments

ChenglongChen commented Mar 21, 2014

mdenil commented Mar 23, 2014

mdenil commented Mar 24, 2014