You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.
The current update rules implemented are as follow:
updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam
for param, gparam_mom in zip(classifier.params, gparams_mom):
stepped_param = param - learning_rate * updates[gparam_mom]
According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:
updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam
for param, gparam_mom in zip(classifier.params, gparams_mom):
stepped_param = param + updates[gparam_mom]
Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.
As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)
However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:
You are correct that the learning rule in the code is not exactly the same as what is described in the dropout paper. This also explains why Hinton is able to use such a large learning rate, as you note.
Have you observed better performance by changing the learning rule to match the dropout paper exactly?
I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.
The current update rules implemented are as follow:
According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:
Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.
As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)
However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:
Regards,
The text was updated successfully, but these errors were encountered: