The implementation of layerwise learning rate decay #51

importpandas · 2020-04-30T11:28:05Z

Lines 188 to 193 in 7911132

    
           for layer in range(n_layers): 
        
             key_to_depths["encoder/layer_" + str(layer) + "/"] = layer + 1 
        
           return { 
        
               key: learning_rate * (layer_decay ** (n_layers + 2 - depth)) 
        
               for key, depth in key_to_depths.items() 
        
           }

According to the code here, assume that n_layers=24, then key_to_depths["encoder/layer_23/"] = 24 which is the depth for last encoder layer, but the learning rate for last layer is:
learning_rate * (layer_decay ** (24+ 2 - 24)) = learning_rate * (layer_decay ** (2)).

That's what confused me. Why the learning rate for last layer is learning_rate * (layer_decay ** (2)) rather than learning_rate? Do I ignore anything?

The text was updated successfully, but these errors were encountered:

clarkkev · 2020-05-08T21:50:18Z

For the layerwise learning rate decay we count task-specific layer added on top of the pre-trained transformer as additional layer of the model, so the learning rate for the last layer of ELECTRA should be learning_rate * 0.8. But you've still found a bug, where instead it is learning_rate * 0.8^2.

The bug happened because there used to be a pooler layer in ELECTRA before we removed the next-sentence-prediction task. In that case the learning rates per layer were

task-specific softmax: learning_rate
pooler: learning_rate * 0.8
transformer layer 24: learning_rate * 0.8^2
transformer layer 23: learning_rate * 0.8^3
...
However, when we removed the pooling layer, we didn't fix the learning rates correspondingly. I guess in practice this didn't hurt performance much, so I'm leaving it as-is to keep result reproducible for now.

importpandas · 2020-05-19T11:53:47Z

I got it, thanks for your explanation.

clarkkev closed this as completed May 18, 2020

clarkkev mentioned this issue Jun 23, 2020

Question about layerwise learning rate decay #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The implementation of layerwise learning rate decay #51

The implementation of layerwise learning rate decay #51

importpandas commented Apr 30, 2020

clarkkev commented May 8, 2020

importpandas commented May 19, 2020

The implementation of layerwise learning rate decay #51

The implementation of layerwise learning rate decay #51

Comments

importpandas commented Apr 30, 2020

clarkkev commented May 8, 2020

importpandas commented May 19, 2020