Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As mentioned in the original paper, they separated the dropout rates of the transformer cells and the classifier, moreover, in V2 the dropouts are 0 (expect for the classifier, again).
Current implementation does not supports this and models are not training well (can't reproduce results of GLUE benchmark using V2 models). I manually updated these values and got V2 models converging.
This issue was raised in #2337 and also mentioned in google-research/albert#23
I added a separate parameter in the config file and update the sequence classification head.
Please also update the configuration of ALBERT V2 models (base, large, xlarge) in your repository.
More specifically, the configuration of the attention and hidden dropout rates of ALBERT V2 models in your repository as well (see as in https://tfhub.dev/google/albert_base/3, https://tfhub.dev/google/albert_large/3, https://tfhub.dev/google/albert_xlarge/3 and https://tfhub.dev/google/albert_xxlarge/3)