Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds options for:
Sharing weights across some of the transformer layers. This technique was introduced in the ALBERT paper. If you want a model with shared weights to match the quality of a model without sharing, you need more compute but less parameters. Thus, it is a way to trade extra computations for reduced GPU memory consumption and reduced communication between GPUs (in case of distributed training).
Sharing weights for input and output embeddings. This technique is commonly used in seq2seq models. For example, there is the
--share-input-output-embed
option in fairseq.Experiments
Originally, I've implemented this for our project where we train a DALL-E-like model collaboratively over the Internet on LAION-400M.
We use both options above in a model with the following config:
The model demonstrates reasonable outputs, so we can indeed use weight sharing for DALL-E in practice. These pictures are generated after passing 1/3 of the training schedule: