Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options for weight sharing #408

Closed
wants to merge 7 commits into from

Conversation

borzunov
Copy link
Contributor

@borzunov borzunov commented Jan 10, 2022

Description

This PR adds options for:

  1. Sharing weights across some of the transformer layers. This technique was introduced in the ALBERT paper. If you want a model with shared weights to match the quality of a model without sharing, you need more compute but less parameters. Thus, it is a way to trade extra computations for reduced GPU memory consumption and reduced communication between GPUs (in case of distributed training).

  2. Sharing weights for input and output embeddings. This technique is commonly used in seq2seq models. For example, there is the --share-input-output-embed option in fairseq.

Experiments

Originally, I've implemented this for our project where we train a DALL-E-like model collaboratively over the Internet on LAION-400M.

We use both options above in a model with the following config:

depth = 64
attn_types = list(islice(cycle(['axial_row', 'axial_col', 'axial_row', 'axial_row']), depth - 1))
attn_types.append('conv_like')  # attn_types as in the DALL-E paper
shared_layer_ids = list(islice(cycle(range(4)), depth - 1))
shared_layer_ids.append('w_conv')
dalle = DALLE(
    vae=vqgan_f8,
    num_text_tokens=32100,
    text_seq_len=256,
    dim=1024,
    depth=depth,
    heads=16,
    dim_head=64,
    attn_types=attn_types,
    ff_dropout=0,
    attn_dropout=0,
    shared_attn_ids=shared_layer_ids,
    shared_ff_ids=shared_layer_ids,
    rotary_emb=True,
    reversible=True,
    share_input_output_emb=True,
)

The model demonstrates reasonable outputs, so we can indeed use weight sharing for DALL-E in practice. These pictures are generated after passing 1/3 of the training schedule:

This was referenced Jan 10, 2022
@lucidrains
Copy link
Owner

@borzunov Interesting! Weight sharing has been shown to work with Albert, though it is still not popular in practice with language models. Will do some testing on my end before merging this one in! Thank you!

@borzunov
Copy link
Contributor Author

@lucidrains Oh, I am afraid this code has been already merged as a part of #409. That PR was branched from this one, sorry for the confusion :(

If you eventually decide that this feature shouldn't be in the repo, feel free to remove weight sharing (or I can prepare a PR removing it myself).

@lucidrains
Copy link
Owner

No problem! Let's keep the feature! Thank you 🙏

@lucidrains lucidrains closed this Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants