Add options for weight sharing #408

borzunov · 2022-01-10T17:45:13Z

Description

This PR adds options for:

Sharing weights across some of the transformer layers. This technique was introduced in the ALBERT paper. If you want a model with shared weights to match the quality of a model without sharing, you need more compute but less parameters. Thus, it is a way to trade extra computations for reduced GPU memory consumption and reduced communication between GPUs (in case of distributed training).
Sharing weights for input and output embeddings. This technique is commonly used in seq2seq models. For example, there is the --share-input-output-embed option in fairseq.

Experiments

Originally, I've implemented this for our project where we train a DALL-E-like model collaboratively over the Internet on LAION-400M.

We use both options above in a model with the following config:

depth = 64
attn_types = list(islice(cycle(['axial_row', 'axial_col', 'axial_row', 'axial_row']), depth - 1))
attn_types.append('conv_like')  # attn_types as in the DALL-E paper
shared_layer_ids = list(islice(cycle(range(4)), depth - 1))
shared_layer_ids.append('w_conv')
dalle = DALLE(
    vae=vqgan_f8,
    num_text_tokens=32100,
    text_seq_len=256,
    dim=1024,
    depth=depth,
    heads=16,
    dim_head=64,
    attn_types=attn_types,
    ff_dropout=0,
    attn_dropout=0,
    shared_attn_ids=shared_layer_ids,
    shared_ff_ids=shared_layer_ids,
    rotary_emb=True,
    reversible=True,
    share_input_output_emb=True,
)

The model demonstrates reasonable outputs, so we can indeed use weight sharing for DALL-E in practice. These pictures are generated after passing 1/3 of the training schedule:

lucidrains · 2022-01-11T12:54:15Z

@borzunov Interesting! Weight sharing has been shown to work with Albert, though it is still not popular in practice with language models. Will do some testing on my end before merging this one in! Thank you!

borzunov · 2022-01-11T13:07:50Z

@lucidrains Oh, I am afraid this code has been already merged as a part of #409. That PR was branched from this one, sorry for the confusion :(

If you eventually decide that this feature shouldn't be in the repo, feel free to remove weight sharing (or I can prepare a PR removing it myself).

lucidrains · 2022-01-12T02:56:03Z

No problem! Let's keep the feature! Thank you 🙏

borzunov added 7 commits January 10, 2022 17:20

Implement weight sharing in transformer

c5f009a

Pass sharing args from CLI

c57eae6

Improve checking reused attn type

538c42a

Fix view errors in sparse attention

ec5b477

Implement share_input_output_emb option

543a3e4

Revert excess changes

6de6cb0

Add CLI option for share_input_output_emb

f968697

This was referenced Jan 10, 2022

Implement cached inference #409

Merged

faster inference #407

Closed

lucidrains closed this Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add options for weight sharing #408

Add options for weight sharing #408

borzunov commented Jan 10, 2022 •

edited

lucidrains commented Jan 11, 2022

borzunov commented Jan 11, 2022

lucidrains commented Jan 12, 2022

Add options for weight sharing #408

Add options for weight sharing #408

Conversation

borzunov commented Jan 10, 2022 • edited

Description

Experiments

lucidrains commented Jan 11, 2022

borzunov commented Jan 11, 2022

lucidrains commented Jan 12, 2022

borzunov commented Jan 10, 2022 •

edited