Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence length limited #17

Closed
Henrykwokkk opened this issue Nov 20, 2020 · 14 comments
Closed

Sequence length limited #17

Henrykwokkk opened this issue Nov 20, 2020 · 14 comments

Comments

@Henrykwokkk
Copy link

I tried this model, but the sequence length that the Routing Transformer can process seemed limited. I set the batch size as 16 and the sequence length as 1024, but it was out of GPU memory.

@lucidrains
Copy link
Owner

Hmm, that doesn't seem right, have you tried running the colab? Want to share your script?

@lucidrains
Copy link
Owner

How deep is your network? Try turning on reversibility?

@Henrykwokkk
Copy link
Author

How deep is your network? Try turning on reversibility?

The depths of both the encoder and decoder are 3. Let me give you the feedback later.

@lucidrains
Copy link
Owner

how much memory are you working with? can you show me your full settings?

@Henrykwokkk
Copy link
Author

Henrykwokkk commented Nov 21, 2020

The setting is shown as follows:
NUM_BATCHES = int(1e5) BATCH_SIZE = 32 LEARNING_RATE = 1e-4 GENERATE_EVERY = 100 NUM_TOKENS = 256 + 2 ENC_SEQ_LEN = 1024 DEC_SEQ_LEN = 2048
model = RoutingTransformerEncDec( dim=512, enc_num_tokens=NUM_TOKENS, enc_depth=3, enc_heads=8, enc_max_seq_len=ENC_SEQ_LEN, enc_window_size=32, dec_num_tokens = NUM_TOKENS, dec_depth = 3, dec_heads = 8, dec_max_seq_len=DEC_SEQ_LEN, dec_window_size=32, ).cuda()
RuntimeError happened: Tried to allocate 64.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 20.94 MiB free; 9.89 GiB reserved in total by PyTorch)

Similarly, I also had this problem in implementing the reformer model, When I implemented about 500 batches, it was out of memory.

@lucidrains
Copy link
Owner

So firstly, turn on reversibility, and second, you can decrease your batch size and do gradient accumulation instead

@Henrykwokkk
Copy link
Author

So firstly, turn on reversibility, and second, you can decrease your batch size and do gradient accumulation instead

I turned on reversibility and set the batch size 8. But the training stopped at batch 172 and RuntimeError happened because CUDA out of memory.

@lucidrains
Copy link
Owner

@guohanyang1994 make your batch size even smaller and increase your gradient accumulation

@Henrykwokkk
Copy link
Author

Could I ask why CUDA out of memory occurs during training (batch 172) rather than at the beginning of training?

@tomweingarten
Copy link
Contributor

Hard to say without seeing the code, but are your batches different sizes? It's possible it takes that long to hit the longest combination of sequence lengths.

@Henrykwokkk
Copy link
Author

The batch size is fixed at 4. The sequence length is set to 2048 at most, but it still stops at about the 1200th batch. I am still confused :(
But thanks for your reply.

@lucidrains
Copy link
Owner

lucidrains commented Nov 23, 2020

@guohanyang1994 are you sure you don't have a memory leak? Routing Transformer has been trained on GPT3 sized datasets successfully by others, so I doubt there's any problems with the framework

@Henrykwokkk
Copy link
Author

Oh yeah, exactly it is the memory leak problem. I have fixed it and thank you so much. Sorry to bother you as an NLP beginner.

@lucidrains
Copy link
Owner

ok np :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants