Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation doesn't seem right #56

Closed
timsoraro opened this issue Mar 8, 2020 · 13 comments
Closed

Generation doesn't seem right #56

timsoraro opened this issue Mar 8, 2020 · 13 comments

Comments

@timsoraro
Copy link

First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a 8.5X boost and 20X with fp16 turned on compared to just one GPU.

I trained a model on 300MB dialogue dataset for 2 epochs but the generated samples weren't good. I'm quite sure I messed up with the code somehow since I come from a programming background and not ML.

Here's my code: https://pastebin.com/V1t5Ctg7
lr = 0.0004, bs=32, vocab_size=2000

Here are some samples: https://pastebin.com/yCL0vVdv

From my experiments with other architectures (GPT-2 from scratch, LSTM), it should generate decent samples after feeding this data so something must be wrong somewhere.

@lucidrains
Copy link
Owner

@timsoraro thanks for trying and validating things are working! Please try the following settings : dimensions to 1024, depth to 24, head of 12, and then see how big of a context of you can stretch it to without memory running out. That would be equivalent to gpt2 small for a fair comparison. Also, try turning on Axial Positional Encoding, instructions in the readme. I found it gave the best results for 1024+ context. Thanks for sharing your deepspeed benchmarks!

@lucidrains
Copy link
Owner

Lastly, set weight_tie to false for your next run. It is from the Albert paper, and is mainly for self supervised models. n_hashes can be set to 8, if memory allows. I know, a ton of knobs to tweak lol

@timsoraro
Copy link
Author

Haha ok, I'll run a test and report back! Thanks :)

@timsoraro
Copy link
Author

So changed to

dim=1024
depth=24
head=16 (since needs to be divisible by dim)
axial_position_emb = True,
axial_position_shape = (64, 32)
axial_position_dims = (512, 512)
weight_tie=False
n_hashes=8

And after seven epochs (on 8 V100), still samples don't make any sense (I think they should by now).
https://pastebin.com/PzuHLrMx

Something is not right I think...

@lucidrains
Copy link
Owner

@timsoraro wow that's a lot of firepower! I think it is best to compare the final perplexities between different models given equal parameter count and amount of data trained. But I'll take your word that you trained the other two models (gpt-2 and lstm) from scratch.

My hunch is that Reformer will never be as good as full attention, and we will have to test the limits of making up for the approximation with parameter count and data. Given your hardware, I suggest adding even more parameters to your model, while capping the amount of epochs you train for at 5. If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

If none of those work, then that is valuable information that either the implementation is wrong, or Reformer simply does not work as well as advertised.

For LSH related hyperparameters, you could always try increasing bucket_size to 128 and beyond, although the authors noted in their paper that they got diminishing returns after hitting 64, which is the setting I defaulted to.

Thanks for sharing these results, this is great for everyone

@lucidrains
Copy link
Owner

lucidrains commented Mar 9, 2020

@timsoraro my advisor @AranKomat, has recommended forcing one of the hashes of LSH to be locally attentive. I've added it as a setting in 0.17.1, which you can turn on with n_local_attn_hashes = 1, if you choose to run another experiment!

Edit: I've made it into a flag instead, so just add_local_attn_hash = True will work

@timsoraro
Copy link
Author

Hey, I think I've done the previous test wrongly, so I'm still experimenting.

Excuse my ignorance, but if let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my batch_size is 32 I would imagine there to be around 937 steps for one epoch, but

for i, data in enumerate(trainloader):

Still walks through 30,000. So what's happening here? How can I know when the model ran over all the data examples (one epoch)?

@timsoraro
Copy link
Author

timsoraro commented Mar 11, 2020

I'll move this question to deepspeed repo.

BTW, I used the GPT2 tokenizer (huggingface) and the results are much better (tho need more testing). I think it's the much larger vocabulary (50257 vs 2000), but not quite sure.

Thanks again!

@lucidrains
Copy link
Owner

@timsoraro Please share some samples!

@timsoraro
Copy link
Author

Sure! I will update here (:

@timsoraro
Copy link
Author

timsoraro commented Mar 14, 2020

If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

Does this paper suggest to train models with fewer parameters but with more data for 1 epoch? It seems to take quite a long time for a model this size (d=1024, d=12, h=8) to train so I'm considering training a model with fewer parameters on more data for less epochs.

@lucidrains
Copy link
Owner

@timsoraro the new recommendation is to make your model as big as possible, and stop your training early https://twitter.com/Eric_Wallace_/status/1235616760595791872?s=20 more data always helps!

@timsoraro
Copy link
Author

Oh okay, thanks for the info! I'm considering how to spend my budget, I really wished there was a pre-trained Reformer model a la GPT2. I assume it would cost 5330$ to pre-train OpenWebText on 8V100 on AWS spot instance (with the same variables as the model posted here), which I cannot afford.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants