-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generation doesn't seem right #56
Comments
@timsoraro thanks for trying and validating things are working! Please try the following settings : dimensions to 1024, depth to 24, head of 12, and then see how big of a context of you can stretch it to without memory running out. That would be equivalent to gpt2 small for a fair comparison. Also, try turning on Axial Positional Encoding, instructions in the readme. I found it gave the best results for 1024+ context. Thanks for sharing your deepspeed benchmarks! |
Lastly, set |
Haha ok, I'll run a test and report back! Thanks :) |
So changed to
And after seven epochs (on 8 V100), still samples don't make any sense (I think they should by now). Something is not right I think... |
@timsoraro wow that's a lot of firepower! I think it is best to compare the final perplexities between different models given equal parameter count and amount of data trained. But I'll take your word that you trained the other two models (gpt-2 and lstm) from scratch. My hunch is that Reformer will never be as good as full attention, and we will have to test the limits of making up for the approximation with parameter count and data. Given your hardware, I suggest adding even more parameters to your model, while capping the amount of epochs you train for at 5. If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5. If none of those work, then that is valuable information that either the implementation is wrong, or Reformer simply does not work as well as advertised. For LSH related hyperparameters, you could always try increasing Thanks for sharing these results, this is great for everyone |
@timsoraro my advisor @AranKomat, has recommended forcing one of the hashes of LSH to be locally attentive. I've added it as a setting in Edit: I've made it into a flag instead, so just |
Hey, I think I've done the previous test wrongly, so I'm still experimenting. Excuse my ignorance, but if let's say, I have 240,000 data examples. When I do
Still walks through 30,000. So what's happening here? How can I know when the model ran over all the data examples (one epoch)? |
I'll move this question to deepspeed repo. BTW, I used the GPT2 tokenizer (huggingface) and the results are much better (tho need more testing). I think it's the much larger vocabulary (50257 vs 2000), but not quite sure. Thanks again! |
@timsoraro Please share some samples! |
Sure! I will update here (: |
Does this paper suggest to train models with fewer parameters but with more data for 1 epoch? It seems to take quite a long time for a model this size (d=1024, d=12, h=8) to train so I'm considering training a model with fewer parameters on more data for less epochs. |
@timsoraro the new recommendation is to make your model as big as possible, and stop your training early https://twitter.com/Eric_Wallace_/status/1235616760595791872?s=20 more data always helps! |
Oh okay, thanks for the info! I'm considering how to spend my budget, I really wished there was a pre-trained Reformer model a la GPT2. I assume it would cost 5330$ to pre-train OpenWebText on 8V100 on AWS spot instance (with the same variables as the model posted here), which I cannot afford. |
First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a 8.5X boost and 20X with fp16 turned on compared to just one GPU.
I trained a model on 300MB dialogue dataset for 2 epochs but the generated samples weren't good. I'm quite sure I messed up with the code somehow since I come from a programming background and not ML.
Here's my code: https://pastebin.com/V1t5Ctg7
lr = 0.0004, bs=32, vocab_size=2000
Here are some samples: https://pastebin.com/yCL0vVdv
From my experiments with other architectures (GPT-2 from scratch, LSTM), it should generate decent samples after feeding this data so something must be wrong somewhere.
The text was updated successfully, but these errors were encountered: