Generation doesn't seem right #56

timsoraro · 2020-03-08T15:36:01Z

First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a 8.5X boost and 20X with fp16 turned on compared to just one GPU.

I trained a model on 300MB dialogue dataset for 2 epochs but the generated samples weren't good. I'm quite sure I messed up with the code somehow since I come from a programming background and not ML.

Here's my code: https://pastebin.com/V1t5Ctg7
lr = 0.0004, bs=32, vocab_size=2000

Here are some samples: https://pastebin.com/yCL0vVdv

From my experiments with other architectures (GPT-2 from scratch, LSTM), it should generate decent samples after feeding this data so something must be wrong somewhere.

lucidrains · 2020-03-08T16:27:01Z

@timsoraro thanks for trying and validating things are working! Please try the following settings : dimensions to 1024, depth to 24, head of 12, and then see how big of a context of you can stretch it to without memory running out. That would be equivalent to gpt2 small for a fair comparison. Also, try turning on Axial Positional Encoding, instructions in the readme. I found it gave the best results for 1024+ context. Thanks for sharing your deepspeed benchmarks!

lucidrains · 2020-03-08T16:30:14Z

Lastly, set weight_tie to false for your next run. It is from the Albert paper, and is mainly for self supervised models. n_hashes can be set to 8, if memory allows. I know, a ton of knobs to tweak lol

timsoraro · 2020-03-08T17:22:55Z

Haha ok, I'll run a test and report back! Thanks :)

timsoraro · 2020-03-08T23:01:08Z

So changed to

dim=1024
depth=24
head=16 (since needs to be divisible by dim)
axial_position_emb = True,
axial_position_shape = (64, 32)
axial_position_dims = (512, 512)
weight_tie=False
n_hashes=8

And after seven epochs (on 8 V100), still samples don't make any sense (I think they should by now).
https://pastebin.com/PzuHLrMx

Something is not right I think...

lucidrains · 2020-03-09T00:08:13Z

@timsoraro wow that's a lot of firepower! I think it is best to compare the final perplexities between different models given equal parameter count and amount of data trained. But I'll take your word that you trained the other two models (gpt-2 and lstm) from scratch.

My hunch is that Reformer will never be as good as full attention, and we will have to test the limits of making up for the approximation with parameter count and data. Given your hardware, I suggest adding even more parameters to your model, while capping the amount of epochs you train for at 5. If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

If none of those work, then that is valuable information that either the implementation is wrong, or Reformer simply does not work as well as advertised.

For LSH related hyperparameters, you could always try increasing bucket_size to 128 and beyond, although the authors noted in their paper that they got diminishing returns after hitting 64, which is the setting I defaulted to.

Thanks for sharing these results, this is great for everyone

lucidrains · 2020-03-09T07:07:25Z

@timsoraro my advisor @AranKomat, has recommended forcing one of the hashes of LSH to be locally attentive. I've added it as a setting in 0.17.1, which you can turn on with n_local_attn_hashes = 1, if you choose to run another experiment!

Edit: I've made it into a flag instead, so just add_local_attn_hash = True will work

timsoraro · 2020-03-10T11:17:00Z

Hey, I think I've done the previous test wrongly, so I'm still experimenting.

Excuse my ignorance, but if let's say, I have 240,000 data examples. When I do print(len(trainloader)), it accurately shows 30,000 (240,000/8gpu's). But since my batch_size is 32 I would imagine there to be around 937 steps for one epoch, but

for i, data in enumerate(trainloader):

Still walks through 30,000. So what's happening here? How can I know when the model ran over all the data examples (one epoch)?

timsoraro · 2020-03-11T18:08:54Z

I'll move this question to deepspeed repo.

BTW, I used the GPT2 tokenizer (huggingface) and the results are much better (tho need more testing). I think it's the much larger vocabulary (50257 vs 2000), but not quite sure.

Thanks again!

lucidrains · 2020-03-12T02:05:19Z

@timsoraro Please share some samples!

timsoraro · 2020-03-12T09:03:40Z

Sure! I will update here (:

timsoraro · 2020-03-14T22:12:18Z

If you can, first train on OpenWebText https://github.com/jcpeterson/openwebtext for just 1 epoch (https://arxiv.org/abs/1906.06669), and then finally train on your own dialogue corpus for up to 5.

Does this paper suggest to train models with fewer parameters but with more data for 1 epoch? It seems to take quite a long time for a model this size (d=1024, d=12, h=8) to train so I'm considering training a model with fewer parameters on more data for less epochs.

lucidrains · 2020-03-15T03:20:27Z

@timsoraro the new recommendation is to make your model as big as possible, and stop your training early https://twitter.com/Eric_Wallace_/status/1235616760595791872?s=20 more data always helps!

timsoraro · 2020-03-15T10:14:49Z

Oh okay, thanks for the info! I'm considering how to spend my budget, I really wished there was a pre-trained Reformer model a la GPT2. I assume it would cost 5330$ to pre-train OpenWebText on 8V100 on AWS spot instance (with the same variables as the model posted here), which I cannot afford.

timsoraro closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation doesn't seem right #56

Generation doesn't seem right #56

timsoraro commented Mar 8, 2020

lucidrains commented Mar 8, 2020

lucidrains commented Mar 8, 2020

timsoraro commented Mar 8, 2020

timsoraro commented Mar 8, 2020

lucidrains commented Mar 9, 2020

lucidrains commented Mar 9, 2020 •

edited

Loading

timsoraro commented Mar 10, 2020

timsoraro commented Mar 11, 2020 •

edited

Loading

lucidrains commented Mar 12, 2020

timsoraro commented Mar 12, 2020

timsoraro commented Mar 14, 2020 •

edited

Loading

lucidrains commented Mar 15, 2020

timsoraro commented Mar 15, 2020

Generation doesn't seem right #56

Generation doesn't seem right #56

Comments

timsoraro commented Mar 8, 2020

lucidrains commented Mar 8, 2020

lucidrains commented Mar 8, 2020

timsoraro commented Mar 8, 2020

timsoraro commented Mar 8, 2020

lucidrains commented Mar 9, 2020

lucidrains commented Mar 9, 2020 • edited Loading

timsoraro commented Mar 10, 2020

timsoraro commented Mar 11, 2020 • edited Loading

lucidrains commented Mar 12, 2020

timsoraro commented Mar 12, 2020

timsoraro commented Mar 14, 2020 • edited Loading

lucidrains commented Mar 15, 2020

timsoraro commented Mar 15, 2020

lucidrains commented Mar 9, 2020 •

edited

Loading

timsoraro commented Mar 11, 2020 •

edited

Loading

timsoraro commented Mar 14, 2020 •

edited

Loading