Recommended hardware for gpt2-medium #14

g-karthik · 2019-06-14T21:51:13Z

Hello!

By adapting the code in this repo, I've been able to fine-tune GPT and GPT-2 small using Topical-Chat with an EC2 instance with 8 Tesla V100 GPUs (32 GB memory each). However, I am unable to fine-tune GPT-2 medium on the same instance with the exact same hyper-parameters - I'm getting out of memory issues, presumably because GPT-2 medium is much larger than GPT-2 small. I haven't tried fine-tuning GPT-2 medium using Persona-Chat yet though.

Have you tried fine-tuning GPT-2 medium (from the attention branch in pytorch-pretrained-BERT) on large dialog datasets with long turns and if so, could you share the details of the underlying hardware used?

Thanks!

The text was updated successfully, but these errors were encountered:

martinritchie · 2019-10-28T16:13:07Z

What batchsize are you using? Also, are you using --FP16 O1?

michaelklachko · 2019-11-12T02:35:03Z

https://www.gwern.net/GPT-2#training

Seems like a single 1080Ti with 11GB should be enough - if you switch to FP16 you wouldn't even need to use gradient checkpointing (Gwern used FP32).

gwern · 2019-11-20T21:40:40Z

Tweaking the gradient checkpointing was enough to get 774M to work, but not 1.5b. We experimented with FP16 when we were trying to get 1.5b to work a 1080ti. It caused a lot of issues: the codebase multiplies by constants which can't be represented in FP16 (so it wound up generating '!!!!!!' infinitely, because that's the first BPE token or whatever) and once we figured that out and converted the pretrained model over to FP16, the output was completely screwed up, so something slightly more clever obviously was required to make reduced precision work. At which point we switched over to Colab TPUs (which opened up an entirely different kettle of worms relating to TPU iterations randomly freezing, our best guess so far is that some reshape or loop makes the TPU very unhappy).

ibzip · 2020-01-05T15:01:36Z

Tweaking the gradient checkpointing was enough to get 774M to work, but not 1.5b. We experimented with FP16 when we were trying to get 1.5b to work a 1080ti. It caused a lot of issues: the codebase multiplies by constants which can't be represented in FP16 (so it wound up generating '!!!!!!' infinitely, because that's the first BPE token or whatever) and once we figured that out and converted the pretrained model over to FP16, the output was completely screwed up, so something slightly more clever obviously was required to make reduced precision work. At which point we switched over to Colab TPUs (which opened up an entirely different kettle of worms relating to TPU iterations randomly freezing, our best guess so far is that some reshape or loop makes the TPU very unhappy).

@gwern mind mentioning broadly what you tweaked? I am using checkpointing in pytorch and can't fit even 1 sample into a 12 gb gpu for the 774M version.

gwern · 2020-01-05T15:27:28Z

I believe we needed something like this:

diff --git a/src/model.py b/src/model.py
index 4e942d8..71092bc 100644
--- a/src/model.py
+++ b/src/model.py
@@ -124,10 +124,10 @@ def block(x, scope, *, past, hparams):
     with tf.variable_scope(scope):
         nx = x.shape[-1].value
         a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
-        x = x + a
+        x = x1 = x + a
         m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
         x = x + m
-        return x, present
+        return x, present, x1
 
 def past_shape(*, hparams, batch_size=None, sequence=None):
     return [batch_size, hparams.n_layer, 2, hparams.n_head, sequence, hparams.n_embd // hparams.n_head]
@@ -161,9 +161,9 @@ def model(hparams, X, past=None, scope='model', reuse=tf.AUTO_REUSE):
         pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer
         assert len(pasts) == hparams.n_layer
         for layer, past in enumerate(pasts):
-            h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
-            if layer == 10:
-                tf.add_to_collection('checkpoints', h)
+            h, present, x1 = block(h, 'h%d' % layer, past=past, hparams=hparams)
+            if layer < 48:
+                tf.add_to_collection('checkpoints', x1)
             presents.append(present)
         results['present'] = tf.stack(presents, axis=1)
         h = norm(h, 'ln_f')

g-karthik · 2020-01-22T07:58:14Z

What batchsize are you using? Also, are you using --FP16 O1?

@martinritchie I've been using FP16 O3 -- this is giving me a NaN error for the loss computation after like 55% of an epoch of training is done: WARNING:root:NaN or Inf found in input tensor.. The training continues after the warning, but it simply continually prints NaN for the loss with the same warning.

I've also been using a batch size of 2, but I think the NaN error above is specific to FP16.

@michaelklachko @gwern I am using FP16 and facing NaN errors.

martinritchie · 2020-01-22T08:02:22Z

FP16 O3 can be unstable (check the apex docs), stick to O1 and this should reduce the chance of it diverging. As Gwern suggested, use gradient checkpointing and reduce the batch size to one if you are still having memory problems.

g-karthik · 2020-01-24T07:22:36Z

@martinritchie do you have any thoughts on how exactly to perform the gradient checkpointing when the underlying modules return variable-number of tensors, like here:

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py#L478

I get errors like CheckpointFunctionBackward.forward: expected Variable (got list) for return value 0., and when I looked it up, it seems that variable-number of tensors is not supported.

Perhaps I could unpack the variable-number of tensors explicitly in every sub-module of the GPT2Model and then place checkpoints?

martinritchie · 2020-01-24T08:54:49Z

That sounds like it would be a little heavy handed. Could you provide a minimal working example or show me how you are using it?

g-karthik · 2020-01-24T21:41:29Z

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py#L478

@martinritchie So the above line, I basically replaced it with:

            if i == 10:
                outputs = checkpoint(block, hidden_states, layer_past=layer_past, attention_mask=attention_mask,
                                     head_mask=head_mask[i])
            else:
                outputs = block(hidden_states,
                                layer_past=layer_past,
                                attention_mask=attention_mask,
                                head_mask=head_mask[i])

And that initially failed saying that checkpoint.py does not support keyword arguments. So I removed the keywords from the arguments.

            if i == 10:
                outputs = checkpoint(block, hidden_states, layer_past, attention_mask, head_mask[i])
            else:
                outputs = block(hidden_states,
                                layer_past=layer_past,
                                attention_mask=attention_mask,
                                head_mask=head_mask[i])

And with that, the previous error went away and I now have CheckpointFunctionBackward.forward: expected Variable (got list) for return value 0. .

Here's a thread I found about this on the PyTorch forums: https://discuss.pytorch.org/t/checkpoint-didnt-support-list-output/16957/3

g-karthik · 2020-01-25T01:24:51Z

Update: It looks like in the transformers library, I need to simply change the forward() in Block: specifically, it returns a list of outputs, which needs to change to tuple(outputs) for checkpointing to work.

g-karthik · 2020-01-27T20:28:46Z

FP16 O3 can be unstable (check the apex docs), stick to O1 and this should reduce the chance of it diverging. As Gwern suggested, use gradient checkpointing and reduce the batch size to one if you are still having memory problems.

@martinritchie So I'm using gradient checkpointing like above (I changed the if condition to i <= 12 so all layers are check-pointed) with a batch size of 1, along with fp16 O1. I'm still facing CUDA OOM errors. My train dataset tensor is (Batch, Candidates, Seq length): torch.Size([33410, 3, 150]).

Any ideas how to get this to work?

g-karthik closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended hardware for gpt2-medium #14

Recommended hardware for gpt2-medium #14

g-karthik commented Jun 14, 2019 •

edited

Loading

martinritchie commented Oct 28, 2019 •

edited

Loading

michaelklachko commented Nov 12, 2019

gwern commented Nov 20, 2019

ibzip commented Jan 5, 2020

gwern commented Jan 5, 2020

g-karthik commented Jan 22, 2020

martinritchie commented Jan 22, 2020

g-karthik commented Jan 24, 2020

martinritchie commented Jan 24, 2020

g-karthik commented Jan 24, 2020 •

edited

Loading

g-karthik commented Jan 25, 2020 •

edited

Loading

g-karthik commented Jan 27, 2020

Recommended hardware for gpt2-medium #14

Recommended hardware for gpt2-medium #14

Comments

g-karthik commented Jun 14, 2019 • edited Loading

martinritchie commented Oct 28, 2019 • edited Loading

michaelklachko commented Nov 12, 2019

gwern commented Nov 20, 2019

ibzip commented Jan 5, 2020

gwern commented Jan 5, 2020

g-karthik commented Jan 22, 2020

martinritchie commented Jan 22, 2020

g-karthik commented Jan 24, 2020

martinritchie commented Jan 24, 2020

g-karthik commented Jan 24, 2020 • edited Loading

g-karthik commented Jan 25, 2020 • edited Loading

g-karthik commented Jan 27, 2020

g-karthik commented Jun 14, 2019 •

edited

Loading

martinritchie commented Oct 28, 2019 •

edited

Loading

g-karthik commented Jan 24, 2020 •

edited

Loading

g-karthik commented Jan 25, 2020 •

edited

Loading