OOM error with new 774M model when running in Colab #108

ghost · 2019-08-20T17:04:42Z

When running sess command, getting OOM issue. Not sure if new large model is too large for Colab?

WARNING: Logging before flag parsing goes to stderr.
W0820 16:58:18.137592 140704259733376 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/sample.py:17: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:

7 frames
ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/wte/Initializer/random_normal/RandomStandardNormal}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1368 pass
1369 message = error_interpolation.interpolate(message, self._graph)
-> 1370 raise type(e)(node_def, op, message)
1371
1372 def _extend_graph(self):

ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/wte/Initializer/random_normal/RandomStandardNormal (defined at /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py:185) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Original stack trace for 'model/wte/Initializer/random_normal/RandomStandardNormal':
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 12, in
save_every=500
File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 170, in finetune
output = model.model(hparams=hparams, X=context)
File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 185, in model
initializer=tf.compat.v1.random_normal_initializer(stddev=0.02))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 929, in _get_single_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 259, in call
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 198, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 2511, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 263, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1568, in init
shape=shape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1698, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 901, in
partition_info=partition_info)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py", line 323, in call
shape, self.mean, self.stddev, dtype, seed=self.seed)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py", line 79, in random_normal
shape_tensor, dtype, seed=seed1, seed2=seed2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 728, in random_standard_normal
seed2=seed2, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

minimaxir · 2019-08-20T19:10:03Z

It is likely not possible to finetune 774M. Discussion here: https://news.ycombinator.com/item?id=20749037

I need to run tests to determine how well it works; if it's not possible, I'll add a bespoke assert to prevent finetuning on it.

sdan · 2019-08-21T07:08:50Z

Would using fp16 help?

saippuakauppias · 2019-08-21T13:31:48Z

Maybe it can be run at: https://cloud.google.com/compute/all-pricing#gpus ?
What should be the minimum configuration? Or are there any places where it will cost less?
Is it possible to use repo: https://github.com/minimaxir/gpt-2-cloud-run ?

minimaxir · 2019-08-21T14:25:27Z

There is no magic switch for FP16 in TensorFlow [yet], and the 16 GB VRAM offered by cloud GPUs still isn't enough.

If there are any workarounds, I would be interested in them.

saippuakauppias · 2019-08-21T14:47:48Z

Need to recompile tensorflow to use fp16? I have some experience in this, I can tell you how to do it without any special difficulties.

woctezuma · 2019-08-21T14:57:41Z

For reference, this was @AdamDanielKing's answer on HackerNews:

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them before decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.

I won't say that I understand everything though.

AdamDanielKing · 2019-08-21T15:15:14Z

@woctezuma That comment only explains how to deploy a trained model, which requires much less GPU memory than training because the gradients aren't stored. @minimaxir is probably right that for training you won't fit a full batch of 774M training samples in the K80 GPU that Colab gives you.

@minimaxir You can work around this by training with a smaller batch size but accumulating gradients over several iterations before applying an update to the weights. That achieves a larger effective batch size than can fit in the GPU. This page might be helpful.

minimaxir · 2019-08-21T15:29:38Z

The workflow for 345M finetuning uses a batch size of 1 w/ accumulated gradients. That is the workflow the 774M should be using now, with apparently no success.

AdamDanielKing · 2019-08-21T15:42:50Z

Ah, I see. That's surprising.

I know OpenAI uses gradient checkpointing for some of their other work, so in that case I'd bet they use it in their training code for GPT-2 as well. See https://github.com/cybertronai/gradient-checkpointing. Instead of storing all the layer activations at once, this stores a subset of them and then recomputes them during the backward pass to significantly reduce memory usage. In my experience it's pretty easy to get that library working, and if you do then it should be effective.

Another workaround is to only train with sequences significantly shorter than the maximum of 1024 tokens.

saippuakauppias · 2019-08-21T16:05:23Z

Maybe should turn back and try to implement using TPU or Multiple GPU?

In your opinion, which option would be preferable to not return to this issue in the future (when 1558M parameter model will be released)? (I assume that Colab may not be enough for this, of course)

sdan · 2019-08-21T18:59:16Z

@AdamDanielKing, This repo took a good chunk of nshepperd's codebase as @minimaxir has said in the past. This means this repo automatically does gradient checkpointing for anything that is not 117 (see gpt_2.py).

@saippuakauppias I already tried it on all GPU configurations/RAM/CPU configurations on GCP. After 10-15 failed attempts did I realize it was an issue and was prompted to HN to see @minimaxir and @AdamDanielKing's discussion.

And @saippuakauppias someone has already tried TPU: [https://colab.research.google.com/github/shawwn/gpt-2/blob/tpu/Training_GPT_2_Using_TPUs.ipynb](Colab using TPU). At this moment I didn't get the best results, although I have to do some data preprocessing to see what the exact issue it. I'm also getting a pretty high loss on it.

@saippuakauppias can you help me recompile TF to only use FP16?

AdamDanielKing · 2019-08-21T19:46:44Z

@dantuluri Thanks for pointing this out. It looks like the code only uses 1 gradient checkpoint at layer 10:

gpt-2-simple/gpt_2_simple/src/model.py

Lines 195 to 196 in 4c36ea7

    
           if layer == 10: 
        
               tf.compat.v1.add_to_collection('checkpoints', h)

The code is using memory_saving_gradients in 'collection' mode, so it doesn't automatically add any other checkpoints. 774M has 36 layers, so this means the activations of at least 26 layers will be in memory at the same time. I'd suggest adding many more checkpoints or trying the other modes.

saippuakauppias · 2019-08-21T20:20:33Z

@dantuluri, I misunderstood the discussion on "Hacker News" (recompilation is not needed).
FP16 is already available in tensorflow version 1.14:

Can anyone check if this helps for a Colab or for Cloud Run?

PS: if you suddenly need to recompile TF, then here is the easiest way: https://github.com/yaroslavvb/tensorflow-community-wheels/pull/121/files

minimaxir · 2019-08-21T21:23:17Z

If FP16 is indeed in TensorFlow 1.14 via pip, I'll give it a test.

sdan · 2019-08-21T21:31:27Z

Looking into the code it seems @minimaxir used https://github.com/cybertronai/gradient-checkpointing for gradient checkpointing.

I used the variations:

collection (which appears to be default)
speed (ran into the same OOM problems)
memory (ran into: 'unable to find bottleneck tensors! please provide checkpoint nodes manually, or use checkpoints="speed"')

Here are the definitions of each variation just for reference:

'collection' (default): This checkpoints all tensors returned by tf.get_collection('checkpoints'). You then need to make sure you add tensors to this collection using tf.add_to_collection('checkpoints', tensor) when you define your model.
'memory' : This uses a heuristic to automatically select a set of nodes to checkpoint which achieves our desired O(sqrt(n)) memory usage. The heuristic works by automatically identifying articulation points in the graph, i.e. tensors which split the graph into two disconnected parts when removed, and then checkpointing a suitable number of these tensors. This currently works well for many, but not all, models.
'speed' : This option tries to maximize running speed by checkpointing the outputs of all ops that are typically expensive to compute, namely convolutions and matrix multiplies.

I think FP16 is probably the way to go if it works as @minimaxir said.

sdan · 2019-08-21T21:41:39Z

@saippuakauppias do you know how to use FP16? Not too familiar on how to start using it.

saippuakauppias · 2019-08-21T21:52:06Z

@dantuluri No, but now I'm trying to figure out how to use it.

An example of how to enable FP16: https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/docs/amp/notebook_v1.14/auto_mixed_precision_demo_cifar10.ipynb

You just need to wrap the optimizer in tensorflow.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite and that's it!

Documentation:
https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite
https://gist.github.com/tlkh/fa20c5bf3c8b48def4501cccff8b3559

AdamDanielKing · 2019-08-22T01:52:20Z

There should definitely still be more gradient checkpoints. I just ran some tests on a K80 setting accumulate_gradients to 1 and seeing how many samples can fit in a batch without running out of memory.

Model	Checkpointing just layer 10	Checkpointing all layer outputs
345M	1 sample fits	8 samples fit

The only code change is removing the if layer == 10: line. This makes the the large internal activations of each layer (attention layer, MLP layer) be recomputed, with only the skip connections between each layer being stored. Still, the optimal strategy is likely to be a bit different from this.

Unfortunately I'm still struggling to fit 1 sample of 774M into memory mainly because the attn function inside each layer requires a lot of memory.

Edit: By the way, adding more checkpoints doesn't have a performance hit because its only effect is to not deallocate and recompute the checkpointed layer. So you just want to choose the checkpoints in a way that minimizes the peak memory usage.

sdan · 2019-08-22T02:15:08Z

How does nshepperd's fork deal with this?
It seems like he puts gradient checkpointing at all layers: if args.accumulate_gradients > 1: not sure though.

AdamDanielKing · 2019-08-22T02:23:47Z

@dantuluri He also has the if layer == 10: line, so only one checkpoint. When memory_saving_gradients is in 'collection' mode (the default) it only uses checkpoints that you explicitly add.

sdan · 2019-08-22T02:35:32Z

Interesting.
Going through speed, memory, and collection modes removing if layer == 10.
Will update on once done (running on 16 vram V100)

sdan · 2019-08-22T05:04:10Z

Update:
speed and memory options don't work. Just collection works (by default).
All you need to do is delete the if layer == 10 line` (I've tried if layer == 5 and 2 and still didn't work) as @AdamDanielKing said.

Currently running on V100. Will try on lower VRAM GPUs.

Edit: Can't vouch for quality of training. Just saw it training and thought it works.
Edit: Running on P100 is works fine.

AdamDanielKing · 2019-08-22T06:17:34Z

@dantuluri Perfect! This is with 774M, right? How many samples fit if you set accumulate_gradients to 1 and vary batch_size? Can you get more than one?

I think the boundary between not fitting a sample and fitting one is between 12 GB and 16 GB. I wasn't able to get one of the K80s that Google offers (12 GB) to work. So it seems we still can't train for free on Colab.

Edit: One place in particular that seemed helpful to add a checkpoint was at the model's output:

gpt-2-simple/gpt_2_simple/gpt_2.py

Line 170 in 4c36ea7

output = model.model(hparams=hparams, X=context)

I suggest experimenting with adding it

    output = model.model(hparams=hparams, X=context)
    tf.compat.v1.add_to_collection('checkpoints', output['logits'])

and seeing if that increases the number of samples you can fit on the GPU. Edit Sept 21: The line above had a bug but should work now. Still not certain that it lowers the overall memory usage but it's worth trying.

Memory peaks around there, and checkpointing seemed to bring the peak usage down while I was playing with the K80.

AdamDanielKing · 2019-08-22T06:38:54Z

Updated my code suggestion one last time. ^

sdan · 2019-08-22T20:36:44Z

At this moment it's training, accumulate_gradients = 1 and batch_size = 1 as default.

I'm think my input may be wrong because it's structured like this:

<|startoftext|>
hello world
more text more text more text
more text more text more text
<|endoftext|>
<|startoftext|>
more text more text more text
more text more text more text
<|endoftext|>

and so on.
But I'm getting the start and end tags in my results...
like this:

something
something
something
<|endoftext|>
<|startoftext|>
something
something
sometimes weird characters

Because I'm a bit more familiar with nshepperd's code, do you know where he did the checkpointing (layer == 10)? I got better results training 335 using his code.

Otherwise, any help on getting this code to work with my data would be much appreciated.

In regards to GPU memory usage:
I'm using a P100 which has 16GB VRAM. When training, regardless of model (335 or 774) it always maxes out, to around 99% all the time (except when generating samples, when it goes to around 50%).

In regards to loss
The loss is really low compared to 335. I'm getting around 1.5 out of the gate, opposed to around high 2's with 335.

Quality of results
For the short time I've been training it, it's not a whole lot greater than 335. This will hopefully change. And as said before, the <|endoftext|> <|startoftext|> tags are somewhat annoying when in the middle of the results... not to mention... where can I delete ======== SAMPLE 1 ========? It's always showing up in all my samples. And when the program saves these samples, it doesn't save them in .txt files, just samples-100 with no extension.
With nshepperd's code I could easily make these adjustments. I'm not sure where I can make them in this code.

In regards to your suggestion
Haven't tried output = model.model(hparams=hparams, X=context) tf.compat.v1.add_to_collection('checkpoints', output)will update when I get I get the data issues out of the way.

saippuakauppias · 2019-08-23T16:03:20Z

Maybe @nshepperd already tried to solve this problem too?

AdamDanielKing · 2019-09-21T22:06:03Z

@saippuakauppias That line I gave had a bug. What I meant was to try adding output['logits'].

saippuakauppias · 2019-09-21T22:30:17Z

@AdamDanielKing thanks for quick reply!

Unfortunately, memory consumption has not decreased: Adam = 16.7GB, SGD = 8.5.GB.
Should there be a decrease of loss or step time?

AdamDanielKing · 2019-09-21T23:11:03Z

@saippuakauppias In some cases adding a checkpoint will speed things up slightly but I was mainly trying for a memory decrease. The outcomes of each step will not be changed by adding a checkpoint.

Are those numbers the peak memory usage? Memory usage varies significantly throughout each training step, going up and down more than once. I was using a script that showed every (de)allocation and the total memory usage at each point, but I can't seem to remember where I got it.

saippuakauppias · 2019-09-21T23:27:49Z

@AdamDanielKing , no, these are not peak usage. I ran nvidia-smi -l 1 (it displays information every second in loop) and went through about 100 epochs (steps) - during this time the memory usage did not changed.
With a script that logs memory - it would be easier..)

Tests with SM3 optimizer (with momentum=0.9 as in the implementation of @dantuluri ):

only SM3 optimizer added: OOM error
SM3 + remove if layer == 10:: 16.7GB
SM3 + remove if layer == 10: + add tf.compat.v1.add_to_collection('checkpoints', output['logits']): 16.7GB

Rechecked Adam and SGD without removing if layer == 10: (no changes, all as in the repository):

Adam: OOM error
SGD: 16.6GB

Currently there is only one way out: use SGD and remove if layer == 10:.
But how will this affect quality and speed?

jkraybill · 2019-09-22T01:37:12Z

So I have found a quick-and-dirty way to fine-tune 774M under Colab without OOM's, thanks to this tweet by @basedblue: https://twitter.com/BasedBlue/status/1169601983046672385?s=20

under this line in train.py

train_vars = [v for v in tf.trainable_variables() if 'model' in v.name]

add this line:

train_vars = train_vars[-12:]

This is just tuning h35, according to his tweet here you can train h15-h35: https://twitter.com/BasedBlue/status/1169600560535916544?s=20

I can't attest to the results yet - running my own tests as we speak - but I can attest that this is the first time I've been able to do any sort of tuning of 774M under Colab.

EDIT: can confirm I was able to do up to 336 layers ("-336" instead of "-12" above) before getting OOM's. I do still get them intermittently at lower numbers, which seems to be a Colab quirk. The results of my finetuning on a set I'm using for benchmarking were markedly better than the 345M model in both validation and test. Also, if you're doing this, make sure to pass in the FULL list of vars to your Saver, not the shortened tuning list, or you won't be able to restore into the full model.

jkraybill · 2019-09-22T22:54:08Z

Here's my notebook that demonstrates fine-tuning 774M with reduced parameters in Colab:

https://github.com/jkraybill/gpt-2/blob/finetuning/GPT2-finetuning2-774M.ipynb

You may need to adjust the "layers_to_train" parameter; I've had success with numbers up to 336, but still get intermittent OOM's before training starts with values between 200-336. 200 or less so far always works but the results I got with 336 seemed more coherent for text generation tuned on a large corpus than results at 200 and less.

On a very non-scientific experimental benchmark, which is a randomly selected set of 100 hard trivia questions, after fine-tuning on a large trivia question set, 117M scored 6/100, 345M scored 11/100, and 774M scored 19/100 and 20/100 in two different runs. (I'm decent at trivia and my score was 43/100.) These scores interestingly scale closely with model size, so at this rate the full-size model (if able to be fine-tuned) will score around what I scored.

AdamDanielKing · 2019-09-22T23:16:01Z

under this line in train.py

train_vars = [v for v in tf.trainable_variables() if 'model' in v.name]

add this line:

train_vars = train_vars[-12:]

That's a nice approach in some ways. It may even produce better results because the risk of overfitting is less when training fewer parameters.

jkraybill · 2019-09-23T00:29:00Z

I got pretty poor results when just training on 12 vars, but with 336 the results seemed superior for the tests I ran vs 345M. As the number goes lower, unconditional output seems to deviate from the training set format more than when doing "full stack" training on 345M. This is very preliminary and I have a lot more testing to do!

saippuakauppias · 2019-09-23T14:33:59Z

Checked memory usage on V100 (32 GB):

Adam optimizer:
- No changes: 31046MiB / 32480MiB
- remove if layer == 10: 16972MiB / 32480MiB
- Mixed Precision: 31046MiB / 32480MiB
- remove if layer == 10 + Mixed Precision: 31046MiB / 32480MiB (its not mistake, I rechecked 2 times 🤦‍♂)
SGD optimizer
- No changes: 31046MiB / 32480MiB
- remove if layer == 10: 8798MiB / 32480MiB
- remove if layer == 10 + Mixed Precision: 31046MiB / 32480MiB
SM3 optimizer
- Only sm3 added: 31046MiB / 32480MiB
- sm3 + remove if layer == 10 + Mixed Precision: 31046MiB / 32480MiB

PS: I watched memory usage only the first 100 steps (as before). I don’t know why the memory coincided, this is not a mistake.

PPS: Interesting thing from Mixed Precision log:
Adam:

Converted 2397/25089 nodes to float16 precision using 435 cast(s) to float16 (excluding Const and Variable casts)

SGD:

Converted 2397/23311 nodes to float16 precision using 435 cast(s) to float16 (excluding Const and Variable casts)

SM3:

Converted 2397/34689 nodes to float16 precision using 435 cast(s) to float16 (excluding Const and Variable casts)

saippuakauppias · 2019-09-23T20:34:40Z

Forgot to tell an interesting feature. When I started testing the V100, I accidentally started tuning without using the GPU. The server had a weak processor with 8 cores and 64GB of memory, but the training worked (Adam optimizer, no changes). Very slow, but it runs on the CPU using less than 32GB of RAM.

saippuakauppias · 2019-09-23T21:55:29Z

Can anyone adapt optimizer Adafactor for this repository?

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.

https://arxiv.org/abs/1804.04235
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/adafactor.py

https://github.com/ConnorJL/GPT2/blob/master/optimizers.py
https://github.com/rowanz/grover/blob/master/lm/optimization_adafactor.py
tensorflow/addons#522

saippuakauppias · 2019-09-24T10:41:11Z

My quick and stupid implementation of Adafactor optimizer + remove if layer == 10: line did not receive a Out Of Memory in the Colab and it is now finetuning!
I hope that he will be able to generate the text, and not result in something terrible as a result.

I will now check the memory usage for the V100 (32 GB) and write results here.

saippuakauppias · 2019-09-24T12:24:43Z

batch_size=1:
- only Adafactor optimizer: 16974MiB / 32480MiB
- Adafactor + remove if layer == 10:: 8782MiB / 32480MiB
- Adafactor + remove if layer == 10: + Mixed Precision: 31046MiB / 32480MiB
batch_size=10:
- Adafactor + remove if layer == 10:: 16982MiB / 32480MiB
batch_size=25:
- Adafactor + remove if layer == 10:: 31046MiB / 32480MiB

A few other tests for an example:

SGD + batch_size=25 + remove if layer == 10:: 31046MiB / 32480MiB
SGD + batch_size=1 + only_train_transformer_layers = False + remove if layer == 10:: 8798MiB / 32480MiB

Can anyone recommend what other settings (like turn off only_train_transformer_layers or change AdafactorOptimizer default parameters) should be changed in order to achieve a decrease in memory and/or improve the quality of generation? I could check it.

jkraybill · 2019-09-24T15:20:40Z

Nice work. I'm currently experimenting with adafactor, train_vars numbers, and also checkpointing less frequently than every single layer. Checkpointing seems to have resulted in a 5-10x slowdown in training in exchange for better memory usage, so trying to find a happy medium.

The good news is that it is clear that combos of adafactor, train_vars, and layer checkpointing definitively allow finetuning of 774M!

I don't have any empirical results from adafactor here yet on my trivia set but will post them in an edit here once I have them. I have heard that using non-Adam optimizers can greatly detrimentally affect the learning process so we'll see what comes out.

AdamDanielKing · 2019-09-24T17:01:12Z

This progress looks good. :) Though I still think it would be more informative to look at peak memory usage rather than what nvidia-smi is giving. The memory usage varies substantially over a single training step and it's really the peak usage that decides if the program will crash. One method is shown here with the tf.contrib.memory_stats.MaxBytesInUse() op. This Stack Overflow answer also suggests a way to view a timeline of memory usage.

Checkpointing seems to have resulted in a 5-10x slowdown in training in exchange for better memory usage, so trying to find a happy medium.

@jkraybill I think most of that slowdown must be coming from somewhere else. Checkpointing should result in each operation being computed at most twice, so I don't think it should be possible to get more than a 2x slowdown. Especially since only the forward step is recomputed. The gradient checkpointing repo suggests a 1.2x slowdown.

jkraybill · 2019-09-24T17:23:15Z

@AdamDanielKing thanks for the tip -- I need to spend a lot more time with the code to determine the cause. I updated my notebook codebase from a quite old @Tenoke fork to the most recent @nshepperd fork to use adafactor etc so it's quite likely the cause is from other changes in the codebase. The most glaring change was the checkpointing but I agree with you that 5-10x doesn't agree with what should be happening. I'll do a bunch more testing and try to pin it down. I've done a little cleanup and the code is less slow but still maybe 4-5x slower than before. I'll figure it out and share results here.

BTW thank you so much for TalkToTransformer -- it's by far been the most useful tool for getting non-AI people around me interested in this stuff, and is a great implementation!

ronnyli · 2019-09-25T21:36:19Z

@saippuakauppias Thanks so much for sharing. I noticed that the finetuning_774M branch still has this line (link): assert model_name not in ['774M', '1558M']. Should '774M' be removed?

saippuakauppias · 2019-09-25T21:43:05Z

I delete this line in bash-script on initial setup (check your package installation path in GPT2PY_PATH var):

echo -e "Fix finetuning 774M:"
PYTHON_VERSION=`python3 -c 'import sys; print("%s.%s" % (sys.version_info.major, sys.version_info.minor))'`
GPT2PY_PATH="/usr/local/lib/python${PYTHON_VERSION}/dist-packages/gpt_2_simple/gpt_2.py"
NEED_FIX=`cat ${GPT2PY_PATH} | grep 'cannot finetune the 774M' | wc -l`
if [ ${NEED_FIX} -ne 0 ]
then
    mv ${GPT2PY_PATH} gpt_2.py
    grep -vwE 'cannot finetune the 774M' gpt_2.py > ${GPT2PY_PATH}
    if [ $? -ne 0 ]
    then
        echo -e "FAILED grep"
        exit 1
    fi
    rm gpt_2.py
fi

ronnyli · 2019-09-28T21:14:22Z

for the benefit of people after me, I can confirm that the finetuning_774M branch in the fork @saippuakauppias made works after adjusting this line. It would be great if this fork could be merged into master in the near future!

minimaxir · 2019-09-29T01:20:08Z

If there's a PR, I can test and merge it. (as long as there aren't any other preconditions/hacks outside of the package required to get it working)

saippuakauppias · 2019-09-29T06:41:08Z

@minimaxir, All that is needed now is to add additional checkpoints and use SGD optimizer.

What do you think about adding optimizers SM3 and Adafactor?

saippuakauppias · 2019-10-08T13:12:16Z

Can anyone help me implement RectifiedAdam optimizer (https://arxiv.org/pdf/1908.03265v1.pdf)? Or Lookahead optimizer (https://arxiv.org/abs/1907.08610v1)? Found them here.

I'm trying to check RAdam here: https://colab.research.google.com/drive/1waCfxIgrrY-s4gZm7R3uEZDAk2hdCKVK (install gpt-2-simple from my fork)

saippuakauppias · 2019-10-10T13:40:38Z

Does anyone know why after training with SGD optimizer - is nonsense generated?
SGD uses as much memory as Adafactor, but Adafactor generates text as in a training file.

Here are two of my Colabs with examples for comparing the generation results (at the end of the colab).
Anyone have any thoughts on this?

SGD: https://colab.research.google.com/drive/1LjZ8Z7oIjQVdgDM1FsWLusjbMXW925eV
Adafactor: https://colab.research.google.com/drive/1-F8NIc2XANEqQd8MbX3QC4aEiFbG7weR

gpt2ent · 2019-12-13T15:11:07Z

FYI @shawwn managed to be able to finetune 1558M model in Colab using TPU.

pearsonkyle · 2020-09-29T05:22:23Z

Will it work in a collab pro?

Merzmensch · 2021-10-23T11:46:17Z

I have a similar issue with 355m model in Colab Notebook (even in Colab Pro) with small dataset training file: #277
Can somebody please check, if something have to be updated in the Notebook? Get continuously issue "ResourceExhaustedError: failed to allocate memory", tried with Colab Pro and other accounts, always the same issue.

Pls help, I wanted to show this process for students, and till last months everything worked (I was using this colab notebook for longer time without any issues).

I am trying to reproduce the training session which worked perfectly in August 2021, now it doesn't work.

minimaxir · 2021-10-23T17:13:33Z

Recently, Google made it so that you generally only get K80s on the free plan, which with 12GB VRAM didn't work well with the 355M model, so that might explain it.

Merzmensch · 2021-10-23T17:50:54Z

Thank you! That explains it indeed - 124M works without issues. Still, I'm using Colab Pro and wonder why 355M doesn't work for me... But 124M is also fine for my experiments (at least, it working) :)

dean-dalianis · 2021-11-11T14:53:42Z

I made 355M work on Colab Pro but used the gpt-2-simple==0.7.2.
For some reason even with installing this, I was still getting tensorflow>2, so in essence I ran these:

!pip install gpt-2-simple==0.7.2
!pip show tensorflow
!pip install tensorflow==1.15.2
!pip show tensorflow
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

I did this so I could use

              use_memory_saving_gradients=True,
              only_train_transformer_layers=True

Which are not available in tensorflow>2.

I have no idea if this is the right approach though.

Here's the caveat: Tensorflow 1.15. 2 no longer has GPU support.Jan 30, 2020 :(.

minimaxir closed this as completed Aug 20, 2019

minimaxir reopened this Aug 20, 2019

saippuakauppias mentioned this issue Aug 23, 2019

774M Model running out of memory nshepperd/gpt-2#24

Open

minimaxir pinned this issue Aug 23, 2019

Mennaruuk mentioned this issue Mar 14, 2022

Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Open

OOM error with new 774M model when running in Colab #108

OOM error with new 774M model when running in Colab #108

Comments

ghost commented Aug 20, 2019

minimaxir commented Aug 20, 2019

sdan commented Aug 21, 2019 • edited

saippuakauppias commented Aug 21, 2019 • edited

minimaxir commented Aug 21, 2019

saippuakauppias commented Aug 21, 2019

woctezuma commented Aug 21, 2019 • edited

AdamDanielKing commented Aug 21, 2019

minimaxir commented Aug 21, 2019

AdamDanielKing commented Aug 21, 2019

saippuakauppias commented Aug 21, 2019 • edited

sdan commented Aug 21, 2019

AdamDanielKing commented Aug 21, 2019 • edited

saippuakauppias commented Aug 21, 2019 • edited

minimaxir commented Aug 21, 2019

sdan commented Aug 21, 2019

sdan commented Aug 21, 2019

saippuakauppias commented Aug 21, 2019 • edited

AdamDanielKing commented Aug 22, 2019 • edited

sdan commented Aug 22, 2019

AdamDanielKing commented Aug 22, 2019

sdan commented Aug 22, 2019

sdan commented Aug 22, 2019 • edited

AdamDanielKing commented Aug 22, 2019 • edited

AdamDanielKing commented Aug 22, 2019

sdan commented Aug 22, 2019

saippuakauppias commented Aug 23, 2019

AdamDanielKing commented Sep 21, 2019

saippuakauppias commented Sep 21, 2019

AdamDanielKing commented Sep 21, 2019 • edited

saippuakauppias commented Sep 21, 2019 • edited

jkraybill commented Sep 22, 2019 • edited

jkraybill commented Sep 22, 2019

AdamDanielKing commented Sep 22, 2019

jkraybill commented Sep 23, 2019

saippuakauppias commented Sep 23, 2019 • edited

saippuakauppias commented Sep 23, 2019 • edited

saippuakauppias commented Sep 23, 2019 • edited

saippuakauppias commented Sep 24, 2019

saippuakauppias commented Sep 24, 2019

jkraybill commented Sep 24, 2019

AdamDanielKing commented Sep 24, 2019

jkraybill commented Sep 24, 2019

ronnyli commented Sep 25, 2019

saippuakauppias commented Sep 25, 2019 • edited

ronnyli commented Sep 28, 2019

minimaxir commented Sep 29, 2019 • edited

saippuakauppias commented Sep 29, 2019

saippuakauppias commented Oct 8, 2019 • edited

saippuakauppias commented Oct 10, 2019 • edited

gpt2ent commented Dec 13, 2019

pearsonkyle commented Sep 29, 2020

Merzmensch commented Oct 23, 2021

minimaxir commented Oct 23, 2021

Merzmensch commented Oct 23, 2021

dean-dalianis commented Nov 11, 2021 • edited

sdan commented Aug 21, 2019 •

edited

saippuakauppias commented Aug 21, 2019 •

edited

woctezuma commented Aug 21, 2019 •

edited

saippuakauppias commented Aug 21, 2019 •

edited

AdamDanielKing commented Aug 21, 2019 •

edited

saippuakauppias commented Aug 21, 2019 •

edited

saippuakauppias commented Aug 21, 2019 •

edited

AdamDanielKing commented Aug 22, 2019 •

edited

sdan commented Aug 22, 2019 •

edited

AdamDanielKing commented Aug 22, 2019 •

edited

AdamDanielKing commented Sep 21, 2019 •

edited

saippuakauppias commented Sep 21, 2019 •

edited

jkraybill commented Sep 22, 2019 •

edited

saippuakauppias commented Sep 23, 2019 •

edited

saippuakauppias commented Sep 23, 2019 •

edited

saippuakauppias commented Sep 23, 2019 •

edited

saippuakauppias commented Sep 25, 2019 •

edited

minimaxir commented Sep 29, 2019 •

edited

saippuakauppias commented Oct 8, 2019 •

edited

saippuakauppias commented Oct 10, 2019 •

edited

dean-dalianis commented Nov 11, 2021 •

edited