Colab session crashed after using all available RAM #54

Tylersuard · 2020-12-24T04:19:07Z

I am using the premium High-Ram instance. I appreciate that you made a Colab notebook. How can I fix this issue to run it?

Tylersuard · 2020-12-24T04:28:11Z

Crash happened when running this cell:

Random performance without fine-tuning
get_accuracy(params_repl)

andsteing · 2021-01-22T08:36:55Z

Hi Tyler

Do you run out of CPU RAM or GPU/TPU RAM?

Also, how much RAM do you have?
(You can check with !free -mh)

The provided Colab works fine with CIFAR datasets and the default settings (default Colab currently has 12G of RAM).

lucasliunju · 2021-03-19T02:39:15Z

I cannot run the code on colab and I have the same problem when I try to use colab which crashed the blocks.

andsteing · 2021-03-19T07:51:32Z

I just checked, the Colab runs fine (at least up to the "Fine-tune" section that is below the "Random performance without fine-tuning." comment that you point out above).

lucasliunju · 2021-03-19T08:52:36Z

I find an error at the last two steps of Fine Tune:

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

for step, batch, lr_repl in zip(
tqdm.notebook.trange(1, total_steps + 1),
ds_train.as_numpy_iterator(),
lr_iter
):

opt_repl, loss_repl, update_rngs = update_fn_repl(
opt_repl, lr_repl, batch, update_rngs)

Thank you very much! I'm looking forward to your reply.

GuardianWang · 2021-03-22T15:52:12Z

Hi Tyler,

Crash happened when running this cell:

Random performance without fine-tuning
get_accuracy(params_repl)

Did you run
TPU setup : Boilerplate for connecting JAX to TPU?

If you double click this cell, you will find there are multiple codes there:


#@markdown TPU setup : Boilerplate for connecting JAX to TPU.

import os
if 'google.colab' in str(get_ipython()) and 'COLAB_TPU_ADDR' in os.environ:
  # Make sure the Colab Runtime is set to Accelerator: TPU.
  import requests
  if 'TPU_DRIVER_MODE' not in globals():
    url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'
    resp = requests.post(url)
    TPU_DRIVER_MODE = 1

  # The following is required to use TPU Driver as JAX's backend.
  from jax.config import config
  config.FLAGS.jax_xla_backend = "tpu_driver"
  config.FLAGS.jax_backend_target = "grpc://" + os.environ['COLAB_TPU_ADDR']
  print('Registered TPU:', config.FLAGS.jax_backend_target)
else:
  print('No TPU detected. Can be changed under "Runtime/Change runtime type".')

I think these codes will register your TPU.

GuardianWang · 2021-03-22T15:53:58Z

@lucasliunju

I find an error at the last two steps of Fine Tune:

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

for step, batch, lr_repl in zip(
tqdm.notebook.trange(1, total_steps + 1),
ds_train.as_numpy_iterator(),
lr_iter
):

opt_repl, loss_repl, update_rngs = update_fn_repl(
opt_repl, lr_repl, batch, update_rngs)

Thank you very much! I'm looking forward to your reply.

#81 (comment)

To be specific, import flax.optim as optim

lucasliunju · 2021-03-22T16:06:23Z

Hi @GuardianWang

Thanks for your reply. That does work!
Did you know why the official optimizer cannot work?
I would like to ask whether you use this code on multi-host tpu (such as v3-32, v3-64).

Thank you very much!

andsteing · 2021-04-21T06:58:10Z

I just re-ran the code in the provided Colab on TPU and it worked without problem.

@Tylersuard can you try again?
(I think it was some temporary regression in the Colab setup and/or JAX code that was independent from the code in this repo)

Tylersuard · 2021-04-21T13:55:46Z

@andsteing I just tried again, same issue:

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

for step, batch, lr_repl in zip(
tqdm.notebook.trange(1, total_steps + 1),
ds_train.as_numpy_iterator(),
lr_iter
):

opt_repl, loss_repl, update_rngs = update_fn_repl(
opt_repl, lr_repl, batch, update_rngs)

Response:
Your session crashed after using all available RAM.

Log:

Apr 21, 2021, 6:52:28 AM	WARNING	WARNING:root:kernel 0cb38d64-9225-470e-badb-c668f208fe42 restarted
Apr 21, 2021, 6:52:28 AM	INFO	KernelRestarter: restarting kernel (1/5), keep random ports
Apr 21, 2021, 6:27:14 AM	WARNING	2021-04-21 13:27:14.614322: W tensorflow/core/kernels/data/cache_dataset_ops.cc:798] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Apr 21, 2021, 6:23:38 AM	INFO	Adapting to protocol v5.1 for kernel 0cb38d64-9225-470e-badb-c668f208fe42
Apr 21, 2021, 6:23:37 AM	INFO	Kernel started: 0cb38d64-9225-470e-badb-c668f208fe42
Apr 21, 2021, 6:23:29 AM	INFO	Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

andsteing · 2021-04-21T18:04:02Z

Just to confirm:

You're loading this notebook here : https://colab.sandbox.google.com/github/google-research/vision_transformer/blob/master/vit_jax.ipynb
You're using a Runtime type = TPU kernel
You start with a fresh kernel.
You run end-to-end without modifications.

Because I tried multiple times and was not able to reproduce your error.

Tylersuard · 2021-04-21T21:18:59Z

I used a slightly different link, the one shown in this repo's readme:

https://colab.research.google.com/github/google-research/vision_transformer/blob/master/vit_jax.ipynb

Tylersuard · 2021-04-21T21:59:02Z

@andsteing Ok, I got it to work. For some reason, it does not work with the "high-ram" instance option, but it does work with the regular option. Thank you for your help.

Tylersuard closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colab session crashed after using all available RAM #54

Colab session crashed after using all available RAM #54

Tylersuard commented Dec 24, 2020

Tylersuard commented Dec 24, 2020 •

edited

Loading

andsteing commented Jan 22, 2021

lucasliunju commented Mar 19, 2021

andsteing commented Mar 19, 2021

lucasliunju commented Mar 19, 2021

GuardianWang commented Mar 22, 2021 •

edited

Loading

GuardianWang commented Mar 22, 2021 •

edited

Loading

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

lucasliunju commented Mar 22, 2021

andsteing commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

andsteing commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

Colab session crashed after using all available RAM #54

Colab session crashed after using all available RAM #54

Comments

Tylersuard commented Dec 24, 2020

Tylersuard commented Dec 24, 2020 • edited Loading

andsteing commented Jan 22, 2021

lucasliunju commented Mar 19, 2021

andsteing commented Mar 19, 2021

lucasliunju commented Mar 19, 2021

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

GuardianWang commented Mar 22, 2021 • edited Loading

GuardianWang commented Mar 22, 2021 • edited Loading

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

lucasliunju commented Mar 22, 2021

andsteing commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

The world's simplest training loop.

Completes in ~20 min on the TPU runtime.

andsteing commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

Tylersuard commented Apr 21, 2021

Tylersuard commented Dec 24, 2020 •

edited

Loading

GuardianWang commented Mar 22, 2021 •

edited

Loading

GuardianWang commented Mar 22, 2021 •

edited

Loading