Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

tanaymeh · 2021-04-18T05:45:32Z

I installed and imported accelerate in both Kaggle Kernels and Google Colab with TPU turned on but it doesn't seem to detect the TPU and instead detects CPU when running the following code:

$ pip install -q accelerate

import accelerate
acc = accelerate.Accelerator()
device = acc.device
print(device)

The above snippet just outputs cpu when ran on both aforementioned platforms with TPU enabled.

Is there something that I am doing wrong?

PyTorch version: 1.7.0
Python version: 3.7.9

The text was updated successfully, but these errors were encountered:

sgugger · 2021-04-19T12:14:44Z

Yes it's only when you launch your training function with the xmp.spawn method (like you would usually do) that it will recognize the TPU devices.

romanoss · 2021-04-19T16:30:53Z

same problem here:
changed launch code to

def _mp_fn(rank, flags):
global acc_list
torch.set_default_tensor_type('torch.FloatTensor')
res = run_training()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

and in run_training() I do:
trainloader = torch.utils.data.DataLoader(.. #no Paralell loader
optimizer = Ranger(model.parameters(), lr = Config.SCHEDULER_PARAMS['lr_start'])
scheduler = ShopeeScheduler(optimizer,**Config.SCHEDULER_PARAMS)
model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader)

for i in range(Config.EPOCHS):
avg_loss_train = train_fn(model, trainloader, optimizer, scheduler, i)

and I get:
Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8
Exception in device=TPU:1: Cannot replicate if number of devices (1) is different from 8
....
for all tpu devices

value for acc.device=xlr:1

What did I do wrong?
Best Roman

sgugger · 2021-04-19T18:51:31Z

Thanks for reporting. I haven't fully tested on Colab (the first use case is running script) so will need some time to set up a way of reproducing. In the meantime, could you share your notebook so I can have a full reproducer?

romanoss · 2021-04-19T19:41:35Z

this is a kaggle notebook from the current shopee comp - easiest probably: join the comp and upload
shopee-pytorch-eca-nfnet-l0-image-training.zip

I think the only thing you have to change is RETRAIN_MODEL=''
and switch USE_TPU_AND_ACCELERATE

a small bug: in train_fn() change
data[k] = v.to(Config.DEVICE) to data[k] = v #in case of USE_TPU_AND_ACCELERATE=True

sgugger · 2021-04-20T15:08:48Z

Could you share which notebook it is? There are a lot of notebooks for this comp and a simple search with "accelerate" as a keyword does not yield anything.

sgugger · 2021-04-20T15:17:34Z

Also one thing that might be of help: I just did some testing on a Colab (will share an example soon) and I get your error (Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8) if I have an accelerator = Accelerator() outside of the training function.

Since this call initializes the distributed state, it can't be made in a cell of your run_training, it has to be inside.

I will see if I can work on a fix for this to make it easier to use.

romanoss · 2021-04-20T17:39:00Z

Thx for the fast answer!
I tried to put the creation of the accelerator inside of the training function, like:

def run_training():

if USE_TPU_AND_ACCELERATE:
    accelerator = Accelerator()
    device = accelerator.device
    print(device)
    Config.DEVICE=device
 df = pd.read_csv(Config.TRAIN_CSV)
 .......

and get some "Exception in device=TPU:5: name 'accelerator' is not defined" - errors.

Did I misunderstand "Since this call initializes the distributed state, it can't be made in a cell of your run_training, it has to be inside." ?

sgugger · 2021-04-20T18:09:59Z

It looks like the object accelerator is not defined in another place of your notebook. Again it would be much easier if you shared a link to the notebook in question.

romanoss · 2021-04-20T18:15:14Z

shopee-pytorch-eca-nfnet-l0-image-training (1).zip

Sorry thought just referencing the last notebook gives enough information

sgugger · 2021-04-20T18:27:04Z

Ah my mistake, I hadn't realized the notebooks was in the zip file, I thought it was the data 🤦

You are not passing around the accelerator to your train_fn and eval_fn, that's why you get the undefined error.

romanoss · 2021-04-20T21:30:09Z

Thx - this solved the problem - but next experiment has been very strange.
Training started but was very slow (on gpu ~30min/epoch - tpu showed 4h) - ram usage was very high and went over 16GB in step8 -exit. Cpu and tpu usage has been 0 - I took a screenshot (cpu and ram usage shown after restart of vm)

sgugger · 2021-04-20T21:46:52Z

Oh, that's because you have a call to loss.item() at every step. This should be avoided on TPUs (I'll add some mention of that in the docs of accelerate) as it triggers some annoying sync.

You should add a loss.detach() to your running loss and only log it every, let's say 100 steps (instead of every step) to call this .item() only when you log it.

Another thing to look at is to make sure you have fixed shapes (which you seem to have). In any case, the first step will always be very slow on TPU (as it compiles the whole graph for your training), it's after the first step that it goes faster. Hope that helps!

romanoss · 2021-04-20T21:54:52Z

I know tpu is slow in the beginning from my tf runs. I am new to pytorch and will try to implement your tips. If it goes well I will report. Thx again :)

romanoss · 2021-04-20T22:17:08Z

did not work out well - I commented out the loss.item call #fin_loss += loss.item() and the results:
ram as high as before - crashed a bit later
sometimes high cpu usage but not often
time prediction for epoch rather the same

sgugger · 2021-04-26T16:22:23Z

This is fixed by #44, there is now a notebook_launcher that helps you run your training function in a colab or Kaggle notebook. Will add some examples soon!

nbroad1881 · 2021-06-16T13:02:16Z

I tried to run the notebook example (simple_nlp_example.ipynb) in Kaggle and I got the following error when trying to import transformers modules:

ImportError: /opt/conda/lib/python3.7/site-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

Please try running it here and see for yourself.

sgugger · 2021-06-16T13:04:31Z

Sounds like an issue on the Kaggle kernels, as it works fine on Colab. Does the error persist if you restart and retry?

nbroad1881 · 2021-06-16T13:49:25Z

Sounds like an issue on the Kaggle kernels, as it works fine on Colab. Does the error persist if you restart and retry?

Yes it still persists.

I found this discussion that was similar but did not help me solve my problem. https://www.kaggle.com/discussion/201365

sgugger closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

tanaymeh commented Apr 18, 2021

sgugger commented Apr 19, 2021

romanoss commented Apr 19, 2021 •

edited

sgugger commented Apr 19, 2021

romanoss commented Apr 19, 2021 •

edited

sgugger commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 26, 2021

nbroad1881 commented Jun 16, 2021

sgugger commented Jun 16, 2021

nbroad1881 commented Jun 16, 2021

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

Comments

tanaymeh commented Apr 18, 2021

sgugger commented Apr 19, 2021

romanoss commented Apr 19, 2021 • edited

sgugger commented Apr 19, 2021

romanoss commented Apr 19, 2021 • edited

sgugger commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 20, 2021

romanoss commented Apr 20, 2021

romanoss commented Apr 20, 2021

sgugger commented Apr 26, 2021

nbroad1881 commented Jun 16, 2021

sgugger commented Jun 16, 2021

nbroad1881 commented Jun 16, 2021

romanoss commented Apr 19, 2021 •

edited

romanoss commented Apr 19, 2021 •

edited