Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

Closed
tanaymeh opened this issue Apr 18, 2021 · 18 comments
Closed

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels #29

tanaymeh opened this issue Apr 18, 2021 · 18 comments

Comments

@tanaymeh
Copy link

I installed and imported accelerate in both Kaggle Kernels and Google Colab with TPU turned on but it doesn't seem to detect the TPU and instead detects CPU when running the following code:

$ pip install -q accelerate
import accelerate
acc = accelerate.Accelerator()
device = acc.device
print(device)

The above snippet just outputs cpu when ran on both aforementioned platforms with TPU enabled.

Is there something that I am doing wrong?

PyTorch version: 1.7.0
Python version: 3.7.9

@sgugger
Copy link
Collaborator

sgugger commented Apr 19, 2021

Yes it's only when you launch your training function with the xmp.spawn method (like you would usually do) that it will recognize the TPU devices.

@romanoss
Copy link

romanoss commented Apr 19, 2021

same problem here:
changed launch code to

def _mp_fn(rank, flags):
global acc_list
torch.set_default_tensor_type('torch.FloatTensor')
res = run_training()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

and in run_training() I do:
trainloader = torch.utils.data.DataLoader(.. #no Paralell loader
optimizer = Ranger(model.parameters(), lr = Config.SCHEDULER_PARAMS['lr_start'])
scheduler = ShopeeScheduler(optimizer,**Config.SCHEDULER_PARAMS)
model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader)

for i in range(Config.EPOCHS):
avg_loss_train = train_fn(model, trainloader, optimizer, scheduler, i)

and I get:
Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8
Exception in device=TPU:1: Cannot replicate if number of devices (1) is different from 8
....
for all tpu devices

value for acc.device=xlr:1

What did I do wrong?
Best Roman

@sgugger
Copy link
Collaborator

sgugger commented Apr 19, 2021

Thanks for reporting. I haven't fully tested on Colab (the first use case is running script) so will need some time to set up a way of reproducing. In the meantime, could you share your notebook so I can have a full reproducer?

@romanoss
Copy link

romanoss commented Apr 19, 2021

this is a kaggle notebook from the current shopee comp - easiest probably: join the comp and upload
shopee-pytorch-eca-nfnet-l0-image-training.zip

I think the only thing you have to change is RETRAIN_MODEL=''
and switch USE_TPU_AND_ACCELERATE

a small bug: in train_fn() change
data[k] = v.to(Config.DEVICE) to data[k] = v #in case of USE_TPU_AND_ACCELERATE=True

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

Could you share which notebook it is? There are a lot of notebooks for this comp and a simple search with "accelerate" as a keyword does not yield anything.

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

Also one thing that might be of help: I just did some testing on a Colab (will share an example soon) and I get your error (Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8) if I have an accelerator = Accelerator() outside of the training function.

Since this call initializes the distributed state, it can't be made in a cell of your run_training, it has to be inside.

I will see if I can work on a fix for this to make it easier to use.

@romanoss
Copy link

Thx for the fast answer!
I tried to put the creation of the accelerator inside of the training function, like:

def run_training():

if USE_TPU_AND_ACCELERATE:
    accelerator = Accelerator()
    device = accelerator.device
    print(device)
    Config.DEVICE=device
 df = pd.read_csv(Config.TRAIN_CSV)
 .......

and get some "Exception in device=TPU:5: name 'accelerator' is not defined" - errors.

Did I misunderstand "Since this call initializes the distributed state, it can't be made in a cell of your run_training, it has to be inside." ?

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

It looks like the object accelerator is not defined in another place of your notebook. Again it would be much easier if you shared a link to the notebook in question.

@romanoss
Copy link

shopee-pytorch-eca-nfnet-l0-image-training (1).zip

Sorry thought just referencing the last notebook gives enough information

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

Ah my mistake, I hadn't realized the notebooks was in the zip file, I thought it was the data 🤦

You are not passing around the accelerator to your train_fn and eval_fn, that's why you get the undefined error.

@romanoss
Copy link

Thx - this solved the problem - but next experiment has been very strange.
Training started but was very slow (on gpu ~30min/epoch - tpu showed 4h) - ram usage was very high and went over 16GB in step8 -exit. Cpu and tpu usage has been 0 - I took a screenshot (cpu and ram usage shown after restart of vm)

accelerate

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

Oh, that's because you have a call to loss.item() at every step. This should be avoided on TPUs (I'll add some mention of that in the docs of accelerate) as it triggers some annoying sync.

You should add a loss.detach() to your running loss and only log it every, let's say 100 steps (instead of every step) to call this .item() only when you log it.

Another thing to look at is to make sure you have fixed shapes (which you seem to have). In any case, the first step will always be very slow on TPU (as it compiles the whole graph for your training), it's after the first step that it goes faster. Hope that helps!

@romanoss
Copy link

I know tpu is slow in the beginning from my tf runs. I am new to pytorch and will try to implement your tips. If it goes well I will report. Thx again :)

@romanoss
Copy link

did not work out well - I commented out the loss.item call #fin_loss += loss.item() and the results:
ram as high as before - crashed a bit later
sometimes high cpu usage but not often
time prediction for epoch rather the same

@sgugger
Copy link
Collaborator

sgugger commented Apr 26, 2021

This is fixed by #44, there is now a notebook_launcher that helps you run your training function in a colab or Kaggle notebook. Will add some examples soon!

@sgugger sgugger closed this as completed Apr 26, 2021
@nbroad1881
Copy link
Contributor

I tried to run the notebook example (simple_nlp_example.ipynb) in Kaggle and I got the following error when trying to import transformers modules:

ImportError: /opt/conda/lib/python3.7/site-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

Please try running it here and see for yourself.

@sgugger
Copy link
Collaborator

sgugger commented Jun 16, 2021

Sounds like an issue on the Kaggle kernels, as it works fine on Colab. Does the error persist if you restart and retry?

@nbroad1881
Copy link
Contributor

Sounds like an issue on the Kaggle kernels, as it works fine on Colab. Does the error persist if you restart and retry?

Yes it still persists.

I found this discussion that was similar but did not help me solve my problem. https://www.kaggle.com/discussion/201365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants