Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal memory access (cudaErrorIllegalAddress) #5002

Closed
prajjwal1 opened this issue Jun 15, 2020 · 4 comments
Closed

Illegal memory access (cudaErrorIllegalAddress) #5002

prajjwal1 opened this issue Jun 15, 2020 · 4 comments
Labels
dependencies Pull requests that update a dependency file wontfix

Comments

@prajjwal1
Copy link
Contributor

prajjwal1 commented Jun 15, 2020

馃悰 Bug

Information

This bug/problem has been discussed on Pytorch/Apex and here (bot marked it as stale) also.

I'm using Albert on GLUE (although this issue is model/dataset agnostic).
I've made a slight modifications in my train loop (as compared to train() in Trainer().
The main one which throws this error is when I compute the gradients:

grad = torch.autograd.grad(loss, model.parameters(), allow_unused=True)

where loss is simply model(**inputs)[0]

I'm using Pytorch 1.5.0+cu101, transformers 2.11 on one GPU, no multiGPU, although the instance has 2 by (CUDA_VISIBLE_DEVICES=0). I tried with torch.cuda.set_device() also.

Can you suggest a workaround ?

@prajjwal1
Copy link
Contributor Author

Reducing the batch size further doesn't raise this error. But a lot of RAM is left empty. If this is an issue, then RAM demand exceeded error should be raised.

@sshleifer
Copy link
Contributor

I've seen this error, and think it happens right before OutOfMemory.
I agree the traceback should be different.
Marking wontfix for now since this is a torch/apex issue, as you suggest, not a transformers issue.

@sshleifer sshleifer added dependencies Pull requests that update a dependency file wontfix labels Jun 15, 2020
@stale stale bot removed the wontfix label Jun 15, 2020
@prajjwal1
Copy link
Contributor Author

prajjwal1 commented Jun 16, 2020

I don't think it's an Apex issue also because I ran my code without fp16 integration earlier. Mostly a pytorch issue. I am not sure how RAM usage exceeds in such a short time. Initially, 10 Gigs of RAM is left and suddenly this error pops up. Halving the batch size helped but there are no signs of memory leakage. Not really sure what's happening.

@stale
Copy link

stale bot commented Aug 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 15, 2020
@stale stale bot closed this as completed Aug 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file wontfix
Projects
None yet
Development

No branches or pull requests

2 participants