Illegal memory access (cudaErrorIllegalAddress) #5002

prajjwal1 · 2020-06-15T08:57:41Z

🐛 Bug

Information

This bug/problem has been discussed on Pytorch/Apex and here (bot marked it as stale) also.

I'm using Albert on GLUE (although this issue is model/dataset agnostic).
I've made a slight modifications in my train loop (as compared to train() in Trainer().
The main one which throws this error is when I compute the gradients:

grad = torch.autograd.grad(loss, model.parameters(), allow_unused=True)

where loss is simply model(**inputs)[0]

I'm using Pytorch 1.5.0+cu101, transformers 2.11 on one GPU, no multiGPU, although the instance has 2 by (CUDA_VISIBLE_DEVICES=0). I tried with torch.cuda.set_device() also.

Can you suggest a workaround ?

The text was updated successfully, but these errors were encountered:

prajjwal1 · 2020-06-15T12:56:48Z

Reducing the batch size further doesn't raise this error. But a lot of RAM is left empty. If this is an issue, then RAM demand exceeded error should be raised.

sshleifer · 2020-06-15T18:45:59Z

I've seen this error, and think it happens right before OutOfMemory.
I agree the traceback should be different.
Marking wontfix for now since this is a torch/apex issue, as you suggest, not a transformers issue.

prajjwal1 · 2020-06-16T01:46:56Z

I don't think it's an Apex issue also because I ran my code without fp16 integration earlier. Mostly a pytorch issue. I am not sure how RAM usage exceeds in such a short time. Initially, 10 Gigs of RAM is left and suddenly this error pops up. Halving the batch size helped but there are no signs of memory leakage. Not really sure what's happening.

stale · 2020-08-15T03:03:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

prajjwal1 mentioned this issue Jun 15, 2020

RuntimeError: CUDA error: an illegal memory access was encountered pytorch/pytorch#21819

Closed

sshleifer added dependencies Pull requests that update a dependency file wontfix labels Jun 15, 2020

stale bot removed the wontfix label Jun 15, 2020

prajjwal1 mentioned this issue Jul 26, 2020

cuda runtime error (700) : an illegal memory access was encountered pytorch/pytorch#42077

Closed

stale bot added the wontfix label Aug 15, 2020

stale bot closed this as completed Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal memory access (cudaErrorIllegalAddress) #5002

Illegal memory access (cudaErrorIllegalAddress) #5002

prajjwal1 commented Jun 15, 2020 •

edited

prajjwal1 commented Jun 15, 2020

sshleifer commented Jun 15, 2020

prajjwal1 commented Jun 16, 2020 •

edited

stale bot commented Aug 15, 2020

Illegal memory access (cudaErrorIllegalAddress) #5002

Illegal memory access (cudaErrorIllegalAddress) #5002

Comments

prajjwal1 commented Jun 15, 2020 • edited

🐛 Bug

Information

prajjwal1 commented Jun 15, 2020

sshleifer commented Jun 15, 2020

prajjwal1 commented Jun 16, 2020 • edited

stale bot commented Aug 15, 2020

prajjwal1 commented Jun 15, 2020 •

edited

prajjwal1 commented Jun 16, 2020 •

edited