-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA ERROR an illegal memory access was encountered #114
Comments
How strange - I can reproduce. Must have been the last commit? I thought I checked it before push |
Yes, I tested c02bae2ebc684a2e068c0dc59be00ff43167b44d |
Actually now I'm not sure what caused it. I thought maybe it's our change from a few minutes ago to use cuBLAS tf32, but that can't be it? Btw I think it is non-deterministic. When I re-make / re-run it works, then it doesn't. My favorite kinds of issues with C. |
It can be reproduced on my device every time :P, I am looking into it but I'm not an expert on CUDA. |
I am not sure if it is caused by the lack of VRAM when doing the inference during the training. But I have changed it to a smaller batch size of 2 which only used 5g VRAM but the issue persisted. |
If set
It is the same error when I tried earlier today. |
I got the same issue at the end. |
Also seeing this on NVIDIA GeForce RTX 2080 Ti. The exact error message is non-deterministic, but it always seems to fail after step 19. [cuBLAS ERROR]: 13 train_gpt2.cu 509 or [CUDA ERROR] at file train_gpt2.cu:1238: |
using GTX 1050 on WSL:
|
There's definitely a bug. I'll have to stare at this a bit longer tomorrow and do a dissection to find the commit where it was introduced. It must have been pretty recent, probably somewhere today. |
Bisecting, it starts with 6b49ed1 for me. |
Alright. This is a weird issue. For this kind of error, one should technically use a compute-sanitizer. Surprisingly, I do not see the error when I enable compute-sanitizer! The most logical reason for this is that we have made poor calls for synchronization boundaries. Please recall that CUDA calls are async in nature. Anyway, I cannot debug further without a codebase expert (spent ~20 minutes in this codebase), and here is what I have found so far. You should be able to reproduce this debug.
So, I had to explicitly force the synchronization points to determine which kernel was causing this error (to find the culprit)
Now I have no idea what is happening inside the How do i know? use
An important observation here is Given this. there are three possible culprits in encoder_forward_kernel2.
I can force each of the lines to fixed index values and see which of these lines causes the error. Doing so I found the error is when we read wte_ix to compute the The correct way this code should have been written requires a boundary check before the output write and also while reading Also, this begs the question - if we are observing perf benefit over pytorch, then the two codes may not be functionally equivalent if the boundary conditions checks are not implemented correctly. We may be getting correct results after all but it might be all due to a fluke. So, I would be cautious about the perf claim against pytorch at the current stage (request to add a disclaimer). |
@msharmavikram the positional encoder code has been there for a very long time Looking at the diff, most of it should be harmless (e.g. the zero_grad and backward functions that I accidentally added are actually not called at all). The core issue is I think the addition of softmax_forward_kernel5, and how that interacts with us doing inference, when we truncate the time dimension in an effort to be faster. We only check kernels on a fixed B,T, and here I am meddling with T and changing it dynamically, and I think that's messing up the code. Not 100% sure how it goes wrong, potentially some memory corruption. For example when, in the kernel, we try to do:
I don't think this actually gets the row of |
The dynamic resizing happens with this code // once in a while do model inference to print generated text
if (step > 0 && step % 20 == 0) {
gen_tokens[0] = GPT2_EOT; // the GPT-2 EOT token kicks off the generation
for (int t = 1; t < gen_max_length; t++) {
// note that inference is wasteful here because
// for each t, we re-compute all activations between 0 and t
// leaving this alone because you want separate code for inference anyway
// the inference here is just for sanity checking purposes
int t4 = (t + 3) & ~3; // clever way to round up to multiple of 4
gpt2_forward(&model, gen_tokens, NULL, 1, t4);
float* probs = model.acts.probs + (t-1) * model.config.vocab_size;
float coin = random_f32(&rng_state);
// move probs back to CPU and sample
cudaCheck(cudaMemcpy(cpu_probs, probs, model.config.vocab_size * sizeof(float), cudaMemcpyDeviceToHost));
int next_token = sample_mult(cpu_probs, model.config.vocab_size, coin);
gen_tokens[t] = next_token;
}
printf("generated: ");
for (int t = 0; t < gen_max_length; t++) {
printf("%d ", gen_tokens[t]);
}
printf("\n");
} which only runs every 20 iteration. which is exactly where we see the crash. |
Interesting. Then I agree with you that the issue is somewhere in dynamic resizing and how it interacts with the new |
could you test #122 |
I'm seeing both these errors. Either the error 13
on NVIDIA V100 or T10 (on amazon AWS machines on Debian 11). |
I'm also seeing a similar error Error Message
sys-info
It's also giving me some compilation warnings about an fread call not being used. |
@zocterminal @g8392 Is that with #122? Or the master branch? |
@zocterminal just check out that branch (it's on my fork, so you might need to add this as a remote) , recompile, and see if it works |
#122 Seems to fix it for me @ngc92 , it also looks it has longer times for each step 62 ->108 ms:
|
I'm a bit surprised that this would slow down anything. Can you check what happens if you compile with -DNDEBUG? |
Tentatively: I think this commit is the one that breaks it. maybe someone can verify: Crashes:
Works:
|
@ngc92 now that I (maybe) narrowed down, let me play with your fix for a bit less tentative answer. BRB. |
I could reproduce this here rather consistently (see also #114 (comment)) I would say #122 fixes it. Initial download after cloning master:
Build master and run:
Patch #122 and run again:
Also, no noticeable speed changes. Good job @ngc92 ! |
The -DNDEBUG doesn't seem to affect it @ngc92 , this is what happens when I do the same @zocterminal: ./train_gpt2cu
.....
step 0: train loss 4.367631 (took 59.288083 ms)
step 1: train loss 4.406341 (took 60.256468 ms)
step 2: train loss 4.484756 (took 57.820963 ms)
.....
[CUDA ERROR] at file train_gpt2.cu:1238:
an illegal memory access was encountered patch train_gpt2.cu -p1 < 122.diff
patching file train_gpt2.cu make train_gpt2cu ; ./train_gpt2cu
NICE Compiling with OpenMP support
nvcc -O3 --use_fast_math train_gpt2.cu -lcublas -lcublasLt -o train_gpt2cu
train_gpt2.cu(640): warning #1650-D: result of call is not used
fread(model_header, sizeof(int), 256, model_file);
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
train_gpt2.cu(689): warning #1650-D: result of call is not used
fread(params_memory_cpu, sizeof(float), num_parameters, model_file);
........
train_gpt2.cu: In function 'void dataloader_next_batch(DataLoader*)'
train_gpt2.cu:937:6: warning: ignoring return value of 'size_t fread(void*, size_t, size_t, FILE*) declared with attribute 'warn_unused_result' [-Wunused-result]
937 | fread(loader->batch, sizeof(int), B*T+1, loader->tokens_file);
| ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[GPT-2]
.....
step 0: train loss 4.367857 (took 109.131925 ms)
step 1: train loss 4.406483 (took 108.166702 ms)
step 2: train loss 4.484839 (took 108.053300 ms)
step 3: train loss 4.345326 (took 108.359766 ms)
step 4: train loss 4.043288 (took 109.111375 ms)
step 5: train loss 4.229303 (took 108.113061 ms) |
I haven't been able to pinpoint the source of the difference but by setting up everything again and using train_gpt2.cu from #122 all seems to work and speed is the same too.
|
…ight now for safety. We can later bring back the <= B, <=T forward pass, but we need to do it carefully and have tests that make sure that a strictly smaller configuration produces the exact same results for that chunk of b,t. In other words we'd want ot make sure that the entire forward pass is range invariant. Currently it is not only because of the attention kernel. I think this is fixable, not too difficult, but it requires careful thought and associated tests for ensuring the range invariance. When those tests pass we can bring back old and more efficient behavior. For now it's just causing bugs, so I am putting in this highly defensive, but a lot more likely correct code
commit
c02bae2ebc684a2e068c0dc59be00ff43167b44d
I got this error when running
train_gpt2cu
nvidia-smi
nvcc
The file
train_gpt2.cu:1211:
indicates:The text was updated successfully, but these errors were encountered: