fix #114 #122

ngc92 · 2024-04-14T10:07:52Z

In order to ensure that the sequence length is divisible by four, we do some padding.
We need to make sure that the padded tokens actually do have a valid token-id, so that they don't cause any out-of-bounds access.
Also added some asserts as part of the forward call, so that we hopefully can avoid similar problems in the future (or at least, they will be much easier to trace).

msharmavikram · 2024-04-14T13:16:05Z

I'm yet to check this fix but this will be slow in terms of perf. Wonder if this check can be added in the cuda kernel and assume artificial padding.

ngc92 · 2024-04-14T13:21:52Z

why would it be slow? It doesn't actually change anything inside the training loop, just the autoregressive test generation, which is atrocious in terms of performance anyway, because there is no caching of previous tokens.

so the only thing that changes are the additional asserts, which go away if you compile with -DNDEBUG (which you should if you care about performance).

ngc92 · 2024-04-14T13:25:07Z

I agree that we might want to place the check inside the kernel, if this turns out to have significant performance impact on debug builds.

msharmavikram · 2024-04-14T13:27:50Z

I think we can move this out of the for loop. The padding doesn't require to be done for each step. It needs to be done only once (didn't check. Just looked the code for dependency).

tnorman42 · 2024-04-14T13:30:40Z

This fix works for me. The issue is reproducible to me when any of the entries of gen_tokens are >= V (50257) due to not being initialized. Moving the initialization for loop to just after gen_tokens declaration (so it only needs to run once) also seems to solve the issue.

zocterminal · 2024-04-14T15:09:16Z

Works for me.

A quicker version would probably be

int gen_tokens[gen_max_length] = { 0, };

ngc92 · 2024-04-14T15:34:03Z

Works for me.

A quicker version would probably be
int gen_tokens[gen_max_length] = { 0, };

I would argue strongly against that. Now I need to look up at cppreference whether this syntax actually intializes the entire array, or just the first element. I need to check if 0 == GPT2_EOT, which is no longer the obvious starting token in the code.

Finally, I don't think this is an operation that needs to be optimized anyway. We're running quadratic time inference for the LLM here, currently, reprocessing all past tokens over and over again. Even if we fixed that, this is an extremely small operation (we're writing 64 ints to memory, nothing more!)

zocterminal · 2024-04-14T15:53:15Z

It will initialize the whole array with zeros. The 0'th element will still be set to GPT2_EOT as in the current code line 1227 (GPT2_EOT is 50256). It just avoids that the first call receives an array with incorrect values (after that it will be filled with values from the previous step%20 run).

A slightly more intuitive version would probably be:

int gen_tokens[gen_max_length] = { GPT2_EOT, 0, };

But this is just academic, bc yes I know that the loop you added is trivial in terms of time (based on how often that part is actually executed and based on the fact that it does a ton of printf). It would be just slightly cleaner code IMO.

ngc92 · 2024-04-14T15:59:38Z

Yes, it initializes with zero, but to me, this code always reads highly nonintuitively. Why would you mix explicit and implicit initialization when the implicit values are exactly the same as the explicit ones.
Why not write

int gen_tokens[gen_max_length] = { GPT2_EOT, };

so that all zeros are implicit?

zocterminal · 2024-04-14T16:15:53Z

Yes, agree.

…ight now for safety. We can later bring back the <= B, <=T forward pass, but we need to do it carefully and have tests that make sure that a strictly smaller configuration produces the exact same results for that chunk of b,t. In other words we'd want ot make sure that the entire forward pass is range invariant. Currently it is not only because of the attention kernel. I think this is fixable, not too difficult, but it requires careful thought and associated tests for ensuring the range invariance. When those tests pass we can bring back old and more efficient behavior. For now it's just causing bugs, so I am putting in this highly defensive, but a lot more likely correct code

Prevent out-of-bounds token-ids in input padding for test generation

5c9d61f

This was referenced Apr 14, 2024

attempt to fix #114, but it's not working... #120

Closed

CUDA ERROR an illegal memory access was encountered #114

Closed

karpathy merged commit 312b043 into karpathy:master Apr 14, 2024

ngc92 deleted the fix-illegal-access branch April 28, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #114 #122

fix #114 #122

ngc92 commented Apr 14, 2024

msharmavikram commented Apr 14, 2024

ngc92 commented Apr 14, 2024

ngc92 commented Apr 14, 2024

msharmavikram commented Apr 14, 2024 •

edited

Loading

tnorman42 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

ngc92 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

ngc92 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

fix #114 #122

fix #114 #122

Conversation

ngc92 commented Apr 14, 2024

msharmavikram commented Apr 14, 2024

ngc92 commented Apr 14, 2024

ngc92 commented Apr 14, 2024

msharmavikram commented Apr 14, 2024 • edited Loading

tnorman42 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

ngc92 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

ngc92 commented Apr 14, 2024

zocterminal commented Apr 14, 2024

msharmavikram commented Apr 14, 2024 •

edited

Loading