Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #114 #122

Merged
merged 1 commit into from
Apr 14, 2024
Merged

fix #114 #122

merged 1 commit into from
Apr 14, 2024

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented Apr 14, 2024

In order to ensure that the sequence length is divisible by four, we do some padding.
We need to make sure that the padded tokens actually do have a valid token-id, so that they don't cause any out-of-bounds access.
Also added some asserts as part of the forward call, so that we hopefully can avoid similar problems in the future (or at least, they will be much easier to trace).

@msharmavikram
Copy link
Contributor

I'm yet to check this fix but this will be slow in terms of perf. Wonder if this check can be added in the cuda kernel and assume artificial padding.

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 14, 2024

why would it be slow? It doesn't actually change anything inside the training loop, just the autoregressive test generation, which is atrocious in terms of performance anyway, because there is no caching of previous tokens.

so the only thing that changes are the additional asserts, which go away if you compile with -DNDEBUG (which you should if you care about performance).

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 14, 2024

I agree that we might want to place the check inside the kernel, if this turns out to have significant performance impact on debug builds.

@msharmavikram
Copy link
Contributor

msharmavikram commented Apr 14, 2024

I think we can move this out of the for loop. The padding doesn't require to be done for each step. It needs to be done only once (didn't check. Just looked the code for dependency).

@tnorman42
Copy link

This fix works for me. The issue is reproducible to me when any of the entries of gen_tokens are >= V (50257) due to not being initialized. Moving the initialization for loop to just after gen_tokens declaration (so it only needs to run once) also seems to solve the issue.

@zocterminal
Copy link
Contributor

Works for me.

A quicker version would probably be

int gen_tokens[gen_max_length] = { 0, };

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 14, 2024

Works for me.

A quicker version would probably be

int gen_tokens[gen_max_length] = { 0, };

I would argue strongly against that. Now I need to look up at cppreference whether this syntax actually intializes the entire array, or just the first element. I need to check if 0 == GPT2_EOT, which is no longer the obvious starting token in the code.

Finally, I don't think this is an operation that needs to be optimized anyway. We're running quadratic time inference for the LLM here, currently, reprocessing all past tokens over and over again. Even if we fixed that, this is an extremely small operation (we're writing 64 ints to memory, nothing more!)

@zocterminal
Copy link
Contributor

It will initialize the whole array with zeros. The 0'th element will still be set to GPT2_EOT as in the current code line 1227 (GPT2_EOT is 50256). It just avoids that the first call receives an array with incorrect values (after that it will be filled with values from the previous step%20 run).

A slightly more intuitive version would probably be:

int gen_tokens[gen_max_length] = { GPT2_EOT, 0, };

But this is just academic, bc yes I know that the loop you added is trivial in terms of time (based on how often that part is actually executed and based on the fact that it does a ton of printf). It would be just slightly cleaner code IMO.

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 14, 2024

Yes, it initializes with zero, but to me, this code always reads highly nonintuitively. Why would you mix explicit and implicit initialization when the implicit values are exactly the same as the explicit ones.
Why not write

int gen_tokens[gen_max_length] = { GPT2_EOT, };

so that all zeros are implicit?

@zocterminal
Copy link
Contributor

Yes, agree.

@karpathy karpathy merged commit 312b043 into karpathy:master Apr 14, 2024
zocterminal referenced this pull request Apr 15, 2024
…ight now for safety. We can later bring back the <= B, <=T forward pass, but we need to do it carefully and have tests that make sure that a strictly smaller configuration produces the exact same results for that chunk of b,t. In other words we'd want ot make sure that the entire forward pass is range invariant. Currently it is not only because of the attention kernel. I think this is fixable, not too difficult, but it requires careful thought and associated tests for ensuring the range invariance. When those tests pass we can bring back old and more efficient behavior. For now it's just causing bugs, so I am putting in this highly defensive, but a lot more likely correct code
@ngc92 ngc92 deleted the fix-illegal-access branch April 28, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants