-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #114 #122
fix #114 #122
Conversation
I'm yet to check this fix but this will be slow in terms of perf. Wonder if this check can be added in the cuda kernel and assume artificial padding. |
why would it be slow? It doesn't actually change anything inside the training loop, just the autoregressive test generation, which is atrocious in terms of performance anyway, because there is no caching of previous tokens. so the only thing that changes are the additional asserts, which go away if you compile with -DNDEBUG (which you should if you care about performance). |
I agree that we might want to place the check inside the kernel, if this turns out to have significant performance impact on debug builds. |
I think we can move this out of the for loop. The padding doesn't require to be done for each step. It needs to be done only once (didn't check. Just looked the code for dependency). |
This fix works for me. The issue is reproducible to me when any of the entries of gen_tokens are >= V (50257) due to not being initialized. Moving the initialization for loop to just after gen_tokens declaration (so it only needs to run once) also seems to solve the issue. |
Works for me. A quicker version would probably be
|
I would argue strongly against that. Now I need to look up at cppreference whether this syntax actually intializes the entire array, or just the first element. I need to check if 0 == GPT2_EOT, which is no longer the obvious starting token in the code. Finally, I don't think this is an operation that needs to be optimized anyway. We're running quadratic time inference for the LLM here, currently, reprocessing all past tokens over and over again. Even if we fixed that, this is an extremely small operation (we're writing 64 ints to memory, nothing more!) |
It will initialize the whole array with zeros. The 0'th element will still be set to GPT2_EOT as in the current code line 1227 (GPT2_EOT is 50256). It just avoids that the first call receives an array with incorrect values (after that it will be filled with values from the previous step%20 run). A slightly more intuitive version would probably be:
But this is just academic, bc yes I know that the loop you added is trivial in terms of time (based on how often that part is actually executed and based on the fact that it does a ton of printf). It would be just slightly cleaner code IMO. |
Yes, it initializes with zero, but to me, this code always reads highly nonintuitively. Why would you mix explicit and implicit initialization when the implicit values are exactly the same as the explicit ones.
so that all zeros are implicit? |
Yes, agree. |
…ight now for safety. We can later bring back the <= B, <=T forward pass, but we need to do it carefully and have tests that make sure that a strictly smaller configuration produces the exact same results for that chunk of b,t. In other words we'd want ot make sure that the entire forward pass is range invariant. Currently it is not only because of the attention kernel. I think this is fixable, not too difficult, but it requires careful thought and associated tests for ensuring the range invariance. When those tests pass we can bring back old and more efficient behavior. For now it's just causing bugs, so I am putting in this highly defensive, but a lot more likely correct code
In order to ensure that the sequence length is divisible by four, we do some padding.
We need to make sure that the padded tokens actually do have a valid token-id, so that they don't cause any out-of-bounds access.
Also added some asserts as part of the forward call, so that we hopefully can avoid similar problems in the future (or at least, they will be much easier to trace).