move all kernels into a dedicated cuda stream #448

ngc92 · 2024-05-22T20:46:46Z

In preparation for #361, this restores the existence of a single "main stream" cuda stream.
To make reasoning about parallelism easier, at least in the near future, this change also makes each gpt2_* function explicitly synchronous.
Move the loss calculation from forward to backward, because that gives us a better opportunity to overlap things.
It also enables a slight optimization, in that we now no longer update dlogits in the validation code path.

…side each function

ngc92 force-pushed the stream branch 2 times, most recently from c728993 to e619cf7 Compare May 23, 2024 09:41

ngc92 added 4 commits May 24, 2024 23:14

move all kernels into a dedicated cuda stream

5d02a15

make all gpt2_* functions synchronous, and use streams extensively in…

c7ac55e

…side each function

nvtx ranges and better overlap of validation

7042f4a

moved loss calculation to backward part

c712f43

ngc92 force-pushed the stream branch from e619cf7 to c712f43 Compare May 24, 2024 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move all kernels into a dedicated cuda stream #448

move all kernels into a dedicated cuda stream #448

ngc92 commented May 22, 2024 •

edited

move all kernels into a dedicated cuda stream #448

Are you sure you want to change the base?

move all kernels into a dedicated cuda stream #448

Conversation

ngc92 commented May 22, 2024 • edited

ngc92 commented May 22, 2024 •

edited