Feature/streams #552

karpathy · 2024-06-05T01:04:25Z

bringing back streams, this PR brings back a single "main stream" to start.

…lism

…stream-2

karpathy · 2024-06-05T16:07:52Z

Main difference is that I pulled out the main stream to be a global inside train_gpt2.cu, because I think the stream is not a property of the model itself, it's a property of the trainer run configuration, so it makes more sense to me there.

That said, we are still not 100% on the "main stream". My nsys (which I run as nsys profile ./train_gpt2cu \ ... ) says that there a few more streams actually, each with just a tiny amount of memory work. Streams 16, 18, 20. One of these I traced to be doing some memcopies during ncclCommInitRank. But I can't find any API to "set" the stream of NCCL...

ngc92 · 2024-06-05T22:04:36Z

train_gpt2.cu

@@ -562,7 +563,7 @@ void gpt2_write_to_checkpoint(GPT2 *model, const char* checkpoint_path) {
    fwrite(model_header, sizeof(int), 256, model_file);
    // write the parameters
    void* params_memory_cpu = (void*)mallocCheck(model->num_parameters_bytes);
-    cudaCheck(cudaMemcpy(params_memory_cpu, model->params_memory, model->num_parameters_bytes, cudaMemcpyDeviceToHost));
+    cudaCheck(cudaMemcpyAsync(params_memory_cpu, model->params_memory, model->num_parameters_bytes, cudaMemcpyDeviceToHost, main_stream));


missing sync

this might actually be fine, because the copy involves memory which is not page-locked, so it isn't actually going to run async, I believe, but at the very least this code looks like a time bomb

Is it better to use cudaMemcpy here without Async, or with Async and then synchronize? I don't have enough background in C / CUDA / multi-stream here. If it's not the Async function, does it use the default stream, and then it's synchronous?

The docs are not super complete:
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-sync

Which stream is used for cudaMemcpy?
And they don't specify what happens when you cpu from device to a normal, non-pinned memory, as far as I can see.

Not sure what the correct solution is here.

Yes, things are a bit vague:
If you want to look at nsight systems and see that nothing is using the legacy stream, then memcpyAsync followed by synchronize is maybe the better solution. It also lets you search for synchronize to find the places where we have to wait.

A lot of these would also become at least partially async with 556, where we overlap device<->host and host<->disk transfers.

ngc92 · 2024-06-05T22:04:52Z

train_gpt2.cu

@@ -628,11 +629,13 @@ void gpt2_build_from_checkpoint(GPT2 *model, const char* checkpoint_path) {
    // read in all the parameters from file and copy them to device
    void* params_memory_cpu = (void*)mallocCheck(model->num_parameters_bytes);
    freadCheck(params_memory_cpu, 1, model->num_parameters_bytes, model_file);
-    cudaCheck(cudaMemcpy(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice));
+    cudaCheck(cudaMemcpyAsync(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice, main_stream));


missing sync

ngc92 · 2024-06-05T22:06:03Z

train_gpt2.cu

@@ -718,13 +721,14 @@ void gpt2_build_from_random(GPT2 *model, int depth) {
    }

    // copy them to GPU
-    cudaCheck(cudaMemcpy(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice));
+    cudaCheck(cudaMemcpyAsync(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice, main_stream));


missing sync

ngc92 · 2024-06-05T22:07:09Z

train_gpt2.cu

@@ -1331,9 +1348,9 @@ void save_state(const char* filename, int step, GPT2* model, DataLoader* loader)
    // write AdamW m, v, and master_weights here (they are all float)
    size_t shard_num_parameters = multi_gpu_config.shard_num_parameters;
    float* cpu_buffer = (float*)mallocCheck(shard_num_parameters * sizeof(float));
-    cudaCheck(cudaMemcpy(cpu_buffer, model->m_memory, shard_num_parameters * sizeof(float), cudaMemcpyDeviceToHost));
+    cudaCheck(cudaMemcpyAsync(cpu_buffer, model->m_memory, shard_num_parameters * sizeof(float), cudaMemcpyDeviceToHost, main_stream));


more syncs missing

karpathy · 2024-06-08T16:19:46Z

I decided the the Async look scary and we should minimize dependencies "across lines of code" (e.g. requiring a synchronize right after) so I reverted them. This way we can also easily search for "Async" to look for possible trouble. We'll have memory traffic on the default stream but that's ok. Last thought we should minimize use of parallelism outside of the "critical path" that makes code fast. So anything we do a single time or rarely (e.g. load, store, checkpoint, etc.) would remain sync, just doesn't seem worth it.

ngc92 and others added 8 commits June 4, 2024 22:35

move all kernels into a main_stream in preparation for future paralle…

ca7caac

…lism

overlap validity check and data movement

4c369ee

moved last remaining kernel to main stream

0de54d1

naming the stream

bc35d34

Merge branch 'stream-2' of https://github.com/ngc92/llm.c into ngc92-…

8940dcd

…stream-2

moved remaining memcpy/memset in the training loop to main_stream

9af7628

Merge branch 'stream-2' of https://github.com/ngc92/llm.c into ngc92-…

d5f7458

…stream-2

streams, everything right now is on main stream, almost

6d14c29

ngc92 reviewed Jun 5, 2024

View reviewed changes

ngc92 mentioned this pull request Jun 7, 2024

Re-introduce cuda streams #550

Closed

take out Async copies and memsets

ee6b3c9

karpathy merged commit 637c1b6 into master Jun 8, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/streams #552

Feature/streams #552

karpathy commented Jun 5, 2024

karpathy commented Jun 5, 2024

ngc92 Jun 5, 2024

ngc92 Jun 5, 2024

karpathy Jun 8, 2024

ngc92 Jun 8, 2024

ngc92 Jun 5, 2024

ngc92 Jun 5, 2024

ngc92 Jun 5, 2024

karpathy commented Jun 8, 2024

Feature/streams #552

Feature/streams #552

Conversation

karpathy commented Jun 5, 2024

karpathy commented Jun 5, 2024

ngc92 Jun 5, 2024

Choose a reason for hiding this comment

ngc92 Jun 5, 2024

Choose a reason for hiding this comment

karpathy Jun 8, 2024

Choose a reason for hiding this comment

ngc92 Jun 8, 2024

Choose a reason for hiding this comment

ngc92 Jun 5, 2024

Choose a reason for hiding this comment

ngc92 Jun 5, 2024

Choose a reason for hiding this comment

ngc92 Jun 5, 2024

Choose a reason for hiding this comment

karpathy commented Jun 8, 2024