Utilities for cuda streams + disk IO #556

ngc92 · 2024-06-05T22:19:16Z

handling disk io for checkpointing with cuda streams is a nontrivial task. If you're not careful, you can easily get broken code (need to wait for data to be on the CPU before you can start writing the buffer to the disk), or synchronous behaviour (because memory is not page-locked, so async copies are not possible)
Therefore, this PR introduces two new utility functions to do the disk <-> device data transfer.

In addition to being less error prone and reducing code duplication, this also gives us a single point at which we can implement double buffering to actually get an overlap between device transfer and disk writes; with an added bonus that we no longer need to allocate giant CPU-side arrays (think: 8 A100s on the same node each wanting to write 40 GB model state; we'd attempt to allocate 320 GB of host memory there; maybe the boxes are big enough even under that scenario, but do we really want to do that?)

I've also added a unit test for these functions; its very rough, I am not going to touch the makefile so for now you need to compile yourself, and it leaves behind its temp file. But it already payed off, because my first implementation had a wrong offset somewhere, and the test caught this :)

There is also the problem notices first with #522 that we currently miss master weights in the saved state. This is hacked in here very quickly, but its not really a good solution; should be combined at least with #522's addition of a flag in the file that indicates whether to expect master weights or not.

Because we now do file IO with cuda, we get that cuda_common.h includes utils.h, and requires us to mark all functions there inline. I also had to add a checked write function, for which I've just copied the error handling; not 100% sure if that makes sense.

For the double-buffered transfers, I've put a size of 32MiB, but that is not based on any actual data. I just wanted to put something there that is neither super tiny nor super big :)

dev/test/device_file_io.cu

gordicaleksa · 2024-06-18T15:25:14Z

train_gpt2.cu

-    cudaCheck(cudaMemcpy(model->params_memory, params_memory_cpu, model->num_parameters_bytes, cudaMemcpyHostToDevice));
-    free(params_memory_cpu);
+    file_to_device(model->params_memory, model_file, model->num_parameters_bytes,
+                   32*1024*1024, main_stream);


any particular reason why buffers are hardcoded to 32 MBs?

nit: maybe extract this into a global variable and pass in everywhere, easier for maintenance if we want to change it later on

gordicaleksa · 2024-06-18T15:29:24Z

llmc/cuda_common.h

+    // prime the read buffer; first copy means we have to wait
+    char* gpu_read_ptr = (char*)src;
+    size_t copy_amount = std::min(buffer_size, num_bytes);
+    cudaCheck(cudaMemcpyAsync(read_buffer, gpu_read_ptr, copy_amount, cudaMemcpyDeviceToHost, stream));


curious: does async matter here given that we call synchronize immediately on the next line?

no, its just for consistency; since non-async versions of this function don't take a stream argument, if you want this transfer to show up in the "right" stream in nsight, you need to use the async version.

gordicaleksa · 2024-06-18T15:57:25Z

llmc/cuda_common.h

+
+    // copy the last remaining write buffer to gpu
+    cudaCheck(cudaMemcpyAsync(gpu_write_ptr, write_buffer, write_buffer_size, cudaMemcpyHostToDevice, stream));
+    cudaCheck(cudaFreeHost(buffer_space));


cudaFreeHost is blocked until the line above finishes?

good question. Allocation is listed as an implicit sync, I would assume deallocation also needs to be, but I'm not 100% sure, so maybe we should be explicit here.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#implicit-synchronization

gordicaleksa · 2024-06-18T16:00:08Z

train_gpt2.cu

@@ -1231,20 +1222,24 @@ void load_state(int* step, GPT2* model, DataLoader* loader, const char* filename
        printf0("allocating %zu MiB for AdamW optimizer state v\n", (shard_num_parameters * sizeof(float)) >> 20);
        cudaCheck(cudaMalloc((void**)&model->v_memory, shard_num_parameters * sizeof(float)));
    }
+
+    if(state_header[4] == 1 && !model->use_master_weights) {


this will have to be refactored a bit due to recent changes

gordicaleksa · 2024-06-18T16:00:46Z

Left some comments - lgtm!

ngc92 mentioned this pull request Jun 8, 2024

Feature/streams #552

Merged

ngc92 force-pushed the streams-io branch from 4601a84 to d1d3662 Compare June 17, 2024 11:24

ngc92 marked this pull request as ready for review June 17, 2024 12:52

ngc92 force-pushed the streams-io branch from d1d3662 to 534bb46 Compare June 17, 2024 21:41

ngc92 changed the base branch from feature/streams to master June 17, 2024 22:34

ngc92 added 3 commits June 18, 2024 13:23

utility functions for device <-> disk IO

04234d0

use new functions for checkpointing

593a71c

add unit test to CI

188e727

ngc92 force-pushed the streams-io branch from 7ed5498 to 188e727 Compare June 18, 2024 10:24

ngc92 added 3 commits June 18, 2024 13:29

added missing checks

dbeb8fc

fix compilation with clang

33136e0

fix path

37c3815

gordicaleksa reviewed Jun 18, 2024

View reviewed changes

dev/test/device_file_io.cu Show resolved Hide resolved

gordicaleksa reviewed Jun 18, 2024

View reviewed changes

dev/test/device_file_io.cu Outdated Show resolved Hide resolved

gordicaleksa reviewed Jun 18, 2024

View reviewed changes

ngc92 added 4 commits June 18, 2024 19:58

fixup tests

3ebd5ab

made buffer size more easily configurable

ec44240

explicit sync

d95313d

small touchups

b3b4dab

karpathy merged commit 2543b62 into karpathy:master Jun 23, 2024
11 checks passed

ngc92 deleted the streams-io branch July 11, 2024 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilities for cuda streams + disk IO #556

Utilities for cuda streams + disk IO #556

ngc92 commented Jun 5, 2024

gordicaleksa Jun 18, 2024 •

edited

Loading

gordicaleksa Jun 18, 2024

ngc92 Jun 18, 2024

gordicaleksa Jun 18, 2024

ngc92 Jun 18, 2024

gordicaleksa Jun 18, 2024

gordicaleksa commented Jun 18, 2024

Utilities for cuda streams + disk IO #556

Utilities for cuda streams + disk IO #556

Conversation

ngc92 commented Jun 5, 2024

gordicaleksa Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

gordicaleksa Jun 18, 2024

Choose a reason for hiding this comment

ngc92 Jun 18, 2024

Choose a reason for hiding this comment

gordicaleksa Jun 18, 2024

Choose a reason for hiding this comment

ngc92 Jun 18, 2024

Choose a reason for hiding this comment

gordicaleksa Jun 18, 2024

Choose a reason for hiding this comment

gordicaleksa commented Jun 18, 2024

gordicaleksa Jun 18, 2024 •

edited

Loading