Async optimizer state and model checkpointing #651

chinthysl · 2024-06-27T10:55:57Z

Additional feature to checkpoint optimizer state and model parameters using a non blocking background thread.
Memcopy device buffers to pined host buffer in one shot and let the background thread do I/O operations.

In my 8xA100 setup rough latency improvement is 5.4 sec to 2.3 sec ~ 2X improvement.
When it comes to the larger model sizes this feature will save a lot of time.

ademeure · 2024-06-27T12:49:49Z

train_gpt2.cu

+    else {
+        // transfer device data to host memory
+        char* buffer_space;
+        cudaCheck(cudaMallocHost(&buffer_space, model->num_parameters_bytes));


My experience with cudaMallocHost is it can be much slower than a regular malloc which eventually "pays for itself" because the data transfer is so much faster, but for a single transfer, I suspect it might be slower than just a regular malloc.

So if that holds true here, this might need to either keep that memory allocated permanently or just use a regular malloc.

chinthysl added 6 commits June 27, 2024 10:10

thread var for async file write

2565256

async switch for model checkpointing

b3efbb3

async switch opt state

8eefbad

main loop changes

b063b9b

fix barrier for no async write

35d60fe

fix test_gpt.cu

0288b73

ademeure reviewed Jun 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async optimizer state and model checkpointing #651

Async optimizer state and model checkpointing #651

chinthysl commented Jun 27, 2024 •

edited

Loading

ademeure Jun 27, 2024

Async optimizer state and model checkpointing #651

Are you sure you want to change the base?

Async optimizer state and model checkpointing #651

Conversation

chinthysl commented Jun 27, 2024 • edited Loading

ademeure Jun 27, 2024

Choose a reason for hiding this comment

chinthysl commented Jun 27, 2024 •

edited

Loading