Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async optimizer state and model checkpointing #651

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

chinthysl
Copy link
Contributor

@chinthysl chinthysl commented Jun 27, 2024

Additional feature to checkpoint optimizer state and model parameters using a non blocking background thread.
Memcopy device buffers to pined host buffer in one shot and let the background thread do I/O operations.

In my 8xA100 setup rough latency improvement is 5.4 sec to 2.3 sec ~ 2X improvement.
When it comes to the larger model sizes this feature will save a lot of time.

else {
// transfer device data to host memory
char* buffer_space;
cudaCheck(cudaMallocHost(&buffer_space, model->num_parameters_bytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience with cudaMallocHost is it can be much slower than a regular malloc which eventually "pays for itself" because the data transfer is so much faster, but for a single transfer, I suspect it might be slower than just a regular malloc.

So if that holds true here, this might need to either keep that memory allocated permanently or just use a regular malloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants