Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling checkpoints #636

Merged
merged 2 commits into from
Jun 25, 2024
Merged

rolling checkpoints #636

merged 2 commits into from
Jun 25, 2024

Conversation

karpathy
Copy link
Owner

checkpoints are either MINOR or MAJOR and minor checkpoints get deleted with a rolling window. This is an optimization that will allow us to save state more often, but preserve disk space overall. And if we ever see loss spikes and such, it is easier to reset the state to an earlier checkpoint that is more recent

Introduces two new flags:

    fprintf(stderr, "  -nk <int>   max number of checkpoints to keep in the directory, removing old ones (0 = disable, default)\n");
    fprintf(stderr, "  -nm <int>   every how many step checkpoints are considered major? major checkpoints never get deleted.\n");

Example usage:

        -n 20 \
        -nk 3 \
        -nm 100 \

This will write checkpoint every 20 steps, only keeps up to 3 minor checkpoints in the directory at any time, except every 100th checkpoint is considered a MAJOR checkpoint, and will not be deleted.

… checkpoints get deleted with a rolling window. This is an optimization that will allow us to save state more often, but preserve disk space overall. And if we ever see loss spikes and such, it is easier to reset the state to an earlier checkpoint that is more recent
@@ -1284,7 +1284,7 @@ void load_state(int* step, GPT2* model, DataLoader* loader, const char* filename

// ----------------------------------------------------------------------------
// CLI, poor man's argparse
// unclaimed flags lol: p
// (all single letters have been claimed now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

train_gpt2.cu Outdated
if (checkpoints_keep > 0 && step_delete > 0 &&
(major_checkpoint_every == 0 || step_delete % major_checkpoint_every != 0)
) {
printf0("deleting minor checkpoint %d\n", step_delete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a dedicated delete_checkpoint function?

train_gpt2.cu Show resolved Hide resolved
@karpathy karpathy merged commit 16b5bd5 into master Jun 25, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants