Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading partial networks from checkpoints #1

Open
joegilkes opened this issue Oct 16, 2023 · 0 comments
Open

Loading partial networks from checkpoints #1

joegilkes opened this issue Oct 16, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@joegilkes
Copy link
Collaborator

Currently, networks explored via IterativeExplore that are terminated early (due to an error, exceeding walltime, etc.) can be restarted directly from the contents of their rdir_head. However, this is only useful as long as rdir_head is always available.

If running on distributed resources like HPC, network exploration should be performed within a scratch space to allow for the currently heavy IO requirements of CDE runs. However, these scratch spaces are usually semi-volatile and in many cases cease to exist once a job is finished. This wipes the entire rdir_head directory tree, preventing restarts.

While rdir_head could be periodically backed up to non-volatile storage, this would be incredibly expensive and would nullify many of the benefits of performing exploration on a scratch space. Instead, we could use the already implemented incomplete network saves (which can be saved into a non-scratch directory) as checkpoints and allow for partial (or full) network restoration from them when rdir_head is not present (e.g. when it has been wiped by end of job). This would work as follows:

  1. Check if rdir_head exists. If it does, the network within may either be full (present in the directory tree from the initial level) or partial (present in the directory tree only from a certain point, as it has been loaded from a checkpoint before).
  2. If not, check if checkpoints exist. If they do, read in the latest checkpoint, establish next level seeds and create a new partial directory tree starting from this level.
  3. If not, start a new exploration from scratch.

In step 1, when there is a full network it can be directly loaded. However, when there is only a partial network, a checkpoint file corresponding to the exploration progress made from the level(s) before those that exist in rdir_head MUST be available for exploration to continue without error.

@joegilkes joegilkes added the enhancement New feature or request label Oct 16, 2023
@joegilkes joegilkes self-assigned this Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant