-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed bug in EthosBinary dataset class and model directory copying logic in RayTuneReportCallback #1129
Fixed bug in EthosBinary dataset class and model directory copying logic in RayTuneReportCallback #1129
Conversation
…n process_downloaded_dataset
…n downloading sst datasets
@@ -722,7 +723,14 @@ def on_epoch_end(self, trainer, progress_tracker, save_path): | |||
if trainer.is_coordinator(): | |||
with tune.checkpoint_dir(step=progress_tracker.epoch) as checkpoint_dir: | |||
checkpoint_model = os.path.join(checkpoint_dir, 'model') | |||
shutil.copytree(save_path, checkpoint_model) | |||
if not os.path.isdir(checkpoint_model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here about the fact that the previous implementation was copytree (you can leave that line commented) but that that's not atomic and so we have to do this temp + move thing, otherwise in the future we may forget about it and maybe replace it with copytree again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great feedback -- addressed in this commit here: c924b9f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Atomic copy is awesome.
This PR fixes two bugs:
RayTuneReportCallback
is atomic: prior to this addition, issues arose where files within the model checkpoint directory were being accessed before being completely written.cc: @tgaddair