Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing the full state path in checkpoint handler+loss report calculation #51

Merged
merged 16 commits into from
Aug 1, 2023

Conversation

HamidShojanazeri
Copy link
Contributor

@HamidShojanazeri HamidShojanazeri commented Jul 26, 2023

This PR fixes

utils/train_utils.py Outdated Show resolved Hide resolved
utils/train_utils.py Outdated Show resolved Hide resolved
@HamidShojanazeri HamidShojanazeri changed the title fixing the full state path in checkpoint handler fixing the full state path in checkpoint handler+loss report calculation Jul 30, 2023
utils/train_utils.py Outdated Show resolved Hide resolved
utils/train_utils.py Outdated Show resolved Hide resolved
utils/train_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamidShojanazeri Thanks for the fixes. Please see comments inline.

@DhruvaBansal00
Copy link

Seems like this is supposed to fix #65? I am going to try this out and will keep you folks updated!

@HamidShojanazeri
Copy link
Contributor Author

@DhruvaBansal00 thanks for trying out the PR, I am not able to repro on my end can you pls share the command your running? and you env usingpython -m "torch.utils.collect_env".

@DhruvaBansal00
Copy link

@HamidShojanazeri I hadn't applied your changes correctly before. Just finished running three epochs and seems like everything is running as expected. Thanks for the changes and appreciate the fast turnaround :)

@HamidShojanazeri
Copy link
Contributor Author

@HamidShojanazeri I hadn't applied your changes correctly before. Just finished running three epochs and seems like everything is running as expected. Thanks for the changes and appreciate the fast turnaround :)

Thanks very much @DhruvaBansal00 for confirmation.

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HamidShojanazeri for all the fixes and updates.

Thanks @jeonsworld your changes for PR #24 have also been rolled into this PR

@chauhang chauhang merged commit 1387b76 into main Aug 1, 2023
3 checks passed
@chauhang chauhang deleted the checkpoint_handler_path_fix branch September 1, 2023 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants