resume training with different numbers of nodes. #1895
tangjiasheng
started this conversation in
Ideas
Replies: 1 comment
-
@tangjiasheng, thanks for sharing this idea. In fact, we have an effort along this line. But it has been slow going due to bandwidth issues. However, expressed interests from our valued users like you will help us prioritize appropriately. Hope you can engage with the discussion. Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I think it would be better to provide this ability to resume training with different amount of nodes. e.g. from 128 cards to 64 cards.
Since training is not a stable progress, it often break down. If I train models on cloud, resources without guarantee is so general when resume training. So, combine all saved checkpoints and split them again is a good way to save time for not waiting resources.
Beta Was this translation helpful? Give feedback.
All reactions