Resuming training #36

artnoage · 2023-09-22T06:37:49Z

I am training a 120M model from scratch because I would like to do some experiments myself. When I stop and try to resume, it requires me to drop the batch size significantly otherwise I get memory error. Any ideas why?

Also please consider making a discord server where people can discuss about the project.

jzhang38 · 2023-09-23T07:53:37Z

Hi, Artnoage. If you check the training log I actually resumed the process twice and did not notice any memory error. I am not sure why that is the case on your end.

Thanks for your suggestions on opening a discord. I think I will open one soon.

artnoage · 2023-09-23T19:06:18Z

Yes I already knew that. I was just thinking maybe it had to do with the size of the network. like some missed parameter. When you did the first run, did you check the memory usage? If it is not a bit issue please leave the question open for a while, in case someone else tries it.

jzhang38 · 2023-09-24T00:44:47Z

When you did the first run, did you check the memory usage?

The memory usage is always 39G on my end.

dtxwhzw · 2023-10-16T03:46:59Z

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

ChaosCodes · 2023-10-16T03:53:13Z

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

You can add --resume your_checkpoint.pth term in your pretraining command to resume training

dtxwhzw · 2023-10-16T04:21:19Z

Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this?

You can add --resume your_checkpoint.pth term in your pretraining command to resume training

thanks, i found that

dustinwloring1988 · 2024-05-20T13:34:59Z

@artnoage can you please post this project I would like to try to he same with 2 4060 training at home.

artnoage · 2024-05-20T16:37:37Z

I am not so sure what are you referring to because it is been a while. However if you like to have a quick chat over what you want to do, you can find me in discord with the same name (artnoage)

…

On Mon, 20 May 2024, 16:35 Dustin Loring, ***@***.***> wrote: @artnoage <https://github.com/artnoage> can you please post this project I would like to try to he same with 2 4060 training at home. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECUDGARZ75OZYTJUWTRBI3ZDH3ZTAVCNFSM6AAAAAA5CTY2CKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQGQ3TMMRXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training #36

Resuming training #36

artnoage commented Sep 22, 2023

jzhang38 commented Sep 23, 2023 •

edited

Loading

artnoage commented Sep 23, 2023

jzhang38 commented Sep 24, 2023

dtxwhzw commented Oct 16, 2023

ChaosCodes commented Oct 16, 2023

dtxwhzw commented Oct 16, 2023

dustinwloring1988 commented May 20, 2024

artnoage commented May 20, 2024 via email

Resuming training #36

Resuming training #36

Comments

artnoage commented Sep 22, 2023

jzhang38 commented Sep 23, 2023 • edited Loading

artnoage commented Sep 23, 2023

jzhang38 commented Sep 24, 2023

dtxwhzw commented Oct 16, 2023

ChaosCodes commented Oct 16, 2023

dtxwhzw commented Oct 16, 2023

dustinwloring1988 commented May 20, 2024

artnoage commented May 20, 2024 via email

jzhang38 commented Sep 23, 2023 •

edited

Loading