-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming training #36
Comments
Hi, Artnoage. If you check the training log I actually resumed the process twice and did not notice any memory error. I am not sure why that is the case on your end. Thanks for your suggestions on opening a discord. I think I will open one soon. |
Yes I already knew that. I was just thinking maybe it had to do with the size of the network. like some missed parameter. When you did the first run, did you check the memory usage? If it is not a bit issue please leave the question open for a while, in case someone else tries it. |
The memory usage is always 39G on my end. |
Hi! My training crashed, and I couldn't find the code to resume training from the last saved checkpoint. How can I resume my training? How do you handle this? |
You can add |
thanks, i found that |
@artnoage can you please post this project I would like to try to he same with 2 4060 training at home. |
I am not so sure what are you referring to because it is been a while.
However if you like to have a quick chat over what you want to do, you can
find me in discord with the same name (artnoage)
…On Mon, 20 May 2024, 16:35 Dustin Loring, ***@***.***> wrote:
@artnoage <https://github.com/artnoage> can you please post this project
I would like to try to he same with 2 4060 training at home.
—
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECUDGARZ75OZYTJUWTRBI3ZDH3ZTAVCNFSM6AAAAAA5CTY2CKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQGQ3TMMRXG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am training a 120M model from scratch because I would like to do some experiments myself. When I stop and try to resume, it requires me to drop the batch size significantly otherwise I get memory error. Any ideas why?
Also please consider making a discord server where people can discuss about the project.
The text was updated successfully, but these errors were encountered: