-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to restart training of a saved model? #3
Comments
Hi, I recommend saving the cloned sequences of modules instead of the If you want to just serialize The Cheers, Brendan |
Wow thanks for the quick reply :) I don't really understand how you can get the I'm sort of confused because |
Here's the sequence of operations:
I think step 3 is where you're a bit confused here. We create the sequences Does that help? On Fri, Apr 17, 2015 at 5:26 PM, Ajay Talati notifications@github.com
|
Yes - I think so? 👍 So the variables, That seems a bit magical, but easy to do ? Basically I just reload the closure |
Sorry for the confusion, in my first reply I listed two options:
If you save clones like you just mentioned, that's correct as long as you serialize (Torch's serialization system will see that their Tensors point to the same Storage objects as all the params inside the modules, and so they'll point to the same thing on deserialization too.) Edit: To comment on your initial post: I missed part because I read it before you edited. Your way with copying parameter values would work as well, but option 1 above is the way I usually do it. |
Wow, thank's a lot for great explanation 👍 - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though. So basically just saving and unpacking to their original names, the 'things' in the following table, i.e
save it, reload it and unpack it
is all you have to do? All the parameter/memory sharing technicalities are On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of If I What do you think? I need to do some more experiments just for my own sanity to make sure both methods work 👍 |
That's the correct way to serialize, yes, and sharing/references are The amount of space is large because the activations and gradients in each On Fri, Apr 17, 2015 at 9:23 PM, Ajay Talati notifications@github.com
|
Brilliant - thank you very much for the clear explanation 👍 |
Hi Brendan, thanks for all the great help you've given me. Just to share with you a little trick I found. i) Train a system for say ii) then rebuild your system fresh with an extra timestep/clone I think this little trick is working, (at least for the variational auto encoder I'm working on 👍 ) |
Hi, just an update on my suggest trick of restarting training with a rebuilt system with an added clone - after doing more controlled experiments - it does not seem to be working? Basically I've found that there's no substitution for fixing a number of clones/timesteps and being patient, waiting for the system to start breaking it's symmetries. My suggested trick of using the parameters of a shorter system, as the initial parameters of a system with an extra clone/timestep, seems to restrict the parameter space, and result in a higher final loss. The standard method of simply training using the desired number of timesteps, and being patient, or finding better ways to initialize the system, or other tricks, seems to result in a lower final loss. Very sorry for the half-baked idea 👎 |
Thanks a lot for all your great help Brendan 👍 I think anyone who reads through this issue will get a few choices of how to save and restart training networks. Best regards, Aj |
Nice if i could get the code to work. |
Hi,
I've been experimenting with using the model_utils.lua file on some on my own concatenations of gModules. I was just wondering if you could give an example of how to use,
model_utils.combine_all_parameters
and
model_utils.clone_many_times
to get the
params
andgrad_params
of a savedprotos
which can then be used with some of the appropriate lines oftrain.lua
to restart training?Just to give some context - what I've tried is instead of saving the full protos, just saving the following table,
table_to_save = { options = opt , saved_params=params, saved_grad_params=grad_params }
Then used basically all of
train.lua
, with the following,saved_data = torch.load(saved_filename)
opt = saved_data.options
params:copy( saved_data.saved_params )
grad_params:copy( saved_data.grad_params )
That is, I recreate the system using the same options and clone it in the same way - the main change is simply transferring the saved
params
andgrad_params
before starting the optimization.I was just wondering if this is the right way to do it?
Thanks for your help 👍
Best regards,
Aj
The text was updated successfully, but these errors were encountered: