Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to restart training of a saved model? #3

Closed
ghost opened this issue Apr 17, 2015 · 12 comments
Closed

How to restart training of a saved model? #3

ghost opened this issue Apr 17, 2015 · 12 comments

Comments

@ghost
Copy link

ghost commented Apr 17, 2015

Hi,

I've been experimenting with using the model_utils.lua file on some on my own concatenations of gModules. I was just wondering if you could give an example of how to use,

model_utils.combine_all_parameters

and

model_utils.clone_many_times

to get the params and grad_params of a saved protos which can then be used with some of the appropriate lines of train.lua to restart training?

Just to give some context - what I've tried is instead of saving the full protos, just saving the following table,

table_to_save = { options = opt , saved_params=params, saved_grad_params=grad_params }

Then used basically all of train.lua, with the following,

saved_data = torch.load(saved_filename)

opt = saved_data.options

params:copy( saved_data.saved_params )

grad_params:copy( saved_data.grad_params )

That is, I recreate the system using the same options and clone it in the same way - the main change is simply transferring the saved params and grad_params before starting the optimization.

I was just wondering if this is the right way to do it?

Thanks for your help 👍

Best regards,

Aj

@bshillingford
Copy link
Member

Hi, I recommend saving the cloned sequences of modules instead of the protos, just to make things easier. This would result in larger saved files of course, though, since the activations would be saved as well (but still the same number of weight matrices).

If you want to just serialize protos: In train.lua, lines 53 onward, the protos will have all their params pointing to subtensors of one shared tensor. So, to serialize just the protos, you can do that in the training loop periodically, I think I did that there. To re load the protos, wrap lines 43-53 in an if statement that checks if you want to load it from a file, then either recreate protos as lines43-53 already do, or deserialize it from a file.

The clone_many_times must be done after this (see line 54 for an explanation why).

Cheers,

Brendan

@ghost
Copy link
Author

ghost commented Apr 17, 2015

Wow thanks for the quick reply :)

I don't really understand how you can get the params of the cloned sequence of modules, and put them into memory as a shared 1D tensor, that the optimizer can use?

I'm sort of confused because clones is a table of modules, not a single module? Is there some trick I'm missing?

@bshillingford
Copy link
Member

Here's the sequence of operations:

  1. The prototypes are generated (protos)
  2. Their parameters are flattened by allocating a new tensor that holds all
    of their weights/biases so that the optimizer can access them easily, and
    recursively replacing the protos' parameters with new tensors pointing to
    this one tensor.
  3. Now we create the cloned sequence of modules using the prototypes...
    remember that no new params are allocated now, and each instance in the
    clone just has a reference to the same tensors.

I think step 3 is where you're a bit confused here. We create the sequences
of clones from the protos after the parameters of the prototypes all
point to the shared tensor for optim.

Does that help?

On Fri, Apr 17, 2015 at 5:26 PM, Ajay Talati notifications@github.com
wrote:

Wow thanks for the quick reply :)

I don't really understand how you can get the params of the cloned
sequence of modules, and put them into memory as a shared 1D tensor, that
the optimizer can use?

I'm sort of confused because clones is a table of modules, not a single
module?


Reply to this email directly or view it on GitHub
#3 (comment)
.

@ghost
Copy link
Author

ghost commented Apr 17, 2015

Yes - I think so? 👍

So the variables, params and grad_params are basically addresses to the shared tensor, and should be sort of saved, (for want of a better word), when I save the clones closure? So when I reload the sequence of clones, the variables params and grad_params will be reloaded and still point to the shared tensor (which holds the weights/biases).

That seems a bit magical, but easy to do ? Basically I just reload the closure clones, and start the optimization loop - as before - no need to use either of the functions from model_utils.lua ?

@bshillingford
Copy link
Member

Sorry for the confusion, in my first reply I listed two options:

  1. serialize only protos and params and grad_params to save space in the saved model, then recreate clones on load (I explained this one in the " lines 43-53 " paragraph in the first reply: just serialize {params,grad_params,protos} then recreate clones after you deserialize)
  2. just save clones, protos, and params and grad_params all together in a table

If you save clones like you just mentioned, that's correct as long as you serialize params and grad_params at the same time (e.g. put them all in one table {params,grad_params,clones,protos}), which I neglected to mention. No need to use model_utils.lua if you serialize everything together. IIRC protos isn't used past line 50 or so, so you probably don't need to serialize it. I'd serialize it anyway though.

(Torch's serialization system will see that their Tensors point to the same Storage objects as all the params inside the modules, and so they'll point to the same thing on deserialization too.)

Edit: To comment on your initial post: I missed part because I read it before you edited. Your way with copying parameter values would work as well, but option 1 above is the way I usually do it.

@ghost
Copy link
Author

ghost commented Apr 17, 2015

Wow, thank's a lot for great explanation 👍 - I'm trying it now - it takes a long time for my model to train - so it's not easy to tell if it's working?? I think it is though.

So basically just saving and unpacking to their original names, the 'things' in the following table, i.e

table_to_save = {params,grad_params,clones,protos, opt}

save it, reload it and unpack it

params,grad_params,clones,protos, opt = unpack( table_to_save )

is all you have to do? All the parameter/memory sharing technicalities are magically reloaded. I did'nt think it would be that easy?

On a different level though its still kind of confusing/unsatisfying? This is a bit theoretical, so bear with me, but in terms of information theory and source coding, my system is a variational autoencoder. So if I add the number of bits that it takes for the algorithms of the luajit compiler, the essential modules of Torch I use, my VAE system's .lua files and the my trained models saved parameters, which is just a few million 64 bit numbers, I should have the amount of information that's coded into my algorithmic/generative model, which is basically a probability distribution, of my dataset which is cluttered MNIST32. So in total that lot I guess is 1Gig at the most.

If I clone my trained modules and then save them all, the full amount of data saved comes to about 5 Gig each time. So it just seems that in terms of information and data compression, its more satisfactory just to recreate the system fresh, using model utils, and then :copy the saved parameter numbers into the shared param tensor of the freshly (re) built system.

What do you think?

I need to do some more experiments just for my own sanity to make sure both methods work 👍

@bshillingford
Copy link
Member

That's the correct way to serialize, yes, and sharing/references are
handled correctly. Without going into too much detail, there's a few
different levels of a serialization system's complexity regarding
pointers/references. In C/C++ notation, if &a.b == &c.b, then when
serializing a and c together we'd expect &a.b == &c.b when deserializing
too. In the case of parameter sharing, a and c are Tensors, and the b is
the shared underlying Storage. Remember there's only one Storage for the
parameters in the entire network. More advanced serialization libraries can
correctly (de)serialize pointer/reference cycles (torch probably does as
well, but I haven't checked, and this situation is probably rare for most
torch code anyway).

The amount of space is large because the activations and gradients in each
clone in the network are being serialized too (i.e. module.output and
module.gradOutput for each module). The values in these are obviously
useless. To avoid this, serialize just params,grad_params,protos,opt and
recreate clones using clone_many_times when you start, or just use your
solution of serializing parameter values and copying them (but remember to
do this after calling combine_all_params).

On Fri, Apr 17, 2015 at 9:23 PM, Ajay Talati notifications@github.com
wrote:

Wow, thank's a lot for great explanation [image: 👍] - I'm trying it
now - it takes a long time for my model to train - so it's not easy to tell
if it's working?? I think it is though.

So basically just saving and unpacking to their original names, the
'things' in the following table, i.e

table_to_save = {params,grad_params,clones,protos, opt}

save it, reload it and unpack it

params,grad_params,clones,protos, opt = unpack( table_to_save )

is all you have to do? All the parameter/memory sharing technicalities are magically
reloaded. I did'nt think it would that easy?

On a different level though its still kind of confusing/unsatisfying? This
is a bit theoretical, so bear with me, but in terms of information theory
and source coding, my system is a variational autoencoder. So if I add the
number of bits that it takes for the algorithms of the luajit compiler,
the essential modules of Torch I use, my VAE system's .lua files and the my
trained models saved parameters, which is just a few million 64 bit
numbers, I should have the amount of information that's coded into my
algorithmic/generative model, which is basically a probability
distribution, of my dataset which is cluttered MNIST32. So in total that
lot I guess is 1Gig at the most.

If I clone my trained modules and then save them all, the full amount of
data saved comes to about 5 Gig each time. So it just seems that in terms
of information and data compression, its more satisfactory just to recreate
the system fresh, using model utils, and then :copy the saved parameter
numbers into the shared param tensor of the freshly (re) built system.

What do you think?

I need to do some more experiments just for my own sanity to make sure
both methods work [image: 👍]


Reply to this email directly or view it on GitHub
#3 (comment)
.

@ghost
Copy link
Author

ghost commented Apr 17, 2015

Brilliant - thank you very much for the clear explanation 👍

@ghost
Copy link
Author

ghost commented Apr 21, 2015

Hi Brendan,

thanks for all the great help you've given me. Just to share with you a little trick I found.

i) Train a system for say n timesteps/clones of the master modules and save (serialize) the parameter, and grad parameter tensors, and the first clone in the method as you explained above,

ii) then rebuild your system fresh with an extra timestep/clone n+1, using model_utils.clone_many_times, on the first clone

I think this little trick is working, (at least for the variational auto encoder I'm working on 👍 )

@ghost
Copy link
Author

ghost commented Apr 23, 2015

Hi, just an update on my suggest trick of restarting training with a rebuilt system with an added clone - after doing more controlled experiments - it does not seem to be working?

Basically I've found that there's no substitution for fixing a number of clones/timesteps and being patient, waiting for the system to start breaking it's symmetries.

My suggested trick of using the parameters of a shorter system, as the initial parameters of a system with an extra clone/timestep, seems to restrict the parameter space, and result in a higher final loss. The standard method of simply training using the desired number of timesteps, and being patient, or finding better ways to initialize the system, or other tricks, seems to result in a lower final loss.

Very sorry for the half-baked idea 👎

@ghost
Copy link
Author

ghost commented May 1, 2015

Thanks a lot for all your great help Brendan 👍

I think anyone who reads through this issue will get a few choices of how to save and restart training networks.

Best regards, Aj

@ghost ghost closed this as completed May 1, 2015
@mszlazak
Copy link

mszlazak commented May 1, 2015

Nice if i could get the code to work.

96749c8#commitcomment-10954747

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants