-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropout for ephemeral connections #988
Comments
I am becoming more convinced that this method should be quite useful, and I'd really like someone (or people) to help with this. |
It seems reasonable. I would have to think more about whether it could be implemented quickly on GPU (it probably can be, in spite of caching values in back propagation for a long time). Overall, though, my main concern is that it strikes me that we could get easy gains by adopting other innovations in the neural network literature like batch norm, residual networks, and persistent RNNs (don't quite understand persistent RNNs yet, though; but a 30 times speed up would make experimenting with RNNs a lot easier). |
Dan |
I had a look into ResNets... it's where every other layer, you have a component-node name=Tdnn_5_relu component=Tdnn_5_relu input=Tdnn_5_affine and you could easily change this to: component-node name=Tdnn_5_relu component=Tdnn_5_relu [and, say, only do this for odd-numbered components]. Dan On Mon, Sep 12, 2016 at 6:46 PM, Daniel Povey dpovey@gmail.com wrote:
|
As far as I know, ResNets is especially beneficial to train very deep networks (more than hundreds of layers). Not sure if it would improve over the current tdnn setup. But anyway, I can pick it up. Yiming |
Regarding ResNets (and this is also of broader interest), They are also using lattice-free MMI, and they cite our lattice-free MMI On Mon, Sep 12, 2016 at 8:05 PM, Yiming Wang notifications@github.com
|
Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper. |
OK sure. Yiming has tons of stuff to do anyway, I think. On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:
|
I am already working on it. Perhaps Pegah could try on other configurations. Yiming On Tue, Sep 13, 2016 at 4:20 PM, Daniel Povey notifications@github.com
Yiming Wang |
Oh OK. On Tue, Sep 13, 2016 at 1:22 PM, Yiming Wang notifications@github.com
|
Good idea! I can implement Dropout component in nnet3 setup. |
Not yet. I am testing it on tdnn now. On Tue, Sep 13, 2016 at 4:31 PM, pegahgh notifications@github.com wrote:
Yiming Wang |
@pegahgh Since I've given this some thought already, feel free to ping me On Tue, Sep 13, 2016 at 1:31 PM, pegahgh notifications@github.com wrote:
Daniel Galvez |
Makes sense. Pegah, if you put an early draft of the nnet3 dropout stuff On Tue, Sep 13, 2016 at 1:33 PM, Daniel Galvez notifications@github.com
|
Regarding the ephemeral connections, I just noticed that in this paper https://arxiv.org/pdf/1510.08983.pdf about highway LSTMs, they start with dropout on these connections at 0.1 and increase it to 0.8-- so it's a bit like the proposed ephemeral-connections idea (except they don't increase the dropout all to 1 and then remove the component). Anyway, to me it confirms that there is something to the idea. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it. |
I have an idea that I'd like help to implement in nnet3. I'm not sure who to ask to work on this.
It's related to dropout, but it's not normal dropout. In normal dropout we'd normally have a schedule where we start off with dropout (e.g. dropout probability of 0.5) and gradually make it more like a regular non-stochastic node (dropout probability of 0.0). In this idea we start off with not much dropout, and end up dropping out with probability 1.0, after which we delete the associated node.
In this idea, the dropout is to be applied to weight matrices for skip connections and (in RNNs) to time-delay connections with a bigger-than-one time delay. These are connections which won't exist in the final network but which only exist near the start of training. The idea is that allowing the network to depend on these skip-connections early on in training will help it to later learn that same information via the regular channels (e.g. in a regular DNN, through the non-skip connections; or in an RNN, through the regular recurrence). Imagine, for instance, that there is useful information to be had from 10 frames ago in a RNN. By giving the rest of the network a "taste" for this information, then gradually reducing the availability of this information, we encourage the network to find ways to get that information through the recurrent connections.
The dropout-proportion would most likely increase from about 0.0 to 0.5 at the start of training, to 1.0 maybe a third of the way through the iterations. Once the dropout-proportion reaches 1.0 the corresponding weight matrix could be removed entirely since it has no effect.
Implementing this will require implementing a DropoutComponent in nnet-simple-component.h.
This will be a little similar to the dropout component in nnet2, except there will be no concept of a scale-- it will scale its input by zero or one, doing dropout with scale==0 with a specified probability. Be careful with the flags returned by Properties()... the backprop needs both the input and output values to be present so that it can figure out the scaling factor [kBackpropNeedsInput|kBackpropNeedsOutput]. This component will have a function to set the dropout probability; and you'll declare in nnet-utils.h a function
void SetDropoutProbability(BaseFloat dropout_prob, Nnet *nnet);
and a corresponding option to nnet3-copy, which calls this function, that can be used to set the dropout prob from the command line.
Removal of components once the dropout probability falls to zero can be accomplished by nnet3-init with a suitable config. This can't actually remove the component yet [we'd need to implement config commands like
delete-component <component-name>
anddelete-node <node-name>
], but you can replace the node input descriptors in such a way as to 'orphan' the component and its node, so it won't actually participate in the computation.In order to facilitate the removal of nodes, it will be easiest if, instead of splicing together the dropped-out skip-connection with the regular time-spliced inputs, you use
Sum(..)
after the affine component, to sum together the output of the regular affine layer with the output of an affine layer whose input is the dropped-out skip-connection.That is, instead of modifying the regular TDNN layer which looks like:
to look like:
instead you should modify it to look like:
Initially I wouldn't worry too much about making the scripts too nice, since this may not even work.
In what I've sketched out above, I've assumed that the dropout precedes the affine component. In fact, it might be better to have the dropout follow the affine component. The reason for this relates to the bias term in the affine component: by discarding the dropout path we won't get a network that's equivalent to the original network with dropout-probability of 1.0, because of the bias term. And we don't have a convenient way to get rid of the bias term (there is no NaturalGradientLinearComponent implemented). This problem disappears if the dropout comes after the affine component. If this turns out to be useful we can figure out how to solve this in a more elegant way later on.
The text was updated successfully, but these errors were encountered: