Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout for ephemeral connections #988

Closed
danpovey opened this issue Aug 21, 2016 · 17 comments
Closed

Dropout for ephemeral connections #988

danpovey opened this issue Aug 21, 2016 · 17 comments
Labels
stale Stale bot on the loose

Comments

@danpovey
Copy link
Contributor

I have an idea that I'd like help to implement in nnet3. I'm not sure who to ask to work on this.

It's related to dropout, but it's not normal dropout. In normal dropout we'd normally have a schedule where we start off with dropout (e.g. dropout probability of 0.5) and gradually make it more like a regular non-stochastic node (dropout probability of 0.0). In this idea we start off with not much dropout, and end up dropping out with probability 1.0, after which we delete the associated node.

In this idea, the dropout is to be applied to weight matrices for skip connections and (in RNNs) to time-delay connections with a bigger-than-one time delay. These are connections which won't exist in the final network but which only exist near the start of training. The idea is that allowing the network to depend on these skip-connections early on in training will help it to later learn that same information via the regular channels (e.g. in a regular DNN, through the non-skip connections; or in an RNN, through the regular recurrence). Imagine, for instance, that there is useful information to be had from 10 frames ago in a RNN. By giving the rest of the network a "taste" for this information, then gradually reducing the availability of this information, we encourage the network to find ways to get that information through the recurrent connections.

The dropout-proportion would most likely increase from about 0.0 to 0.5 at the start of training, to 1.0 maybe a third of the way through the iterations. Once the dropout-proportion reaches 1.0 the corresponding weight matrix could be removed entirely since it has no effect.

Implementing this will require implementing a DropoutComponent in nnet-simple-component.h.
This will be a little similar to the dropout component in nnet2, except there will be no concept of a scale-- it will scale its input by zero or one, doing dropout with scale==0 with a specified probability. Be careful with the flags returned by Properties()... the backprop needs both the input and output values to be present so that it can figure out the scaling factor [kBackpropNeedsInput|kBackpropNeedsOutput]. This component will have a function to set the dropout probability; and you'll declare in nnet-utils.h a function
void SetDropoutProbability(BaseFloat dropout_prob, Nnet *nnet);
and a corresponding option to nnet3-copy, which calls this function, that can be used to set the dropout prob from the command line.

Removal of components once the dropout probability falls to zero can be accomplished by nnet3-init with a suitable config. This can't actually remove the component yet [we'd need to implement config commands like delete-component <component-name> and delete-node <node-name>], but you can replace the node input descriptors in such a way as to 'orphan' the component and its node, so it won't actually participate in the computation.

In order to facilitate the removal of nodes, it will be easiest if, instead of splicing together the dropped-out skip-connection with the regular time-spliced inputs, you use Sum(..) after the affine component, to sum together the output of the regular affine layer with the output of an affine layer whose input is the dropped-out skip-connection.
That is, instead of modifying the regular TDNN layer which looks like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3))
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Tdnn_4_affine
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
...

to look like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3), Tdnn_2_dropout)
# note: Tdnn2_dropout is the same as Tdnn_2_renorm but followed by dropout.
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Tdnn_4_affine
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
...

instead you should modify it to look like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3))
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Sum(Tdnn_4_affine, Tdnn_2_dropout_affine)
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
# note: Tdnn2_dropout_affine is the same as Tdnn_2_renorm but followed by dropout then an affine component.

Initially I wouldn't worry too much about making the scripts too nice, since this may not even work.

In what I've sketched out above, I've assumed that the dropout precedes the affine component. In fact, it might be better to have the dropout follow the affine component. The reason for this relates to the bias term in the affine component: by discarding the dropout path we won't get a network that's equivalent to the original network with dropout-probability of 1.0, because of the bias term. And we don't have a convenient way to get rid of the bias term (there is no NaturalGradientLinearComponent implemented). This problem disappears if the dropout comes after the affine component. If this turns out to be useful we can figure out how to solve this in a more elegant way later on.

@danpovey
Copy link
Contributor Author

I am becoming more convinced that this method should be quite useful, and I'd really like someone (or people) to help with this.
@galv, is there any chance you could take on the problem of creating the dropout component for nnet3? It will just be a simplification of the nnet2 code, for the most part.
Then maybe @freewym can do the modification of the scripts and test this out.

@galv
Copy link
Contributor

galv commented Sep 12, 2016

It seems reasonable. I would have to think more about whether it could be implemented quickly on GPU (it probably can be, in spite of caching values in back propagation for a long time).

Overall, though, my main concern is that it strikes me that we could get easy gains by adopting other innovations in the neural network literature like batch norm, residual networks, and persistent RNNs (don't quite understand persistent RNNs yet, though; but a 30 times speed up would make experimenting with RNNs a lot easier).

@danpovey
Copy link
Contributor Author

It seems reasonable. I would have to think more about whether it could be
implemented quickly on GPU (it probably can be, in spite of caching values
in back propagation for a long time).

It's already been done in nnet2, it's quite easy, no caching is needed--
the component just requires both the inputs and outputs, and can figure out
the mask from that.

Overall, though, my main concern is that it strikes me that we could get
easy gains by adopting other innovations in the neural network literature
like batch norm, residual networks, and persistent RNNs (don't quite
understand persistent RNNs yet, though; but a 30 times speed up would make
experimenting with RNNs a lot easier).

the natural gradient has a similar effect to batch norm which is why I have
not put effort into implementing that. The other things are not things
that I have heard about in a speech context.

Dan

@danpovey
Copy link
Contributor Author

I had a look into ResNets... it's where every other layer, you have a
skip connection to the layer before, with the identity matrix. That
could very easily be incorporated into our TDNN setups, using the
'Add()' expressions in the config file. It's some very simple
scripting.
@freewym, I don't know if you have time for this?
In the TDNN configs, we have lines like:

component-node name=Tdnn_5_relu component=Tdnn_5_relu input=Tdnn_5_affine

and you could easily change this to:

component-node name=Tdnn_5_relu component=Tdnn_5_relu
input=Plus(Tdnn_5_affine, Tdnn_3_affine)

[and, say, only do this for odd-numbered components].
Of course, there are many other ways to do this, but this seems
closest in spirit to the way it was originally done.
Anyway it's a very low-cost experiment.

Dan

On Mon, Sep 12, 2016 at 6:46 PM, Daniel Povey dpovey@gmail.com wrote:

It seems reasonable. I would have to think more about whether it could be
implemented quickly on GPU (it probably can be, in spite of caching values
in back propagation for a long time).

It's already been done in nnet2, it's quite easy, no caching is needed-- the
component just requires both the inputs and outputs, and can figure out the
mask from that.

Overall, though, my main concern is that it strikes me that we could get
easy gains by adopting other innovations in the neural network literature
like batch norm, residual networks, and persistent RNNs (don't quite
understand persistent RNNs yet, though; but a 30 times speed up would make
experimenting with RNNs a lot easier).

the natural gradient has a similar effect to batch norm which is why I have
not put effort into implementing that. The other things are not things that
I have heard about in a speech context.

Dan

@freewym
Copy link
Contributor

freewym commented Sep 13, 2016

As far as I know, ResNets is especially beneficial to train very deep networks (more than hundreds of layers). Not sure if it would improve over the current tdnn setup. But anyway, I can pick it up.

Yiming

@danpovey
Copy link
Contributor Author

Regarding ResNets (and this is also of broader interest),
Vijay just pointed out to me this ArXiv paper from Microsoft Research
http://arxiv.org/pdf/1609.03528v1.pdf
where they report the best-ever results on Switchboard, at 6.3%. A variety
of ResNet is one of their systems, although it seems to involve
convolutional concepts-- it's maybe not a standard feed-forward ResNet (but
neither would ours be).

They are also using lattice-free MMI, and they cite our lattice-free MMI
paper that I just presented in Interspeech... however, it is probably
something they were doing already, as Geoff had implemented lattice-free
MMI before at IBM; and it's on a conventional 10ms frame rate.

On Mon, Sep 12, 2016 at 8:05 PM, Yiming Wang notifications@github.com
wrote:

As far as I know, ResNets is especially beneficial to train very deep
networks (more than hundreds of layers). Not sure if it would improve over
the current tdnn setup. But anyway, I can pick it up.

Yiming


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu1Tvq6YSR7gEkBFKhuEGlFBWOfKGks5qphMVgaJpZM4Jpaxn
.

@pegahgh
Copy link
Contributor

pegahgh commented Sep 13, 2016

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper.
This is the reason that MSR people decided to put their paper on archive after submitting paper to conferences and they submitted their own version of LF-MMI to ICASSP on Monday!!
Since it is close to whatever I did before, I like to pick this issue if no one already started working on that!

@danpovey
Copy link
Contributor Author

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR.
Actually we proposed and submitted this idea to ICASSP before ResNet
publication, but they didn't cite our work, although it was internal MSR
work and they should cite our paper.
This is the reason that MSR people decided to put their paper on archive
after submitting paper to conferences and they submitted their own version
of LF-MMI to ICASSP on Monday!!
Since it is close to whatever I did before, I like to pick this issue if
no one already started working on that!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn
.

@freewym
Copy link
Contributor

freewym commented Sep 13, 2016

I am already working on it. Perhaps Pegah could try on other configurations.

Yiming

On Tue, Sep 13, 2016 at 4:20 PM, Daniel Povey notifications@github.com
wrote:

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR.
Actually we proposed and submitted this idea to ICASSP before ResNet
publication, but they didn't cite our work, although it was internal MSR
work and they should cite our paper.
This is the reason that MSR people decided to put their paper on archive
after submitting paper to conferences and they submitted their own
version
of LF-MMI to ICASSP on Monday!!
Since it is close to whatever I did before, I like to pick this issue if
no one already started working on that!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADWAkkQ1tXlr0_IyLdtZXT-V1toxbS_2ks5qpwWogaJpZM4Jpaxn
.

Yiming Wang
Department of Computer Science
The Johns Hopkins University
3400 N. Charles St.
Baltimore, MD 21218

@danpovey
Copy link
Contributor Author

Oh OK.
Pegah, if you're interested in this area, maybe you could do that thing
about the "ephemeral connections", e.g. implement the dropout component in
nnet3. But I want to merge the dropout component before you do the
experiments, because otherwise the project will be too big to easily review.
Dan

On Tue, Sep 13, 2016 at 1:22 PM, Yiming Wang notifications@github.com
wrote:

I am already working on it. Perhaps Pegah could try on other
configurations.

Yiming

On Tue, Sep 13, 2016 at 4:20 PM, Daniel Povey notifications@github.com
wrote:

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com
wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in
MSR.
Actually we proposed and submitted this idea to ICASSP before ResNet
publication, but they didn't cite our work, although it was internal
MSR
work and they should cite our paper.
This is the reason that MSR people decided to put their paper on
archive
after submitting paper to conferences and they submitted their own
version
of LF-MMI to ICASSP on Monday!!
Since it is close to whatever I did before, I like to pick this issue
if
no one already started working on that!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#988 (comment)
,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADWAkkQ1tXlr0_IyLdtZXT-V1toxbS_2ks5qpwWogaJpZM4Jpaxn>
.

Yiming Wang
Department of Computer Science
The Johns Hopkins University
3400 N. Charles St.
Baltimore, MD 21218


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVuwloSqrREL2Ul0gt0XHaObkKowAqks5qpwYQgaJpZM4Jpaxn
.

@pegahgh
Copy link
Contributor

pegahgh commented Sep 13, 2016

Good idea! I can implement Dropout component in nnet3 setup.
@freewym Did you start using bypass connection on LSTM setup?

@freewym
Copy link
Contributor

freewym commented Sep 13, 2016

Not yet. I am testing it on tdnn now.

On Tue, Sep 13, 2016 at 4:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup.
@freewym https://github.com/freewym Did you start using bypass
connection on LSTM setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADWAksOqnvPVAEWIJfACpIA_VUFzcR3vks5qpwg3gaJpZM4Jpaxn
.

Yiming Wang
Department of Computer Science
The Johns Hopkins University
3400 N. Charles St.
Baltimore, MD 21218

@galv
Copy link
Contributor

galv commented Sep 13, 2016

@pegahgh Since I've given this some thought already, feel free to ping me
for code review or discussion of implementation.

On Tue, Sep 13, 2016 at 1:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup.
@freewym https://github.com/freewym Did you start using bypass
connection on LSTM setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEi_UM-qLWmUwOI8gMd1JvPuiMXLfGfqks5qpwgygaJpZM4Jpaxn
.

Daniel Galvez

@danpovey
Copy link
Contributor Author

Makes sense. Pegah, if you put an early draft of the nnet3 dropout stuff
up as a pull request, it will be fastest.

On Tue, Sep 13, 2016 at 1:33 PM, Daniel Galvez notifications@github.com
wrote:

@pegahgh Since I've given this some thought already, feel free to ping me
for code review or discussion of implementation.

On Tue, Sep 13, 2016 at 1:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup.
@freewym https://github.com/freewym Did you start using bypass
connection on LSTM setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEi_UM-
qLWmUwOI8gMd1JvPuiMXLfGfqks5qpwgygaJpZM4Jpaxn>
.

Daniel Galvez


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#988 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu5UpGin3iblMImpv1-1nK5mTgBpEks5qpwibgaJpZM4Jpaxn
.

@danpovey
Copy link
Contributor Author

danpovey commented Oct 2, 2016

Regarding the ephemeral connections, I just noticed that in this paper https://arxiv.org/pdf/1510.08983.pdf about highway LSTMs, they start with dropout on these connections at 0.1 and increase it to 0.8-- so it's a bit like the proposed ephemeral-connections idea (except they don't increase the dropout all to 1 and then remove the component). Anyway, to me it confirms that there is something to the idea.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this as completed Jul 19, 2020
@kkm000 kkm000 added the ask-dan label Jul 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose
Projects
None yet
Development

No branches or pull requests

5 participants