Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of RNN #89

Closed
RatanRSur opened this issue Aug 23, 2015 · 14 comments
Closed

Implementation of RNN #89

RatanRSur opened this issue Aug 23, 2015 · 14 comments

Comments

@RatanRSur
Copy link

So I want to work on adding RNN functionality mainly to help myself understand them better and to do something of a larger scale in Julia! I did want to open this issue though so that there would be a forum for discussion about implementation.

Here are my current thoughts, I don't know if they're consistent with Mocha's architecture, or even with the principles of RNN's as I only spent a little time getting acquainted but here goes. Please point out any of my misunderstandings!

RNN Specific Stuff

  • Not strictly forward and back
  • Backprop is unrolled through time instead, which essentially means a final "equivalent" FF net of varying sizes dependent on the number of time steps to backprop
  • LSTM to prevent exploding/vanishing gradients

Topology of an RNN in Mocha

To my understanding, there are split layers which allow a layer's output to be sent to two different layers and still be able to play nice with backprop. An RNN implementation would likely need to use this. Additionally, would something like a join layer be necessary?

Caffe

I think BVLC/caffe#1873 is the relevant thread from Caffe.

If I'm understanding correctly, one of the inputs to a recurrent layer is a stream that represents the past states of that layer. Understandably, the forward prop is exact as it only depends on the current value of an input layer and the most recent past value, presumably stored at one end of the stream. He mentions, however, that the back prop is approximate. This is the part I don't understand at all, how is the backprop being approximated?

Thanks for reading!

@jskDr
Copy link

jskDr commented Aug 23, 2015

It is a great news to hear that RNN implementation for Mocha.
I guess it will be useful for highly nonlinear chemical molecule property
modeling.

On Sun, Aug 23, 2015 at 2:09 PM, Ratan notifications@github.com wrote:

So I want to work on adding RNN functionality mainly to help myself
understand them better and to do something of a larger scale in Julia! I
did want to open this issue though so that there would be a forum for
discussion about implementation.

Here are my current thoughts, I don't know if they're consistent with
Mocha's architecture, or even with the principles of RNN's as I only spent
a little time getting acquainted but here goes. Please point out any of my
misunderstandings!
RNN Specific Stuff

  • Not strictly forward and back
  • Backprop is unrolled through time instead, which essentially means a
    final "equivalent" FF net of varying sizes dependent on the number of time
    steps to backprop
  • LSTM to prevent exploding/vanishing gradients

Topology of an RNN in Mocha

To my understanding, there are split layers which allow a layer's output
to be sent to two different layers and still be able to play nice with
backprop. An RNN implementation would likely need to use this.
Additionally, would something like a join layer be necessary?
Caffe

I think BVLC/caffe#1873 BVLC/caffe#1873 is the
relevant thread from Caffe.

If I'm understanding correctly, one of the inputs to a recurrent layer is
a stream that represents the past states of that layer. Understandably, the
forward prop is exact as it only depends on the current value of an input
layer and the most recent past value, presumably stored at one end of the
stream. He mentions, however, that the back prop is approximate. This is
the part I don't understand at all, how is the backprop being approximated?

Thanks for reading!


Reply to this email directly or view it on GitHub
#89.

Best regards,
(James) Sungjin Kim, Ph.D.

@pluskid
Copy link
Owner

pluskid commented Aug 23, 2015

For the approximate gradient computation issue in the caffe discussion. That is because the back-propagate through time get truncated at the boundary of minibatches when the sequence is longer than the minibatch size. So it is approximate

@RatanRSur
Copy link
Author

Ah, ok. Any other comments on my post?

@Andy-P
Copy link

Andy-P commented Aug 24, 2015

Just thought I would mention that there is a pure Julia implementation of various RNN models (RNN, LSTM etc) in the RecurrentNN.jl package.

https://github.com/Andy-P/RecurrentNN.jl

That might be a useful starting point.

Andre

@jskDr
Copy link

jskDr commented Aug 24, 2015

That's wonderful information.

2015년 8월 23일 일요일, Andre Pemmelaarnotifications@github.com님이 작성한 메시지:

Just thought I would mention that there is a pure Julia implementation of
various RNN models (RNN, LSTM etc) in the RecurrentNN.jl package.

https://github.com/Andy-P/RecurrentNN.jl

That might be a useful starting point.

Andre


Reply to this email directly or view it on GitHub
#89 (comment).

Best regards,
(James) Sungjin Kim, Ph.D.

@RatanRSur
Copy link
Author

Thanks for pointing me to that @Andy-P, I'll definitely take a look at those when I need help with the conceptual stuff :)

@pluskid
Copy link
Owner

pluskid commented Aug 24, 2015

@RatanRSur Thanks for your interests in doing this! Some other comments:

  • Yes SplitLayer is needed to pass the output blob to both the current outputs and the input to the next time step. I do not think a join layer is needed. Because ultimately, the outputs at each time step goes to the loss layer, and Mocha automatically accumulates all the losses.
  • When un-rolled in time, RNN becomes an ordinary deep nets, with a large depth (depth = time-step unrolled). You will need to use the parameter sharing mechanism to make share the un-rolled layers shares the same parameters.

@RatanRSur RatanRSur reopened this Aug 24, 2015
@RatanRSur
Copy link
Author

Oops, didn't mean to close the issue.

@jskDr
Copy link

jskDr commented Aug 24, 2015

It happen sometimes. That's okay.

Now, I have a question. Is there any special point which makes Mocha to
adopt RNN yet?
May be no right?

On Mon, Aug 24, 2015 at 11:16 AM, Ratan notifications@github.com wrote:

Oops, didn't mean to close the issue.


Reply to this email directly or view it on GitHub
#89 (comment).

Best regards,
(James) Sungjin Kim, Ph.D.

@RatanRSur
Copy link
Author

Assuming a simple net like:

X_t-1 ----> Y_t-1 ------> H_t-1
                        |
                        V
X_t -------->Y_t ------> H_t

So, in some way, the user specifies the recurrence of the hidden layer (more on this later) and it is converted into the unrolled RNN by Net.jl? Is this what the solver eventually sees?
imag0486

Regarding designating a layer as recurrent, I'm guessing this would be implemented through a characterization?

@jskDr
Copy link

jskDr commented Aug 24, 2015

Thank you for sharing your model, Ratan.
It seems a canonical form of RNN. Is my understanding right?

Then, while we are implementing RNN we put modes for full RNN and canonical
RNN.
In linear model, if I remember correctly this is called as decision
feedback model.

On Mon, Aug 24, 2015 at 1:12 PM, Ratan notifications@github.com wrote:

Assuming a simple net like:

X_t-1 ----> Y_t-1 ------> H_t-1
|
V
X_t -------->Y_t ------> H_t

So, in some way, the user specifies the recurrence of the hidden layer
(more on this later) and it is converted into the unrolled RNN by Net.jl?
Is this what the solver eventually sees?
[image: imag0486]
https://cloud.githubusercontent.com/assets/4733314/9446758/4bab76be-4a61-11e5-8cae-0eefff87003b.jpg

Regarding designating a layer as recurrent, I'm guessing this would be
implemented through a characterization?


Reply to this email directly or view it on GitHub
#89 (comment).

Best regards,
(James) Sungjin Kim, Ph.D.

@pluskid
Copy link
Owner

pluskid commented Aug 25, 2015

@RatanRSur Yes, conceptually the unrolled network looks exactly like what you described.

@pluskid
Copy link
Owner

pluskid commented Nov 13, 2015

For those who is interested in RNN/LSTM in Julia. Please checkout this char-rnn LSTM example in MXNet.jl now. It used explicit unrolling so everything fit in the current FeedForward model, therefore multi-GPU training can be used directly. For more general purpose variable length RNN without unrolling, we will still need to develop the modeling interface. I will add tutorial document soon.

@RatanRSur
Copy link
Author

Awesome, thanks!

@pluskid pluskid closed this as completed Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants