Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRU layer for ANN module #1018

Merged
merged 33 commits into from Aug 15, 2017

Conversation

Projects
None yet
3 participants
@sumedhghaisas
Copy link
Member

commented Jun 5, 2017

Implementation of Grated Recurrent Unit for recurrent networks in ANN
Tests are implemented similar to the tests of LSTM

#include "layer_types.hpp"
#include "add_merge.hpp"
#include "sequential.hpp"

This comment has been minimized.

Copy link
@zoq

zoq Jun 5, 2017

Member

Do you mind to cite the "Gated Feedback Recurrent Neural Networks" paper here, as we did for e.g. https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/ann/layer/elu.hpp?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 6, 2017

Author Member

Ahh yes. I will add that first thing.

gradientStep(0),
deterministic(false)
{
// input specific linear layers(for zt, rt, ot)

This comment has been minimized.

Copy link
@zoq

zoq Jun 5, 2017

Member

Can you take a look at: https://github.com/mlpack/mlpack/wiki/DesignGuidelines#comments and use the correct format (punctuation) for each comment?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 6, 2017

Author Member

Surely. I will add better documentation than that.

This comment has been minimized.

Copy link
@zoq

zoq Jun 6, 2017

Member

I'm fine with the documentation: Just like to follow the style guidelines: so instead of input specific linear layers(for zt, rt, ot) we do: Input specific linear layers(for zt, rt, ot).. I think using equations/symbols from the paper is a great idea.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 7, 2017

Author Member

Sure. Will do that.

boost::apply_visitor(outputParameterVisitor,
hiddenStateModule));

prevOutput = output;

This comment has been minimized.

Copy link
@zoq

zoq Jun 5, 2017

Member

I wonder can we avoid the extra copy if we do

if (!deterministic)
{
  outParameter.push_back(prevOutput);
}

here. The problem I see is, that the first output is zero, but we could preallocated the vector and use forwardStep as index and push the first output in the preallocated step.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 6, 2017

Author Member

Hmm... I agree. We could use the last entry in outParameters rather than prevOutput, right? or am I getting it wrong? I think it will save 2 copies. If we do push the temporary computed output directly to outParameters with c++11 move semantics and then make a copy of that for returning. I think we can do the same thing for LSTM as well. What do you think?

This comment has been minimized.

Copy link
@zoq

zoq Jun 6, 2017

Member

Yes, I was thinking the same, the impact might be minimal, but I think it's worth a test.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jun 6, 2017

Member

You could also replace the std::vector<arma::mat> with an arma::cube, that may provide some very minor additional gain.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 7, 2017

Author Member

hmm... I think it will increase a locality of the data. I will try to do a benchmarking check.

This comment has been minimized.

Copy link
@zoq

zoq Jun 7, 2017

Member

If we test the cube approach we should also test resize, since we do support a dynamic sequence length, e.g. rho might change over time.

This comment has been minimized.

Copy link
@zoq

zoq Jun 13, 2017

Member

You said, cube is slightly faster as std::vector to you think it's worth to introduce a parameter that allows us to select the 'right' container, e.g. if the sequence length is const we could use cube?

{
DistractedSequenceRecallTestNetwork();
DistractedSequenceRecallTestNetwork<GRU<>>();

This comment has been minimized.

Copy link
@zoq

zoq Jun 5, 2017

Member

I like the idea to reuse the existing test code, but can we tune the GRU parameter to accelerate the test, or do we require the same settings to get stable results. The problem I see is are the current runtimes on the sparc machine: http://masterblaster.mlpack.org/job/mlpack%20-%20nightly%20matrix%20build/arch=sparc64,armadillo=armadillo-5.500.2,buildmode=debug/lastCompletedBuild/testReport/mlpackTest/RecurrentNetworkTest/

I thought the latest change improved the timings, but it looks like that's not true at least for the sparc machines.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 6, 2017

Author Member

Sure. I will try to tweak the parameters for GRU specific tests. The current tests are just for validating the implementation. I have usually observed that GRU converges faster than LSTM, do you think we should add that as a test as well?

This comment has been minimized.

Copy link
@zoq

zoq Jun 6, 2017

Member

I like the idea, but I'm not sure the test would provide much insight about the correctness of the GRU or LSTM implementation. Like in some situations LSTM might converge faster as the GRU layer.

@rcurtin
Copy link
Member

left a comment

Looks good to me so far, but I think I might like to see some more runtime comparisons. I can see that the Travis build added ~3 minutes to the build time with the two GRU tests, so maybe it is a bit slow, or maybe those tests just take a long time.

I also think it would be good to add some more unit-test-like tests for GRU, like a test that checks Forward(), Backward(), and Gradient() with some inputs. It might be a little difficult to design those; I haven't thought it through. But that strategy could allow more fine-grained tests to see that the GRU is working right---it's always possible that there is some bug but the rest of the network is able to learn "around" the bug so that the EmbeddedReberGrammarTest still passes.

LayerTypes hiddenStateModule;

//! Locally-stored forget gate module.
LayerTypes forgetGateModule;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jun 6, 2017

Member

A couple comments here---

  • It might be better to use the actual type of the layer here just for clarity in the definition.

  • It's possible that it may be better to manually implement each of the layers here. This could lead to a more optimized implementation of Forward() and Backward(). In order to determine whether it's necessary to take the time to do that, I'd suggest doing a quick benchmark of mlpack vs. Keras on some simple GRU network.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 7, 2017

Author Member

I agree to changing the definition to actual types. It will provide more clarity while reading the headers.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 7, 2017

Author Member

About implementing Forward and Backward manually. I am not so sure. I did a quick and dirty check on the paper (writing down the equations involved), and did not find any operation that could be avoided with manual implementation. All the matrices that currently saved inside layers will be saved in GRU layer and they all help in Backward and Gradient. We will be saving on locality and function calls. Nonetheless, I will do the calculations again tomorrow, in case I missed out on something.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jun 7, 2017

Member

If you don't see an easy route toward any optimization with hand implementation and the existing implementation is comparably fast to, e.g., Keras or whatever other tools, then I think it's fine as-is. :)

boost::apply_visitor(outputParameterVisitor,
hiddenStateModule));

prevOutput = output;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jun 6, 2017

Member

You could also replace the std::vector<arma::mat> with an arma::cube, that may provide some very minor additional gain.

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from 0bf1003 to ef26d94 Jun 12, 2017

@@ -183,7 +183,7 @@ void GenerateReber(const arma::Mat<char>& transitions, std::string& reber)
grammerIdx + 2)) - '0';
} while (idx != 0);

reber = "BPTVVE";
//reber = "BPTVVE";

This comment has been minimized.

Copy link
@zoq

zoq Jun 21, 2017

Member

Nice catch!

/**
* Train the specified networks on a Reber grammar dataset.
*/
BOOST_AUTO_TEST_CASE(ReberGrammarTest)
BOOST_AUTO_TEST_CASE(GRUReberGrammarTest)

This comment has been minimized.

Copy link
@zoq

zoq Jun 21, 2017

Member

I end up with error: Mat::rows(): indices out of bounds or incorrectly used running RecurrentNetworkTest/GRUReberGrammarTest wonder if you missed to check in something crucial?

This comment has been minimized.

Copy link
@zoq

zoq Jun 21, 2017

Member

Same error for LSTMReberGrammarTest.

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from e241f5e to 9d06d22 Jun 23, 2017

namespace ann /** Artificial Neural Network. */ {

template <
typename InputDataType = arma::mat,

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Can comment the template parameter?

@@ -0,0 +1,307 @@
/**
* @file lstm_impl.hpp

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Minor issue, use the correct file name.

* @file lstm_impl.hpp
* @author Sumedh Ghaisas
*
* Implementation of the LSTM class, which implements a lstm network

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Use the correct layer description.

#define MLPACK_METHODS_ANN_LAYER_GRU_IMPL_HPP

// In case it hasn't yet been included.
#include "linear.hpp"

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

We should include gru.hpp here right, it looks like I did the same mistake for lstm_impl.hpp.

arma::mat modInput = (boost::apply_visitor(outputParameterVisitor,
forgetGateModule) % *prevOutput);

// Pass that through the outputHidden2GateModule

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Pedantic style comment: missing . at the end.

0, 0, 1 * outSize - 1, 0)), std::move(boost::apply_visitor(
outputParameterVisitor, inputGateModule))), inputGateModule);

// Pass the second through forgetGate

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Pedantic style comment: missing . at the end.

hiddenStateModule);

// Update the output (nextOutput): cmul1 + cmul2
// Wwhere cmul1 is input gate * prevOutput and

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

Looks like you mean Where .

void GRU<InputDataType, OutputDataType>::ResetCell()
{
outParameter.clear();
outParameter.push_back(arma::zeros<arma::mat>(outSize, 1));

This comment has been minimized.

Copy link
@zoq

zoq Jun 25, 2017

Member

I was wondering if we could preserve the existing data, and only allocate if the list.size < rho. This way we don't have to push arma::zeros<arma::mat>(outSize, 1) again?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jun 26, 2017

Author Member

I am sorry I did not get that. Yo mean we clear the list till the first element?

This comment has been minimized.

Copy link
@zoq

zoq Jun 26, 2017

Member

Yes, not sure that is feasible or even necessary, just an idea.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

Correct me if I am misunderstanding---I think the first element of outParameter will always be a matrix of zeros and it is never modified. This leads me to three thoughts:

  1. If outParameter.front() is always arma::zeros<arma::mat>(outSize, 1) and outSize does not change between calls to ResetCell() we can just do:
arma::mat tmp = std::move(outParameter[0]);
outParameter.clear();
outParameter.push_back(arma::mat()); // not 100% sure on the syntax here, I think this is right
outParameter.front() = std::move(tmp); // maybe this could be compressed with the previous line but you get the idea
  1. In another place you also push back a zeros matrix onto outParameter. If outParameter held pointers to matrices instead, you could do outParameter.push_back(outParameter.front()). But you could also create an alias matrix with one of the special Armadillo constructor, of outParameter.front(), and push that onto the list.

  2. Would it be possible to avoid allocating the zeros matrix entirely? And then detect when you are at the front of outParameter? You use outParameter in code like...

arma::mat d_zt = gy % (*backIterator - stuff);

but if we could instead do something like

if (we are at the first element)
  arma::mat d_zt = gy % stuff;
else
  arma::mat d_zt = gy % (*backIterator - stuff);

then we can save a possibly significant amount of computation. Since we are pushing the zeros matrix into something of type arma::mat, it is no longer known at compile time that the matrix only has zeros. Thus Armadillo is not able to see something like *backIterator - stuff and know that *backIterator is actually a matrix of zeros, and an unnecessary loop over the values happens.

The third comment is the most complex of the three options, but it will result in the most speedup so I think it would be worth it to implement. If you don't want to do that now, we should open an issue for it. But at the very least we should do #1 (and #2) if not, plus possibly the improvement suggested by Marcus above.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jun 28, 2017

Hey Sumedh, I wanted to ask what the status of this PR was. Based on the weekly report it seems like you are done with GRU, but maybe there is something I misunderstood? Let me know, I am looking forward to when we can merge this. :)

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from 6c5ed3e to c1c8b76 Jun 28, 2017

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2017

Hey Ryan. If the current checks all pass the branch is ready to be merged :)

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from c1c8b76 to 3a28b3e Jun 28, 2017

@@ -0,0 +1,246 @@
/**
* @file gru.hpp
* @authorSumedh Ghaisas

This comment has been minimized.

Copy link
@zoq

zoq Jun 29, 2017

Member

Minor style issue, missing space between tag and author.

/**
* An implementation of a gru network layer.
*
* This class allows specification of the type of the activation functions used

This comment has been minimized.

Copy link
@zoq

zoq Jun 29, 2017

Member

The comment doesn't match with the layer.

@@ -572,8 +583,8 @@ void DistractedSequenceRecallTestNetwork()
{
RNN<MeanSquaredError<> > model(rho);
model.Add<IdentityLayer<> >();
model.Add<Linear<> >(inputSize, 14);
model.Add<LSTM<> >(14, 7, rho);
model.Add<Linear<> >(inputSize, 16);

This comment has been minimized.

Copy link
@zoq

zoq Jun 29, 2017

Member

I wonder if a hiddensize < 16 would pass the test too?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 3, 2017

Author Member

I tried some tests... to my surprise Reber grammar tests pass with just 1 unit of each. Distracted sequence tests pass with 3 units.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 2, 2017

@sumedhghaisas: do you think you could add some unit tests for the GRU, like for the Gradient(), Forward(), and Baclward() functions? It might be a little tricky to do this outside of a network but I think it should be possible, and I think that will be a good supplement to the grammar tests.

It looks like there are some other comments to be addressed too; if you can take care of those, we can get closer to merging this.

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2017

Hey Ryan. I am not sure how I would write the Forward, Backward and Gradient tests cause it would involve setting weights for every parameter used in the cell. Cause random initialization would give different results. Exactly what do you mean by Forward test?

@zoq

This comment has been minimized.

Copy link
Member

commented Jul 3, 2017

What about checking the Gradient, as done in ann_layer_test.cpp? We can also test the forward function using some generated input and compare against manually calculated values?

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2017

So according to the current implementation, the parameters have to be accessed through the model stored inside the layer. If we set all the parameters to value '1' we can check first couple of outputs by sending in '1's as inputs. Is that a good idea?

@zoq

This comment has been minimized.

Copy link
Member

commented Jul 5, 2017

Sounds reasonable to me.

/**
* This class is used to initialize randomly the weight matrix.
*/
class UnitInitialization

This comment has been minimized.

Copy link
@zoq

zoq Jul 7, 2017

Member

This class is only for testing right? So I think we can use the RandomInitialization class and set lowerBound and upperBound to 1. What do you think?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 8, 2017

Author Member

Yeah we can. I thought a separate initialization policy is maybe simpler to read. And also can be used for different purposes later. The better design would be to have a constant initialization function and make zero and 1 special cases. Is that better?

This comment has been minimized.

Copy link
@zoq

zoq Jul 9, 2017

Member

ConstIntialization sounds like a good to me.

@@ -202,6 +202,8 @@ class RNN
//! Modify the initial point for the optimization.
arma::mat& Parameters() { return parameter; }

size_t& Rho() { return rho; }

This comment has been minimized.

Copy link
@zoq

zoq Jul 7, 2017

Member

Can you add a comment here?

/**
* GRU layer manual forward test.
*/
BOOST_AUTO_TEST_CASE(ForwardGRULayerTest)

This comment has been minimized.

Copy link
@zoq

zoq Jul 7, 2017

Member

Can we add a similar test for the backward function?

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 8, 2017

Author Member

umm... I can. But I thought if the gradients are right at each layer then backward has to be correct right?

This comment has been minimized.

Copy link
@zoq

zoq Jul 9, 2017

Member

Yeah, you are right, it's just an additional test.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

I took a look at Backward(), I think the code there is non-trivial, not just a call to Gradient() or anything. So even if the gradient test is passing, I think it would still be a good idea to write a test for Backward().

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from 3545af9 to af06340 Jul 10, 2017

*/
class ZeroInitialization
template<size_t initConst>

This comment has been minimized.

Copy link
@zoq

zoq Jul 10, 2017

Member

I think we should use the constructor to pass the const value, that way we could also use doubleas type and pass e.g.: 0.5.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

Agreed, I think the time savings from using a compile-time constant aren't going to be noticeable in this case, and it's pretty restrictive to disallow non-size_t values.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 11, 2017

Author Member

I agree. Will make the appropriate change.

@rcurtin
Copy link
Member

left a comment

Thanks for adding the tests, I think it is a big improvement. Just a few more comments. I think this is getting close to ready to merge, at least from my side.

ar & data::CreateNVP(weights, "weights");
ar & data::CreateNVP(inSize, "inSize");
ar & data::CreateNVP(outSize, "outSize");
ar & data::CreateNVP(rho, "rho");

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

Do we also need to serialize network here?

@@ -94,6 +94,11 @@ class LSTM
arma::Mat<eT>&& /* error */,
arma::Mat<eT>&& /* gradient */);

/*
* Resets the cell(BPTT) to accept a new input.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

Can you clarify this comment a little? Specifically, the (BPTT) doesn't make sense to me; would it be better to say Resets the cell to accept a new input. This is part of backpropagation through time (BPTT).?

@@ -779,6 +780,92 @@ BOOST_AUTO_TEST_CASE(GradientLSTMLayerTest)
}

/**
* GRU layer numerically gradient test.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

This is a nice test, thanks for writing it. It might be good to expand the comment a little bit here; it took me a few minutes to fully understand it. To me it looks like the test ensures that an approximation of the gradient is close enough to the actual gradient returned by the GRU<> layer.

expectedOutput *= -4;
expectedOutput = arma::exp(expectedOutput);
expectedOutput = arma::ones(3, 1) / (arma::ones(3, 1) + expectedOutput);
expectedOutput = (arma::ones(3, 1) - expectedOutput) % expectedOutput;

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

If you can add a comment here on why this is the correct output, it will probably be really helpful to someone in the future who is trying to debug a failure. :)

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 16, 2017

Author Member

For Backward() test, the code is definitely non-trivial but the way we test gradients, if the Backward function is not correct for all the layers, the gradients of the input won't be correct. The gradients computed at each layer depends on the signal coming from the layer above, which is computed by the backward call. Unless the backward function of the layer above is correct, the gradient of the layer below won't be correct. Writing the test is not a very complex task, but I feel that it is unnecessary.

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 18, 2017

Member

Fair enough---having a separate test for Backward() can help rule out whether a failure in the gradient test is caused by an incorrect Backward() implementation or an incorrect implementation in some other part of the code that is being tested. But if it is tedious to add, I am ok with leaving it out, as long as the Backward() implementation is indeed being tested at least somewhere.

/**
* GRU layer manual forward test.
*/
BOOST_AUTO_TEST_CASE(ForwardGRULayerTest)

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

I took a look at Backward(), I think the code there is non-trivial, not just a call to Gradient() or anything. So even if the gradient test is passing, I think it would still be a good idea to write a test for Backward().

void GRU<InputDataType, OutputDataType>::ResetCell()
{
outParameter.clear();
outParameter.push_back(arma::zeros<arma::mat>(outSize, 1));

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 10, 2017

Member

Correct me if I am misunderstanding---I think the first element of outParameter will always be a matrix of zeros and it is never modified. This leads me to three thoughts:

  1. If outParameter.front() is always arma::zeros<arma::mat>(outSize, 1) and outSize does not change between calls to ResetCell() we can just do:
arma::mat tmp = std::move(outParameter[0]);
outParameter.clear();
outParameter.push_back(arma::mat()); // not 100% sure on the syntax here, I think this is right
outParameter.front() = std::move(tmp); // maybe this could be compressed with the previous line but you get the idea
  1. In another place you also push back a zeros matrix onto outParameter. If outParameter held pointers to matrices instead, you could do outParameter.push_back(outParameter.front()). But you could also create an alias matrix with one of the special Armadillo constructor, of outParameter.front(), and push that onto the list.

  2. Would it be possible to avoid allocating the zeros matrix entirely? And then detect when you are at the front of outParameter? You use outParameter in code like...

arma::mat d_zt = gy % (*backIterator - stuff);

but if we could instead do something like

if (we are at the first element)
  arma::mat d_zt = gy % stuff;
else
  arma::mat d_zt = gy % (*backIterator - stuff);

then we can save a possibly significant amount of computation. Since we are pushing the zeros matrix into something of type arma::mat, it is no longer known at compile time that the matrix only has zeros. Thus Armadillo is not able to see something like *backIterator - stuff and know that *backIterator is actually a matrix of zeros, and an unnecessary loop over the values happens.

The third comment is the most complex of the three options, but it will result in the most speedup so I think it would be worth it to implement. If you don't want to do that now, we should open an issue for it. But at the very least we should do #1 (and #2) if not, plus possibly the improvement suggested by Marcus above.

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch 3 times, most recently from f0b3b1a to 671e6f2 Jul 11, 2017

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from 671e6f2 to acce147 Jul 19, 2017

}

// Delta zt.
arma::mat d_zt = gy % (*backIterator -

This comment has been minimized.

Copy link
@zoq

zoq Jul 20, 2017

Member

Pedantic style issue; can you use camel casing for all names?

void GRU<InputDataType, OutputDataType>::Forward(
arma::Mat<eT>&& input, arma::Mat<eT>&& output)
{
arma::mat temp = arma::zeros<arma::mat>(outSize, 1);

This comment has been minimized.

Copy link
@zoq

zoq Jul 20, 2017

Member

It looks like this parameter is unused.


#include <mlpack/prereqs.hpp>

#include <boost/ptr_container/ptr_vector.hpp>

This comment has been minimized.

Copy link
@zoq

zoq Jul 23, 2017

Member

I think, we can remove the header.

forgetGateModule);

// Put delta zt.
prevError.submat(0, 0, 1 * outSize - 1, 0) = boost::apply_visitor(

This comment has been minimized.

Copy link
@zoq

zoq Jul 23, 2017

Member

That works if we use a batchSize = 1, do you mind to refactor the parameter so that it works with batchSize > 1? You could get the batchSize by looking at input.n_cols, so instead of using prevError = arma::zeros<arma::mat>(3 * outSize, 1); we have to use prevError.set_size(4 * outSize, batchSize); and prevError.submat(outSize, 0, 2 * outSize - 1, batchSize - 1). I can refactor the LSTM layer, but it would be great if you could update the GRU layer.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 24, 2017

Author Member

Ohh. I did not know that batchSize is also a factor. I will refactor GRU and LSTM.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 25, 2017

Author Member

Is the batchSize constant, or may change during training?

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 25, 2017

Member

I am not 100% sure on this, so @zoq please correct me if I am wrong, but I think that for now batchSize will be constant during training. If we wanted the user to be able to set batchSize differently during training, we would probably need to add a BatchSizeVisitor that could operate on many different layer types, so a change like that I think would be outside the scope of this PR.

This comment has been minimized.

Copy link
@zoq

zoq Jul 25, 2017

Member

The layer should be able to handle different sizes, but I agree adding another visitor shouldn't be part of this PR.

It's correct that in most cases the batchSize is constant, during training, I guess we could say that we only train on full batches, however it probably makes sense to evaluate more or less samples during the prediction process.

This comment has been minimized.

Copy link
@sumedhghaisas

sumedhghaisas Jul 26, 2017

Author Member

So should I resize prevError in every call or set it once and keep using it?

This comment has been minimized.

Copy link
@zoq

zoq Jul 26, 2017

Member

We could do something like:

if (input.n_cols != batchSize)
{
  batchSize = input.n_cols;
  // reset prevError;
}

What do you think?

@zoq

This comment has been minimized.

Copy link
Member

commented Jul 23, 2017

I think once, we fixed the issue with the batch size, this is ready to be merged.

@sumedhghaisas sumedhghaisas force-pushed the sumedhghaisas:gru branch from 6059358 to 5ee5f24 Aug 13, 2017

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2017

So I rebased the branch on master and made necessary changes to RNN module regarding 'rho'.

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2017

I also replaced the EmbeddedReberTest with RecursiveReberTest. RecursiveReberTest is a generalization of EmbeddedReberTest. In embedded tests, a reber is embedded inside another reber, where as recursive reber recursively embeds a reber inside given depth.

But then, I thought of making it better so rather than constant depth in recursive reber, I added options of average depth and max depth. This is where things get interesting. In a normal reber, a start symbol is 'B' and then a 'P' or 'T'. So I assigned 'P' to recursion and 'T' to termination. So after selecting 'B', the reber generator selects either 'P' or 'T' given probability, which you can guess is 1 / (average depth). If 'P' comes, it selects again, but if 'T' comes then the selection stops and a normal reber string is attached to the current construction. After this, the entire string is closed with proper closing sequence.
In context free grammar -
s -> B a E
a -> P a P
a -> T r T
r -> {reber string}

For precaution, I have added the max depth option which terminates 'P'/'T' selection if the depth reaches maximum, and forces the selection of 'T'.

Experiments with such recursive reber are more interesting that constant depth one. Here, the LSTM has to remember how many 'P' to reproduce while ending the sequence, which is kind of stack storage. Both GRU and LSTM, pass recursive reber test for recursive depth of 3 (max 5) with cell state size of 15. No cell size less than 15 passes the test.

It would be an interesting experiment to see the relation between average depth and the cell state size required. I will set that up in models when I get time.

@zoq

This comment has been minimized.

Copy link
Member

commented Aug 13, 2017

Really nice idea and description, also I like the idea running some experiments on the task, I guess if we do something like that a simple graphic might help to understand the "Recursive" Embedded Reber Grammar task, something like: https://www.willamette.edu/~gorr/classes/cs449/reber.html

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 14, 2017

I think some of the refactoring here has introduced a bug in the LSTM class, based on the Travis output. Is it better to revert the changes for Rho() and merge GRU, then open another PR for the Rho() changes?

Overall I agree though, I think ResetCells() is a much nicer solution.

One additional thought I have is that allocating memory for the allZeros matrix is not the fastest thing to do. Armadillo will still go through all the effort of using that allZeros matrix, but really it might be better to do something like

if (firstStep)
{
  // Perform computation with arma::zeros()
}
else
{
  // Perform computation as usual.
}

Although the code gets a little longer, it is surely faster to check the condition than to perform the operation with zeros.

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

hmm... I am not sure. The gradient tests are passing. And the Rho() changes majorly in RNN module. Do you want me to remove the ResetCell code?

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

So I checked the diff between the current RNN, GRU and LSTM vs my version of them before I rebased. There is not any algorithmic difference. I wonder why is this happening.

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

And regarding the AllZeroes thing. Most of the current usage of LSTM and GRU or any other recurrent unit uses a linear layer initialization. We may need to come up for with a design for that. Thats why I thought there is no point in specializing for zeroes. But we can maybe use some specialization tricks?

@sumedhghaisas

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2017

Okay. So I reverted back to the old network sizes for the tests. That solved the problem.

@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 15, 2017

And regarding the AllZeroes thing. Most of the current usage of LSTM and GRU or any other recurrent unit uses a linear layer initialization. We may need to come up for with a design for that. Thats why I thought there is no point in specializing for zeroes. But we can maybe use some specialization tricks?

Hm, I am not sure, I think maybe we are talking about different things. But I think, in the end, my idea will not work, because the visitors require an arma::mat to be passed (so we can't pass some expression like arma::zeros<arma::mat>(...) which simply represents a zeros matrix without allocating it). Therefore I think what you've done seems like the best option for now, and nevermind about my comments. :)

Thanks for solving the LSTM test issue!

@rcurtin rcurtin merged commit 406cef4 into mlpack:master Aug 15, 2017

3 of 4 checks passed

Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Copy link
Member

commented Aug 15, 2017

I merged and also fixed the static code analysis issues. Nice work with the GRU! I am happy that we have merged it now. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.