New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Neural Turing Machine. #1072

Open
wants to merge 920 commits into
base: master
from

Conversation

Projects
None yet
@sumedhghaisas
Member

sumedhghaisas commented Jul 29, 2017

After million tries to fix the gradients here is the implementation which passed gradient checks. :)
Tests are also performed with Reber grammar task.

This implementation introduces the concept of memory first time in our framework thus some new 'Forward' and 'Backward' signatures are introduced (backward compatible) to deal with this extra parameter of memory.

I found out that layers that deal with memory are harder to debug cause the gradient come back in 2 directions, input and memory. To deal with this, I have added a new layer called 'MemoryTest' which fakes a memory content with linear layer and checks the gradient w.r.t. both input and memory. Thus all the layers that accept memory can be checked individually before adding them into bigger framework such as NTM which simulates memory with the combination of these layers.

NTM implementation is broken down into 4 main modules

  1. Memory Head

  2. Read Memory

  3. Write Memory

  4. Controller

  5. Memory head module takes controller output and generates weights for memory locations.

  6. Read memory module incorporates a Memory Head to read memory content given controller output and memory

  7. Write memory module incorporates a Memory Head to create new memory content using older memory content and controller output

  8. Currently controller is hardcoded as 2 layer feedforward network but a better design has to be implemented to abstract the controller network from NTM

This modularization will allow us to implement more complicated networks than NTM in later future. For example, NTM with multiple read and write heads. Best example of this is Differential Neural Computer.

1 major thing that is missing is design for different initialization of memory. Current the memory is initialized as constant ones, but according to the paper should be initialised with a bias layer so that the network also learnes the correct initialization. The framework for that is actually similar to the framework used in MemoryTest layer but some design decisions are needed.

I still need to go through the implementation once more to try to optimize wherever possible cause till now I have been paying more attention towards the correct implementation rather than fastest implementation. This PR would help in that as well.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 1, 2017

Member

Found another bug in the gradients of Memory Head while testing for Copy Task. Maybe that was the reason for failing tests on Travis. Although the gradients tests are passing on my system. Don;t know the reason for that. Will update as soon as it is resolved...

Member

sumedhghaisas commented Aug 1, 2017

Found another bug in the gradients of Memory Head while testing for Copy Task. Maybe that was the reason for failing tests on Travis. Although the gradients tests are passing on my system. Don;t know the reason for that. Will update as soon as it is resolved...

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 3, 2017

Member

@zoq @rcurtin The PR is ready for review. I have fixed the gradient errors and optimized the implementation of Memory Head a bit. I am currently trying to run the copy task for NTM. Will send the PR to models repo.

Member

sumedhghaisas commented Aug 3, 2017

@zoq @rcurtin The PR is ready for review. I have fixed the gradient errors and optimized the implementation of Memory Head a bit. I am currently trying to run the copy task for NTM. Will send the PR to models repo.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 3, 2017

Member

The gradient test for NTM is failing online, but on my machine, they are passing. I checked the gradients of NTM again but could not find anything. The tests I am running with copy tasks are performing correctly. I am getting above 70 accuracy with sequence length of 10. But the network could be learning to compensate for the wrong gradients. @zoq Can you take a quick look at my gradients in NeuralTuringMachine class?

Member

sumedhghaisas commented Aug 3, 2017

The gradient test for NTM is failing online, but on my machine, they are passing. I checked the gradients of NTM again but could not find anything. The tests I am running with copy tasks are performing correctly. I am getting above 70 accuracy with sequence length of 10. But the network could be learning to compensate for the wrong gradients. @zoq Can you take a quick look at my gradients in NeuralTuringMachine class?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 3, 2017

Member

Couldn't see anything right away, I'll have to take a closer look into the issue; I'll do that later today.

Member

zoq commented Aug 3, 2017

Couldn't see anything right away, I'll have to take a closer look into the issue; I'll do that later today.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 6, 2017

Member

Okay, I can reproduce the issue, you said on your system it passes the Gradient check?

Member

zoq commented Aug 6, 2017

Okay, I can reproduce the issue, you said on your system it passes the Gradient check?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 6, 2017

Member

Just printed the gradients (currentGradient in the RNN class), and it looks like a big chunk is just zero. Can you reproduce the issue? I think that might be a good starting point to look into the issue. Let's see if we can identify the corresponding layer.

Member

zoq commented Aug 6, 2017

Just printed the gradients (currentGradient in the RNN class), and it looks like a big chunk is just zero. Can you reproduce the issue? I think that might be a good starting point to look into the issue. Let's see if we can identify the corresponding layer.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 6, 2017

Member

I am going to do clean build gain. Lets see if I can reproduce the issue. But re the gradients passing for MemoryHead, ReadMemory and WriteMemory? Then we can concentrate on NTM module alone. Its just the combination all 3 modules.

Member

sumedhghaisas commented Aug 6, 2017

I am going to do clean build gain. Lets see if I can reproduce the issue. But re the gradients passing for MemoryHead, ReadMemory and WriteMemory? Then we can concentrate on NTM module alone. Its just the combination all 3 modules.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 6, 2017

Member

GradientMemoryHeadTest, GradientWriteMemoryTest, GradientReadMemoryTest is fine at least on my machine.

Member

zoq commented Aug 6, 2017

GradientMemoryHeadTest, GradientWriteMemoryTest, GradientReadMemoryTest is fine at least on my machine.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 11, 2017

Member

Any clue how I could fix the errors on windows build? I am not sure where the problem lies. is it the lambda function support in Armadillo?

Member

sumedhghaisas commented Aug 11, 2017

Any clue how I could fix the errors on windows build? I am not sure where the problem lies. is it the lambda function support in Armadillo?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 11, 2017

Member

Appveyor builds against armadillo-7.800.2, which should support lambda functions, looks like we have to switch to a for loop to solve the problem.

Member

zoq commented Aug 11, 2017

Appveyor builds against armadillo-7.800.2, which should support lambda functions, looks like we have to switch to a for loop to solve the problem.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 12, 2017

Member

@rcurtin @zoq I am not sure if for-loops are equivalent to for_each. Is it? I am scared to make any assumption about Armadillo to be honest :)

Member

sumedhghaisas commented Aug 12, 2017

@rcurtin @zoq I am not sure if for-loops are equivalent to for_each. Is it? I am scared to make any assumption about Armadillo to be honest :)

@17minutes

This comment has been minimized.

Show comment
Hide comment
@17minutes

17minutes Aug 14, 2017

Contributor

Looks like this one already has a lot of neat stuff implemented 👍 About LayerTypes and FFN integration: is there any way to add FFN<T> for all T? As far as I understand, the running code version uses only NegativeLogLikelihood<> as T:

FFN<NegativeLogLikelihood<>, RandomInitialization>*,

It would be really neat, however, if one could also use MSE error (e.g., EMBED function in HAM), cross-entropy (for OUTPUT function of HAM) or, in fact, any predefined error.

Contributor

17minutes commented Aug 14, 2017

Looks like this one already has a lot of neat stuff implemented 👍 About LayerTypes and FFN integration: is there any way to add FFN<T> for all T? As far as I understand, the running code version uses only NegativeLogLikelihood<> as T:

FFN<NegativeLogLikelihood<>, RandomInitialization>*,

It would be really neat, however, if one could also use MSE error (e.g., EMBED function in HAM), cross-entropy (for OUTPUT function of HAM) or, in fact, any predefined error.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 14, 2017

Member

We are working on that, right now you have to add the types you need manually.

Member

zoq commented Aug 14, 2017

We are working on that, right now you have to add the types you need manually.

17minutes added a commit to 17minutes/mlpack that referenced this pull request Aug 15, 2017

@rcurtin

I'm really excited to see this PR start to come together. It seems like maybe some tests (like the gradient tests for NTM) are not checked in; maybe I overlooked something?

I made some other comments; probably each of them will need to be discussed a bit. Sorry if that takes a long time but I think it will be helpful in the end. :)

Show outdated Hide outdated src/mlpack/methods/ann/ffn.hpp
Show outdated Hide outdated src/mlpack/tests/cv_test.cpp
Show outdated Hide outdated src/mlpack/tests/init_rules_test.cpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/base_layer.hpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/neural_turing_machine.hpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/neural_turing_machine.hpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/memory_head.hpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/read_memory.hpp
Show outdated Hide outdated src/mlpack/methods/ann/layer/neural_turing_machine_impl.hpp
@@ -92,6 +109,7 @@ using LayerTypes = boost::variant<
DropConnect<arma::mat, arma::mat>*,
Dropout<arma::mat, arma::mat>*,
ELU<arma::mat, arma::mat>*,
FFN<NegativeLogLikelihood<>, RandomInitialization>*,

This comment has been minimized.

@rcurtin

rcurtin Aug 15, 2017

Member

It seems like this has to be added here so that you can pass a single layer or an entire FFN to the constructor of NeuralTuringMachine. But unfortunately this means there is a restriction, that NegativeLogLikelihood<> and RandomInitialization must be used. I have a different idea which maybe could be more helpful and robust; let me know what you think.

Instead of adding FFN<> to LayerTypes, add an additional constructor to NeuralTuringMachine that takes some FFN type (I guess a template type) and extracts the layers from it and adds them to NeuralTuringMachine::network. I think we might have to add some functions to FFN so that individual layers could be accessed (not sure about that).

@rcurtin

rcurtin Aug 15, 2017

Member

It seems like this has to be added here so that you can pass a single layer or an entire FFN to the constructor of NeuralTuringMachine. But unfortunately this means there is a restriction, that NegativeLogLikelihood<> and RandomInitialization must be used. I have a different idea which maybe could be more helpful and robust; let me know what you think.

Instead of adding FFN<> to LayerTypes, add an additional constructor to NeuralTuringMachine that takes some FFN type (I guess a template type) and extracts the layers from it and adds them to NeuralTuringMachine::network. I think we might have to add some functions to FFN so that individual layers could be accessed (not sure about that).

This comment has been minimized.

@rcurtin

rcurtin Aug 15, 2017

Member

@partobs-mdp: I think this comment is relevant to the comment you made on this PR too---if you need to add networks to internal layers that you're building (I think you have to for HAM), this could be another approach that could work for that and not have the drawback you pointed out.

@rcurtin

rcurtin Aug 15, 2017

Member

@partobs-mdp: I think this comment is relevant to the comment you made on this PR too---if you need to add networks to internal layers that you're building (I think you have to for HAM), this could be another approach that could work for that and not have the drawback you pointed out.

This comment has been minimized.

@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

Ohh thats not the case. So the template parameters passed on to the FFN are irrelevant. They won;t be used at all. The initialization is done by the base class, which in this case is the RNN in which NTM layer will be added. There the user can specify the appropriate initialization and error function. Both those template arguments, for this use of FFN as a controller are dummies.

@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

Ohh thats not the case. So the template parameters passed on to the FFN are irrelevant. They won;t be used at all. The initialization is done by the base class, which in this case is the RNN in which NTM layer will be added. There the user can specify the appropriate initialization and error function. Both those template arguments, for this use of FFN as a controller are dummies.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

@partobs-mdp For the case of NTM the parameter passed on to the FFN object, which is used as a controller are dummy parameter. The parameter passed to the base RNN class will be used instead to initialize the controller network. So my design is, NTM is just a layer, which is added in RNN. So the user can add multiple layers on top of that and use appropriate Error in RNN class.

I am not sure about the design that you are using. Can you tell me the details of your design, maybe I can help you

Member

sumedhghaisas commented Aug 16, 2017

@partobs-mdp For the case of NTM the parameter passed on to the FFN object, which is used as a controller are dummy parameter. The parameter passed to the base RNN class will be used instead to initialize the controller network. So my design is, NTM is just a layer, which is added in RNN. So the user can add multiple layers on top of that and use appropriate Error in RNN class.

I am not sure about the design that you are using. Can you tell me the details of your design, maybe I can help you

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

@partobs-mdp Also let me rebase this branch on the merged GRU. This is on the older GRU. I can imagine the horrible errors that have come up for you :) I apologize for that. I will get it done as soon as possible.

Member

sumedhghaisas commented Aug 16, 2017

@partobs-mdp Also let me rebase this branch on the merged GRU. This is on the older GRU. I can imagine the horrible errors that have come up for you :) I apologize for that. I will get it done as soon as possible.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

@rcurtin The gradient and forward tests are implemented in ann_layer_test.cpp. But I am yet to add recurrent tests in recurrent_layer_test.cpp. I just have to copy the LSTM and GRU tests for NTM.

Member

sumedhghaisas commented Aug 16, 2017

@rcurtin The gradient and forward tests are implemented in ann_layer_test.cpp. But I am yet to add recurrent tests in recurrent_layer_test.cpp. I just have to copy the LSTM and GRU tests for NTM.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 16, 2017

Member

@zoq I like the idea of creating a visual representation of the recursive grammar. It would make an amazing blog post. I also need to write those scripts for models to get the minimum cell state size required versus the average recursive depth. I am so looking forward to that. Although this M.Sc. dissertation submission is keeping me away from these pleasures. 4 more days and I will be back with some fun experiments :)

Member

sumedhghaisas commented Aug 16, 2017

@zoq I like the idea of creating a visual representation of the recursive grammar. It would make an amazing blog post. I also need to write those scripts for models to get the minimum cell state size required versus the average recursive depth. I am so looking forward to that. Although this M.Sc. dissertation submission is keeping me away from these pleasures. 4 more days and I will be back with some fun experiments :)

@17minutes

This comment has been minimized.

Show comment
Hide comment
@17minutes

17minutes Aug 16, 2017

Contributor

@sumedhghaisas: HAM mostly follows the same design (use FFNs for running some sub-steps), but is more complicated, because it uses five different FFNs: one for embedding sequence vectors into memory, one - for traversing the memory (which is tree-based - the noteworthy part of HAM), one - for replacing memory contents, one - for evaluating inner memory nodes by it children, and one - for emitting output sequence.

@rcurtin Before (unsuccessfully) trying to migrate to LayerTypes, I was trying to make a templated class. Personally, the only drawback I see about that is that the class definition will be really messy (HAM<FFN, FFN, FFN, FFN, RNN>, where FFN and RNN are themselves templated) What do you think?

As a follow-up: can I use boost::visitor for FFN (or RNN) in the templated setting? (Maybe a dumb question, but I can't clearly answer it)

Contributor

17minutes commented Aug 16, 2017

@sumedhghaisas: HAM mostly follows the same design (use FFNs for running some sub-steps), but is more complicated, because it uses five different FFNs: one for embedding sequence vectors into memory, one - for traversing the memory (which is tree-based - the noteworthy part of HAM), one - for replacing memory contents, one - for evaluating inner memory nodes by it children, and one - for emitting output sequence.

@rcurtin Before (unsuccessfully) trying to migrate to LayerTypes, I was trying to make a templated class. Personally, the only drawback I see about that is that the class definition will be really messy (HAM<FFN, FFN, FFN, FFN, RNN>, where FFN and RNN are themselves templated) What do you think?

As a follow-up: can I use boost::visitor for FFN (or RNN) in the templated setting? (Maybe a dumb question, but I can't clearly answer it)

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 16, 2017

Member

I think both strategies have their advantages, as pointed out by @partobs-mdp the class definition is much cleaner but on the other side, we are more restricted since we have to each type to layer types. On the other side, we could provide reasonable default values for HAM<FFN, FFN, FFN, FFN, RNN> so that we could use HAM<> later on.

About the visitor, if we know the type we don't have to use any visitor we can directly use the function.

Member

zoq commented Aug 16, 2017

I think both strategies have their advantages, as pointed out by @partobs-mdp the class definition is much cleaner but on the other side, we are more restricted since we have to each type to layer types. On the other side, we could provide reasonable default values for HAM<FFN, FFN, FFN, FFN, RNN> so that we could use HAM<> later on.

About the visitor, if we know the type we don't have to use any visitor we can directly use the function.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 22, 2017

Member

Have you tested the code on the recursive embedded reber grammar task?

Member

zoq commented Aug 22, 2017

Have you tested the code on the recursive embedded reber grammar task?

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 23, 2017

Member

@zoq Some documentation changes and recursive reber grammar checks are remaining. Will add them today.

Member

sumedhghaisas commented Aug 23, 2017

@zoq Some documentation changes and recursive reber grammar checks are remaining. Will add them today.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 25, 2017

Member

@zoq @rcurtin For windows, maybe armadillo does not have lambda function overload for each_row and each_col. Do you know how to solve this?

Member

sumedhghaisas commented Aug 25, 2017

@zoq @rcurtin For windows, maybe armadillo does not have lambda function overload for each_row and each_col. Do you know how to solve this?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 25, 2017

Member

Have you tested:

I'm wondering if you remove the first & if that would solve the windows problem.

Member

zoq commented Aug 25, 2017

Have you tested:

I'm wondering if you remove the first & if that would solve the windows problem.

@sumedhghaisas

This comment has been minimized.

Show comment
Hide comment
@sumedhghaisas

sumedhghaisas Aug 25, 2017

Member

@zoq You mean the the '&' inside the closure? that would mean we will incur copy overhead.

Member

sumedhghaisas commented Aug 25, 2017

@zoq You mean the the '&' inside the closure? that would mean we will incur copy overhead.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Aug 25, 2017

Member

Just wanted to test if that works, since MSVC supports lambda functions.

Member

zoq commented Aug 25, 2017

Just wanted to test if that works, since MSVC supports lambda functions.

zoq and others added some commits Feb 12, 2018

Merge pull request #1245 from ShikharJ/Misc
Update COPYRIGHT.txt and core.hpp
Merge pull request #1159 from zoq/svrg
Add SVRG, SARAH, Katyusha.
Merge pull request #1248 from zoq/no_decay_merge
Avoid redefinition issue (NoDecay).
Merge pull request #1234 from manish7294/GreedyFix
Greedy Traverser Fix for KNN
@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Feb 22, 2018

Member

Thanks for the updates, I'll review the code over the next days.

Member

zoq commented Feb 22, 2018

Thanks for the updates, I'll review the code over the next days.

kTNonLinear);
// Build bT with non linearity
lBt.push_back(bTNonLinear.Fn(lOutput(memSize, 0)));

This comment has been minimized.

@zoq

zoq Feb 25, 2018

Member

Do you think instead of using a vector to hold the states, we could use arma::cube? I think this would improve the overall readability.

lBt.push_back(bTNonLinear.Fn(lOutput(memSize, 0)));
const double& bT = lBt.back();

would become

gT.slice(step) = gTNonLinear.Fn(lOutput(memSize + 1, 0);
@zoq

zoq Feb 25, 2018

Member

Do you think instead of using a vector to hold the states, we could use arma::cube? I think this would improve the overall readability.

lBt.push_back(bTNonLinear.Fn(lOutput(memSize, 0)));
const double& bT = lBt.back();

would become

gT.slice(step) = gTNonLinear.Fn(lOutput(memSize + 1, 0);
// Build wC with bT and softmax
lWe.push_back(arma::exp(bT * cosSimilarity));
const arma::vec& wE = lWe.back();

This comment has been minimized.

@zoq

zoq Feb 25, 2018

Member

We should remove the line here, since lwe doesn't contain the softmax, lWc does.

@zoq

zoq Feb 25, 2018

Member

We should remove the line here, since lwe doesn't contain the softmax, lWc does.

auto fun = [&](RNN<MeanSquaredError<> >& model)
{
FFN<>* controller = new FFN<>();

This comment has been minimized.

@zoq

zoq Feb 25, 2018

Member

Should we test the NTM code with an LSTM controller as well?

@zoq

zoq Feb 25, 2018

Member

Should we test the NTM code with an LSTM controller as well?

arma::mat nMemory = cMemory;
auto addVecIt = addVec.begin();

This comment has been minimized.

@zoq

zoq Feb 25, 2018

Member

Do you mind to avoid auto here, I think it helps to understand the code if the type is immediately visible.

@zoq

zoq Feb 25, 2018

Member

Do you mind to avoid auto here, I think it helps to understand the code if the type is immediately visible.

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Feb 25, 2018

Member

I think it would be a good idea, to test the code on more tasks, like the copy task mentioned in the paper.

Member

zoq commented Feb 25, 2018

I think it would be a good idea, to test the code on more tasks, like the copy task mentioned in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment