Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Separate the MlpSoftmaxDecoder's components (re #289) #298

Merged
merged 22 commits into from
Apr 5, 2018

Conversation

mbollmann
Copy link
Contributor

This is WIP for separating the decoder components as discussed in #289.

Currently, this code:

  • Implements multiple layers and state functionality for UniLSTMSeqTransducer
  • Refactors the MlpSoftmaxDecoder to take an UniLSTMSeqTransducer directly

What's missing:

  • Get default arguments to work (see below) so that the implementation can be continued
  • Refactor residual RNN to work as an RNN layer for the decoder
  • Refactor the MLP components into a separate object as well
  • Make sure all examples, tests etc. are adapted to the new architecture

As discussed in #289, default arguments don't work right now due to input feeding. That means a lot of tests fail simply because of that. I would strongly prefer finding a solution to this before continuing this WIP.

I was thinking to explicitly make UniLSTMSeqTransducer aware of input feeding during __init__, so that it can modify its default dims accordingly. However, the main roadblock I encountered is that I don't see a way to detect (within UniLSTMSeqTransducer) whether it's child of a decoder or not.

Anyway, comments/suggestions welcome.

@msperber
Copy link
Contributor

That's a good idea to have UniLSTMSeqTransducer try to detect if it's a child of a decoder. It should be possible similar to how the vocab size is determined, i.e. by taking yaml_path in the constructor which will be set automatically when the YAML file is loaded: https://github.com/neulab/xnmt/blob/master/xnmt/embedder.py#L57

xnmt/decoder.py Outdated
self.rnn_layer = rnn_layer

#=== TODO: what does this do and why is it needed? how to preserve it with
#=== the refactoring?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can safely delete this. The DyNet RNN builders are hard to configure regarding parameter initialization, so this was a kind of workaround, but with the XNMT builders parameter initialization is fully supported and already implemented.

@mbollmann
Copy link
Contributor Author

Thanks @msperber, utilizing yaml_path works like a charm!

Marcel Bollmann added 7 commits March 15, 2018 16:47
Caveat: I could not get the residual LSTM to work as a decoder component even
on the master branch.  I don't see how it could currently work at all, as the
decoder attempts to set state (.set_s) etc. which the residual's PseudoState
doesn't support.
@mbollmann
Copy link
Contributor Author

mbollmann commented Mar 15, 2018

This should be functional now!

Some notes:

  1. The decoder instantiation in the tests has become quite verbose, as the yaml_path trick doesn't work there automatically to make the sub-components aware they're in a decoder. Currently, I've solved this by adding a yaml_path everywhere. This also means I had to explicitly instantiate the decoder components everywhere just for that. If you can think of a more elegant solution here, please go ahead.
  2. The ResidualLSTMSeqTransducer is not working as a decoder component at the moment. However, I also couldn't make it work on the master branch, and there seems to be no working example or test for this scenario. The decoder expects some functionality (like .set_s()) that just isn't there on the PseudoState object. Since I'm not familiar enough with the residual LSTM implementation, I decided to leave this be for now.

From March 18 to 31, I'll be on vacation and unable to do anything with the code, but feel free to continue with the code or leave comments for me to work on when I return -- whatever you prefer.

Copy link
Contributor

@msperber msperber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking great! I have 2 small comments inline, otherwise it looks fine unless @neubig has any more comments. I think we can leave your 2 remarks as-is for now.

Before merging, we should probably test this on a benchmark to make sure there is no digression. Maybe an example with dropout, a bridge, and shared softmax/target embeddings would be best. The only minor difference to the master branch is that the decoder LSTM is initialized slightly differently now: with UniLSTMTransducer, the Glorot initializer treats the 4 weight matrices as separate matrices now instead of one bigger matrix as was the case with the dynet builder. Would you be able to run something like this? I'm also planning on adding some one-button recipes that could be used for this, and could also help with a benchmark, but won't get to it immediately.

xnmt/lstm.py Outdated

def __init__(self, exp_global=Ref(Path("exp_global")), layers=1, input_dim=None, hidden_dim=None,
dropout=None, weightnoise_std=None, param_init=None, bias_init=None,
yaml_path=None, decoder_input_dim=None, decoder_input_feeding=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new arguments should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml_path as well, even though it's kind of for internal use? Because the decoder_... arguments only get interpreted when "decoder" in yaml_path ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe you could add yaml_path as an argument but leave the description empty? And then somehow the information should be conveyed that decoder_... is only interpreted when yaml_path is given.

xnmt/mlp.py Outdated
vocab_size (int): vocab size or None; only relevant if MLP is used as a decoder component
vocab (Vocab): vocab or None; only relevant if MLP is used as a decoder component
trg_reader (InputReader): Model's trg_reader, if exists and unambiguous; only relevant if MLP is used as a decoder component
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be mentioned that vocab size etc. will overwrite output_dim if given.

@neubig
Copy link
Contributor

neubig commented Mar 16, 2018

@msperber If you think it's OK, go ahead and merge. I trust your judgement and think we only need one reviewer per PR.

@mbollmann
Copy link
Contributor Author

I think a benchmark is a good idea, but I don't have any prepared benchmarks or anything that I could easily run at the moment. Any immediate recommendations come to mind?

@msperber
Copy link
Contributor

The easiest to use would probably be the Stanford benchmark because it's already preprocessed: https://nlp.stanford.edu/projects/nmt/
I often use TED because it's smaller and fast to train a model, but you'd have to do the preprocessing yourself: https://wit3.fbk.eu/mt.php?release=2017-01-trnted

If you can somehow guarantee that both models are initialized with the exact same random parameters, it would also be enough to run just one training epoch plus evaluation and make sure scores are exactly the same.

@mbollmann
Copy link
Contributor Author

The only minor difference to the master branch is that the decoder LSTM is initialized slightly differently now: with UniLSTMTransducer, the Glorot initializer treats the 4 weight matrices as separate matrices now instead of one bigger matrix as was the case with the dynet builder.

I think this might actually be a good thing, as it makes the initialization consistent with the encoder LSTM. However, I'm not sure what the easiest way would be to get the exact same random initialization for both the old and new architecture, as simply fixing the random seed won't suffice then.

Either way, I don't think I'll get around to this (and the benchmarks in general) before my vacation. If you want to take this up, please go ahead, otherwise I can continue that in April.

@msperber
Copy link
Contributor

Yes, it's definitely a good thing! Getting the parameters to be identical would probably some custom parameter saving / loading code, so I'd probably start with the complete benchmark as it involves less human labor (although more computer labor) :) In any case, I might also give it a shot depending on when I get around to the recipes.

@msperber
Copy link
Contributor

msperber commented Apr 3, 2018

To follow up on this, I haven't had a chance to actually run it, but I would suggest the new stanford-iwslt recipe which uses a small data set and requires minimal effort to run (basically run a download script and then start training).

@mbollmann
Copy link
Contributor Author

Oh, that's cool, I actually downloaded that very dataset an hour ago and am looking for a machine to run it on now. ;)

@msperber
Copy link
Contributor

msperber commented Apr 3, 2018

Sorry also for the merge conflicts, let me know if you need help resolving them.

@mbollmann
Copy link
Contributor Author

Results for a single run of the IWSLT recipe on master:

Experiment                    | Final Scores
-----------------------------------------------------------------------
iwslt-experiment              | BLEU4: 0.2475964750860351, 0.586186/0.332474/0.197394/0.120693 (BP = 0.948510, ratio=0.95, hyp_len=32575, ref_len=34297)
                              | BLEU4: 0.27212110133216566, 0.623154/0.360932/0.221823/0.140138 (BP = 0.941058, ratio=0.94, hyp_len=31753, ref_len=33682)

And on this feature branch:

Experiment                    | Final Scores
-----------------------------------------------------------------------
iwslt-experiment              | BLEU4: 0.24962836908499383, 0.584297/0.329941/0.194742/0.118162 (BP = 0.967257, ratio=0.97, hyp_len=33192, ref_len=34297)
                              | BLEU4: 0.2732730662351511, 0.618037/0.357961/0.219796/0.138445 (BP = 0.954025, ratio=0.96, hyp_len=32168, ref_len=33682)

@msperber
Copy link
Contributor

msperber commented Apr 4, 2018

Great, thanks! Would you mind adding these numbers (the new ones) to the recipe's README file? Otherwise, I think we can merge once the remaining failing test is resolved.

@neubig
Copy link
Contributor

neubig commented Apr 4, 2018

Looks good! Could we set the the following:

rnn_layer: !UniLSTMSeqTransducer
  layers: 1

to be the default and remove it from the config files, unless it would be useful to have for illustrative purposes? Maybe similarly for the MLP layer?

Also, this could maybe be for another commit, but we could potentially generalize UniLSTMSeqTransducer to UnidirectionalSeqTransducer that defines the interface for sequence transducers that can add elements one-by-one left to right.

@mbollmann
Copy link
Contributor Author

@neubig, that is the default already, so it's actually safe to remove. But then again, there's a lot of redundancy in the example configs anyway with the explicit dimensionality definitions (even though they usually match default_layer_dim).

I noticed that there are a few new examples that I hadn't adapted yet, so I'm also taking care of those now.

@mbollmann
Copy link
Contributor Author

Unless I missed some config file to adapt, or we want to remove more of the rnn_layer definitions from them, this should be ready for merging from my side.

xnmt/decoder.py Outdated
bias_init_context (ParamInitializer): how to initialize context bias vectors
param_init_output (ParamInitializer): how to initialize output weight matrices
bias_init_output (ParamInitializer): how to initialize output bias vectors
rnn_layer (SeqTransducer): recurrent layer of the decoder; defaults to UniLSTMSeqTransducer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeqTransducer is not the correct type here, as SeqTransducers can't consume inputs one by one. For lack of an appropriate abstract class, the requested type should probably be UniLSTMSeqTransducer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's where @neubig's suggestion would probably come in handy. Also, it's not like you should really use the BiLSTMSeqTransducer either, so...

@msperber
Copy link
Contributor

msperber commented Apr 5, 2018

I took another look and had one mini comment but other than that we can merge this.

@msperber
Copy link
Contributor

msperber commented Apr 5, 2018

Cool, will merge!

@msperber msperber merged commit e0faea6 into neulab:master Apr 5, 2018
@mbollmann mbollmann deleted the separate-decoder branch April 5, 2018 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants