-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Separate the MlpSoftmaxDecoder's components (re #289) #298
Conversation
That's a good idea to have |
xnmt/decoder.py
Outdated
self.rnn_layer = rnn_layer | ||
|
||
#=== TODO: what does this do and why is it needed? how to preserve it with | ||
#=== the refactoring? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can safely delete this. The DyNet RNN builders are hard to configure regarding parameter initialization, so this was a kind of workaround, but with the XNMT builders parameter initialization is fully supported and already implemented.
Thanks @msperber, utilizing |
Caveat: I could not get the residual LSTM to work as a decoder component even on the master branch. I don't see how it could currently work at all, as the decoder attempts to set state (.set_s) etc. which the residual's PseudoState doesn't support.
This should be functional now! Some notes:
From March 18 to 31, I'll be on vacation and unable to do anything with the code, but feel free to continue with the code or leave comments for me to work on when I return -- whatever you prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looking great! I have 2 small comments inline, otherwise it looks fine unless @neubig has any more comments. I think we can leave your 2 remarks as-is for now.
Before merging, we should probably test this on a benchmark to make sure there is no digression. Maybe an example with dropout, a bridge, and shared softmax/target embeddings would be best. The only minor difference to the master branch is that the decoder LSTM is initialized slightly differently now: with UniLSTMTransducer, the Glorot initializer treats the 4 weight matrices as separate matrices now instead of one bigger matrix as was the case with the dynet builder. Would you be able to run something like this? I'm also planning on adding some one-button recipes that could be used for this, and could also help with a benchmark, but won't get to it immediately.
xnmt/lstm.py
Outdated
|
||
def __init__(self, exp_global=Ref(Path("exp_global")), layers=1, input_dim=None, hidden_dim=None, | ||
dropout=None, weightnoise_std=None, param_init=None, bias_init=None, | ||
yaml_path=None, decoder_input_dim=None, decoder_input_feeding=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new arguments should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml_path
as well, even though it's kind of for internal use? Because the decoder_...
arguments only get interpreted when "decoder" in yaml_path
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe you could add yaml_path
as an argument but leave the description empty? And then somehow the information should be conveyed that decoder_...
is only interpreted when yaml_path
is given.
xnmt/mlp.py
Outdated
vocab_size (int): vocab size or None; only relevant if MLP is used as a decoder component | ||
vocab (Vocab): vocab or None; only relevant if MLP is used as a decoder component | ||
trg_reader (InputReader): Model's trg_reader, if exists and unambiguous; only relevant if MLP is used as a decoder component | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be mentioned that vocab size etc. will overwrite output_dim if given.
@msperber If you think it's OK, go ahead and merge. I trust your judgement and think we only need one reviewer per PR. |
I think a benchmark is a good idea, but I don't have any prepared benchmarks or anything that I could easily run at the moment. Any immediate recommendations come to mind? |
The easiest to use would probably be the Stanford benchmark because it's already preprocessed: https://nlp.stanford.edu/projects/nmt/ If you can somehow guarantee that both models are initialized with the exact same random parameters, it would also be enough to run just one training epoch plus evaluation and make sure scores are exactly the same. |
I think this might actually be a good thing, as it makes the initialization consistent with the encoder LSTM. However, I'm not sure what the easiest way would be to get the exact same random initialization for both the old and new architecture, as simply fixing the random seed won't suffice then. Either way, I don't think I'll get around to this (and the benchmarks in general) before my vacation. If you want to take this up, please go ahead, otherwise I can continue that in April. |
Yes, it's definitely a good thing! Getting the parameters to be identical would probably some custom parameter saving / loading code, so I'd probably start with the complete benchmark as it involves less human labor (although more computer labor) :) In any case, I might also give it a shot depending on when I get around to the recipes. |
To follow up on this, I haven't had a chance to actually run it, but I would suggest the new stanford-iwslt recipe which uses a small data set and requires minimal effort to run (basically run a download script and then start training). |
Oh, that's cool, I actually downloaded that very dataset an hour ago and am looking for a machine to run it on now. ;) |
Sorry also for the merge conflicts, let me know if you need help resolving them. |
Results for a single run of the IWSLT recipe on master:
And on this feature branch:
|
Great, thanks! Would you mind adding these numbers (the new ones) to the recipe's README file? Otherwise, I think we can merge once the remaining failing test is resolved. |
Looks good! Could we set the the following:
to be the default and remove it from the config files, unless it would be useful to have for illustrative purposes? Maybe similarly for the MLP layer? Also, this could maybe be for another commit, but we could potentially generalize |
@neubig, that is the default already, so it's actually safe to remove. But then again, there's a lot of redundancy in the example configs anyway with the explicit dimensionality definitions (even though they usually match I noticed that there are a few new examples that I hadn't adapted yet, so I'm also taking care of those now. |
Unless I missed some config file to adapt, or we want to remove more of the |
xnmt/decoder.py
Outdated
bias_init_context (ParamInitializer): how to initialize context bias vectors | ||
param_init_output (ParamInitializer): how to initialize output weight matrices | ||
bias_init_output (ParamInitializer): how to initialize output bias vectors | ||
rnn_layer (SeqTransducer): recurrent layer of the decoder; defaults to UniLSTMSeqTransducer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SeqTransducer is not the correct type here, as SeqTransducers can't consume inputs one by one. For lack of an appropriate abstract class, the requested type should probably be UniLSTMSeqTransducer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's where @neubig's suggestion would probably come in handy. Also, it's not like you should really use the BiLSTMSeqTransducer either, so...
I took another look and had one mini comment but other than that we can merge this. |
Cool, will merge! |
This is WIP for separating the decoder components as discussed in #289.
Currently, this code:
UniLSTMSeqTransducer
MlpSoftmaxDecoder
to take anUniLSTMSeqTransducer
directlyWhat's missing:
As discussed in #289, default arguments don't work right now due to input feeding. That means a lot of tests fail simply because of that. I would strongly prefer finding a solution to this before continuing this WIP.
I was thinking to explicitly make
UniLSTMSeqTransducer
aware of input feeding during__init__
, so that it can modify its default dims accordingly. However, the main roadblock I encountered is that I don't see a way to detect (withinUniLSTMSeqTransducer
) whether it's child of a decoder or not.Anyway, comments/suggestions welcome.