Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NMT #20

Merged
merged 58 commits into from
Aug 9, 2015
Merged

NMT #20

merged 58 commits into from
Aug 9, 2015

Conversation

orhanf
Copy link
Contributor

@orhanf orhanf commented Jun 30, 2015

@dmitriy-serdyuk, this is the initial implementation of RNN encoder-decoder with attention for machine translation. Working with the following versions (latest commits June30, 2015): blocks, fuel and picklable_itertools

Items TODO:

  • Documentation; is not complete yet but decent
  • Sampling; not tested but should be fine
  • Early stopping based on BLEU; not tested, have to clean it up and adapt the changes/fixes from NMT
  • Example is using WMT'15 Czech->English translation, but necessary input files (preprocessed bitext, vocabularies, validation sets) are not provided. This will be handled by adding a script that does all the pre-processing and puts everything to the corresponding folder. Have the scripts that we used for WMT'15, will clean it up and put it here soon.
  • Fix logging saving and loading issues.
  • Add tests
  • Anything else ?

I will continue on these items this week, all minor issues compared to the whole PR

@ejls, can you please take a look if i am missing something.
@rizar your comments/recommendations are also highly welcomed

batch_size=source_sentence.shape[0],
attended=representation,
attended_mask=tensor.ones(source_sentence.shape).T,
glimpses=self.attention.take_glimpses.outputs[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glimpses is unnecessary, forgot to remove it

@orhanf
Copy link
Contributor Author

orhanf commented Jul 1, 2015

@kyunghyuncho

@rizar
Copy link
Contributor

rizar commented Jul 2, 2015

A quick question: why English and Czech?

@orhanf
Copy link
Contributor Author

orhanf commented Jul 2, 2015

We did the least amount of preprocessing in Czech - English among others for wmt15 (only tokenization), so I thought it would be easier to setup for others, tho we can add other pairs as well, there is nothing specific or hard coded for Cs-En (a few names only which wont be a problem when changed)

# send end of file, read output.
mb_subprocess.stdin.close()
stdout = mb_subprocess.stdout.readline()
print "output ", stdout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed with 6df83ab

@dmitriy-serdyuk
Copy link
Contributor

You use logger and prints at the same time. I think, we should stick with the logger.


if j == 0:
# Write to subprocess and file if it exists
print >> mb_subprocess.stdin, trans_out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use python3-style print everywhere else (from __future__ import print_function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with ecbc38c

@fhirschmann
Copy link

Thanks for this example!

In init.py:319 you save to model to search_model_cs2en_model (due to using save_separately), but in
init.py:351 the model is loaded from search_model_cs2en. Hence loading does not work.

Another related issue is that when setting config['reload'] to True, it fails the first time it is run because sampling.py:155 creates a directory named search_model_cs2en and the check in saveload.py:157 uses os.path.exists instead of os.path.isfile.

I'm currently trying to figure out how to save and load the machine translation model, but unfortunately haven't had much success so far even when hardcoding the paths the model is saved to and loaded from. I'd really appreaciate if you could look into this.

@orhanf
Copy link
Contributor Author

orhanf commented Jul 8, 2015

@fhirschmann thanks a lot for the pointers, i've changed the whole checkpointing structure and made it more specific for NMT-example. So currently only parameters, log and iteration_state are saved, mostly what we need for experiments.

sampling and beam-search still needs to be tested.

@fhirschmann
Copy link

@orhanf, thank you very very much for this. I was under the impression that the Save/Load architecture in blocks would suffice for this.

There are some small problems with the current version:

  • The self.config['saveto'] directory still gets created by sampling.py:157, hence os.path.exists in init.py:134 never actually returns False. I suggest changing the exists check to work on self.path_to_parameters. Then maybe the three except Exception as e checks in init.py:186-199 are not required anymore
  • When resuming the first time, the resumed_from log entry contains some binary identifier, e.g. MW÷GRªpN
  • After running the experiment the second time, something happens to the log and it can't be loaded anymore. Loading the log file using blocks.serialization.load works after the experiment has been run once, but after the second time it produces the following traceback:
In [4]: load("model/log")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-d4d06dd157f9> in <module>()
----> 1 load("model/log")

/home/fabian/msc/exp/blocks-examples/env/src/theano/theano/misc/pkl_utils.pyc in load(f, persistent_load)
    318         p = pickle.Unpickler(BytesIO(zip_file.open('pkl').read()))
    319         p.persistent_load = persistent_load(zip_file)
--> 320         return p.load()
    321 
    322 

/home/fabian/msc/exp/blocks-examples/local/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/home/fabian/msc/exp/blocks-examples/local/lib/python2.7/pickle.pyc in load_newobj(self)
   1081         args = self.stack.pop()
   1082         cls = self.stack[-1]
-> 1083         obj = cls.__new__(cls, *args)
   1084         self.stack[-1] = obj
   1085     dispatch[NEWOBJ] = load_newobj

TypeError: buffer() takes at least 1 argument (0 given)

May I ask what version of Pyhton you are using? The last point may actually be a bug in Python 2.7.

@fhirschmann
Copy link

Please see this pull request as far as sampling is concerned. I have not yet gotten to the BLEU Validator.

@fhirschmann
Copy link

Another issue I found that may only be limited to sampling, but is more likely an issue for the NMT model itself:

In stream_cs2en.py:51 you set the end-of-sequence marker to the size of the vocabulary. However, the EOS marker is never actually added when the model is computed. In GroundHog this was solved by setting the last element in the sequence to the EOS token, and indeed there are some remains in sampling.py:42 which do not seem to get executed at all.

An example input sequence now looks like this (with a vocabulary size of 220):

array([[ 22, 114,  11,  23, 143,   2,  10, 156,  89,   1,  27,  32,  33,
         38, 165, 119,   2, 137,  85, 154,  63, 120,  54, 208,   6, 182,
          2,  20,   8,  83,   1,   1,   3,   1,   0,   0,   0,   0,   0,
          0,   0,   0,   0]])

Likewise, a sequence does not start with a BOS token, but I believe this was also the case in GroundHog. I also noticed that, disregarding the 0-padding, all sequences end with 1 (the UNK token).

@fhirschmann
Copy link

I figured out why the EOS token is not present. While fuel.datasets.TextFile does append it, stream_cs2en.py:26 replaces it with 1 (UNK) because it checks using <. Note that using <= does not work due to a theano out of bounds exception. I just set EOS to the vocabulary size minus one (the last element).

@orhanf
Copy link
Contributor Author

orhanf commented Jul 9, 2015

@fhirschmann thanks a lot for the pointers and testing efforts again, really appreciate it :) Please see my comments below

  • The self.config['saveto'] directory still gets created by sampling.py:157, hence os.path.exists in init.py:134 never actually returns False. I suggest changing the exists check to work on self.path_to_parameters. Then maybe the three except Exception as e checks in init.py:186-199 are not required anymore

This will be fixed as i start testing sampling/beam-search. The reason why we have separate Exception for each of three is that we sometimes only provide one of them, like initializing with a pre-trained model or changing the training corpus at some point etc. So we still need them, but yes the overlaps should be removed.

  • When resuming the first time, the resumed_from log entry contains some binary identifier, e.g. MW÷GRªpN

This seems more like of a blocks issue which is caused by logger resuming, i will take a closer look on this one soon.

  • After running the experiment the second time, something happens to the log and it can't be loaded anymore. Loading the log file using blocks.serialization.load works after the experiment has been run once, but after the second time it produces the following traceback:

Nice catch again, will try to figure the problem, but again the source of this problem is probably beyond the scope of this PR.

  • May I ask what version of Pyhton you are using?

Python 2.7.6 -- 64-bit is default here at MILA

  • I just set EOS to the vocabulary size minus one (the last element).

This is also apparently a sync blunder of mine, in Groundhog we set vocabulary size V and and eos idx to V-1. So please increase the vocabulary size by one or use eos idx as one minus vocabulary size (as you suggested) depending on your problem.

So i am out of town attending a conference and will be back at MILA in one week, will try to resolve these issues as i have some time.

@fhirschmann
Copy link

Thanks @orhanf, I would very much like to continue to test and help fix the rest of this code.

# Add early stopping based on bleu
if config['bleu_script'] is not None:
logger.info("Building bleu validator")
BleuValidator(sampling_input, samples=samples, config=config,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to extensions.append() here.

@bartvm
Copy link
Member

bartvm commented Jul 13, 2015

The binary value for resumed_from is correct, it's the binary universally unique identifier UUID used to refer to the previous log.

@dmitriy-serdyuk
Copy link
Contributor

I restarted travis, the test should pass now.

@dmitriy-serdyuk
Copy link
Contributor

That's weird. @orhanf , can you rebase?

@dmitriy-serdyuk
Copy link
Contributor

Well, it's a huge PR already. I'll merge it and open an issue to refactor it sometime in the future.

@dmitriy-serdyuk dmitriy-serdyuk mentioned this pull request Aug 9, 2015
3 tasks
dmitriy-serdyuk added a commit that referenced this pull request Aug 9, 2015
@dmitriy-serdyuk dmitriy-serdyuk merged commit 7143f3b into mila-iqia:master Aug 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants