NMT #20

orhanf · 2015-06-30T21:43:12Z

@dmitriy-serdyuk, this is the initial implementation of RNN encoder-decoder with attention for machine translation. Working with the following versions (latest commits June30, 2015): blocks, fuel and picklable_itertools

Items TODO:

Documentation; is not complete yet but decent
Sampling; ~~not~~ tested ~~but should be fine~~
Early stopping based on BLEU; ~~not~~ tested, ~~have to clean it up and adapt the changes/fixes from NMT~~
Example is using WMT'15 Czech->English translation, but necessary input files (preprocessed bitext, vocabularies, validation sets) are not provided. This will be handled by adding a script that does all the pre-processing and puts everything to the corresponding folder. Have the scripts that we used for WMT'15, will clean it up and put it here soon.
Fix logging saving and loading issues.
Add tests
Anything else ?

I will continue on these items this week, all minor issues compared to the whole PR

@ejls, can you please take a look if i am missing something.
@rizar your comments/recommendations are also highly welcomed

orhanf · 2015-06-30T21:51:32Z

machine_translation/__init__.py

+            batch_size=source_sentence.shape[0],
+            attended=representation,
+            attended_mask=tensor.ones(source_sentence.shape).T,
+            glimpses=self.attention.take_glimpses.outputs[0])


glimpses is unnecessary, forgot to remove it

orhanf · 2015-07-01T21:37:20Z

@kyunghyuncho

rizar · 2015-07-02T06:14:38Z

A quick question: why English and Czech?

orhanf · 2015-07-02T06:25:50Z

We did the least amount of preprocessing in Czech - English among others for wmt15 (only tokenization), so I thought it would be easier to setup for others, tho we can add other pairs as well, there is nothing specific or hard coded for Cs-En (a few names only which wont be a problem when changed)

dmitriy-serdyuk · 2015-07-03T19:24:44Z

machine_translation/sampling.py

+        # send end of file, read output.
+        mb_subprocess.stdin.close()
+        stdout = mb_subprocess.stdout.readline()
+        print "output ", stdout


It shouldn't be there.

fixed with 6df83ab

dmitriy-serdyuk · 2015-07-03T19:32:01Z

You use logger and prints at the same time. I think, we should stick with the logger.

dmitriy-serdyuk · 2015-07-03T19:34:10Z

machine_translation/sampling.py

+
+                if j == 0:
+                    # Write to subprocess and file if it exists
+                    print >> mb_subprocess.stdin, trans_out


We use python3-style print everywhere else (from __future__ import print_function)

done with ecbc38c

fhirschmann · 2015-07-07T20:39:04Z

Thanks for this example!

In init.py:319 you save to model to search_model_cs2en_model (due to using save_separately), but in
init.py:351 the model is loaded from search_model_cs2en. Hence loading does not work.

Another related issue is that when setting config['reload'] to True, it fails the first time it is run because sampling.py:155 creates a directory named search_model_cs2en and the check in saveload.py:157 uses os.path.exists instead of os.path.isfile.

I'm currently trying to figure out how to save and load the machine translation model, but unfortunately haven't had much success so far even when hardcoding the paths the model is saved to and loaded from. I'd really appreaciate if you could look into this.

orhanf · 2015-07-08T01:11:56Z

@fhirschmann thanks a lot for the pointers, i've changed the whole checkpointing structure and made it more specific for NMT-example. So currently only parameters, log and iteration_state are saved, mostly what we need for experiments.

sampling and beam-search still needs to be tested.

fhirschmann · 2015-07-09T07:42:06Z

@orhanf, thank you very very much for this. I was under the impression that the Save/Load architecture in blocks would suffice for this.

There are some small problems with the current version:

The self.config['saveto'] directory still gets created by sampling.py:157, hence os.path.exists in init.py:134 never actually returns False. I suggest changing the exists check to work on self.path_to_parameters. Then maybe the three except Exception as e checks in init.py:186-199 are not required anymore
When resuming the first time, the resumed_from log entry contains some binary identifier, e.g. MW÷GRªpN
After running the experiment the second time, something happens to the log and it can't be loaded anymore. Loading the log file using blocks.serialization.load works after the experiment has been run once, but after the second time it produces the following traceback:

In [4]: load("model/log")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-d4d06dd157f9> in <module>()
----> 1 load("model/log")

/home/fabian/msc/exp/blocks-examples/env/src/theano/theano/misc/pkl_utils.pyc in load(f, persistent_load)
    318         p = pickle.Unpickler(BytesIO(zip_file.open('pkl').read()))
    319         p.persistent_load = persistent_load(zip_file)
--> 320         return p.load()
    321 
    322 

/home/fabian/msc/exp/blocks-examples/local/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/home/fabian/msc/exp/blocks-examples/local/lib/python2.7/pickle.pyc in load_newobj(self)
   1081         args = self.stack.pop()
   1082         cls = self.stack[-1]
-> 1083         obj = cls.__new__(cls, *args)
   1084         self.stack[-1] = obj
   1085     dispatch[NEWOBJ] = load_newobj

TypeError: buffer() takes at least 1 argument (0 given)

May I ask what version of Pyhton you are using? The last point may actually be a bug in Python 2.7.

fhirschmann · 2015-07-09T10:04:33Z

Please see this pull request as far as sampling is concerned. I have not yet gotten to the BLEU Validator.

fhirschmann · 2015-07-09T11:48:57Z

Another issue I found that may only be limited to sampling, but is more likely an issue for the NMT model itself:

In stream_cs2en.py:51 you set the end-of-sequence marker to the size of the vocabulary. However, the EOS marker is never actually added when the model is computed. In GroundHog this was solved by setting the last element in the sequence to the EOS token, and indeed there are some remains in sampling.py:42 which do not seem to get executed at all.

An example input sequence now looks like this (with a vocabulary size of 220):

array([[ 22, 114,  11,  23, 143,   2,  10, 156,  89,   1,  27,  32,  33,
         38, 165, 119,   2, 137,  85, 154,  63, 120,  54, 208,   6, 182,
          2,  20,   8,  83,   1,   1,   3,   1,   0,   0,   0,   0,   0,
          0,   0,   0,   0]])

Likewise, a sequence does not start with a BOS token, but I believe this was also the case in GroundHog. I also noticed that, disregarding the 0-padding, all sequences end with 1 (the UNK token).

fhirschmann · 2015-07-09T16:36:37Z

I figured out why the EOS token is not present. While fuel.datasets.TextFile does append it, stream_cs2en.py:26 replaces it with 1 (UNK) because it checks using <. Note that using <= does not work due to a theano out of bounds exception. I just set EOS to the vocabulary size minus one (the last element).

orhanf · 2015-07-09T23:11:49Z

@fhirschmann thanks a lot for the pointers and testing efforts again, really appreciate it :) Please see my comments below

The self.config['saveto'] directory still gets created by sampling.py:157, hence os.path.exists in init.py:134 never actually returns False. I suggest changing the exists check to work on self.path_to_parameters. Then maybe the three except Exception as e checks in init.py:186-199 are not required anymore

This will be fixed as i start testing sampling/beam-search. The reason why we have separate Exception for each of three is that we sometimes only provide one of them, like initializing with a pre-trained model or changing the training corpus at some point etc. So we still need them, but yes the overlaps should be removed.

When resuming the first time, the resumed_from log entry contains some binary identifier, e.g. MW÷GRªpN

This seems more like of a blocks issue which is caused by logger resuming, i will take a closer look on this one soon.

After running the experiment the second time, something happens to the log and it can't be loaded anymore. Loading the log file using blocks.serialization.load works after the experiment has been run once, but after the second time it produces the following traceback:

Nice catch again, will try to figure the problem, but again the source of this problem is probably beyond the scope of this PR.

May I ask what version of Pyhton you are using?

Python 2.7.6 -- 64-bit is default here at MILA

I just set EOS to the vocabulary size minus one (the last element).

This is also apparently a sync blunder of mine, in Groundhog we set vocabulary size V and and eos idx to V-1. So please increase the vocabulary size by one or use eos idx as one minus vocabulary size (as you suggested) depending on your problem.

So i am out of town attending a conference and will be back at MILA in one week, will try to resolve these issues as i have some time.

fhirschmann · 2015-07-13T16:46:17Z

Thanks @orhanf, I would very much like to continue to test and help fix the rest of this code.

fhirschmann · 2015-07-13T16:47:22Z

machine_translation/__init__.py

+    # Add early stopping based on bleu
+    if config['bleu_script'] is not None:
+        logger.info("Building bleu validator")
+        BleuValidator(sampling_input, samples=samples, config=config,


You forgot to extensions.append() here.

bartvm · 2015-07-13T21:07:37Z

The binary value for resumed_from is correct, it's the binary universally unique identifier UUID used to refer to the previous log.

dmitriy-serdyuk · 2015-07-17T18:58:58Z

I restarted travis, the test should pass now.

dmitriy-serdyuk · 2015-07-17T19:02:27Z

That's weird. @orhanf , can you rebase?

dmitriy-serdyuk · 2015-08-09T20:19:00Z

Well, it's a huge PR already. I'll merge it and open an issue to refactor it sometime in the future.

NMT

orhanf reviewed Jun 30, 2015
View reviewed changes

dmitriy-serdyuk reviewed Jul 3, 2015
View reviewed changes

fhirschmann reviewed Jul 13, 2015
View reviewed changes

orhanf added 10 commits July 17, 2015 15:52

machine translation initial commit

9cc8dbe

sync with central and add some comments

cd8e7ca

sync with new checkpointing

4419050

fix gru gate inputs dim for encoder

7276d59

use etienne's solution for gru fork output dims

ac911ba

working version v1.0

2746d6a

fix gru with initial state bug, re-define all checkpointing

6f1f72f

refactor model and checkpointing

fe13d32

discard logger save

ea91ed0

fix boundary conditions in sampling

97f6bdd

orhanf force-pushed the nmt branch from 27fce75 to 97f6bdd Compare July 17, 2015 19:53

orhanf added 26 commits August 9, 2015 14:32

python3 xrange fix

a6a1774

machine translation initial commit

9205915

sync with central and add some comments

92fd5db

sync with new checkpointing

36935c5

fix gru gate inputs dim for encoder

57bbfd1

use etienne's solution for gru fork output dims

0d26ffc

working version v1.0

6fb253c

fix gru with initial state bug, re-define all checkpointing

9b0c05e

refactor model and checkpointing

53db767

discard logger save

5a55cfa

fix boundary conditions in sampling

97a5205

extensions append

5ef6a7f

replace prints with logger in sampling

d1bc266

add preprocessing script

ff48913

add shuffling to preprocessing

a1f40da

log issues fixed, sampling tested, preprocessing enhanced

6149556

stream calling scheme changed, configurations renamed

9e39ffd

bleu validator tested, preprocessing fixes

ade83ab

fix padding, change special token ensuring

0b8a80d

future print function, small fixes

97517d3

add cost normalization by length for bleu validator

a7edb04

change plot import

7f85cc3

remove config from stream

148757f

add tests for model, stream, sampling

49f5970

fix imports

5110dd2

Merge branch 'nmt' of https://github.com/orhanf/blocks-examples into nmt

6b83fa2

dmitriy-serdyuk mentioned this pull request Aug 9, 2015

Refactor NMT example #32

Open

3 tasks

dmitriy-serdyuk added a commit that referenced this pull request Aug 9, 2015

Merge pull request #20 from orhanf/nmt

7143f3b

NMT

dmitriy-serdyuk merged commit 7143f3b into mila-iqia:master Aug 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT #20

NMT #20

orhanf commented Jun 30, 2015

orhanf Jun 30, 2015

orhanf commented Jul 1, 2015

rizar commented Jul 2, 2015

orhanf commented Jul 2, 2015

dmitriy-serdyuk Jul 3, 2015

orhanf Jul 17, 2015

dmitriy-serdyuk commented Jul 3, 2015

dmitriy-serdyuk Jul 3, 2015

orhanf Aug 5, 2015

fhirschmann commented Jul 7, 2015

orhanf commented Jul 8, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

orhanf commented Jul 9, 2015

fhirschmann commented Jul 13, 2015

fhirschmann Jul 13, 2015

bartvm commented Jul 13, 2015

dmitriy-serdyuk commented Jul 17, 2015

dmitriy-serdyuk commented Jul 17, 2015

dmitriy-serdyuk commented Aug 9, 2015

NMT #20

NMT #20

Conversation

orhanf commented Jun 30, 2015

orhanf Jun 30, 2015

Choose a reason for hiding this comment

orhanf commented Jul 1, 2015

rizar commented Jul 2, 2015

orhanf commented Jul 2, 2015

dmitriy-serdyuk Jul 3, 2015

Choose a reason for hiding this comment

orhanf Jul 17, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk commented Jul 3, 2015

dmitriy-serdyuk Jul 3, 2015

Choose a reason for hiding this comment

orhanf Aug 5, 2015

Choose a reason for hiding this comment

fhirschmann commented Jul 7, 2015

orhanf commented Jul 8, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

fhirschmann commented Jul 9, 2015

orhanf commented Jul 9, 2015

fhirschmann commented Jul 13, 2015

fhirschmann Jul 13, 2015

Choose a reason for hiding this comment

bartvm commented Jul 13, 2015

dmitriy-serdyuk commented Jul 17, 2015

dmitriy-serdyuk commented Jul 17, 2015

dmitriy-serdyuk commented Aug 9, 2015