Skip to content

hitochan777/forest_aligner

Repository files navigation

Forest Aligner

a hierarchical, forest-based discriminative alignment package

This document describes how to use and run forest-aligner.

1. Requirements

forest-aligner currently depends on a few packages for logging, ui, implementation, and parallelization:

  1. python-gflags: a commandline flags module for python
  2. pyglog: a logging facility for python based on google-glog
  3. svector: a python module for sparse vectors, by David Chiang A version is included in this distribution under svector/ You can download the latest version from: http://www.isi.edu/~chiang/software/svector.tgz
  4. An MPI Implementation. We use MPICH2. Download the latest stable release for your architecture. Then follow the Installer's Guide, available in the documentation section of the website. See Section II for more information.
  5. Boost MPI Python bindings. (http://www.boost.org/users/download/) Installation instructions: http://www.boost.org/doc/libs/1_49_0/doc/html/mpi.html

2. Preparing your data

  1. To train an alignment model you will need some data. We use some simple canonical filenames below in describing each, but you can call them anything you'd like.

    1. train.f: a file of source-language sentences, one per line.

    2. train.e: a file of target-language sentences, one per line.

    3. train.a: a file of gold-standard alignments for each sentence pair in train.f and train.e; each line in the file should be a sequence of space-separated strings encoding a single link in f-e format as follows:

       0-1 1-2 2-2 2-3 4-5
      
    4. train.e-parse: a file of target-language parse trees, one for each line in train.e; trees should be in standard Penn Treebank format as follows:

      (TOP (S (NP (DT the) (NN man)) (VP (VBD ate))))
      

      We use tokens -RRB- and -LRB- to represent right and left parentheses, respectively (see below).

    5. train.f-parse: a file of source-language parse-trees, one for each line in train.f (OPTIONAL)

    Also prepare heldout development and test data in the same manner. Source-tree files are optional, but all others are required. Throughout the rest of this document we use the same filename extensions as above for our development and test data, e.g.:

    dev.e <-- target-language sentences in heldout development data dev.f <-- source-language sentences in heldout development data test.e <-- target-language sentences in heldout test data test.f <-- source-language sentences in heldout test data

    ADDITIONAL NOTES:

    1. Why use a heldout development (dev) and test set?

      After every epoch of training forest-aligner checks it's current performance on this dev set. When performance is no longer increasing on this dev set, we say that we've converged and we stop training. (Section III(D) and III(E))

      Since we select the alignment model to ultimately use based on performance on our development set, we need a second heldout set to as a way to predict performance on truly unseen future data. Using the model we select based on development performance in Section III(E), we align our test.* data and note the accuracy.

    2. We relabel parentheses tokens before parsing, i.e. "(" -> -LRB- and ")" -> -RRB-. For example:

      $ sed -e 's/(/-LRB-/g' -e 's/)/-RRB-/g' < input > input.clean
      

      And then we parse, by doing:

        java -Xmx2600m -Xms2600m -jar berkeleyParser.jar \
            -gr eng_grammar.gr \
            -binarize \
            -maxLength 1000 < input.clean > output
      

      Because of the way cube pruning works, you will encounter far fewer search errors if you binarize your trees before training by using the -binarize flag.

    3. In case of sentences that failed to parse: Use a blank line, a 0 on a line by itself, or the Berkeley parser default failure string: (()) to tell forest-aligner to skip the affected sentence pair.

3. Tables from GIZA++ output (Brown et al., 1993; Och and Ney, 2003)

We run GIZA++ Model-4 on a large corpus, and compute p(e|f) and p(f|w) word association tables from simply counting links in the final Viterbi alignment. If you don't have time to run Model-4, that's fine. We've seen benefits from using counts from just HMM or Model-1 training.

p(e|f) file format:
<e-word> <f-word> p(e|f)

p(f|e) file format:
<f-word> <e-word> p(f|e)

4. Alignment files from GIZA++ (OPTIONAL)

You can pass up to two third-party alignment files to the trainer with flags --a1 and --a2 in nile.py. For --a1 we use intersection of Model-4 alignments from e->f and f->e directions. For --a2 we use grow-diag-final-and symmetrizatized alignments. These alignments will allow the trainer to fire indicator features for making the same predictions as your supplied alignments. Feel free to substitute any other type of alignments here as input. Using GIZA++ Model-4 intersection and grow-diag-final-and alignments here, we generally see a large F-score increase.

5. Vocabulary files.

We'll need to give the trainer (and aligner) some vocabulary files it will use to filter potentially large p(e|f) and p(f|e) data files. Keeping these full data files in memory can be prohibitively expensive.

Concatenate your training and development e and f files and run
prepare-vocab.py:
$ cat train.e dev.e | ./prepare-vocab.py > e.vcb
$ cat train.f dev.f | ./prepare-vocab.py > f.vcb

Use these files as input to nile.py with flags --evcb and --fvcb.

Training

Training a new model with forest-aligner involves (1) specifying your data files as commandline arguments, and (2) invoking training mode. We provide a sample training script, train.sh, in this distribution invoking only the flags required to get going.

  1. Cluster computing The sample training script uses the Portable Batch System (PBS), a popular networked subsystem for controlling jobs on a computing cluster. You can remove the PBS directives at the top of the file if you are running locally on a single machine (we strongly recommend machines with multiple CPUs), or just modify the file to suit your architecture.

  2. MPI Take note of where your MPI binaries, libraries, and MPI Python bindings live. Then modify the MPI Initialization section with the appropriate paths.

  3. Training Name Every training run has a name. Your run's default name is: d.k.n..target-tree.0

  4. Running the program. On PBS, do: $ qsub train.sh Or, on a local machine with multiple CPUs, do: $ ./train.sh

  5. Inspecting accuracy on the held-out data: To inspect held-out F-scores, do: $ grep F-score-dev .err

    To sort held-out F-scores in descending order do: $ grep F-score-dev .err | awk '{print $2}' | cat -n | sort -nr -k 2

  6. Convergence If the highest-scoring epoch, H, is much earlier than your current epoch number, you have probably converged. Kill the training job and extract weights from epoch H: $ ./weights H

    Weights will be written to file: .weights-H

User-defined features

You can add your own feature functions to Features.py or maintain several different Feature modules for different language pairs.

If you want to have Feature modules for, say, Arabic-English and Chinese-English, name them: Features_ar_en.py and Features_zh_en.py respectively.

Setting the --langpair flag with argument LANG1_LANG2 will tell forest-aligner to use these modules. forest-aligner will look for a file called: Features_LANG1_LANG2.py.

nile.py --e train.e
--f train.f
... --langpair ar_en

Iterative viterbi training & inference (optional)

This procedure is somewhat time-consuming because you will need to train several models, and align your data several times. However, if you have the time, the improvement in alignment quality may be worth it.

Parse trees for both target and source text are required for this procedure.

  1. Train a target-tree model as in section III.

  2. Train a source-tree model by:

    1. Transform your gold-standard data to e-f format; source-tree models will read and output alignments in e-f format as opposed to f-e format. $ perl -pe 's/(\d+)-(\d+)/$2-$1/g' < train.a.f-e > train.e.e-f

    2. flip the argument flags for your e and f data when you run forest-aligner. For example: python aligner.py
      --e train.f
      --f train.e
      --gold train.a.e-f
      --ftrees train.e-parse
      --etrees train.f-parse
      --fdev dev.e
      --edev dev.f
      --ftreesdev dev.e-parse
      --etreesdev dev.f-parse
      --golddev dev.a.e-f
      --fvcb e.vcb
      --evcb f.vcb
      --pfe GIZA++.m4.pef
      --pef GIZA++.m4.pfe
      --a1 train.m4i.e-f
      --a2 train.m4gdfa.e-f
      --a1_dev dev.m4i.e-f
      --a2_dev dev.m4gdfa.e-f
      --langpair zh-en
      --train
      --k 128

  3. The next step of training involves learning target-tree and source-tree models again, but this time giving as input the outputs of the models learned in the first round. You do this with the --inverse and --inverse_dev flags.

  4. Run forest-aligner in --align mode and align your training data and then dev data with the source-tree model you've learned.

  5. Flip the alignment links to f-e format and supply these to your next target-tree training with --inverse and --inverse_dev, e.g.:

    python aligner.py --e train.e --f train.f --a train.a.f-e --inverse train-st.a.f-e ... etc.
    

    forest-aligner will fire features to softly enforce agreement between the two models.

  6. Analogously, for your next source-tree model, flip the aligned 1-best alignments of your training and dev data from the target-tree model to e-f format, and supply it to forest-aligner with the --inverse and --inverse_dev flags:

    python aligner.py --e train.f --f train.e --a train.e.e-f --inverse train-tt.a.e-f ... etc.
    

Testing

At test time, it is important to use the same types of parameters and input data you used during training. If you trained a model with a beam of K=128, then keep that beam at test time. If you used GIZA++ Model-4 alignments as input with flags --a1 and --a2, then similarly also supply alignment predictions from GIZA++ at test time. Finally, binarize your trees on test data the same way you did for training and development data.

  1. Preparing Vocabulary files: As with the training and development data, prepare source and target vcb files.

    $ ./prepare-vocab.py < test.e > test.e.vcb
    $ ./prepare-vocab.py < test.f > test.f.vcb
    

    Use these files as arguments for the --evcb and --fvcb flags to nile.py in your testing script.

  2. Editing test.sh Edit the WEIGHTS= line in test.sh for your weights filename.

  3. Set forest-aligner to "align" mode. At test time, we replace the --train flag with the --align flag when running aligner.py.

  4. Running test.sh Then, run test script test.sh. Using PBS, do:

    $ qsub test.sh
    

    Or, on a local machine with multiple CPUs:

    $ ./test.sh
    

    By default, alignment output is written in f-e format to: .weights-H.test-output.a

  5. Evaluation If you are aligning data for which you have gold-standard alignments, you can calculate F-measure using our provided scripts. Remember, alignments are output in f-e format and should be compared against data in the same format.

    $ ./Fmeasure.py <your-file> <gold-file>`
    

Other options

A. L1 Regularization (Feature Selection; experimental) forest-aligner implements a parallelized version of L1 Regularization via projection after each epoch. (Hastie 1996; Duchi et al., 2008; Martins et al. 2011)

Enabling this feature will allow you to learn a much smaller model that, in our experiments, should achieve essentially the same accuracy or better. This is useful for scaling to very large training sets, and you may also see some generalization benefits.

To enable, set forest-aligner's L1 Tau coefficient variable to 1 with commandline flag: --tau 1

B. Debiasing (experimental) While L1 yields sparse solutions, these solutions are known to be biased in magnitude which may negatively affect accuracy. After learning a sparse model with L1, we train a new model but this time, we only allow the features we learned in our sparse model to fire. This is called Debiasing.

To enable:

  1. Turn debiasing mode on: --debiasing
  2. Tell forest-aligner about the weight vector you learned during the L1 feature selection step, use --debiasing_weights and supply a weight vector in svector format: --debiasing_weights <sparse-model weights> (Make sure you have removed the --tau flag from your forest-aligner invocation.)

C. Advanced Perceptron Updates

  1. Changing the default Oracle. In selecting the oracle towards which we update, Chiang et al. (2008) find that in their task, modify the traditional selection criterion from minimized loss to a linear combination of minimized loss and model score. We call this the "hope" oracle, because we have more of a chance to reach it; it has high model score and low loss. To use a "hope" oracle, use flag and argument: --oracle hope
  2. Changing the default hypothesis. In selecting the hypothesis that we update our model away from (and towards the oracle), we can select a hypothesis somewhat analogously to selecting the "hope" oracle as described above. In this case, we modify the default selection criterion from maximum model score to a linear combination of maximum model score and maximum loss. We call this the "fear" hypothesis; it has the nefarious property of having both a high model score (our model likes it), but also very high loss (it is a bad alignment). To use a "fear" hypothesis, use flag and argument: --hyp fear
  3. Changing the default learning rate. There is a single learning rate parameter used in the standard perceptron update which affects the magnitude of each update. It is set to 1.0 by default. To use a different learning rate, use: --learning_rate <new learning rate>

Questions/comments

Troubleshooting: If you've gone through this brief guide and are having trouble getting this software to work for you, send mail to Jason Riesa riesa@isi.edu.

Technical correspondence also welcomed.

If you are interested in contributing to this project please also let us know!

References

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Volume 19, Number 2, pages 263-311. June 1993.

David Chiang, Yuval Marton, and Philip Resnick. Online Large-Margin Training of Syntactic and Structural Translation Features. 2008. Proceedings of EMNLP, pages. 224-233.

John Duchi, Shai Shalev-Schwartz, Yoram Singer, and Tushar Chandra. Efficient Projections onto the L1-Ball for Learning in High Dimensions. 2008. Proceedings of ICML.

Andre F. T. Martins, Noah A. Smith, Pedro M. Q. Aguiar, Mario A. T. Figueiredo. Structured Sparsity in Structured Prediction. 2011. Proceedings of EMNLP, pages 1500-1511.

Franz Josef Och and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, Volume 29, Number 1, pages 19-51. March 2003.

Jason Riesa and Daniel Marcu. Hierarchical Search for Word Alignment. 2010. Proceedings of ACL, pages 157-166.

Jason Riesa, Ann Irvine, and Daniel Marcu. Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation. 2011. Proceedings of EMNLP, pages 497-507.

Jason Riesa and Daniel Marcu. Automatic Parallel Fragment Extraction from Noisy Data. 2012. Proceedings of the NAACL HLT. To appear.

Robert Tibshirani. Regression shrinkage and selection via the lasso. 1996. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288.