Skip to content
A word reordering model.
C++ C Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
scripts
.gitignore
CMakeLists.txt
README.md
aligned_tree.cc
aligned_tree.h
alignment_constructor.cc
alignment_constructor.h
alignment_heuristic.cc
alignment_heuristic.h
definitions.h
dictionary.cc
dictionary.h
distributed_rule_counts.cc
distributed_rule_counts.h
filter.cc
generate_alignments.cc
grammar.cc
grammar.h
heuristic.cc
log_add.h
multi_sample_reorderer.cc
multi_sample_reorderer.h
node.cc
node.h
pcfg_table.cc
pcfg_table.h
reorder_main.cc
reorderer.cc
reorderer.h
reorderer_base.h
restaurant.h
restaurant_process.h
restaurant_process_inl.h
rule_extractor.cc
rule_extractor.h
rule_matcher.cc
rule_matcher.h
rule_reorderer.cc
rule_reorderer.h
rule_stats_reporter.cc
rule_stats_reporter.h
sampler.cc
sampler.h
sampler_main.cc
single_sample_reorderer.cc
single_sample_reorderer.h
time_util.cc
time_util.h
translation_table.cc
translation_table.h
tree.h
util.cc
util.h
viterbi_reorderer.cc
viterbi_reorderer.h

README.md

worm

A word reordering model that includes an implementation of the nonparametric Bayesian model for synchronous tree substitution grammar induction proposed by Cohn and Blunsom (2009).

Getting started

Prior to starting, you should

  • Build the sampler using CMake.
  • Install cdec, the root directory will be referred to as $CDEC in this document

Preparing your parallel data

This section assumes you have a parallel corpus of source trees in corpus.parsed-zh, with in a one-tree-per-line format as follows:

(IP (NP (NR 伊犁)) (VP (ADVP (AD 大规模)) (VP (VV 开展) (NP (IP (VP (VRD (PU “) (VV 面对面) (PU ”) (VV 宣讲)))) (NP (NN 活动))))))
...

And target strings in corpus.en, one sentence-per-line, tokenized (and, optionally, lowercased):

yili launches large - scale ' face - to - face ' propaganda activity
...

You will need to extract the terminal sentences from corpus.parsed-zh, which can be done with the following command.

./worm/scripts/scripts/extract-terminals.pl corpus.parsed-zh > corpus.zh

You will now need to create training data corpus.zh-en for the word aligner that is used as the base distribution for the nonparametric tree aligner:

$CDEC/corpus/paste-corpus.pl corpus.zh corpus.en > corpus.zh-en

Initial word alignment and base parameters

In this section, a basic word alignment, corpus.gdfa, along with lexical translation probabilities are created.

$CDEC/word-aligner/fast_align -i corpus.zh-en \
    -v -o -d -p fwd.probs -t -10000 > fwd.al
$CDEC/word-aligner/fast_align -i corpus.zh-en \
    -v -o -d -p rev.probs -t -10000 -r > rev.al
$CDEC/utils/atools -i fwd.probs -j rev.probs -c grow-diag-final-and > gdfa.al

Infer grammars

This section describes running the TSG aligner. Outputs are written every 10 iterations to worm-out/.

./worm/sampler --alignment corpus.gdfa \
               --trees corpus.parsed-zh \
               --strings corpus.en \
               --forward-prob fwd.probs \
               --reverse-prob rev.probs \
               --output worm-out \
               --threads 6 \
               --iterations 1000 &>log.sampler &

Convert grammars to cdec format

Cdec supports tree-to-string translation. To use the grammars produced by the sampler with cdec, they must be converted into the proper format, this can be done as follows:

./worm/scripts/convert-worm-to-cdec.pl worm-out/output.grammar |
   ./worm/scripts/featurize-cdec-grammar.pl | gzip -9 > rules.t2s.gz
You can’t perform that action at this time.