diff --git a/paper/acl2014.tex b/paper/acl2014.tex index 0f280052..8168d918 100644 --- a/paper/acl2014.tex +++ b/paper/acl2014.tex @@ -58,10 +58,11 @@ \section{Introduction} %The algorithm starts by examining boundary words: the last word of each hypothesis and the first word of each phrase. Options that score well together are refined by examining additional words from hypotheses with matching suffixes and phrases with matching prefixes. This refinement process continues until some combinations have been fully scored by examining all $2N-2$ relevant words. -As with most search algorithms for phrase-based machine translation, our algorithm is approximate. One can trade between CPU time and search accuracy by choosing how many hypotheses to keep in each step of the search. The primary claim is that our algorithm offers a better trade-off between time and accuracy when compared with the popular cube pruning algorithm \cite{cubit}. +Like cube pruning \cite{cubit}, our search algorithm is approximate. Our primary claim is that we can attain higher accuracy in less time than cube pruning. +%As with most search algorithms for phrase-based machine translation, our algorithm is approximate. One can trade between CPU time and search accuracy by choosing how many hypotheses to keep in each step of the search. The primary claim is that our algorithm offers a better trade-off between time and accuracy when compared with the popular cube pruning algorithm \cite{cubit}. \section{Related Work} -Part of our phrase-based decoding algorithm is inspired by the syntactic decoding algorithm of \newcite{search}. Their work exploited common prefixes and suffixes of translation hypotheses in order to efficiently reason over many hypotheses at once. In some sense, phrase-based translation is simpler because hypotheses are constructed from left to right, so there is no need to worry about the prefix of a hypothesis. However, this simplification comes with a different cost: phrase-based translation implements reordering by allowing hypotheses to correspond to discontiguous words in the source sentence. There are exponentially many ways to cover the source sentence, so we developed an efficient way for the language model to reason over hypotheses that cover different parts of the source sentence. In contrast, syntactic machine translation hypotheses correspond to contiguous spans in the source sentence, so \newcite{search} simply ran their algorithm separately in every possible span. +The coarse-to-fine refinement portion of our contribution is inspired by the syntactic decoding algorithm of \newcite{search}. Their work exploited common prefixes and suffixes of translation hypotheses in order to efficiently reason over many hypotheses at once. In some sense, phrase-based translation is simpler because hypotheses are constructed from left to right, so there is no need to worry about the prefix of a hypothesis. However, this simplification comes with a different cost: phrase-based translation implements reordering by allowing hypotheses to correspond to discontiguous words in the source sentence. There are exponentially many ways to cover the source sentence, so we developed an efficient way for the language model to ignore this information. In contrast, syntactic machine translation hypotheses correspond to contiguous spans in the source sentence, so \newcite{search} simply ran their algorithm separately in every possible span. Another difference from \newcite{search} is that they made no effort to exploit common words that appear in translation rules, which in our case are analogous to phrases. In this work, we explicitly group target phrases by common prefixes, doing so directly in the phrase table. @@ -69,10 +70,12 @@ \section{Related Work} Our baseline is cube pruning, which was originally developed for syntactic machine translation \cite{cubepruning} and subsequently ported to phrase-based translation by \newcite{cubit}. We have largely adopted their search strategy, which we summarize in Section \ref{basic_search}. However, as noted in Section \ref{intro_label}, cube pruning repeatedly calls the language model regarding hypotheses that differ only in coverage, while we collapse these calls. Moreover, we take a coarse-to-fine approach to finding good combinations of hypotheses and phrases rather than simply trying a large number of them. -\newcite{lagrangian-phrase} developed an exact decoding algorithm based on Lagrangian relaxation. However, they experimented with trigram language models and it remains unclear whether their algorithm would tractably handle the 5-gram language models used by many modern machine translation systems. We evaluate our approximate search algorithm using a 5-gram language model. +\newcite{lagrangian-phrase} developed an exact decoding algorithm based on Lagrangian relaxation. However, it has not been shown to tractably scale to $5$-gram language models used by many modern translation systems. + +%However, they experimented with trigram language models and it remains unclear whether their algorithm would tractably handle the 5-gram language models used by many modern machine translation systems. We evaluate our approximate search algorithm using a 5-gram language model. \section{Decoding} -We begin by summarizing the high-level organization of phrase-based cube pruning in Moses \cite{moses}, which is largely based upon Cubit \cite{cubit}, and in turn based upon Pharaoh \cite{pharaoh}. Sections \ref{contribution} and later show our contribution. +We begin by summarizing the high-level organization of phrase-based cube pruning \cite{pharaoh,moses,cubit}. Sections \ref{contribution} and later show our contribution. \subsection{Search Organization} \label{basic_search} @@ -116,13 +119,13 @@ \subsection{Search Organization} \caption{\label{stacks}Stacks to translate the French ``le chat .'' into English. Filled circles indicate that the source word has been translated. A phrase translates ``le chat'' as simply ``cat'', emphasizing that stacks are organized by the number of source words rather than the number of target words.} \end{figure} -In practice, the decoder enforces a reordering limit that prevents the search process from jumping around the source sentence too much and dramatically reduces the size of the search space. Formally, when the reordering limit is $R$, the decoder must translate source words at indices $[0,n-R)$ before, or at the same time as, it can translate the $n$th source word. +The decoder enforces a reordering limit that prevents the search process from jumping around the source sentence too much and dramatically reduces the size of the search space. Formally, the decoder can translate the $n$th source word only if has or is translating the words at indices $[0,n-R)$. -The second practical constraint is a limit on the number of hypotheses in each stack. There are generally too many possible hypotheses, so the decoder approximates by remembering at most $k$ hypotheses in each stack, where $k$ is a number chosen by the user. Small $k$ makes search fast but may prune good hypotheses, while large $k$ is more thorough but takes more CPU time, thereby comprising a time-accuracy trade-off. The central question in this paper is how to select these $k$ hypotheses while improving the time-accuracy trade-off. +In practice, the decoder limits stacks to $k$ hypotheses, where $k$ is set by the user. Small $k$ makes search fast but may prune good hypotheses, while large $k$ is more thorough but takes more CPU time, thereby comprising a time-accuracy trade-off. The central question in this paper is how to select these $k$ hypotheses. % while improving the time-accuracy trade-off. %To formalize the preceding paragraph, stack $s_i$ is a set of hypotheses that have translated $i$ source words. The initial stack $s_0$ contains a single hypothesis that translates nothing while the subsequent stacks are defined inductively %\[s_i = \bigcup_{j=0}^{i-1} \text{extend}(s_j, \text{source}_{i-j}) \] -Populating a stack can be boiled down into two steps. In the first step, the decoder matches hypotheses with source phrases subject to three constraints: the total source length matches the stack being populated, none of the source words has already been translated by the hypothesis, and the reordering limit. We do not improve this first step, which is largely driven by checking whether a hypothesis and source phrase are compatible along with some knowledge about the reordering limit. In the second step, the decoder runs an algorithm that searches through these matches to select $k$ high-scoring hypotheses for placement in the stack. We improve this second step. +Populating a stack can be boiled down into two steps. In the first step, the decoder matches hypotheses with source phrases subject to three constraints: the total source length matches the stack being populated, none of the source words has already been translated by the hypothesis, and the reordering limit. We do not improve this first step. In the second step, the decoder runs an algorithm that searches through these matches to select $k$ high-scoring hypotheses for placement in the stack. We improve this second step. %First, the decoder matches hypotheses with source phrases that they have yet to translate and that will meet the source length requirement of the stack. The purpose of the reordering limit is to substantially reduce the size of the search space to make search tractable and as a workaround for models that are too weak to handle long-distance reordering. Second, the decoder searches through @@ -139,7 +142,7 @@ \subsection{Search Organization} \subsection{Tries} \label{contribution} -For each source phrase, we collect the set of compatible hypotheses. We then place these hypotheses in a trie that emphasizes the suffix words because these matter most when appending a target phrase. Figure~\ref{hypsuff} shows an example. While it suffices to build this trie on the last $N-1$ words that matter to the language model, \newcite{zhifei} have shown that fewer words are necessary in cases where the language model will provably back off. Therefore, the trie does not necessarily have uniform depth. The leaves of the trie are complete hypotheses and reveal information irrelevant to the language model, such as coverage of the source sentence and the state of other features. +For each source phrase, we collect the set of compatible hypotheses. We then place these hypotheses in a trie that emphasizes the suffix words because these matter most when appending a target phrase. Figure~\ref{hypsuff} shows an example. While it suffices to build this trie on the last $N-1$ words that matter to the language model, \newcite{zhifei} have identified cases where fewer words are necessary because the language model will back off. The leaves of the trie are complete hypotheses and reveal information irrelevant to the language model, such as coverage of the source sentence and the state of other features. \begin{figure}\centering \begin{tikzpicture}[grow=left,->,arrows={-angle 90},line width=1pt,inner sep=1pt] @@ -185,7 +188,7 @@ \subsection{Tries} \caption{\label{hypsuff}Hypothesis suffixes arranged into a trie. The leaves indicate source coverage and any other hypothesis state.} \end{figure} -Each source phrase translates to a set of target phrases. Because these phrases will be appended to a hypothesis, the first few words matter the most to the language model. We therefore arrange the source phrases into a prefix trie. An example is shown in Figure~\ref{tgtpre}. Similar to the hypothesis trie, the depth may be shorter than $N-1$ in cases where the language model will provably back off \cite{zhifei}. The trie can also be short because the target phrase has fewer than $N-1$ words. We currently store this trie data structure directly in the phrase table, though it could also be computed on demand to save memory. Empirically, our phrase table uses less RAM than Moses's memory-based phrase table. +Each source phrase translates to a set of target phrases. Because these phrases will be appended to a hypothesis, the first few words matter the most to the language model. We therefore arrange the target phrases into a prefix trie. An example is shown in Figure~\ref{tgtpre}. Similar to the hypothesis trie, the depth may be shorter than $N-1$ in cases where the language model will provably back off \cite{zhifei}. The trie can also be short because the target phrase has fewer than $N-1$ words. We currently store this trie data structure directly in the phrase table, though it could also be computed on demand to save memory. Empirically, our phrase table uses less RAM than Moses's memory-based phrase table. \begin{figure}\centering \begin{tikzpicture}[grow=right,->,arrows={-angle 90},line width=1pt,inner sep=1pt] @@ -213,18 +216,18 @@ \subsection{Boundary Pairs} A boundary pair consists of a node in the hypothesis trie and a node in the target phrase trie. For example, the decoder starts at the root of each trie with the boundary pair $(\epsilon, \epsilon)$. The score of a boundary pair is the sum of the scores of the underlying trie nodes. However, once some words have been revealed, the decoder calls the language model to compute a score adjustment. For example, the boundary pair $(\text{country}, \text{that})$ has score adjustment \[\log \frac{p(\text{that}\mid\text{country})}{p(\text{that})} \] -times the weight of the language model. This has the effect of cancelling out the estimate made when the phrase was scored in isolation and replacing the estimate with a more accurate one based on available context. These score adjustments are efficient to compute because the decoder retains a pointer to the entry for ``that'' in the language model's data structure \cite{iwslt}. +times the weight of the language model. This has the effect of cancelling out the estimate made when the phrase was scored in isolation, replacing it with a more accurate estimate based on available context. These score adjustments are efficient to compute because the decoder retained a pointer to ``that'' in the language model's data structure \cite{iwslt}. \subsection{Splitting} Refinement is the notion that the boundary pair $(\epsilon, \epsilon)$ divides into several boundary pairs that reveal specific words from hypotheses or target phrases. The most straightforward way to do this is simply to split into all children of a trie node. Continuing the example from Figure~\ref{hypsuff}, we could split $(\epsilon, \epsilon)$ into three boundary pairs: $(\text{country}, \epsilon)$, $(\text{nations}, \epsilon)$, and $(\text{countries}, \epsilon)$. However, it is somewhat inefficient to separately consider the low-scoring child $(\text{countries}, \epsilon)$. Instead, we continue to split off the best child $(\text{country}, \epsilon)$ and leave a note that the zeroth child has been split off, denoted $(\epsilon[1^+], \epsilon)$. The index increases each time a child is split off. -For purposes of scoring, the best child no longer counts as a descendant of $(\epsilon[1^+], \epsilon)$, so its score decreases. +For purposes of scoring, the best child $(\text{country}, \epsilon)$ no longer counts as a descendant of $(\epsilon[1^+], \epsilon)$, so its score is lower. -Splitting alternates sides, so splitting $(\text{countries}, \epsilon)$ will reveal a word from the target phrase. The exception is that language model scores are completely resolved before hypotheses reveal coverage vectors and other feature state. +Splitting alternates sides. For example, $(\text{country}, \epsilon)$ splits into $(\text{country}, \text{that})$ and $(\text{country}, \epsilon[1^+])$. As an exception, language model scores are completely resolved before hypotheses reveal coverage vectors and other feature state. \begin{figure*}[t]% \input{plot/model.tex}\input{plot/bleu.tex} -\caption{\label{results}Performance of our decoder and Moses for various pop limits $k$.} +\caption{\label{results}Performance of our decoder and Moses for various stack sizes $k$.} \end{figure*} \subsection{Priority Queue} @@ -233,22 +236,23 @@ \subsection{Priority Queue} \subsection{Overall Algorithm} We build hypotheses from left-to-right and manage stacks just like cube pruning. The only difference is how the $k$ elements of these stacks are selected. -When the decoder matches a hypothesis with a compatible source phrase, we immediately evaluate the distortion feature and update future costs. Our future costs are exactly the same as those used in Moses \cite{moses}: the highest-scoring way to cover the rest of the source sentence. This includes the language model score within target phrases but ignores the change in language model score that would occur were these phrases to be appended together. The hypotheses compatible with each source phrase are arranged into a trie. Finally, the priority queue algorithm from the preceding section searches for options that the language model likes. +When the decoder matches a hypothesis with a compatible source phrase, we immediately evaluate the distortion feature and update future costs, both of which are independent of the target phrase. Our future costs are exactly the same as those used in Moses \cite{moses}: the highest-scoring way to cover the rest of the source sentence. This includes the language model score within target phrases but ignores the change in language model score that would occur were these phrases to be appended together. The hypotheses compatible with each source phrase are arranged into a trie. Finally, the priority queue algorithm from the preceding section searches for options that the language model likes. \section{Experiments} The primary claim is that our algorithm performs better than cube pruning in terms of the trade-off between time and accuracy. We compare our new decoder implementation with Moses \cite{moses} by translating 1677 sentences from Chinese to English. These sentences are a deduplicated subset of the NIST Open MT 2012 test set and were drawn from Chinese online text sources, such as discussion forums. We trained our phrase table using a bitext of 10.8 million sentence pairs, which after tokenization amounts to approximately 290 million words on the English side. The bitext contains data from several sources, including news articles, UN proceedings, Hong Kong government documents, online forum data, and specialized sources such as an idiom translation table. We also trained our language model on the English half of this bitext using unpruned interpolated modified Kneser-Ney smoothing \cite{kn,kn-modified}. %We trained weights for our translation system using the Phrasal toolkit, using the online Adagrad technique to minimize a smoothed version of BLEU\cite{}. All our translation models used a simple feature set consisting of forward and backward translation models, a language model, a target word generation penalty and a linear distortion feature. -The system is limited to standard phrase table features, the distortion feature, and a single language model. We set the reordering limit to 15. The phrase table was pre-filtered to at most 20 target-side phrases per source phrase using the total score of the target phrase including the language model---the same method Moses uses internally. We then disabled phrase table pruning in the individual decoders to ensure a consistent set of target phrases. This system is not designed to be competitive, but rather a benchmark that removes as many confounds as possible. +The system is limited to standard phrase table features, the distortion feature, and a single language model. We set the reordering limit to 15. The phrase table was pre-prepruned to at most 20 target-side phrases per source phrase. We implemented the same pruning heuristic as Moses: select the top 20 target phrases by score, including the language model. We then disabled phrase table pruning in both decoders to ensure a consistent set of target phrases. This system is not designed to be competitive, but rather a benchmark that removes as many confounds as possible. -Moses \cite{moses} revision d6df825 was compiled with all optimizations recommended in the documentation. We use the in-memory phrase table for speed. Tests were run on otherwise-idle identical machines with 32 GB RAM; the processes did not come close to running out of memory. Binarized KenLM language models in probing format were placed in a RAM disk and text phrase tables were forced into the disk cache before each run. Timing is based on CPU usage reported by the kernel (user plus system) minus loading time, as measured by running on empty input; our decoder is also faster at loading. All results are single-threaded. Model score is averaged over all 1677 sentences; higher is better. We have verified that the model scores are comparable across decoders. +Moses \cite{moses} revision d6df825 was compiled with all optimizations recommended in the documentation. We use the in-memory phrase table for speed. Tests were run on otherwise-idle identical machines with 32 GB RAM; the processes did not come close to running out of memory. Binarized KenLM language models in probing format were placed in a RAM disk and text phrase tables were forced into the disk cache before each run. Timing is based on CPU usage reported by the kernel (user plus system) minus loading time, as measured by running on empty input; our decoder is also faster at loading. All results are single-threaded. Model score is averaged over all 1677 sentences; higher is better. We have verified that the model scores are comparable across decoders. We approximate translation quality with uncased BLEU \cite{bleu}; due to model errors, the relationship between model score and BLEU is noisy. -Figure~\ref{results} shows the results for pop limits $k$ ranging from $5$ to $10000$. For Moses, we also set the stack size to $k$ to disable a second pruning pass, as is common. Because Moses is slower, we also ran our decoder with higher beam sizes to fill in the graph. Our decoder is faster and attains higher accuracy. We can interpret accuracy improvments as speed improvements by asking how much time is required to attain the same accuracy as the baseline. By this metric, our decoder is 4.0 to 7.7 times as fast as Moses. +Figure~\ref{results} shows the results for pop limits $k$ ranging from $5$ to $10000$. For Moses, we also set the stack size to $k$ to disable a second pruning pass, as is common. Because Moses is slower, we also ran our decoder with higher beam sizes to fill in the graph. Our decoder is faster and attains higher accuracy. We can interpret accuracy improvments as speed improvements by asking how much time is required to attain the same accuracy as the baseline. By this metric, our decoder is 4.0 to 7.7 times as fast as Moses, depending on $k$. \section{Conclusion} +We have contributed a new phrase-based search algorithm based on the principle that the language model cares the most about boundary words. This leads to two contributions: a way to hide irrelevant state, such as coverage, from the language model and an incremental refinement algorithm to find high-scoring combinations. This algorithm is implemented in a new fast phrase-based decoder, which we release as open-source under the LGPL. %TODO -When the decoder matches a hypothesis with a compatible source phrase, we immediately evaluate the distortion feature and update future costs. Our future costs are exactly the same as those used in Moses \cite{moses}: the highest-scoring way to cover the rest of the source sentence. This includes the language model score within target phrases but ignores the change in language model score that would occur were these phrases to be appended together. The hypotheses compatible with each source phrase are arranged into a trie and the language model algorithm searches for good combinations of hypotheses and target phrases. +%When the decoder matches a hypothesis with a compatible source phrase, we immediately evaluate the distortion feature and update future costs. Our future costs are exactly the same as those used in Moses \cite{moses}: the highest-scoring way to cover the rest of the source sentence. This includes the language model score within target phrases but ignores the change in language model score that would occur were these phrases to be appended together. The hypotheses compatible with each source phrase are arranged into a trie and the language model algorithm searches for good combinations of hypotheses and target phrases. \bibliographystyle{acl}