# Phrase-based models


Motivation: Capture many-to-many mapping to include word context

State-of-the-art MT models until about 2017 (Google Translate)

Generative phrase-based models [1,2]: hidden variables for phrase segmentation and alignment

Heuristic phrase-based models: alignment symmetrisation and relative counts [3, Chapter 5]

## Alignment symmetrisation and phrase extraction


Learn word-based alignment model from target to source $a$ and from source to target $b$

Take the intersection $a \cap b$ and heuristically include alignments from $a \cup b$

Extract all possible phrases *consistent* with the alignment previously selected

Consistent means that all words of the phrase pair $(\bar{x},\bar{y})$ are aligned to each other

In [20]:
from nltk.translate import AlignedSent, IBMModel2, phrase_based

esText = ['la casa es azul','mi casa era blanca','mi perro es blanco','el perro era azul']
enText = ['the house is blue', 'my house was white','my dog is white', 'the dog was blue']

# Source language is Euskera and target language is English
corpus = []
for enSent, esSent in zip(enText,esText):
    corpus.append(AlignedSent(enSent.split(),esSent.split()))

m2 = IBMModel2(corpus, 5)
m2.align_all(corpus)

In [28]:
for biSent in corpus:
    esSent = " ".join(biSent.words)
    enSent = " ".join(biSent.mots)
    print(f'{esSent} > {enSent}: {biSent.alignment}')
    phrases = phrase_based.phrase_extraction(esSent,enSent, biSent.alignment)
    for i in sorted(phrases):
         print(i)

the house is blue > la casa es azul: 0-3 1-1 2-2 3-3
((0, 4), (0, 4), 'the house is blue', 'la casa es azul')
((0, 4), (1, 4), 'the house is blue', 'casa es azul')
((1, 2), (0, 2), 'house', 'la casa')
((1, 2), (1, 2), 'house', 'casa')
((1, 3), (0, 3), 'house is', 'la casa es')
((1, 3), (1, 3), 'house is', 'casa es')
((2, 3), (2, 3), 'is', 'es')
my house was white > mi casa era blanca: 0-3 1-1 2-2 3-3
((0, 4), (0, 4), 'my house was white', 'mi casa era blanca')
((0, 4), (1, 4), 'my house was white', 'casa era blanca')
((1, 2), (0, 2), 'house', 'mi casa')
((1, 2), (1, 2), 'house', 'casa')
((1, 3), (0, 3), 'house was', 'mi casa era')
((1, 3), (1, 3), 'house was', 'casa era')
((2, 3), (2, 3), 'was', 'era')
my dog is white > mi perro es blanco: 0-3 1-1 2-2 3-3
((0, 4), (0, 4), 'my dog is white', 'mi perro es blanco')
((0, 4), (1, 4), 'my dog is white', 'perro es blanco')
((1, 2), (0, 2), 'dog', 'mi perro')
((1, 2), (1, 2), 'dog', 'perro')
((1, 3), (0, 3), 'dog is', 'mi perro es')
((1, 3), (

## Model estimation

$p(\bar{y} \mid \bar{x})$ are estimated as relative counts

In practice, phrase-based systems involved not only phrase-based models, but additional models:

- Phrase-based models in both directions

- Word-based models in both directions

- Target language model

- Reordering model

- Phrase penalty: log of number of phrases involved

- etc.

combined in a log-linear fashion:

$$
\begin{align*}
\hat{y} &= \argmax_{y} P(y \mid x)\\% 
        &= \argmax_{y} \sum_m \lambda_m h_m (x, y)
\end{align*}        
$$

## Search

A* algorithm based on dynamic programming, multi-stack decoding and hypothesis pruning

Incremental development of partial hypothesis

Each hypothesis is characterised by a coverage of the source sentence, a target prefix and a score

Each stack stores hypotheses with the same coverage of the source sentence sorted by score from highest to lowest

Hypotheses from the top of the stacks are selected for expansion

Hypothesis prunning based on:

- Beam-search (relative difference w.r.t. best scoring hypothesis) or threshold 

- Histogram prunning (max. number of hypotheses per stack)

## Additional bibliography

<ol>
<li><a href="https://aclanthology.org/W02-1018.pdf" target="_blank">D. Marcu and W. Wong. A Phrase-Based, Joint Probability Model for Statistical Machine Translation, EMNLP 2002.</a></li>
<li><a href="https://aclanthology.org/2009.eamt-1.23.pdf" target="_blank">J. Andrés-Ferrer and A. Juan. A Phrase-Based Hidden Semi-Markov Approach to Machine Translation, EAMT 2009.</a></li>
<li><a href="https://doi.org/10.1017/CBO9780511815829" target="_blank">P. Koehn. Statistical Machine Translation, MIT Press 2010.</a></li>
</ol>