## Learning from data

<ul>
<li><b>Monolingual data</b></li>
    Ex.: Mary did not slap the green witch.
<li><b>Multilingual data</b></li>
    Ex.: Mary did not slap the green witch. Mary no dió una botefada a la bruja verde.
<li><b>Parallel data</b></li>
<ul>
<li><b>Text-To-Text.</b></li>
    Ex.: Mary did not slap the green witch. <b>||</b> Mary no dió una botefada a la bruja verde.
<li><b>Speech-To-Text.</b> Automatic speech recognition or speech translation</li> 
<li><b>Text-To-Speech.</b> Speech synthesis</li>
<li><b>Speech-To-Speech</b></li>
</ul>
</ul>


## Learning from parallel data: text-to-text

Example of parallel text:
<table>
<tr><td>the house is blue</td><td>etxea urdina da</td></tr>
<tr><td>my house was white</td><td>nire etxea zuria zen</td></tr>
<tr><td>my dog is white</td><td>nire txakurra zuria da</td></tr>
<tr><td>the dog was blue</td><td>txakurra urdina zen</td></tr>
</table>

Exercise: Can you identify which words are mutual translations? That is, define a bilingual dictionary.

Solution:

<table>
<tr><td>my</td><td>nire</td></tr>
<tr><td>house</td><td>etxea</td></tr>
<tr><td>is</td><td>da</td></tr>
<tr><td>blue</td><td>urdina</td></tr>
<tr><td>dog</td><td>txakurra</td></tr>
<tr><td>was</td><td>zen</td></tr>
<tr><td>the</td><td>NULL</td></tr>
</table>

<ul>
<li>The concept of <b>alignment</b> between source and target words naturally arises.</li>
<li>If alignments were available, it would be straightforward to derive a bilingual dictionary.</li>
<li>Can we automatically learn word alignments from parallel text?</li>
</ul>

## Word-based alignment models


Let $x = x_1 \cdots x_{|x|} = x_1^{|x|}$ and $y = y_1 \cdots y_{|y|} = y_1^{|y|}$ be source and target sentences that are mutual translations. The variables $x_j$ and $y_i$ denote the $j$-th source word and the $i$-th target word, respectively. For the sake of clarity, let $J=|x|$ and $I=|y|$ be the number of source and target words, respectively.

Let $a = a_1 \cdots a_J$ be an alignment variable that assigns each target position to a source position. That is, $a_j \in \{1,\cdots,I\}$. For example, in the first sentence above, $a=(1, 2, 4, 3)$.

More precisely, a ficticius target position $i=0$ (NULL word) is defined to account for those positions in the source sentence that are not aligned to any target position. Thus, $a_i \in \{0, 1,\cdots,I\}$. So, the last sentence would be $a=(0, 2, 4, 3)$.

The alignment is considered a hidden variable, so that we sum over all its possible values:

$$
\begin{align*}
P(y \mid x) &= P(y, I \mid x)\\%
            &= P(I \mid x) \, P(y \mid I, x)\\
            &= P(I \mid x) \sum_a P(y, a \mid I, x)\\%
\end{align*}
$$

with

$$
\begin{align*}
P(y, a \mid I, x) &= \prod_{i=1}^I P(y_i, a_i \mid x, y_1^{i-1}, a_1^{i-1})\\%
                  &= \prod_{i=1}^I P(y_i \mid x, y_1^{i-1}, a_1^{i}) \, P(a_i \mid x, y_1^{i-1}, a_1^{i-1})%
\end{align*}
$$


### Model 1

Assumptions and model parameters:

$$
\begin{align*}
P(y_i \mid x, y_1^{i-1}, a_1^{i})   &:= p(y_i \mid x_{a_i})\\ 
P(a_i \mid x, y_1^{i-1}, a_1^{i-1}) &:= \frac{1}{J+1}
\end{align*}
$$

Model 1 is defined as:

$$
\begin{align*}
P(y \mid x) &\sim \sum_a \prod_{i=1}^I \frac{1}{J+1} \, p(y_i \mid x_{a_i})\\%
            &=       \prod_{i=1}^I \sum_{a_i=0}^J \frac{1}{J+1} \, p(y_i \mid x_{a_i})\\%
            &= \frac{1}{(J+1)^I} \, \prod_{i=1}^I \sum_{a_i=0}^J p(y_i \mid x_{a_i})\\%
            &= \frac{1}{(J+1)^I} \, \prod_{i=1}^I \sum_{j=0}^J p(y_i \mid x_j)
\end{align*}
$$

Parameter optimization of log-likelihood by EM algorithm:

$$
\begin{align*}
\text{E step}: a_{nij} &= \frac{p(y_{ni} \mid x_{nj})}{\sum_{j'} p(y_{ni} \mid x_{nj'})}\\%
\text{M step}: p(u \mid v) &\sim  \sum_n \sum_{i:y_{ni}=u} \sum_{j:x_{nj}=v} a_{nij}
\end{align*}
$$


In [181]:
from nltk.translate import AlignedSent, IBMModel1

euText = ['etxea urdina da','nire etxea zuria zen','nire txakurra zuria da','txakurra urdina zen']
enText = ['the house is blue', 'my house was white','my dog is white', 'the dog was blue']

# Source language is Euskera and target language is English
corpus = []
for enSent, euSent in zip(enText,euText):
    corpus.append(AlignedSent(enSent.split(),euSent.split()))

# Training M1 model for 5 iterations 
# p(trg_word | src_word): m1.translation_table[trgWord][srcWord]
m1 = IBMModel1(corpus, 5)

for trgWord in m1.translation_table:
    print(f'p({trgWord:>5} | x ) = ',end="")
    for srcWord in m1.translation_table[trgWord]:
        print(f'{m1.translation_table[trgWord][srcWord]:.2f} ({srcWord}) ',end="")
    print("")

# Computing best alignment
m1.align_all(corpus)
for sentencePair in corpus: print(f'{sentencePair.mots} > {sentencePair.words}: {sentencePair.alignment}')

p(  was | x ) = 0.11 (None) 0.02 (nire) 0.02 (etxea) 0.02 (zuria) 0.83 (zen) 0.03 (txakurra) 0.01 (urdina) 
p(  the | x ) = 0.21 (None) 0.05 (etxea) 0.48 (urdina) 0.05 (da) 0.05 (txakurra) 0.05 (zen) 
p( blue | x ) = 0.21 (None) 0.05 (etxea) 0.48 (urdina) 0.05 (da) 0.05 (txakurra) 0.05 (zen) 
p(house | x ) = 0.11 (None) 0.83 (etxea) 0.01 (urdina) 0.03 (da) 0.02 (nire) 0.02 (zuria) 0.02 (zen) 
p(  dog | x ) = 0.11 (None) 0.02 (nire) 0.83 (txakurra) 0.02 (zuria) 0.02 (da) 0.01 (urdina) 0.03 (zen) 
p(white | x ) = 0.06 (None) 0.46 (nire) 0.01 (etxea) 0.46 (zuria) 0.01 (zen) 0.01 (txakurra) 0.01 (da) 
p(   is | x ) = 0.11 (None) 0.03 (etxea) 0.01 (urdina) 0.83 (da) 0.02 (nire) 0.02 (txakurra) 0.02 (zuria) 
p(   my | x ) = 0.06 (None) 0.46 (nire) 0.01 (etxea) 0.46 (zuria) 0.01 (zen) 0.01 (txakurra) 0.01 (da) 
['etxea', 'urdina', 'da'] > ['the', 'house', 'is', 'blue']: 0-1 1-0 2-2 3-1
['nire', 'etxea', 'zuria', 'zen'] > ['my', 'house', 'was', 'white']: 0-2 1-1 2-3 3-2
['nire', 'txakurra', 'z

### Model 2

Assumptions and model parameters:

$$
\begin{align*}
P(y_i \mid x, y_1^{i-1}, a_1^{i})   &:= p(y_i \mid x_{a_i})\\ 
P(a_i \mid x, y_1^{i-1}, a_1^{i-1}) &:= p(a_i \mid i, J, I)
\end{align*}
$$

Model 2 is defined as:

$$
\begin{align*}
P(y \mid x) &\sim \prod_{i=1}^I \sum_{a_i=0}^J p(a_i \mid i, J, I) \, p(y_i \mid x_{a_i})\\%
            &=    \prod_{i=1}^I \sum_{j=0}^J   p(j \mid i, J, I) \, p(y_i \mid x_j)
\end{align*}
$$

Parameter optimization of log-likelihood by EM algorithm:

$$
\begin{align*}
\text{E step}: a_{nij} &= \frac{p(j \mid i, J, I) \, p(y_{ni} \mid x_{nj})}{\sum_{j'} p(j' \mid i, J, I) \, p(y_{ni} \mid x_{nj'})}\\%
\text{M step}: p(j \mid i, J, I) & \sim  \sum_{n:x_n=J \wedge y_n=I} a_{nij} \\%
               p(u \mid v) &\sim  \sum_n \sum_{i:y_{ni}=u} \sum_{j:x_{nj}=v} a_{nij}
\end{align*}
$$

In [191]:
from nltk.translate import AlignedSent, IBMModel2

# Training M1 model for 10 iterations followed by M2 model for 5 iterations
# p(j | i, J, I): m1.alignment_table[j][i][J][I]
m2 = IBMModel2(corpus, 5)

J=4; I=4
for i in range(1,I+1):
    print(f'p(j | i = {i}, J = {J}, I = {I}) = ',end="")
    for j in range(J+1):
        print(f'{m2.alignment_table[j][i][4][4]:.2f} (j = {j}) ',end="")
    print("")


p(j | i = 1, J = 4, I = 4) = 0.00 (j = 0) 0.50 (j = 1) 0.00 (j = 2) 0.50 (j = 3) 0.00 (j = 4) 
p(j | i = 2, J = 4, I = 4) = 0.00 (j = 0) 0.00 (j = 1) 1.00 (j = 2) 0.00 (j = 3) 0.00 (j = 4) 
p(j | i = 3, J = 4, I = 4) = 0.00 (j = 0) 0.00 (j = 1) 0.00 (j = 2) 0.00 (j = 3) 1.00 (j = 4) 
p(j | i = 4, J = 4, I = 4) = 0.00 (j = 0) 0.50 (j = 1) 0.00 (j = 2) 0.50 (j = 3) 0.00 (j = 4) 


## Other word-based models

<ul>
<li>In addition to models 1 and 2, the IBM research group proposed models 3 through 5</li>
<li>HMM alignment model</li>
<li>Mixture models</li>
<li>etc.</li>
</ul>

## Additional bibliography

<ul>
<li><a href="https://aclanthology.org/J93-2003.pdf" target="_blank">P.F. Brown et al. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993.</a></li>
<li><a href="https://kevincrawfordknight.github.io/papers/wkbk-rw.pdf" target="_blank">K. Knight. A Statistical MT Tutorial Workbook, August 1999.</a></li>
<li><a href="https://github.com/moses-smt/giza-pp" target="_blank">F. Och. GIZA++ toolkit and the mkcls tool.</a></li>
</ul>