## Learning from data

<ul>
<li><b>Monolingual data</b></li>
    Ex.: Mary did not slap the green witch.
<li><b>Multilingual data</b></li>
    Ex.: Mary did not slap the green witch. Mary no dió una botefada a la bruja verde.
<li><b>Parallel data</b></li>
<ul>
<li><b>Text-To-Text.</b></li>
    Ex.: Mary did not slap the green witch. <b>||</b> Mary no dió una botefada a la bruja verde.
<li><b>Speech-To-Text.</b> Automatic speech recognition or speech translation</li> 
<li><b>Text-To-Speech.</b> Speech synthesis</li>
<li><b>Speech-To-Speech</b></li>
</ul>
</ul>


## Learning from parallel data: text-to-text

Example of parallel text:
<table>
<tr><td>my house is blue</td><td>nire etxea urdina da</td></tr>
<tr><td>my house is white</td><td>nire etxea zuria da</td></tr>
<tr><td>my dog was white</td><td>nire txakurra zuria zen</td></tr>
<tr><td>the dog was blue</td><td>txakurra urdina zen</td></tr>
</table>

Exercise: Can you identify which words are mutual translations? That is, define a bilingual dictionary.

Solution:

<table>
<tr><td>my</td><td>nire</td></tr>
<tr><td>house</td><td>etxea</td></tr>
<tr><td>is</td><td>da</td></tr>
<tr><td>blue</td><td>urdina</td></tr>
<tr><td>dog</td><td>txakurra</td></tr>
<tr><td>was</td><td>zen</td></tr>
<tr><td>the</td><td>NULL</td></tr>
</table>

<ul>
<li>The concept of <b>alignment</b> between source and target words naturally arises.</li>
<li>If alignments were available, it would be straightforward to derive a bilingual dictionary.</li>
<li>Can we automatically learn word alignments from parallel text?</li>
</ul>

## Word-based alignment models


Let $x = x_1 \cdots x_{|x|} = x_1^{|x|}$ and $y = y_1 \cdots y_{|y|} = y_1^{|y|}$ be source and target sentences that are mutual translations. The variables $x_j$ and $y_i$ denote the $j$-th source word and the $i$-th target word, respectively. For the sake of clarity, let $J=|x|$ and $I=|y|$ be the number of source and target words, respectively.

Let $a = a_1 \cdots a_J$ be an alignment variable that assigns each target position to a source position. That is, $a_j \in \{1,\cdots,I\}$. For example, in the first sentence above, $a=(1, 2, 4, 3)$.

More precisely, a ficticius target position $i=0$ is defined to account for those positions in the source sentence that are not aligned to any target position. Thus, $a_i \in \{0, 1,\cdots,I\}$. So, the last sentence would be $a=(0, 2, 4, 3)$.

The alignment is considered a hidden variable, so that we sum over all its possible values:

$$
\begin{align*}
P(x \mid y) &= \sum_a P(x, a \mid y)\\%
            &= \sum_a \prod_j P(x_j, a_j \mid x, x_1^{j-1}, a_1^{j-1}, x)\\%
            &= \sum_a \prod_j P(x_j \mid y, x_1^{j-1}, a_1^{j}, x) \, P(a_j \mid x, y_1^{j-1}, a_1^{j-1}, x)%
\end{align*}
$$

### Model 1

Assumptions and model parameters:

$$
\begin{align*}
P(x_j \mid y, x_1^{j-1}, a_1^{j}, x)   &:= p(x_j \mid y_{a_j})\\ 
P(a_j \mid y, x_1^{j-1}, a_1^{j-1}, x) &:= \frac{1}{I+1}
\end{align*}
$$

Model 1 is defined as:

$$
\begin{align*}
P(x \mid y) &\approx \sum_a \prod_j \frac{1}{I+1} \, p(x_j \mid y_{a_j})\\%
            &=       \prod_j \sum_{a_j} \frac{1}{I+1} \, p(x_j \mid y_{a_j})\\%
            &= \frac{1}{(I+1)^J} \, \prod_j \sum_{a_j} p(x_j \mid y_{a_j})
\end{align*}
$$

Parameter optimization of log-likelihood by EM algorithm:

$$
\begin{align*}
\text{E step}: a_{nji} &= \frac{p(x_{nj} \mid y_{ni})}{\sum_{i'} p(x_{nj} \mid y_{ni'})}\\%
\text{M step}: p(u \mid v) &\sim  \sum_n \sum_{j:x_{nj}=u} \sum_{i:y_{ni}=v} a_{nji}
\end{align*}
$$


## Other word-based models

<ul>
<li>IBM research group proposed models 1 through 5</li>
<li>HMM alignment model</li>
<li>Mixture models</li>
<li>etc.</li>
</ul>

## Additional bibliography

<ul>
<li><a href="https://kevincrawfordknight.github.io/papers/wkbk-rw.pdf" target="_blank">K. Knight. A Statistical MT Tutorial Workbook, August 1999.</a></li>
<li><a href="https://github.com/moses-smt/giza-pp" target="_blank">F. Och. GIZA++ toolkit and the mkcls tool.</a></li>
</ul>