## Learning from data

<ul>
<li><b>Monolingual data</b></li>
    Ex.: Mary did not slap the green witch.
<li><b>Multilingual data</b></li>
    Ex.: Mary did not slap the green witch. Mary no dió una botefada a la bruja verde.
<li><b>Parallel data</b></li>
<ul>
<li><b>Text-To-Text.</b></li>
    Ex.: Mary did not slap the green witch. <b>||</b> Mary no dió una botefada a la bruja verde.
<li><b>Speech-To-Text.</b> Automatic speech recognition or speech translation</li> 
<li><b>Text-To-Speech.</b> Speech synthesis</li>
<li><b>Speech-To-Speech</b></li>
</ul>
</ul>


## Learning from parallel data: text-to-text

Example of parallel text:
<table>
<tr><td>the house is blue</td><td>etxea urdina da</td></tr>
<tr><td>my house was white</td><td>nire etxea zuria zen</td></tr>
<tr><td>my dog is white</td><td>nire txakurra zuria da</td></tr>
<tr><td>the dog was blue</td><td>txakurra urdina zen</td></tr>
</table>

Exercise: Can you identify which words are mutual translations? That is, define a bilingual dictionary.

Solution:

<table>
<tr><td>my</td><td>nire</td></tr>
<tr><td>house</td><td>etxea</td></tr>
<tr><td>is</td><td>da</td></tr>
<tr><td>blue</td><td>urdina</td></tr>
<tr><td>dog</td><td>txakurra</td></tr>
<tr><td>was</td><td>zen</td></tr>
<tr><td>the</td><td>NULL</td></tr>
</table>

<ul>
<li>The concept of <b>alignment</b> between source and target words naturally arises.</li>
<li>If alignments were available, it would be straightforward to derive a bilingual dictionary.</li>
<li>Can we automatically learn word alignments from parallel text?</li>
</ul>

## Word-based alignment models


Let $x = x_1 \cdots x_{|x|} = x_1^{|x|}$ and $y = y_1 \cdots y_{|y|} = y_1^{|y|}$ be source and target sentences that are mutual translations. The variables $x_j$ and $y_i$ denote the $j$-th source word and the $i$-th target word, respectively. For the sake of clarity, let $J=|x|$ and $I=|y|$ be the number of source and target words, respectively.

Let $a = a_1 \cdots a_J$ be an alignment variable that assigns each target position to a source position. That is, $a_j \in \{1,\cdots,I\}$. For example, in the first sentence above, $a=(1, 2, 4, 3)$.

More precisely, a ficticius target position $i=0$ (NULL word) is defined to account for those positions in the source sentence that are not aligned to any target position. Thus, $a_i \in \{0, 1,\cdots,I\}$. So, the last sentence would be $a=(0, 2, 4, 3)$.

The alignment is considered a hidden variable, so that we sum over all its possible values:

$$
\begin{align*}
P(y \mid x) &= P(y, I \mid x)\\%
            &= P(I \mid x) \, P(y \mid I, x)\\
            &= P(I \mid x) \sum_a P(y, a \mid I, x)\\%
\end{align*}
$$

with

$$
\begin{align*}
P(y, a \mid I, x) &= \prod_{i=1}^I P(y_i, a_i \mid x, y_1^{i-1}, a_1^{i-1})\\%
                  &= \prod_{i=1}^I P(y_i \mid x, y_1^{i-1}, a_1^{i}) \, P(a_i \mid x, y_1^{i-1}, a_1^{i-1})%
\end{align*}
$$


### Model 1

Assumptions and model parameters:

$$
\begin{align*}
P(y_i \mid x, y_1^{i-1}, a_1^{i})   &:= p(y_i \mid x_{a_i})\\ 
P(a_i \mid x, y_1^{i-1}, a_1^{i-1}) &:= \frac{1}{J+1}
\end{align*}
$$

Model 1 is defined as:

$$
\begin{align*}
P(x \mid y) &\sim \sum_a \prod_{i=1}^I \frac{1}{J+1} \, p(y_i \mid x_{a_i})\\%
            &=       \prod_{i=1}^I \sum_{a_i=0}^J \frac{1}{J+1} \, p(y_i \mid x_{a_i})\\%
            &= \frac{1}{(J+1)^I} \, \prod_{i=1}^I \sum_{a_i=0}^J p(y_i \mid x_{a_i})\\%
            &= \frac{1}{(J+1)^I} \, \prod_{i=1}^I \sum_{j=0}^J p(y_i \mid x_j)
\end{align*}
$$

Parameter optimization of log-likelihood by EM algorithm:

$$
\begin{align*}
\text{E step}: a_{nij} &= \frac{p(y_{ni} \mid x_{nj})}{\sum_{j'} p(y_{ni} \mid x_{nj'})}\\%
\text{M step}: p(u \mid v) &\sim  \sum_n \sum_{i:y_{ni}=u} \sum_{j:x_{nj}=v} a_{nij}
\end{align*}
$$


In [166]:
from nltk.translate import AlignedSent, IBMModel1

enText = ['the house is blue', 'my house was white','my dog is white', 'the dog was blue']
euText = ['etxea urdina da','nire etxea zuria zen','nire txakurra zuria da','txakurra urdina zen']

# Source language is Euskera and target language is English
corpus = []
for enSent, euSent in zip(enText,euText):
    corpus.append(AlignedSent(enSent.split(),euSent.split()))

for sentencePair in corpus: print(f'{sentencePair.mots} > {sentencePair.words}')

# Training p(trg_word | src_word): m1.translation_table[trgWord][srcWord]
m1 = IBMModel1(corpus, 5)

m1.align_all(corpus)

print(corpus)

for trgWord in m1.translation_table:
    for srcWord in m1.translation_table[trgWord]:
        print(f'p({trgWord} | {srcWord}) = {m1.translation_table[trgWord][srcWord]}')

"""
s={}
for trgWord in m1.translation_table:
    for srcWord in m1.translation_table[trgWord]:
        if srcWord not in s:
            s[srcWord]  = m1.translation_table[trgWord][srcWord]
        else:
            s[srcWord] += m1.translation_table[trgWord][srcWord]

for srcWord in s:
    print(f's[{srcWord}] = {s[srcWord]}')
"""

['etxea', 'urdina', 'da'] > ['the', 'house', 'is', 'blue']
['nire', 'etxea', 'zuria', 'zen'] > ['my', 'house', 'was', 'white']
['nire', 'txakurra', 'zuria', 'da'] > ['my', 'dog', 'is', 'white']
['txakurra', 'urdina', 'zen'] > ['the', 'dog', 'was', 'blue']
[AlignedSent(['the', 'house', 'is', 'blue'], ['etxea', 'urdina', 'da'], Alignment([(0, 1), (1, 0), (2, 2), (3, 1)])), AlignedSent(['my', 'house', 'was', 'white'], ['nire', 'etxea', 'zuria', 'zen'], Alignment([(0, 2), (1, 1), (2, 3), (3, 2)])), AlignedSent(['my', 'dog', 'is', 'white'], ['nire', 'txakurra', 'zuria', 'da'], Alignment([(0, 2), (1, 1), (2, 3), (3, 2)])), AlignedSent(['the', 'dog', 'was', 'blue'], ['txakurra', 'urdina', 'zen'], Alignment([(0, 1), (1, 0), (2, 2), (3, 1)]))]
p(was | None) = 0.11476098738559218
p(was | nire) = 0.019614016975134548
p(was | etxea) = 0.019819949432785266
p(was | zuria) = 0.019614016975134548
p(was | zen) = 0.8256223171010132
p(was | txakurra) = 0.03293937092327597
p(was | urdina) = 0.010651973572

"\ns={}\nfor trgWord in m1.translation_table:\n    for srcWord in m1.translation_table[trgWord]:\n        if srcWord not in s:\n            s[srcWord]  = m1.translation_table[trgWord][srcWord]\n        else:\n            s[srcWord] += m1.translation_table[trgWord][srcWord]\n\nfor srcWord in s:\n    print(f's[{srcWord}] = {s[srcWord]}')\n"

In [81]:
import numpy as np
np.set_printoptions(precision=3)

def create_dataset(sents):
    dict = {}
    idict = {}
    idSents = []
    id = 1
    for sent in sents:
        idSent = []
        for word in sent.split():
            if word not in dict:
                dict[word] = id
                idict[id] = word
                idSent.append(id)
                id += 1                 
            else:
                idSent.append(dict[word])
        idSents.append(idSent)
    return dict, idict, idSents
    
srcSents = ['the house is blue','my house was white','my dog is white','the dog was blue']
trgSents = ['etxea urdina da','nire etxea zuria zen','nire txakurra zuria da','txakurra urdina zen']

srcDict, isrcDict, srcData = create_dataset(srcSents)
trgDict, itrgDict, trgData = create_dataset(trgSents)

print(f'srcData = {srcData}')
print(f'trgData = {trgData}')

# M1 dictionary initialise with uniform distro
M1Dict = np.zeros((len(trgDict)+1,len(srcDict)),dtype=float)
for trgWord in range(len(M1Dict)):
    M1Dict[trgWord] = 1.0/len(srcDict)
print(f'M1Dict = {M1Dict}')


for iter in range(5):
    newM1Dict = np.zeros((len(trgDict)+1,len(srcDict)),dtype=float)
    for n in range(len(srcData)): 
        # E-step
        a = np.zeros((len(srcData[n]), len(trgData[n])+1),dtype=float)
        for j in range(len(srcData[n])):
            # NULL word
            a[j][0] = M1Dict[0][srcData[n][j]-1]
            suma = a[j][0]
            for i in range(len(trgData[n])):
                a[j][i+1] = M1Dict[trgData[n][i]][srcData[n][j]-1]
                suma += a[j][i+1]
            a[j][0] /= suma
            for i in range(len(trgData[n])):
                a[j][i+1] /= suma
        #print(f'a =\n{a}')
        # M-step
        for j in range(len(srcData[n])):
            newM1Dict[0][srcData[n][j]-1] += a[j][0]
            for i in range(len(trgData[n])):
                newM1Dict[trgData[n][i]][srcData[n][j]-1] += a[j][i]
        #print(f'newM1Dict = {newM1Dict}')

    # Normalise to obtain probabilities
    for trgWord in range(len(M1Dict)):
        suma = np.sum(newM1Dict[trgWord])
        for srcWord in range(len(M1Dict[trgWord])):
            newM1Dict[trgWord][srcWord] /= suma

    # Update M1 dictionary
    M1Dict = newM1Dict
print(f'M1Dict = {M1Dict}')

print(f'{isrcDict[np.argmax(M1Dict[0])+1]} -> NULL')
for trgWord in range(1,len(M1Dict)):
    print(f'{isrcDict[np.argmax(M1Dict[trgWord])+1]} -> {itrgDict[trgWord]}')


srcData = [[1, 2, 3, 4], [5, 2, 6, 7], [5, 8, 3, 7], [1, 8, 6, 4]]
trgData = [[1, 2, 3], [4, 1, 5, 6], [4, 7, 5, 3], [7, 2, 6]]
M1Dict = [[0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125]]
M1Dict = [[0.164 0.116 0.162 0.164 0.058 0.162 0.058 0.116]
 [0.132 0.207 0.146 0.132 0.123 0.136 0.123 0.   ]
 [0.207 0.167 0.125 0.207 0.    0.125 0.    0.167]
 [0.152 0.13  0.168 0.152 0.144 0.    0.144 0.11 ]
 [0.    0.146 0.195 0.    0.159 0.195 0.159 0.146]
 [0.    0.17  0.11  0.    0.22  0.11  0.22  0.17 ]
 [0.152 0.11  0.    0.152 0.144 0.168 0.144 0.13 ]
 [0.132 0.    0.136 0.132 0.123 0.146 0.123 0.207]]
the -> NULL
house -> etxea
the -> ur

## Other word-based models

<ul>
<li>IBM research group proposed models 1 through 5: alignment distro (model 2), fertility (model 3)</li>
<li>HMM alignment model</li>
<li>Mixture models</li>
<li>etc.</li>
</ul>

## Additional bibliography

<ul>
<li><a href="https://aclanthology.org/J93-2003.pdf" target="_blank">P.F. Brown et al. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993.</a></li>
<li><a href="https://kevincrawfordknight.github.io/papers/wkbk-rw.pdf" target="_blank">K. Knight. A Statistical MT Tutorial Workbook, August 1999.</a></li>
<li><a href="https://github.com/moses-smt/giza-pp" target="_blank">F. Och. GIZA++ toolkit and the mkcls tool.</a></li>
</ul>