# Statistical Machine Translation

If you have seen the SMT lecture, you already know what the task of machine translation is and what word alignment is. With the help of this notebook, let's try to figure it all out in practice.

# IBM MODEL 1

1.We need to calculate $$p(f,a|e,m)=\frac{p(f,a|e,m)}{\sum_{a \in A} p(f,a|e,m)}$$
* English pairs consist of: $l=2$ words.  
* German pairs: $m=2$  
* Alignment $a$ is {${a_{1},..., a_{m}}$}, where each $a_{j} \in {0,...,l}$ 
<F>

2.There are $(l+1)^m$ possible alignments. And in IBM1 all allignments $a$ are equally likely:
$$p(a|e,m) = \frac{1}{(l+1)^m}$$ 
<F>
3.To generate a German string $f$ from an English string $e$:  
* Step 1: Pick an alignment $a$ with probability: $\frac{1}{(l+1)^m}$  
* Step 2: Pick the German words with probs: $$p(f|a,e,m)={\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$
 
4.The final result:  $$p(f,a|e,m)=p(a|e,m)\times{p(f|a,e,m)}=\frac{1}{(l+1)^m}{\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$


In [None]:
#Import function for training model

from smt.ibmmodel.ibmmodel1 import train

In [None]:
#our German and English pairs

sent_pairs = [("the house", "das Haus"),
              ("the book", "das Buch"),
              ("a book", "ein Buch"),
              ]

In [None]:
# Results of train

train(sent_pairs, loop_count=300)

# IBM MODEL 2

1.The main difference between Model 1 and Model 2 is that we introduce alignment (distortion parameters): 
$q(i|j,l,m)=$'probability that $j'$th German word is connected with $i'$th Enhlish word, given sentence lengths of $e$ and $f$ are $l$ and $m$ respectively'.  
<f>  
2.Let's define $$p(a|e,m)={\prod_{j=1}^{m} q(a_{j}|j,l,m)}$$, where $a=${$a_{1},...,a_{m}$}
<f>  
3.Gives $$p(f,a|e,m)={\prod_{j=1}^{m} q(a_{j}|j,l,m)t(f_{j}|e_{a_{j}}}$$
<f>  
4.To generate a German string $f$ from an English string $e$: 
* Step 1: Pick an alignment $a$ with probability: $$\prod_{j=1}^{m} q(a_{j}|j,l,m)$$ 
* Step 2: Pick the German words with probs: $$p(f|a,e,m)={\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$
 
4.The final result:  $$p(f,a|e,m)=p(a|e,m)\times{p(f|a,e,m)}={\prod_{j=1}^{m} q(a_{j}|j,l,m)t(f_{j}|e_{a_{j}})}$$
    


In [None]:
# Import our train function 
# Show_matrix - function for see how allignments work

from smt.ibmmodel.ibmmodel2 import train as ibm2_train
from smt.ibmmodel.ibmmodel2 import show_matrix
from smt.utils.utility import matrix

In [None]:
#Results of our train

sent_pairs = [("the house", "das Haus"),
              ("the book", "das Buch"),
              ("a book", "ein Buch"),
              ]
t, a = ibm2_train(sent_pairs, loop_count=1)

In [None]:
# help function 
def print_lines(line):
    lines = line.split("\n")
    for l in lines:
        print(l)

## IBM model 2 results

In [None]:
es, fs = "the book".split(), "das Buch".split()

In [None]:
print_lines(show_matrix(es, fs, t, a))

## Now on Chinese! ##

In [None]:
sentences = [("僕 は 男 です", "I am a man"),
                     ("私 は 女 です", "I am a girl"),
                     ("私 は 先生 です", "I am a teacher"),
                     ("彼女 は 先生 です", "She is a teacher"),
                     ("彼 は 先生 です", "He is a teacher"),
                     ]

In [None]:
t, a = ibm2_train(sentences, loop_count=10)

In [None]:
es = "私 は 先生 です".split()
fs = "I am a teacher".split()

In [None]:
print_lines(show_matrix(es, fs, t, a))

## Alignments and symmetrization##

Let's see how alignment and symmetrization work!

In [None]:
from smt.phrase.word_alignment import _alignment

In [None]:
es = "michael assumes that he will stay in the house".split()
fs = "michael geht davon aus , dass er im haus bleibt".split()
e2f = [(1, 1), (2, 2), (2, 3), (2, 4), (3, 6),
       (4, 7), (7, 8), (9, 9), (6, 10)]
f2e = [(1, 1), (2, 2), (3, 6), (4, 7), (7, 8),
       (8, 8), (9, 9), (5, 10), (6, 10)]

In [None]:
print(matrix(len(es), len(fs), f2e, es, fs))

In [None]:
ali = _alignment(es, fs, e2f, f2e)
print(matrix(len(es), len(fs), ali, es, fs))

In [None]:
from smt.utils.utility import mkcorpus
from smt.phrase.word_alignment import symmetrization
from pprint import pprint

In [None]:
corpus = mkcorpus(sentences)
es = "私 は 先生 です".split()
fs = "I am a teacher".split()  

In [None]:
syn = symmetrization(es, fs, corpus)
pprint(syn)
print(matrix(len(es), len(fs), syn, es, fs))