# Statistical Machine Translation

If you have seen the SMT lecture, you already know what the task of machine translation is and what word alignment is. With the help of this notebook, let's try to figure it all out in practice.

# IBM MODEL 1

1.We need to calculate $$p(f,a|e,m)=\frac{p(f,a|e,m)}{\sum_{a \in A} p(f,a|e,m)}$$
* English pairs consist of: $l=2$ words.  
* German pairs: $m=2$  
* Alignment $a$ is {${a_{1},..., a_{m}}$}, where each $a_{j} \in {0,...,l}$ 
<F>

2.There are $(l+1)^m$ possible alignments. And in IBM1 all allignments $a$ are equally likely:
$$p(a|e,m) = \frac{1}{(l+1)^m}$$ 
<F>
3.To generate a German string $f$ from an English string $e$:  
* Step 1: Pick an alignment $a$ with probability: $\frac{1}{(l+1)^m}$  
* Step 2: Pick the German words with probs: $$p(f|a,e,m)={\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$
 
4.The final result:  $$p(f,a|e,m)=p(a|e,m)\times{p(f|a,e,m)}=\frac{1}{(l+1)^m}{\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$


In [49]:
#Import function for training model

from smt.ibmmodel.ibmmodel1 import train

In [91]:
#our German and English pairs

sent_pairs = [("the house", "das Haus"),
              ("the book", "das Buch"),
              ("a book", "ein Buch"),
              ]

In [59]:
# Results of train

train(sent_pairs, loop_count=300)

defaultdict(<function smt.ibmmodel.ibmmodel1._constant_factory.<locals>.<lambda>()>,
            {('the', 'das'): Decimal('1'),
             ('the', 'Haus'): Decimal('0.001690039679655961751022473679'),
             ('house', 'das'): Decimal('4.609745521815247181063352303E-89'),
             ('house', 'Haus'): Decimal('0.9983099603203440382489775265'),
             ('the', 'Buch'): Decimal('2.815024286894250589491379615E-90'),
             ('book', 'das'): Decimal('2.815024286894250589491379615E-90'),
             ('book', 'Buch'): Decimal('1'),
             ('a', 'ein'): Decimal('0.9983099603203440382489775265'),
             ('a', 'Buch'): Decimal('4.609745521815247181063352311E-89'),
             ('book', 'ein'): Decimal('0.001690039679655961751022473679')})

# IBM MODEL 2

1.The main difference between Model 1 and Model 2 is that we introduce alignment (distortion parameters): 
$q(i|j,l,m)=$'probability that $j'$th German word is connected with $i'$th Enhlish word, given sentence lengths of $e$ and $f$ are $l$ and $m$ respectively'.  
<f>  
2.Let's define $$p(a|e,m)={\prod_{j=1}^{m} q(a_{j}|j,l,m)}$$, where $a=${$a_{1},...,a_{m}$}
<f>  
3.Gives $$p(f,a|e,m)={\prod_{j=1}^{m} q(a_{j}|j,l,m)t(f_{j}|e_{a_{j}}}$$
<f>  
4.To generate a German string $f$ from an English string $e$: 
* Step 1: Pick an alignment $a$ with probability: $$\prod_{j=1}^{m} q(a_{j}|j,l,m)$$ 
* Step 2: Pick the German words with probs: $$p(f|a,e,m)={\prod_{j=1}^{m} t(f_{j}|e_{a_{j}})}$$
 
4.The final result:  $$p(f,a|e,m)=p(a|e,m)\times{p(f|a,e,m)}={\prod_{j=1}^{m} q(a_{j}|j,l,m)t(f_{j}|e_{a_{j}})}$$
    


In [60]:
# Import our train function 
# Show_matrix - function for see how allignments work

from smt.ibmmodel.ibmmodel2 import train as ibm2_train
from smt.ibmmodel.ibmmodel2 import show_matrix
from smt.utils.utility import matrix

In [58]:
#Results of our train

sent_pairs = [("the house", "das Haus"),
              ("the book", "das Buch"),
              ("a book", "ein Buch"),
              ]
t, a = ibm2_train(sent_pairs, loop_count=1)

the das:    0.6363636363636364
the Haus:   0.42857142857142855
house das:  0.18181818181818182
house Haus: 0.5714285714285714
the Buch:   0.18181818181818182
book das:   0.18181818181818182
book Buch:  0.6363636363636364
a ein:      0.5714285714285714
a Buch:     0.18181818181818182
book ein:   0.42857142857142855
1 1 2 2	0.6111111111111112
2 1 2 2	0.3888888888888889
1 2 2 2	0.3888888888888889
2 2 2 2	0.6111111111111112


In [48]:
# help function 
def print_lines(line):
    lines = line.split("\n")
    for l in lines:
        print(l)

## IBM model 2 results

In [26]:
es, fs = "the book".split(), "das Buch".split()

In [27]:
print_lines(show_matrix(es, fs, t, a))

     das Buch
the  |x| |
book | |x|



## Now on Chinese! ##

In [63]:
sentences = [("僕 は 男 です", "I am a man"),
                     ("私 は 女 です", "I am a girl"),
                     ("私 は 先生 です", "I am a teacher"),
                     ("彼女 は 先生 です", "She is a teacher"),
                     ("彼 は 先生 です", "He is a teacher"),
                     ]

In [64]:
t, a = ibm2_train(sentences, loop_count=10)

僕 I:        0.33333333333333326
僕 am:       2.7953990516595328e-182
僕 a:        0.0
僕 man:      4.2427149655047526e-16
は I:        2.538173158018456e-144
は am:       0.5
は a:        0.5
は man:      0.0
男 I:        0.0
男 am:       0.0
男 a:        0.0
男 man:      0.9999999999999996
です I:       2.538173158018456e-144
です am:      0.5
です a:       0.5
です man:     0.0
私 I:        0.6666666666666667
私 am:       5.5907981033190676e-182
私 a:        0.0
私 girl:     6.236730531464904e-20
は girl:     0.0
女 I:        0.0
女 am:       0.0
女 a:        0.0
女 girl:     1.0
です girl:    0.0
私 teacher:  3.729343612242216e-27
は teacher:  0.0
先生 I:       0.0
先生 am:      0.0
先生 a:       0.0
先生 teacher: 1.0
です teacher: 0.0
彼女 She:     1.0
彼女 is:      2.032497523355089e-186
彼女 a:       0.0
彼女 teacher: 3.524703786008006e-29
は She:      9.987490445724654e-146
は is:       0.5
先生 She:     0.0
先生 is:      0.0
です She:     9.987490445724654e-146
です is:      0.5
彼 He:       1.0
彼 is:       2.032497523355089e-186
彼 a:   

In [65]:
es = "私 は 先生 です".split()
fs = "I am a teacher".split()

In [66]:
print_lines(show_matrix(es, fs, t, a))

     I am a teacher
私    |x| | | |
は    | | |x| |
先生   | | | |x|
です   | | |x| |



## Alignments and symmetrization##

Let's see how alignment and symmetrization work!

In [73]:
from smt.phrase.word_alignment import _alignment

In [74]:
es = "michael assumes that he will stay in the house".split()
fs = "michael geht davon aus , dass er im haus bleibt".split()
e2f = [(1, 1), (2, 2), (2, 3), (2, 4), (3, 6),
       (4, 7), (7, 8), (9, 9), (6, 10)]
f2e = [(1, 1), (2, 2), (3, 6), (4, 7), (7, 8),
       (8, 8), (9, 9), (5, 10), (6, 10)]

In [75]:
print(matrix(len(es), len(fs), f2e, es, fs))

     michael geht davon aus , dass er im haus bleibt
mich |x| | | | | | | | | |
assu | |x| | | | | | | | |
that | | | | | |x| | | | |
he   | | | | | | |x| | | |
will | | | | | | | | | |x|
stay | | | | | | | | | |x|
in   | | | | | | | |x| | |
the  | | | | | | | |x| | |
hous | | | | | | | | |x| |



In [90]:
ali = _alignment(es, fs, e2f, f2e)
print(matrix(len(es), len(fs), ali, es, fs))

     michael geht davon aus , dass er im haus bleibt
mich |x| | | | | | | | | |
assu | |x|x|x| | | | | | |
that | | | | | |x| | | | |
he   | | | | | | |x| | | |
will | | | | | | | | | |x|
stay | | | | | | | | | |x|
in   | | | | | | | |x| | |
the  | | | | | | | |x| | |
hous | | | | | | | | |x| |



In [76]:
from smt.utils.utility import mkcorpus
from smt.phrase.word_alignment import symmetrization
from pprint import pprint

In [38]:
corpus = mkcorpus(sentences)
es = "私 は 先生 です".split()
fs = "I am a teacher".split()  

In [69]:
syn = symmetrization(es, fs, corpus)
pprint(syn)
print(matrix(len(es), len(fs), syn, es, fs))

僕 I:        0.33333333333333326
僕 am:       2.7953990516595328e-182
僕 a:        0.0
僕 man:      4.2427149655047526e-16
は I:        2.538173158018456e-144
は am:       0.5
は a:        0.5
は man:      0.0
男 I:        0.0
男 am:       0.0
男 a:        0.0
男 man:      0.9999999999999996
です I:       2.538173158018456e-144
です am:      0.5
です a:       0.5
です man:     0.0
私 I:        0.6666666666666667
私 am:       5.5907981033190676e-182
私 a:        0.0
私 girl:     6.236730531464904e-20
は girl:     0.0
女 I:        0.0
女 am:       0.0
女 a:        0.0
女 girl:     1.0
です girl:    0.0
私 teacher:  3.729343612242216e-27
は teacher:  0.0
先生 I:       0.0
先生 am:      0.0
先生 a:       0.0
先生 teacher: 1.0
です teacher: 0.0
彼女 She:     1.0
彼女 is:      2.032497523355089e-186
彼女 a:       0.0
彼女 teacher: 3.524703786008006e-29
は She:      9.987490445724654e-146
は is:       0.5
先生 She:     0.0
先生 is:      0.0
です She:     9.987490445724654e-146
です is:      0.5
彼 He:       1.0
彼 is:       2.032497523355089e-186
彼 a:   