In [1]:
from tqdm import tqdm
from collections import defaultdict
from functools import lru_cache
from IPython.display import clear_output

## Problem 1

This is a function that returns the vocab and the text of an language corpus given it's path.

In [2]:
@lru_cache(4)
def get_text_and_vocab(path):
    text = ""
    words = []
    with open(path, encoding='utf8') as corpus:
        text = corpus.read().lower()
        words = list(set(text.split()))
    return (text, words)

This is a function that appends `NULL` to a sentence and returns it.

In [3]:
def append(x):
    x.append("NULL")
    return x

This is a function that preprocesses and returns pairs of sentences. Preprocessing involves converting to lowercase, splitting into words, and appending `NULL`

In [4]:
@lru_cache(16)
def sent_pairs(lang_a="de-en.de", lang_b="de-en.en"):
    pairs = []
    with open(lang_a, encoding="utf8") as a_file:
        with open(lang_b, encoding="utf8") as b_file:
            pairs = [(append(a.strip().lower().split()), append(b.strip().lower().split())) for (a, b) in zip(a_file.readlines(), b_file.readlines())]
    
    return pairs

This is a function to implement the solution to problem 1. It returns the translation probablities stored in a dictionary. 

To retreive the translation from `A` to `B`, just access they key `(A, B)` in the dictionary returned by this function.

For example: 
    ```
    probs = problem_1()
    print(probs[(A, B)])
    ```

In [5]:
@lru_cache(16)
def problem_1(lang_a="de-en.de", lang_b="de-en.en", eps=(5*10e-6)):
    probs = defaultdict(lambda: 1.0 / len(b_words))    
    
    a_text, a_words = get_text_and_vocab(lang_a)
    b_text, b_words = get_text_and_vocab(lang_b)
    
    iters = []
    for x in range(1000):
        counts = defaultdict(lambda: 0)
        total_b = defaultdict(lambda: 0)
        
        for (a_sent, b_sent) in sent_pairs(lang_a, lang_b):
            for a_word in a_sent:
                _sum = 0
                for b_word in b_sent:
                    _sum += probs[(a_word, b_word)]

                for b_word in b_sent:
                    counts[(a_word, b_word)] += probs[(a_word, b_word)] / _sum
                    total_b[b_word] += probs[(a_word, b_word)] / _sum
        
        deltas = []
        for a_word in tqdm(a_words):
            for b_word in b_words:
                new = counts[(a_word, b_word)] / total_b[b_word]
                deltas.append(abs(probs[(a_word, b_word)] - new))
                probs[(a_word, b_word)] = new
        
        if (sum(deltas) / len(deltas)) < eps:
            print(f"Iteration {x}:\tFinal avg. delta is {sum(deltas) / len(deltas)}")
            break
        else:
            clear_output()
            res = f"Iteration {x}:\tCurrent avg. delta is {sum(deltas) / len(deltas)}"
            iters.append(res)
            print("\n".join(iters))
            
    return probs            

If you print the return value of function `problem_1()` you'll see that it is a giant dict of translation probabilities.

In [6]:
probs = problem_1()

Iteration 0:	Current avg. delta is 0.0004908649509178974
Iteration 1:	Current avg. delta is 8.719428439089653e-05
Iteration 2:	Current avg. delta is 6.009279241161375e-05


100%|████████████████████████████████████████████████████████████| 4580/4580 [00:26<00:00, 171.55it/s]


Iteration 3:	Final avg. delta is 4.617691852228025e-05


Following prints the translation probability of `parlament` and `parliament`.

In [7]:
probs[('parlament', 'parliament')]

0.5972480821940453

## Problem 2

A convinience function that joins back the list of words returned by `sent_pairs()` back into a sentence.

In [8]:
def get_sent(x):
    return (" ".join(x[0][:-1]), " ".join(x[1][:-1]))

A function that implements the solution for Problem 2, and returns a dictionary with sentence pairs as keys and their alignment as value.

In [9]:
def problem_2(lang_a="de-en.de", lang_b="de-en.en", probs=None, threshold=0.2):
    if probs is None:
        probs = problem_1(lang_a, lang_b)
    
    alignments = defaultdict(lambda: defaultdict(list))
    
    for (a_sent, b_sent) in tqdm(sent_pairs(lang_a, lang_b)):
        alignment = defaultdict(set)
        for a_word in a_sent:
            for b_word in b_sent:
                if probs[(a_word, b_word)] > threshold:
                    alignment[a_word].add(b_word)
        
        alignments[get_sent((a_sent, b_sent))] = alignment
        
    return alignments

In [10]:
alignments = problem_2(probs=probs)

100%|███████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2846.43it/s]


Consider the first sentence pair in the German-English parallel corpus -

In [11]:
first_sentence = get_sent(sent_pairs()[0])
first_sentence

('wiederaufnahme der sitzungsperiode', 'resumption of the session')

You can retrieve it's alignment by accessing that value stored by key first sentence as follows -

In [12]:
alignments[first_sentence] # alignments stores the return value of `problem_2(probs=probs)`

defaultdict(set,
            {'wiederaufnahme': {'resumption'},
             'der': {'of', 'resumption'},
             'sitzungsperiode': {'resumption', 'session'}})

The aligment is stored as another dictionary, the keys are words the orignal sentence, and the values are the list of words that the algorithm think it aligns to.

I'll write a function to pretty-print this alignment in case my explaination is not well-recieved.

In [13]:
def pretty_print(alignment):
    print("Sentence: ", " ".join(alignment.keys()), "\n")
    for a_word in alignment.keys():
        values = alignment[a_word]
        print(f"{a_word} -> {' '.join(values)}")

In [14]:
pretty_print(alignments[first_sentence])

Sentence:  wiederaufnahme der sitzungsperiode 

wiederaufnahme -> resumption
der -> of resumption
sitzungsperiode -> session resumption


Another example -

In [15]:
pretty_print(alignments[get_sent(sent_pairs()[243])])

Sentence:  abstimmung morgen um uhr 

abstimmung -> vote
morgen -> tomorrow
um -> p.m.
uhr -> p.m.


# Problem 3

In [16]:
_probs = problem_1("fr-en.fr", "fr-en.en")

Iteration 0:	Current avg. delta is 0.0005132330551919902
Iteration 1:	Current avg. delta is 0.00010263965665483475
Iteration 2:	Current avg. delta is 7.423916522707603e-05
Iteration 3:	Current avg. delta is 5.779563241439832e-05


100%|████████████████████████████████████████████████████████████| 4130/4130 [00:22<00:00, 187.63it/s]


Iteration 4:	Final avg. delta is 4.249691062978729e-05


In [17]:
_alignments = problem_2("fr-en.fr", "fr-en.en", probs=_probs)

100%|████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 659.44it/s]


Let's try it out for the first sentence -

In [18]:
first_sentence = get_sent(sent_pairs("fr-en.fr", "fr-en.en")[0])
_alignments[first_sentence]

defaultdict(set,
            {'reprise': {'resumption', 'session'},
             'la': {'resumption'},
             'session': {'resumption', 'session'}})

Pretty printing -

In [19]:
pretty_print(_alignments[first_sentence])

Sentence:  reprise la session 

reprise -> session resumption
la -> resumption
session -> session resumption


Another example -

In [21]:
pretty_print(_alignments[get_sent(sent_pairs("fr-en.fr", "fr-en.en")[1])])

Sentence:  je reprise session parlement vendredi 17 décembre dernier et vous vux que . 

je -> like i
reprise -> session
session -> session
parlement -> parliament
vendredi -> friday
17 -> 17
décembre -> december
dernier -> friday
et -> and
vous -> you
vux -> happy
que -> that
. -> .
