# Task 3 (4 points)

Suppose that we have two languages: Upper and Lower. This is an example Upper sentence:

<pre>
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
</pre>

And this is its translation into Lower:

<pre>
the quick brown fox jumps over the lazy dog
</pre>

You have two corpora for these languages (with different sentences). Your task is to train word embedings for both languages together, so as to make embeddings of the words which are its translations as close as possible. But unfortunately, you have the budget which allows you to prepare the translation only for 1000 words (we call it D, you have to deside which words you want to be in D)

Prepare the corpora wich contains three kind of sentences:
* Upper corpus sentences
* Lower corpus sentences
* sentences derived from Upper/Lower corpus, modified using D

There are many possible ways of doing this, for instance this one (ROT13.COM: hfr rirel fragrapr sebz obgu pbecben gjvpr: jvgubhg nal zbqvsvpngvbaf, naq jvgu rirel jbeqf sebz Q ercynprq ol vgf genafyngvba)

We define the score for an Upper WORD as  $\frac{1}{p}$, where $p$ is a position of its translation in the list of **Lower** words most similar to WORD. For instance, when most similar words to DOG are:

<pre>
WOLF, CAT, WOLVES, LION, gopher, dog
</pre>

then the score for the word DOG is 0.5. Compute the average score separately for words from D, and for words out of D (hint: if the computation takes to much time do it for a random sample).


In [1]:
from collections import Counter
from pathlib import Path

import gensim
import numpy as np
from tqdm import tqdm

In [2]:
c = Counter()
with open("../data/L4/task3_polish_lower.txt", "rt") as f:
    for words in map(str.split, f.readlines()):
        c.update(filter(str.islower, words))

In [3]:
D = [w for w, _ in c.most_common(1000)]
nonD = [w for w, _ in c.most_common()][1000:]

In [4]:
translations = {l: l.upper() for l in D}

In [5]:
with open("../data/L4/task3_polish_lower.txt", "rt") as f_in, open("../data/L4/task3_polish_translations.txt", "wt") as f_out:
    for l in f_in:
        f_out.write(" ".join([translations.get(word, word) for word in l.split()]))
        f_out.write("\n")

In [6]:
!cat ../data/L4/task3_polish* >../data/L4/task3_sentences.txt

In [7]:
def evaluate_model(model):
    positions = {}
    for d_word in tqdm(D):
        position = 0
        for word_idx in np.argsort(model.wv.most_similar(d_word, topn=None))[::-1]:
            word = model.wv.index_to_key[word_idx]
            if word.isupper():
                position += 1
                if word.lower() == d_word:
                    positions[d_word] = position
                    break
    positions_D = np.array(list(positions.values()))
    print("Average (D):", (1 / positions_D).mean())

    positions = {}
    for nond_word in tqdm(nonD):
        position = 0
        for word_idx in np.argsort(model.wv.most_similar(nond_word, topn=None))[::-1]:
            word = model.wv.index_to_key[word_idx]
            if word.isupper():
                position += 1
                if word.lower() == nond_word:
                    positions[nond_word] = position
                    break
    positions_nonD = np.array(list(positions.values()))
    print("Average (outside D):", (1 / positions_nonD).mean())

In [8]:
for vector_size in [30]:
    print("Vector size:", vector_size)
    filepath = Path(f"../data/L4/task3_{vector_size}_model.model")
    if filepath.exists():
        model = gensim.models.Word2Vec.load(str(filepath))
        print("Loaded pretrained")
    else:
        print("Training...")
        model = gensim.models.Word2Vec(corpus_file="../data/L4/task3_sentences.txt", vector_size=vector_size, window=5, min_count=1, workers=6)
        model.save(str(filepath))
        print("Completed.")
    evaluate_model(model)

Vector size: 30
Loaded pretrained


100%|██████████| 1000/1000 [00:07<00:00, 129.35it/s]


Average (D): 0.6344844364838972


100%|██████████| 27286/27286 [03:31<00:00, 128.87it/s]

Average (outside D): 0.18288539900797587



