## Task 2 (4 points)

Your task is to train the embeddings for Simple Wikipedia titles, using gensim library. As the example below shows, training is really simple:

```python
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```
*sentences* can be a list of list of tokens, you can also use *gensim.models.word2vec.LineSentence(source)* to create restartable iterator from file. At first, use [this file] containing such pairs of titles, that one article links to another.

We say that two titles are *related* if they both contain a word (or a word bigram) which is not very popular (it occurs only in several titles). Make this definition more precise, and create the corpora which contains pairs of related titles. Make a mixture of the original corpora, and the new one, then train title vectors again.

Compare these two approaches using similar code to the code from Task 1.

In [1]:
import itertools
from collections import Counter
from pathlib import Path

import gensim

In [2]:
word_counter = Counter()
with open("../data/L4/task2_simple.wiki.links.txt", "rt") as f:
    for line in f:
        for link in line.split():
            word_counter.update(link.split("_"))

In [3]:
words, counts = zip(*word_counter.most_common())

In [4]:
min_count = 2
max_count = 10

In [5]:
unpopular_words = set(words[counts.index(max_count):counts.index(min_count)])

In [6]:
from collections import defaultdict
related_sentences = defaultdict(set)

with open("../data/L4/task2_simple.wiki.links.txt", "rt") as f:
    for line in f:
        for link in line.split():
            for word in link.split("_"):
                related_sentences[word].add(link)

In [7]:
len(related_sentences)

839469

In [8]:
unpopular_words = [k for (k, v) in related_sentences.items() if min_count < len(v) < max_count]
len(unpopular_words)

88836

In [9]:
with open("../data/L4/task2_similar_links.txt", "wt") as f:
    for word, links in related_sentences.items():
        if min_count < len(links) < max_count:
            for link1, link2 in itertools.combinations(links, 2):
                f.write(f"{link1} {link2}\n")

In [10]:
!cat ../data/L4/task2_similar_links.txt ../data/L4/task2_simple.wiki.links.txt  >../data/L4/task2_extended.txt

In [11]:
TASK2_FILEPATH = Path("../data/L4/task2.model")
if not TASK2_FILEPATH.exists():
    model = gensim.models.Word2Vec(corpus_file="../data/L4/task2_simple.wiki.links.txt", vector_size=30, window=2, min_count=1, workers=4)
    model.save(str(TASK2_FILEPATH))
else:
    model = gensim.models.Word2Vec.load(str(TASK2_FILEPATH))

In [12]:
TASK2EXTENDED_FILEPATH = Path("../data/L4/task2_extended.model")
if not TASK2EXTENDED_FILEPATH.exists():
    model_ex = gensim.models.Word2Vec(corpus_file="../data/L4/task2_extended.txt", vector_size=30, window=2, min_count=1, workers=4)
    model_ex.save(str(TASK2EXTENDED_FILEPATH))
else:
    model_ex = gensim.models.Word2Vec.load(str(TASK2EXTENDED_FILEPATH))

In [13]:
example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples

for w0 in example_english_words:
    print ('WORD:', w0)
    for (w1, v1), (w2, v2) in zip(model.wv.most_similar(w0), model_ex.wv.most_similar(w0)):
        print (f'\t{w1:35} [{v1:.3f}]\t{w2:35} [{v2:.3f}]')
    print ()

WORD: dog
	poison                              [0.969]	poison                              [0.975]
	environment                         [0.966]	anatomy                             [0.973]
	anatomy                             [0.965]	horse                               [0.972]
	ecology                             [0.956]	environment                         [0.972]
	female                              [0.955]	genetics                            [0.967]
	color                               [0.955]	geology                             [0.966]
	genetics                            [0.955]	rock_(geology)                      [0.966]
	alcohol                             [0.955]	monkey                              [0.964]
	nature                              [0.954]	sea                                 [0.963]
	pig                                 [0.952]	mineral                             [0.962]

WORD: dragon
	evergreen                           [0.987]	existence                           [0.98

In [14]:
import matplotlib.pyplot as plt
import numpy as np

In [15]:
task1_wv = KeyedVectors.load_word2vec_format('task1_w2v_vectors.txt', binary=False)

example_english_words = ['dog', 'dragon', 'love', 'bicycle', 'marathon', 'logic', 'butterfly']  # replace, or add your own examples
example_polish_words = ['pies', 'smok', 'miłość', 'rower', 'maraton', 'logika', 'motyl']

example_words = example_polish_words

for w0 in example_words:
    print ('WORD:', w)
    for w, v in task1_wv.most_similar(w0):
        print ('   ', w, v)
    print ()

NameError: name 'KeyedVectors' is not defined

In [None]:
model.wv.most_similar("east_berlin", topn=10)

[('west_berlin', 0.9919248223304749),
 ('sint_maarten', 0.9869701266288757),
 ("people's_republic_of_bulgaria", 0.9867256283760071),
 ('vojvodina', 0.9854137897491455),
 ('kingdom_of_greece', 0.9836078882217407),
 ('pitcairn_islands', 0.9828757643699646),
 ('eastern_europe', 0.9828706383705139),
 ('åland_islands', 0.9816504120826721),
 ('tokelau', 0.9813432693481445),
 ('holy_see', 0.980989933013916)]

In [None]:
model.save("word2vec.model")

In [None]:
# The cell for your presentation