# An Introduction to Word Embeddings

One of the breakthroughs of neural networks in Natural Language Processing is the usage of word embeddings. Rather than using the words themselves as features, neural network methods typically take as input dense, relatively low-dimensional vectors that model the meaning and usage of a word. Word embeddings were first popularized through the [Word2Vec](https://arxiv.org/abs/1301.3781) model, developed by Thomas Mikolov and colleagues at Google. Since then, scores of alternative approaches have been developed, such as [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/) embeddings. In this notebook, we'll explore word embeddings with the original Word2Vec approach, as implemented in the [Gensim](https://radimrehurek.com/gensim/) library. 

## Training word embeddings

Training word embeddings with Gensim couldn't be easier. The only thing we need is a corpus of sentences in the language under investigation. Wikipedia is a good choice for training generic embeddings. For our experiments, we're going to use 5,000,000 sentences from Dutch Wikipedia, which we've trained and lowercased in advance. This means we can feed lists of sentence tokens to Word2Vec by reading the lines in our Wikipedia file and splitting them on spaces.

In [1]:
import os

class SentenceCorpus(object):

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, "r") as i:
            for line in i:
                tokens = line.strip().split()
                yield tokens
                
                
WIKI_FILE = os.path.join(os.path.expanduser("~"), "Corpora/NL/Wikipedia", "nlwiki_20170620_tok_small.txt")
sentences = SentenceCorpus(WIKI_FILE)

When we train our word embeddings, gensim allows us to set a number of parameters. The most important of these are `min_count`, `window`, `size` and `sg`:

- `min_count` is the minimum frequency of the words in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more.
- `window` is number of words to the left and to the right that make up the context that word2vec will take into account.
- `size` is the dimensionality of the word vectors. This is generally between 100 and 1000. You often have to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.
- `sg`: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (`sg=0`).

We'll investigate the impact of some of these parameters later.

In [3]:
import gensim

model = gensim.models.Word2Vec(sentences, min_count=100, window=5, size=100)


## Using word embeddings

Let's take a look at the model. The word embeddings are on its `wv` attribute, and we can access them by the using the token as key. For example, here is the embedding for Dutch *koning* (king), with the requested 100 dimensions.

In [4]:
model.wv["koning"]


array([-0.7436782 , -2.5546741 , -2.4181092 ,  0.53079987, -0.6392997 ,
        2.7601945 , -2.836296  ,  1.1442246 ,  2.261504  , -1.8250332 ,
        1.8353037 ,  0.8417728 , -4.150364  ,  0.389364  , -3.0675495 ,
        1.7651662 ,  1.3213423 , -1.8265532 , -1.5197517 , -0.55600065,
        1.3073932 ,  0.15334386,  0.02926308, -0.14631374,  2.5769231 ,
       -0.53777665, -1.6289988 , -1.441241  , -0.93412006,  0.44386926,
       -3.227865  ,  0.16452734,  2.0498326 ,  1.1050102 , -3.7508855 ,
       -0.71464   , -1.6540393 ,  1.1486468 ,  0.602774  , -1.5581201 ,
       -0.6466161 ,  4.055801  ,  1.3687848 , -1.9568108 ,  1.2429739 ,
       -1.4464447 ,  4.8698716 , -0.35930628, -0.18051736, -4.080876  ,
       -1.8546008 , -0.63234687,  4.99867   , -1.4174942 , -1.1202643 ,
       -0.47730613,  1.6716621 , -1.9697584 ,  0.81117696,  3.4258103 ,
        1.1309042 ,  1.2451673 , -1.2035362 , -1.78328   ,  2.8833058 ,
        2.025087  , -2.2452092 ,  0.4089499 ,  1.9777402 ,  0.44

We can also easily find the similarity between two words. Similarity is measured as the cosine between the two word embeddings, and ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that *koning* (king) is closer to *koningin* (queen) than to *koffie* (coffee).

In [17]:
print(model.similarity("koning", "koningin"))
print(model.similarity("koning", "koffie"))

0.73005563
0.0014626831


  """Entry point for launching an IPython kernel.
  


In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to *koning* are all similar titles (such as *keizer* (emperor) and *hertog* (duke)) or are semantically related to royalty (such as *troon* (throne)).

In [19]:
model.similar_by_word("koning", topn=10)


  """Entry point for launching an IPython kernel.


[('keizer', 0.8240664601325989),
 ('kroonprins', 0.8125288486480713),
 ('vorst', 0.7871984243392944),
 ('troonpretendent', 0.7620712518692017),
 ('stadhouder', 0.7611129879951477),
 ('troonopvolger', 0.7572510242462158),
 ('hofmeier', 0.755607545375824),
 ('hertog', 0.7424051761627197),
 ('vorstin', 0.7386618852615356),
 ('prins', 0.7371718883514404)]

Interestingly, we can look for words that are similar to a set of words and dissimilar to another set of words at the same time. This allows us to look for analogies of the type *king (koning) is to man (man) like ... is to woman (vrouw)*. Although the most similar word is not the correct answer (which would be queen), notice how female titles, such as *echtgenote* (wife), *keizerin* (empress) and *koningin* (queen) are now present in the top 10 most similar words. This wasn't the case above.

In [20]:
model.most_similar(positive=['vrouw', 'koning'], negative=["man"], topn=10)


  """Entry point for launching an IPython kernel.


[('kroonprins', 0.7548491358757019),
 ('troonopvolger', 0.701501727104187),
 ('echtgenote', 0.7009282112121582),
 ('keizerin', 0.7000778317451477),
 ('gemalin', 0.6954908967018127),
 ('koningin', 0.6939809322357178),
 ('groothertog', 0.6857107877731323),
 ('troonpretendent', 0.674928605556488),
 ('onderkoning', 0.6680958271026611),
 ('isabella', 0.6667591333389282)]

Similarly, we can also zoom in on one of the meanings of ambiguous words. For example, like in English, *muis* (mouse) in Dutch can refer to two things: an animal and a computer mouse. If we look at the 10 nearest neighbours to *muis*, most of them are animals: *papegaai* (parrot), *kat* (cat), *ezel* (donkey), etc. This suggests the animal meaning is much more frequent on Wikipedia than the other one.

In [22]:
model.most_similar(positive=["muis"], topn=10)

  """Entry point for launching an IPython kernel.


[('papegaai', 0.745761513710022),
 ('kat', 0.7276036739349365),
 ('ezel', 0.7258143424987793),
 ('slang', 0.7252289056777954),
 ('geit', 0.721189022064209),
 ('hond', 0.7077101469039917),
 ('aap', 0.7041506767272949),
 ('schotel', 0.7023760080337524),
 ('verrekijker', 0.6976540088653564),
 ('zeester', 0.6953004598617554)]

However, if we specify we're looking for words that are similar to *muis* (mouse), but dissimilar to *dier* (animal), suddenly the computer meaning takes over. We now find similar devices in the top ten nearest neighbours: *afstandsbediening* (remote control), *controller*, *switch* and *stick*.

In [26]:
model.most_similar(positive=["muis"], negative=["dier"], topn=10)

  """Entry point for launching an IPython kernel.


[('afstandsbediening', 0.5098882913589478),
 ('bel', 0.45299047231674194),
 ('dealer', 0.4435589015483856),
 ('controller', 0.44276514649391174),
 ('belt', 0.44036799669265747),
 ('switch', 0.43773093819618225),
 ('stick', 0.43422454595565796),
 ('driver', 0.42898130416870117),
 ('thin', 0.4241107106208801),
 ('r8', 0.4202088713645935)]

Finally, we can present the word2vec model with a list of words and ask it to identify the odd one out. It then uses the word embeddings to identify the word that is least similar to the other ones. For example, in the list *auto fiets bus koffie* (car, bike, bus, coffee), it correctly identifies *koffie* as the odd one out. In the list *koffie auto thee melk* (coffee, car, tea, milk), it correctly singles out *auto*.

In [27]:
print(model.doesnt_match("auto fiets bus koffie".split()))
print(model.doesnt_match("koffie auto thee melk".split()))

koffie
auto


  """Entry point for launching an IPython kernel.
  


## Exploring hyperparameters

In [10]:
from pattern.nl import pluralize, parse
from tqdm import tqdm_notebook as tqdm

In [11]:
test_pairs = []
for word in tqdm(model.wv.vocab):
    ling = parse(word)
    if ling.split("/")[1] == "NN":
        plural = pluralize(word)
        if plural in model.wv.vocab:
            test_pairs.append((word, plural))
            
print(test_pairs)

HBox(children=(IntProgress(value=0, max=11522), HTML(value='')))


[('architect', 'architecten'), ('minister', 'ministers'), ('hitler', 'hitlers'), ('gold', 'golden'), ('rijk', 'rijken'), ('jaar', 'jaren'), ('gezin', 'gezinnen'), ('vader', 'vaders'), ('zomer', 'zomers'), ('studie', 'studies'), ('hogeschool', 'hogescholen'), ('school', 'scholen'), ('stijl', 'stijlen'), ('behaald', 'behaalden'), ('toespraak', 'toespraken'), ('maand', 'maanden'), ('woord', 'woorden'), ('partij', 'partijen'), ('vorm', 'vormen'), ('idee', 'ideeën'), ('veld', 'velden'), ('recht', 'rechten'), ('licht', 'lichten'), ('afdeling', 'afdelingen'), ('aanleg', 'aanleggen'), ('hand', 'handen'), ('plek', 'plekken'), ('kroon', 'kronen'), ('traditie', 'tradities'), ('stad', 'steden'), ('volk', 'volken'), ('bouw', 'bouwen'), ('vergroot', 'vergroten'), ('bron', 'bronnen'), ('grond', 'gronden'), ('prijs', 'prijzen'), ('rand', 'randen'), ('cilinder', 'cilinders'), ('begin', 'beginnen'), ('bouwwerk', 'bouwwerken'), ('plaats', 'plaatsen'), ('m', 'men'), ('maal', 'malen'), ('meter', 'meters')

In [12]:
len(test_pairs)

948

In [13]:
import numpy as np

def evaluate(test_pairs, model):
    reciprocal_ranks = []
    for sing, plur in tqdm(test_pairs):
        rank = model.wv.rank(sing, plur)
        reciprocal_ranks.append(1/rank)
    return np.mean(reciprocal_ranks)
        
mrp = evaluate(test_pairs, model)

HBox(children=(IntProgress(value=0, max=948), HTML(value='')))




In [14]:
mrp

0.09843961931034355

In [15]:
import pandas as pd

sizes = [100, 200, 300]
windows = [2,5,10]

df = pd.DataFrame(index=windows, columns=sizes)

for size in sizes:
    for window in windows:
        model = gensim.models.Word2Vec(sentences, min_count=100, window=window, size=size)
        mrp = evaluate(test_pairs, model)
        df[size][window] = mrp
        
df

HBox(children=(IntProgress(value=0, max=948), HTML(value='')))




HBox(children=(IntProgress(value=0, max=948), HTML(value='')))




HBox(children=(IntProgress(value=0, max=948), HTML(value='')))




HBox(children=(IntProgress(value=0, max=948), HTML(value='')))




KeyboardInterrupt: 

In [17]:
model.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1a1bda2438>

In [9]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

vocab = list(model.wv.vocab)
vectors = [model.wv[w] for w in vocab]
vectors_norm = normalize(vectors)

clusterer = AgglomerativeClustering(n_clusters=500)
clusters = clusterer.fit_predict(vectors_norm)


In [47]:
cluster_dictionary = {}
for cluster, word in zip(clusters, vocab): 
    if cluster not in cluster_dictionary:
        cluster_dictionary[cluster] = []
    cluster_dictionary[cluster].append(word)

In [48]:
for x in cluster_dictionary:
    if "italië" in cluster_dictionary[x]:
        print(cluster_dictionary[x])

['duitsland', 'frankrijk', 'oostenrijk', 'italië', 'zwitserland', 'slovenië', 'liechtenstein', 'monaco', 'joegoslavië', 'spanje', 'polen', 'turkije', 'luxemburg', 'macedonië', 'sovjet-unie', 'andorra', 'vaticaanstad', 'rusland', 'ussr', 'griekenland', 'portugal', 'tsjecho-slowakije', 'tsjechië', 'hongarije', 'tsjechoslowakije', 'roemenië', 'bulgarije', 'denemarken', 'noorwegen', 'zweden', 'finland', 'oekraïne', 'oost-duitsland', 'bondsrepubliek', 'servië', 'west-duitsland', 'armenië', 'bosnië', 'cyprus', 'wit-rusland', 'litouwen', 'letland', 'albanië', 'slowakije', 'ddr', 'georgië', 'malta', 'oost-berlijn', 'west-berlijn', 'estland', 'montenegro', 'kroatië', 'herzegovina', 'azerbeidzjan']


In [13]:
with open("data/clusters_nl.tsv", "w") as o:
    for c in cluster_dictionary:
        for w in cluster_dictionary[c]:
            o.write(f"{w}\t{c}\n")