<a href="https://colab.research.google.com/github/jansoe/dl_workshop/blob/main/Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Vectors

In [None]:
import matplotlib.pyplot as plt
import nltk
import gensims
from pprint import pprint

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import gensim.downloader as api
# api.info()

In [None]:
model = api.load('glove-wiki-gigaword-300')



### Word Similarities

In [None]:
model.wv.most_similar('king')

[('queen', 0.6336469054222107),
 ('prince', 0.619662344455719),
 ('monarch', 0.5899620652198792),
 ('kingdom', 0.5791267156600952),
 ('throne', 0.5606487989425659),
 ('ii', 0.5562329888343811),
 ('iii', 0.5503199100494385),
 ('crown', 0.5224862694740295),
 ('reign', 0.521735429763794),
 ('kings', 0.5066401362419128)]

### Visualisation with TSNE
Introduction to TSNE: https://distill.pub/2016/misread-tsne/

TSNE is a popular technique to visualize high dimensional vector spaces. It often generates good insights into your data, but results have to be treated with care.
Loosley speaking it trys to preserve the distance between direct neighbours trading for miss-representing long range distances. The parameter perplexity more or less determines the size of the neighborhood.   

In [None]:
To make it computational/visual feasible we select only 2000 words (closest to intelligence)

top_similarity = model.wv.most_similar(positive=['luxembourg'], topn=2000)
selected_words = [word for word, similarity in top_similarity]
vectors = [model.wv.word_vec(word, use_norm=True) for word in selected_words]

In [None]:
import plotly.express as px
import pandas as pd
from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity = 50
)
vectors_2D = tsne.fit_transform(vectors)

to_plot = pd.DataFrame(vectors_2D, columns=['x', 'y'])
to_plot['labels'] = selected_words

px.scatter(to_plot, x='x', y='y', hover_name='labels')

**Task**: Play around with the `perplexity` parameter. What do you observe

Inspect other Word2Vec spaces:  https://projector.tensorflow.org/

### Analogies 

We can formulate analogies: 

`berlin` corresponds to `germany` like `???` to `spain`

as (`berlin - germany + spain`)

In [None]:
model.wv.most_similar(positive=['berlin', 'spain'], negative=['germany'], topn=3)

[('madrid', 0.6979373693466187),
 ('spanish', 0.580416202545166),
 ('barcelona', 0.5777491927146912)]

In [None]:
model.wv.most_similar(positive=['beer', 'poland'], negative=['germany'], topn=1)

[('vodka', 0.4887145161628723)]

In [None]:
model.wv.most_similar(positive=['beer', 'italy'], negative=['germany'], topn=1)

[('wine', 0.5321523547172546)]

### Visualisation with PCA

In [None]:
words_to_vis = [
    'germany', 'italy', 'france', 'spain', 'uk', 'poland', 'russia', 'austria',
    'berlin', 'rome', 'paris', 'madrid', 'london', 'warsaw', 'moscow', 'vienna', 
]

We obtain all word vectors ...



In [None]:
vectors = [model.wv.word_vec(word, use_norm=True) for word in words_to_vis]

and visualize by PCA

In [None]:
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

In [None]:
pca = PCA(n_components=2)
vectors_2D = pca.fit_transform(vectors)

In [None]:
to_plot = pd.DataFrame(vectors_2D, columns=['x', 'y'])
to_plot['labels'] = words_to_vis

px.scatter(to_plot, x='x', y='y', hover_name='labels')