# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
  

# <font color="#003660">Week 2: Unsupervised NLP</font>

# <font color="#003660">Notebook 1: Explore Pre-trained Word Embeddings</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... understand what word embeddings are, <br>
        ... are able to to load and use pre-trained word embeddings for determining semantic similarity between words, <br>
        ... are able to visualize word embeddings, and <br>
        ... can answer word analogy queries with word embeddings.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `spacy` offers industrial-strength natural language processing.
- `gensim` is a fast library for training of vector embeddings and topic models.
- `sklearn` is the de-facto standard machine learning package in Python.
- `plotly` is a library for creating interactive plots.

In [None]:
import pandas as pd
import numpy as np
import pickle
import spacy
from gensim.models import word2vec
from gensim.models import KeyedVectors
import gensim.downloader as api
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# What are word embeddings?

Word embeddings (e.g., word2vec, Glove) are an alternative to representing words through one-hot encoding. In contrast to one-hot encoding, which are hard-coded high-dimensional and sparse representations, word embeddings are lower-dimensional dense representations that are learned from the data. Word embeddings represent each word in a dictionary by a real-valued numeric vector. (Chollet, 2018)

<center><img width=512 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_3/images/one-hot_vs_word-embeddings.png"/></center>

In addition, word embeddings are able to capture the semantic meaning of words and map it into geometric space. This is achieved by assigning a numeric vector to each word in the vocabulary, such that the distance (e.g., cosine distance) between any two word vectors would capture part of the semantic relationship between the two associated words. For example, "apple" and "dog" are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But "kitchen" and "fridge" are related words, so their vectors hould be close to each other. (Chollet, 2018)

Ideally, in a good embeddings space, the path (which is a vector itself) to go from "kitchen" to "fridge" would capture precisely the semantic relationship between these two concepts. This idea is illustrated in the following figure (adapted from Chollet, 2018).

<center><img width=768 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_3/images/wolf_dog_tiger_cat.png"/></center>

# Download pre-trained word embeddings

Download pre-trained word vectors from Gensim-data. The word vectors have 300 dimensions and were learned from 6 billion tokens from Wikipedia (2014) and Gigaword 5 (https://catalog.ldc.upenn.edu/LDC2011T07). See https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors for documentation.

In [None]:
word_vectors = api.load("glove-wiki-gigaword-300")



Save word embeddings to disk.

In [None]:
# Run this code to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
word_vectors.save("/content/drive/MyDrive/colab_notebooks/AML4TA2022/Session_03/data/glove")

Load word embeddings from your disk.

In [None]:
word_vectors = KeyedVectors.load("/content/drive/MyDrive/colab_notebooks/AML4TA2022/Session_03/data/glove")

# Explore embeddings

Look at a single vector.

In [None]:
word_vectors["man"]

array([-0.29784  , -0.13255  , -0.14505  , -0.22752  , -0.027429 ,
        0.11005  , -0.039245 , -0.0089607, -0.18866  , -1.1213   ,
        0.34793  , -0.30056  , -0.50103  , -0.031383 , -0.032185 ,
        0.018318 , -0.090429 , -0.14427  , -0.14306  , -0.057477 ,
       -0.020931 ,  0.56276  , -0.018557 ,  0.15168  , -0.25586  ,
       -0.081564 ,  0.2803   , -0.10585  , -0.16777  ,  0.21814  ,
       -0.11845  ,  0.56475  , -0.12645  , -0.062461 , -0.68043  ,
        0.10507  ,  0.24793  , -0.20249  , -0.30726  ,  0.42815  ,
        0.38378  , -0.19371  , -0.075951 , -0.058287 , -0.067195 ,
        0.2192   ,  0.56116  , -0.28156  , -0.13705  ,  0.45754  ,
       -0.14671  , -0.18562  , -0.074146 ,  0.60737  ,  0.07952  ,
        0.41023  ,  0.18377  , -0.08532  ,  0.43795  , -0.34727  ,
        0.2077   ,  0.50454  ,  0.40244  ,  0.1095   , -0.48078  ,
       -0.22372  , -0.54619  , -0.20782  ,  0.13751  , -0.16206  ,
       -0.24835  ,  0.17124  ,  0.037355 ,  0.14547  , -0.0562

In [None]:
len(word_vectors["man"])

300

Use Gensim's built-in function most_similar() to retrieve most similar words to a given word.

In [None]:
word_vectors.most_similar("man")

[('woman', 0.6998662948608398),
 ('person', 0.6443442106246948),
 ('boy', 0.620827853679657),
 ('he', 0.5926738977432251),
 ('men', 0.5819568634033203),
 ('himself', 0.5810033082962036),
 ('one', 0.5779520869255066),
 ('another', 0.5721587538719177),
 ('who', 0.5703631639480591),
 ('him', 0.5670831203460693)]

In [None]:
word_vectors.most_similar("woman")

[('girl', 0.7296419143676758),
 ('man', 0.6998662948608398),
 ('mother', 0.689943790435791),
 ('she', 0.6433226466178894),
 ('her', 0.6327143311500549),
 ('female', 0.6251603960990906),
 ('herself', 0.6215280294418335),
 ('person', 0.6170896887779236),
 ('women', 0.604761004447937),
 ('wife', 0.5986992120742798)]

Which word doesn't belong to the set?

In [None]:
word_vectors.doesnt_match(["red", "green", "blue", "sky"])

'sky'

Let's look at some analogies using vector arithmetic: King – Man + Woman = ?

In [None]:
word_vectors.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.6713277101516724),
 ('princess', 0.5432624220848083),
 ('throne', 0.5386104583740234),
 ('monarch', 0.5347574949264526),
 ('daughter', 0.498025119304657),
 ('mother', 0.4956442713737488),
 ('elizabeth', 0.4832652509212494),
 ('kingdom', 0.47747087478637695),
 ('prince', 0.4668239951133728),
 ('wife', 0.4647327661514282)]

Berlin – Germany + France = ?

In [None]:
word_vectors.most_similar(positive=['berlin', 'france'], negative=['germany'])

[('paris', 0.7981377840042114),
 ('prohertrib', 0.6862130761146545),
 ('french', 0.6102409362792969),
 ('brussels', 0.5589067935943604),
 ('london', 0.4632590413093567),
 ('le', 0.45709025859832764),
 ('strasbourg', 0.44893765449523926),
 ('parisian', 0.44582897424697876),
 ('des', 0.44490712881088257),
 ('rome', 0.4429358243942261)]

Wolf - Dog + Cat = ?

In [None]:
word_vectors.most_similar(positive=['wolf', 'cat'], negative=['dog'])

[('leopard', 0.42748594284057617),
 ('coyote', 0.39971569180488586),
 ('wolves', 0.3622685968875885),
 ('cats', 0.3573024272918701),
 ('zebra', 0.34423384070396423),
 ('blitzer', 0.3414064943790436),
 ('owl', 0.33941614627838135),
 ('deer', 0.3378032147884369),
 ('markus', 0.33442988991737366),
 ('biermann', 0.33183789253234863)]

Now it's your turn. Explore the similarities and analogies between word embeddings...

In [None]:
# YOUR CODE GOES HERE


# Visualize embeddings

Get a list of all the words in the vocabulary.

In [None]:
vocab = list(word_vectors.vocab)

Retrieve the associated word embedding vectors from the model.

In [None]:
X = word_vectors[vocab]

Reduce the dimensionality of the data with PCA.

In [None]:
X_pca = PCA(n_components=2).fit_transform(X)

Reformat data, add similarity to a "seed" word, and create an interactive scatterplot with plotly.

In [None]:
pca_df = pd.DataFrame(X_pca, index=vocab, columns=['x', 'y'])
pca_df["word"] = vocab

seed = "berlin"
pca_df["sim"] = 0

for word, sim in word_vectors.most_similar(seed, topn=100):
    pca_df.loc[word, 'sim'] = sim

# filter to 100 most similar words?
# pca_df = pca_df[pca_df["sim"]>0]

fig = px.scatter(pca_df, x="x", y="y", color="sim",
                 hover_data=["word"],
                 range_x = [-6, 6], range_y = [-4, 4],
                 opacity = 0.2, color_continuous_scale='agsunset_r')
fig.show()

Output hidden; open in https://colab.research.google.com to view.