In [2]:
import os

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

from helpers import DownloadProgressBar, pprint

glove_path = os.path.abspath('resources/glove.bin')
w2v_path = os.path.abspath('resources/w2v.bin')

ModuleNotFoundError: No module named 'helpers'

In [2]:
if not os.path.exists(w2v_path):
    if not os.path.exists(glove_path):
        raise OSError(f"No file found at {glove_path}. Either change this ",
                      "path in the script configuration or download a pre-",
                      "trained model from nlp.stanford.edu/projects/glove/")
    glove2word2vec(glove_path, w2v_path)

In [3]:
model = KeyedVectors.load_word2vec_format(glove_path, binary=True)

## Why word embeddings

Human language seem intuitive to humans, but not so much to a computer. We can determine whether two words share a similar meaning, but if asked to give a quantivative reason why that is the case, we will start to stuggle. But our shortcoming is where computers thrive, as they are remarkably good with numbers and tabular data.  Word embeddings look to provide words with this computer-friendly format, so we can reap the rewards of NLP.

## The need for Word2Vec

Initial attempts before word embeddings included a one-hot-encoding. Essentially, each word is provided a vector which distinguishes it from other words in the vocabulary, as each vector is othogonal to every other vector.  This, isn't very practical to model anything on the scale of human languages, as the vectors dimensions would be the size of the language. But more importantly, this doesn't capture any sort of similarity between any two words. As far this method is concerned, the words "dolphin" and "neoliberal" are equally similar to "shark". Word2Vec aims to solve this problem by providing word-embeddings which take into account relations between words. In essence, Word2Vec provides a canvas ( albeit a strange multi-dimensional one) where any word in the language could lie, and effectively plots points where each word lies on this canvas. The closer any two points on this canvas lie (captured mathematically by the cosine distance), the more likely we humans are to describe the respresented words as "similar".

In [1]:
vis_glove_path= os.path.abspath('resources/glovevis.txt')
glove_model = KeyedVectors.load_word2vec_format(vis_glove_path, binary=False)


NameError: name 'KeyedVectors' is not defined

In [None]:
def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)
    print("reduce dimensions")

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for idx, word in enumerate(model.vocab):
        print(idx)
        vectors.append(model[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)
    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


In [None]:
def plot_with_matplotlib(x_vals, y_vals, labels):
    # import matplotlib.pyplot as plt
    # import random

    print("plot with matplotlib")

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))
    plt.show()

In [None]:
x_vals, y_vals, labels = reduce_dimensions(glove_model)
plot_function = plot_with_matplotlib
plot_function(x_vals, y_vals, labels)

## What is commonality? - deriving the word embeddings

To derive each word embedding, the Word2Vec model is usually trained using a method called Skipgram with Negative Sampling (SGNS). Essentially a large corpus (billions of words) is fed to the model. An n-sized sliding window is used to capture the words that lie either side of each word in the corpus, to determine each word's context. In simplified terms, the context for each word is used to determine the words embedding vector, in addition to added noise - random words which will most likely have nothing to do with the target word, hence the negative sampling aspect. Because words with a similar context usually have closely-linked meanings, these words will end up having similar embedding vectors.


## The power of Word2Vec 

We can find the commonality between words based on their cosine distance in the vector space. 

In [37]:
model.similarity('Queen', 'throne')

0.38950953

In [26]:
model.similarity('Queen', 'forklift')

-0.002614806

In [33]:
model.similarity('Queen', 'Bowie')

0.20833212

As expected, the "forklift" is relatively distanct from "Queen" compared to "throne". But what's fascinating is tht multiple facets of Queen are captured. "Bowie" is also relatively close to "Queen" like "throne", but instead certainly because of its relations to the iconic rock band.

Naturally with vectors come mathematical operations and the real power of Word2Vec starts to shine. Vector differences are the crux behind **Analogies** which are best explained through examples...


## Analogies

{% note info %}
Analogies derived from the model trained on the [Google News dataset](https://code.google.com/archive/p/word2vec/)
{% endnote %}

### A classic: king - man + woman -> ...

In [4]:
pprint(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1))
# king - man + woman

queen (0.712)


This example is rather intuitive; the female version of the male title "king" is "queen". What's going is that vector sum of "woman" and the vector difference of "king" and "man" gives a vector which is rather similar to the vector "queen".  In other words, the ("king" - "man") vector is approximately equal to the ("Queen" - "woman") vector.

### Plurals

With a rather mundane example such as bikes - bike + gloves, it's not unsurprising the model returns glove; it could be got from deciphering that the pattern is removing the "s" - hardly groundbreaking. But when talking about irregular plurals, the required task to output the derived word shifts from spotting a simple pattern to seemingly needing a human-like understanding of the structure and complexities of the english language. 


In [16]:
pprint(model.most_similar(positive=['cacti', 'foot'], negative=['cactus'], topn=1))
# cacti - cactus + foot -> 

feet (0.568)


In [6]:
pprint(model.most_similar(positive=['sheep', 'child'], negative=['sheep'], topn=3))
# sheep - sheep -> child

children (0.726)
infant (0.702)
chid (0.685)


Here, "sheep" is both the singular and the plural, meaning the resulting vector is simply just "child". But "children" is still found due to its similarity with "child"

### Geographical analogies

Find the city...

In [7]:
pprint(model.most_similar(positive=['Moscow', 'Portugal'], negative=['Russia'], topn=1))
# Moscow - Russia + Portugal -> 

Lisbon (0.655)


and now find the country...

In [8]:
pprint(model.most_similar(positive=['Spain', 'Delhi'], negative=['Barcelona'], topn=1))
# Spain - Barcelona + Delhi ->

India (0.626)


In [4]:
pprint(model.most_similar(positive=['Africa', 'Cambodia'], negative=['Egypt'], topn=1))
# Africa - Egypt + Cambodia ->

NameError: name 'pprint' is not defined

In [3]:
pprint(model.most_similar(positive=['Iran', 'war'],topn=3))
# Iran + war ->

NameError: name 'pprint' is not defined

An interesting example. "Iraq" and "Islamic_republic" are most likely referencing the Iran-Iraq war. In addition "Iraq" and "Syria", are both war-stricken countries near Iran which could be a factor why they're near the (Iran + war) vector

### Opposites


With a well-known opposite, the result makes sense...

In [None]:
pprint(model.most_similar(positive=['big', 'high'], negative=['small'], topn=1))
# big - small + high ->

...but a word with no obvious opposite can give rather surprising results. Ever wondered what the opposite of a "bottle" is? Look no further...

In [10]:
pprint(model.most_similar(negative=['bottle'], topn=1))
#  -bottle ->

Publish_Date (0.266)
By_CHRIS_MARCHESO (0.252)
BY_LYNN_ARDITI (0.252)


It'd make sense to view the vector representing the opposite of a word as the vector in the opposite direction. But usually this makes little sense as portrayed by this example, as it turns out human have very little understanding of what the "opposite" generally is.


### A few vector sums

In [5]:
pprint(model.most_similar(positive=['death', 'water'], topn=1))

Pretty printing has been turned OFF


In [6]:
pprint(model.most_similar(positive=['death', 'knife'], topn=2))

Pretty printing has been turned ON


In [13]:
pprint(model.most_similar(positive=['girlfriend'], negative=['love'], topn=1))


ex_girlfriend (0.517)


In [14]:
pprint(model.most_similar(positive=['colleage', 'love'], topn=2))

loved (0.580)
friend (0.551)


### Miscellaneous 

In [15]:
pprint(model.most_similar(positive=['Obama', 'Russia'], negative=['USA'], topn=3))

Medvedev (0.674)
Putin (0.647)
Kremlin (0.617)


In [17]:
pprint(model.most_similar(positive=['Hitler', 'UK'], negative=['Germany'], topn=3))

Tony_Blair (0.522)
Oliver_Cromwell (0.509)
Maggie_Thatcher (0.506)


In [None]:
Make of that what you will...

In [19]:
pprint(model.most_similar(positive=['Gates', 'Apple'], negative=['Microsoft'], topn=1))
# Gates - Microsoft + Apple ->                          

Steve_Jobs (0.523)


In [None]:
pprint(model.most_similar(positive=['Barack_Obama', 'Victoria_Beckham'], negative=['Michelle'], topn=1))
# Barack_Obama - Michelle + Victoria_Beckham ->

In [None]:
pprint(model.most_similar(positive=['Anfield', 'Manchester'], negative=['Liverpool'], topn=1))
# Anfield - Liverpool + Manchester ->

In [None]:


## Wider Applications

There are many application scenarios for Word2Vec: sentiment analysis, recommender systems, etc. But aside from these e-commerce centric use cases, it has also found usage in scientific fields such as BioNLP, which have utilised word embeddings for technological advancements. Hopefully through these examples, the potential power of Word2Vec has been showcased.