# Embedding Exploration

In this notebook we're going to explore the simple properties of the embeddings.

In [1]:
import numpy as np
import pandas as pd

from sklearn.neighbors import NearestNeighbors

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Set style

In [2]:
sns.set(style="whitegrid", font_scale=1.3)
matplotlib.rcParams["legend.framealpha"] = 1
matplotlib.rcParams["legend.frameon"] = True

Just for the sake of reproducibility

In [3]:
np.random.seed(21)

# Load embedding

Pre-trained embedding can be found [here](https://www.dropbox.com/s/9pu6mt769kj8803/pretrained_embedding.tar.gz?dl=0)

In [4]:
!wget https://www.dropbox.com/s/9pu6mt769kj8803/pretrained_embedding.tar.gz
!tar -xvzf pretrained_embedding.tar.gz

--2017-10-21 14:21:14--  https://www.dropbox.com/s/9pu6mt769kj8803/pretrained_embedding.tar.gz
Resolving www.dropbox.com (www.dropbox.com)... 162.125.66.1, 2620:100:6022:1::a27d:4201
Connecting to www.dropbox.com (www.dropbox.com)|162.125.66.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dl.dropboxusercontent.com/content_link/qqLtLP5EFuLZd9jJJ3B4mNo2zxKWyvN1x3dSUyiVWdvTRGa9KZAv7SUmgJC39Uey/file [following]
--2017-10-21 14:21:15--  https://dl.dropboxusercontent.com/content_link/qqLtLP5EFuLZd9jJJ3B4mNo2zxKWyvN1x3dSUyiVWdvTRGa9KZAv7SUmgJC39Uey/file
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 162.125.66.6, 2620:100:6022:6::a27d:4206
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|162.125.66.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 866689471 (827M) [application/octet-stream]
Saving to: 'pretrained_embedding.tar.gz'


2017-10-21 14:21:33 (48.0 MB/s) - 'pretrained_embedding.

In [5]:
embeddings_index = {}
f = open("./wiki_w2v.vec")
f.readline()
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [6]:
EMBEDDING_DIM = len(embeddings_index["the"])
print("Embedding dimension is", EMBEDDING_DIM)

Embedding dimension is 100


Example of word-vector

In [7]:
print("\"the\"\n", embeddings_index["the"])

"the"
 [-0.139744    0.20813    -0.118162    0.061692   -0.002767    0.19353899
 -0.137596   -0.081763    0.10003    -0.023349   -0.35396001 -0.086922
 -0.10971     0.14912     0.24900199  0.090235   -0.175136    0.100408
  0.227097   -0.38284701  0.20857599  0.203307   -0.060234    0.137311
  0.100868   -0.005166   -0.06901    -0.29135701 -0.077014    0.077363
 -0.67388499  0.33003101 -0.32567599 -0.003312   -0.069184   -0.016198
 -0.20462    -0.01192     0.105787    0.039009    0.38602901  0.014063
 -0.34556401 -0.185229   -0.19212601 -0.051685   -0.055161   -0.279562
 -0.31859699 -0.17527901  0.223395   -0.40426299  0.030896   -0.27991199
 -0.018625   -0.072526    0.0099      0.31887901  0.26463899 -0.157149
 -0.41845399  0.030546    0.30279899  0.126863    0.057413   -0.278595
 -0.49134201 -0.37145299  0.37209201 -0.18141     0.34714001  0.076957
 -0.20196401 -0.257523    0.095072    0.019826   -0.176595    0.39072999
  0.165888    0.17529801 -0.37434199 -0.088488    0.097903   -0.

# Closest word (Quiz 6.1)

Let's implement simple class to find the closest word to the given one

In [8]:
class Embeddings(object):
    
    def __init__(self, embedding_dict):
        """
        Implements simple routine to work with word vector
        using pre-trained embedding and sklearn nearest
        neighbors module
        
        Args:
            embedding_dict(dict): words as keys and vectors as values

        Return:
            self
        """
        # initialize word and vector arrays
        self.words = np.array(list(embedding_dict.keys()))
        self.vectors = np.array(list(embedding_dict.values()))
        # fir nearest neighbors model with cosine similatiry between word-vectors
        self._nn = NearestNeighbors(metric="cosine", n_jobs=1, algorithm="brute")
        self._nn.fit(self.vectors)
        
    def get_vector(self, word):
        """
        Return the word-vector for given word if it is in the vocabulary
        
        Args:
            word(str): word to look up for
            
        Return:
            vector(ndarray): vector for given word
        """
        
        assert (word in self.words), "Word is not in the vocabulary"
        word_idx = np.where(self.words == word)[0]
        return self.vectors[word_idx]
            
    def closest(self, word, n_closest=1, include_itself=False):
        """
        Find the closest word in the vocabulary to the given data
        
        Args:
            word(str or ndarray): if str then the corresponding
                                  vector will be retrieved first;
                                  ndarray will be treated as a
                                  word-vector
            n_closest(int): number of closest words to return
            include_itself(bool): whether to include the vector
                                  itself if it is in the vocabulary
                                  
        Return:
            words(ndarray of str): array with closest words
        """
        # check which input do we have
        if type(word) is str:
            word_vector = self.get_vector(word)
        else:
            word_vector = word
        # distinguish between returning itself and not
        if include_itself:
            closest_idxs = self._nn.kneighbors(word_vector, n_neighbors=n_closest)
            return self.words[closest_idxs[1][0]]
        else:
            closest_idxs = self._nn.kneighbors(word_vector, n_neighbors=n_closest + 1)
            return self.words[closest_idxs[1][0][1:]]

Create an instance of the class

In [9]:
w2v = Embeddings(embeddings_index)

Find the closest word (*Quiz 6.1*)

In [10]:
w2v.closest("assuage")

array(['allay'],
      dtype='<U110')

# Analogical reasoning (Quiz 6.2)

Let's try to perform a task of analogical reasoning (for the definition see lectures or [original paper](https://arxiv.org/pdf/1301.3781.pdf))

Here we want to explore semantic relations, e.g., **Country-Capital** (*Quiz 6.2*)

The question is what relates to Moscow the same as Ireland relates to Dublin? To find the answer to this question remember that our embedding space has the property of linearity. Thus let's take the vector for each word and perform simple arithmetic operations on them, i.e.,

$$
w[\text{ireland}] - w[\text{dublin}] + w[\text{moscow}] = X
$$

and then find the vector in our embedding space which is closest to the resulting $X$ vector. Hopefully it will be Russia.

In [11]:
X = w2v.get_vector("ireland") - w2v.get_vector("dublin") + w2v.get_vector("moscow")

In [12]:
w2v.closest(X, include_itself=True, n_closest=3)

array(['russia', 'ukraine', 'belarus'],
      dtype='<U110')