# GloVe as a TensorFlow Embedding layer

In this tutorial, we'll see how to convert GloVe embeddings to TensorFlow layers. This could also work with embeddings generated from word2vec.

First, we'll download the embedding we need. 

Second, we'll load it into TensorFlow to convert input words with the embedding to word features. The conversion is done within TensorFlow, so it is GPU-optimized and it could run on batches on the GPU. It is also possible to run this tutorial with just a CPU. We'll play with word representations once the embedding is loaded. 

What you'll need: 
- A working installation of TensorFlow.
- 4 to 6 GB of disk space to download embeddings.

## First, some theory

### Representations

We need a way to represent content in neural networks. For audio, it's possible to use a [spectrogram](https://github.com/guillaume-chevalier/filtering-stft-and-laplace-transform). For images, it's possible to directly use the pixels and then get features maps from a convolutional neural network. For text, analyzing every letter is costly, so it's better to use word representations to embed words or documents as vectors into Artificial Neural Networks and other Machine Learning algorithms. 
> ![Features from content](https://www.tensorflow.org/images/audio-image-text.png)
> https://www.tensorflow.org/tutorials/word2vec

As described by Keras, an embedding:

> "Turns positive integers (indexes) into dense vectors of fixed size".

That's it. It's to extract features from words. An embedding is a huge matrix for which each row is a word, and each column is a feature from that word. To summarize, it's possible to convert a word to a vector of a certain length, such as 25, or 100, 200, 1000, and on. In practice, a length of 100 to 300 features is acceptable. With less than 100, we would risk underfitting our linguistic dataset. Word embeddings can eat a lot of RAM, so in this tutorial we'll download and use dimensions of size 25, however changing that to 200 would be a breeze with the actual code. 

### You can compute word analogies

The word representations (features) are linear, therefore it's possible to add and substract words with word embeddings. For example, here's the most known word analogy example:

<!--- $$\text{King} - \text{Man} = \text{Queen} - \text{Woman}$$ -->
<!--- $$\Longleftrightarrow$$ -->
<!--- $$\text{King} - \text{Man} + \text{Woman} = \text{Queen}$$ -->

<p align="center">
  <img src="https://raw.githubusercontent.com/guillaume-chevalier/GloVe-as-TensorFlow-Embedding/master/images/word_analogy.png" />
</p>

 For example, it's possible to change from: 
- Masculine and feminine
- Country and capital
- Singular and plural
- Verb tenses
- And the list goes on...

> ![Word features from embeddings](https://www.tensorflow.org/images/linear-relationships.png)
> https://www.tensorflow.org/tutorials/word2vec

It's also possible to compute the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between a word A and a word B, which is the cosine of the angle between the two words. A cosine similarity of -1 would mean the words are complete opposites, while a cosine similarity of 1 would mean that the words are the same. Here's the formula to compare two words: 

<!--- $$\text{Cosine Similarity}=cos({\theta}_{AB})=\frac{A \cdot B}{|A|_2 |B|_2}$$ -->

<p align="center">
  <img src="https://raw.githubusercontent.com/guillaume-chevalier/GloVe-as-TensorFlow-Embedding/master/images/cosine_similarity.png" />
</p>

Here, the norm (such as |A|₂) is the **L2 norm**, the radius in space from the origin, but in a higher dimensional space such as with $n=300$: 

<!--- $$|A|_2=\sqrt{A_1 + A_2 + A_3 + ... + A_n}$$ -->

<p align="center">
  <img src="https://raw.githubusercontent.com/guillaume-chevalier/GloVe-as-TensorFlow-Embedding/master/images/L2_norm.png" />
</p>

### How does it looks like concretely?

For example, here are some cosine similarities to the word "king", computed from the code explained below: 

| Other Word | Cosine Similarity |
| ---------- | ----------------- |
| prince:    |   0.933741,       |
| queen:     |   0.9202421,      |
| aka:       |   0.91769224,     |
| lady:      |   0.9163239,      |
| jack:      |   0.91473544,     |
| 's:        |   0.90668976,     |
| stone:     |   0.8982374,      |
| mr.:       |   0.89194083,     |
| the:       |   0.88934386,     |
| star:      |   0.88920873,     |

Finally, notice how similar words are close in space: 
> ![](https://www.tensorflow.org/images/embedding-nearest-points.png)
> https://www.tensorflow.org/programmers_guide/embedding

Note: in the image above, the embedding have been subsampled to a lower 3D space with a PCA (Princial Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding) to be explorable. This is possible with TensorBoard for inspection. 300 dimensions can't be visualised easily. 

### But how are those word representations obtained?

Before continuing to the practical part where we'll use pretrained embeddings, it's a good thing to know that embeddings can be obtained from unsupervised training on large datasets of text. That's at least a way we can use the text off the internet! To perform this training to get an embedding, it's possible to go with the word2vec approach, or also with the GloVe (Global word Vectors). GloVe is a more recent approach that builds upon the theory of word2vec. Here, we'll use GloVe embeddings. To summarize how the unsupervised training happens, let's see what John Rupert Firth has to say: 

> You shall know a word by the company it keeps (Firth, J. R., 1957)

It's amazing that by comparing words and trying to guess the surrounding words, it's possible to find their meaning. To learn more on that, I'd recommend you the [5th course of the Deep Learning Specialization](https://www.coursera.org/learn/nlp-sequence-models) on coursera by Andrew Ng, a course which can lead to the Deep Learning Specialization [certificate](https://www.coursera.org/account/accomplishments/specialization/U7VNC3ZD9YD8). 

## Let's get practical! 

### First, download the pretrained embeddings with the code below

Careful, the download will take 4-6 GB on disks. If you have already downloaded the embeddings, they will be located under the `./embeddings/` folder relative to here, and won't be downloaded again. 

Note: several embeddings were downloaded with different dimension sizes in the zip file, but we only need one. 

In [1]:
import chakin
import gzip
import json
import numpy as np
import os
import shutil
import zipfile

from collections import defaultdict

import tensorflow.compat.v1 as tf # REPLACED!!!!
tf.disable_v2_behavior()



Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
help(chakin.search)

Help on function search in module chakin.downloader:

search(lang='')
    Search pre-trained word vectors by their language
    :param lang: str, default ''
    :return: None
        print search result as pandas DataFrame



In [3]:
chakin.search(lang='English')

# Twitter is garbage
# Look at 300-dimension Wikipedia and Google News

                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300      

In [4]:
# ENTER THE INDEX YOU WISH TO USE
# Use 2 for the 300 dimension embeddings created using fastText on Wikipedia data
# Use 16 for the 300 dimension embeddings created using GloVe on Common Crawl data
# Use 17 for the 25 dimension embeddings created using GloVe on Twitter data
# Other embedding schemes are not supported and may require further code modification to work.
# As a side note, chakin does not seem well implemented.
# The Common Crawl (index 16) embeddings are impossible to download using the library, 
# and must be downloaded manually from https://github.com/stanfordnlp/GloVe.
# The formats are inconsistent, with the Wikipedia data (index 2) consisting of a single file
# compressed as a .gz file, and the other two embeddings described here compressed as .zip files.
# Also, the 25-D Twitter embeddings (index 17) comes packed together with the 50-D, 100-D, and 200-D
# Twitter embeddings, even though chakin lets us choose between them,
# while the Common Crawl (index 16) 300-d embeddings come by themselves
# Additionally, the Wikipedia embedding file has a header line that cannot be parsed by
# the provided code and which must be removed, while the other embedding files do not

CHAKIN_INDEX = 16

if CHAKIN_INDEX == 2:
    NUMBER_OF_DIMENSIONS = 300
    FILE_NAME = "cc.en.300.vec"

    DATA_FOLDER = "embeddings"
    GZ_FILE = os.path.join(DATA_FOLDER, "{}.gz".format(FILE_NAME))

    GLOVE_FILENAME = os.path.join(DATA_FOLDER, FILE_NAME)

    if not os.path.exists(GZ_FILE) and not os.path.exists(GLOVE_FILENAME):
        # GloVe by Stanford is licensed Apache 2.0: 
        #     https://github.com/stanfordnlp/GloVe/blob/master/LICENSE
        #     http://nlp.stanford.edu/data/glove.twitter.27B.zip
        #     Copyright 2014 The Board of Trustees of The Leland Stanford Junior University
        print("Downloading embeddings to '{}'".format(GZ_FILE))
        chakin.download(number=CHAKIN_INDEX, save_dir='./{}'.format(DATA_FOLDER))
    else:
        print("Embeddings already downloaded.")
        
    if not os.path.exists(GLOVE_FILENAME):
        print("Extracting embeddings to '{}'".format(GLOVE_FILENAME))

        with gzip.open(GZ_FILE, "rt", encoding="utf-8") as f_in, open(GLOVE_FILENAME, "w", encoding="utf-8") as f_out:
            chunk_size = 1024 # Read one chunk at a time, to avoid filling up RAM
            
            is_first_chunk = True
            while True:
                chunk = f_in.read(chunk_size)
                if is_first_chunk:
                    # We need to omit the first line of the embedding file, as it
                    # contains a header that the provided code cannot parse
                    chunk = chunk.split("\n", maxsplit=1)[1] # index 0 will contain "2000000 300"
                    is_first_chunk = False
                f_out.write(chunk)
                # We are done
                if (len(chunk)) <= 0:
                    break

    else:
        print("Embeddings already extracted.")
elif CHAKIN_INDEX == 16:
    NUMBER_OF_DIMENSIONS = 300
    SUBFOLDER_NAME = "glove.840B.300d"

    DATA_FOLDER = "embeddings"
    ZIP_FILE = os.path.join(DATA_FOLDER, "{}.zip".format(SUBFOLDER_NAME))
    UNZIP_FOLDER = os.path.join(DATA_FOLDER, SUBFOLDER_NAME)


    print(ZIP_FILE)

    GLOVE_FILENAME = os.path.join(UNZIP_FOLDER, "{}.txt".format(SUBFOLDER_NAME))

    if not os.path.exists(ZIP_FILE) and not os.path.exists(UNZIP_FOLDER):
        print("Please download glove.840B.300d.zip from https://github.com/stanfordnlp/GloVe and put it into the embeddings folder.")
        print("It is no longer available for download from the chakin library")
        assert(False)
    else:
        print("Embeddings already downloaded.")
        
    if not os.path.exists(UNZIP_FOLDER):
        with zipfile.ZipFile(ZIP_FILE,"r") as zip_ref:
            print("Extracting embeddings to '{}'".format(UNZIP_FOLDER))
            zip_ref.extractall(UNZIP_FOLDER)
    else:
        print("Embeddings already extracted.")
elif CHAKIN_INDEX == 17:
    NUMBER_OF_DIMENSIONS = 25
    SUBFOLDER_NAME = "glove.twitter.27B"

    DATA_FOLDER = "embeddings"
    ZIP_FILE = os.path.join(DATA_FOLDER, "{}.zip".format(SUBFOLDER_NAME))
    UNZIP_FOLDER = os.path.join(DATA_FOLDER, SUBFOLDER_NAME)

    GLOVE_FILENAME = os.path.join(UNZIP_FOLDER, "{}.{}d.txt".format(SUBFOLDER_NAME, NUMBER_OF_DIMENSIONS))

    if not os.path.exists(ZIP_FILE) and not os.path.exists(UNZIP_FOLDER):
        # GloVe by Stanford is licensed Apache 2.0: 
        #     https://github.com/stanfordnlp/GloVe/blob/master/LICENSE
        #     http://nlp.stanford.edu/data/glove.twitter.27B.zip
        #     Copyright 2014 The Board of Trustees of The Leland Stanford Junior University
        print("Downloading embeddings to '{}'".format(ZIP_FILE))
        chakin.download(number=CHAKIN_INDEX, save_dir='./{}'.format(DATA_FOLDER))
    else:
        print("Embeddings already downloaded.")
        
    if not os.path.exists(UNZIP_FOLDER):
        with zipfile.ZipFile(ZIP_FILE,"r") as zip_ref:
            print("Extracting embeddings to '{}'".format(UNZIP_FOLDER))
            zip_ref.extractall(UNZIP_FOLDER)
    else:
        print("Embeddings already extracted.")
else:
    print("Those embeddings are not currently supported")

embeddings\glove.840B.300d.zip
Embeddings already downloaded.
Embeddings already extracted.


### Let's read the embedding from disks here

First, we load the embeddings, then we demonstrate their usage. 

In [5]:
def load_embedding_from_disks(glove_filename, with_indexes=True):
    """
    Read a GloVe txt file. If `with_indexes=True`, we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, otherwise we return only a direct 
    `word_to_embedding_dict` dictionnary mapping from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
    else:
        word_to_embedding_dict = dict()

    
    with open(glove_filename, 'r', encoding="utf8") as glove_file: # Needed to add encoding="utf8"!!!!!
        for (i, line) in enumerate(glove_file):
            
            split = line.split(' ')
            
            word = split[0]
            
            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )
            
            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    _WORD_NOT_FOUND = [0.0]* len(representation)  # Empty representation for unknown words.
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

In [6]:
print("Loading embedding from disks...")
word_to_index, index_to_embedding = load_embedding_from_disks(GLOVE_FILENAME, with_indexes=True)
print("Embedding loaded from disks.")

Loading embedding from disks...
Embedding loaded from disks.


### Unknown words have representations with values of zero, such as [0, 0, ..., 0]

In [7]:
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
print("This means (number of words, number of dimensions per word)\n")
print("The first words are words that tend occur more often.")

print("Note: for unknown words, the representation is an empty vector,\n"
      "and the index is the last one. The dictionnary has a limit:")
print("    {} --> {} --> {}".format("A word", "Index in embedding", "Representation"))
word = "worsdfkljsdf"
idx = word_to_index[word]
embd = list(np.array(index_to_embedding[idx], dtype=int))  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))
word = "the"
idx = word_to_index[word]
embd = list(index_to_embedding[idx])  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))

# Added
word = "aewaiuhUGRYUgreu"
idx = word_to_index[word]
embd = list(index_to_embedding[idx])  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))

Embedding is of shape: (2196019, 300)
This means (number of words, number of dimensions per word)

The first words are words that tend occur more often.
Note: for unknown words, the representation is an empty vector,
and the index is the last one. The dictionnary has a limit:
    A word --> Index in embedding --> Representation
    worsdfkljsdf --> 2196018 --> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [8]:
# Look at one of the embedding vectors and the word it represents
print(index_to_embedding[0])
for word in word_to_index.keys():
    if word_to_index[word] == 0:
        print(word)

[-0.082752   0.67204   -0.14987   -0.064983   0.056491   0.40228
  0.0027747 -0.3311    -0.30691    2.0817     0.031819   0.013643
  0.30265    0.0071297 -0.5819    -0.2774    -0.062254   1.1451
 -0.24232    0.1235    -0.12243    0.33152   -0.006162  -0.30541
 -0.13057   -0.054601   0.037083  -0.070552   0.5893    -0.30385
  0.2898    -0.14653   -0.27052    0.37161    0.32031   -0.29125
  0.0052483 -0.13212   -0.052736   0.087349  -0.26668   -0.16897
  0.015162  -0.0083746 -0.14871    0.23413   -0.20719   -0.091386
  0.40075   -0.17223    0.18145    0.37586   -0.28682    0.37289
 -0.16185    0.18008    0.3032    -0.13216    0.18352    0.095759
  0.094916   0.008289   0.11761    0.34046    0.03677   -0.29077
  0.058303  -0.027814   0.082941   0.1862    -0.031494   0.27985
 -0.074412  -0.13762   -0.21866    0.18138    0.040855  -0.113
  0.24107    0.3657    -0.27525   -0.05684    0.34872    0.011884
  0.14517   -0.71395    0.48497    0.14807    0.62287    0.20599
  0.58379   -0.13438    

### The L2 norm of some words can vary
Notice how more common words have a longer embedding norm, how some text on Twitter was in French, and how it deals with punctuation

In [9]:
words = [
    "The", "Teh", "A", "It", "Its", "Bacon", "Star", "Clone", "Bonjour", "Intelligence", 
    "À", "A", "Ça", "Ca", "Été", "C'est", "Aujourd'hui", "Aujourd", "'", "hui", "?", "!", ",", ".", "-", "/", "~"
]

for word in words:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    norm = str(np.linalg.norm(embedding))
    print((word + ": ").ljust(15) + norm)
print("Note: here we printed words starting with capital letters, \n"
      "however to take their embeddings we need their lowercase version (str.lower())")

The:           4.709349891895434
Teh:           6.929913774699059
A:             5.306696239670178
It:            4.940976347088982
Its:           5.6994022787014424
Bacon:         7.252612813214055
Star:          6.804042131778104
Clone:         6.326065057352606
Bonjour:       6.961348818803134
Intelligence:  7.082624039779232
À:             8.446534394969143
A:             5.306696239670178
Ça:            8.33402193877499
Ca:            5.479705953036163
Été:           7.607633619292368
C'est:         7.635576121212556
Aujourd'hui:   0.0
Aujourd:       7.124803532887244
':             6.1379290208952595
hui:           7.287160749674351
?:             5.160823253233168
!:             5.620569023196788
,:             5.094723344738617
.:             4.931635594482644
-:             5.603344705010764
/:             5.907700055660746
~:             5.99084175122729
Note: here we printed words starting with capital letters, 
however to take their embeddings we need their lowercase versio

### Let's load the embedding in TensorFlow

We simply create a non-trainable (frozen) tf.Variable() which we set to hold the value of the big embedding matrix.

First, let's define the variables and graph:

In [10]:
batch_size = None  # Any size is accepted

tf.reset_default_graph()

sess = tf.InteractiveSession()  # sess = tf.Session()

# Define the variable that will hold the embedding:
tf_embedding = tf.Variable(
    tf.constant(0.0, shape=index_to_embedding.shape),
    trainable=False,
    name="Embedding"
)

tf_word_ids = tf.placeholder(tf.int32, shape=[batch_size])

tf_word_representation_layer = tf.nn.embedding_lookup(
    params=tf_embedding,
    ids=tf_word_ids
)

In [11]:
tf_embedding.shape

TensorShape([Dimension(2196019), Dimension(300)])

Sending the embedding to TensorFlow below. It will be located in the GPU from now (or on CPU if GPU is unavailable):

In [12]:
tf_embedding_placeholder = tf.placeholder(tf.float32, shape=index_to_embedding.shape)
tf_embedding_init = tf_embedding.assign(tf_embedding_placeholder)
_ = sess.run(
    tf_embedding_init, 
    feed_dict={
        tf_embedding_placeholder: index_to_embedding
    }
)

print("Embedding now stored in TensorFlow. Can delete numpy array to clear some CPU RAM.")
del index_to_embedding

Embedding now stored in TensorFlow. Can delete numpy array to clear some CPU RAM.


Now we can use or fetch representations, for example:

In [13]:
batch_of_words = ["Hello", "World", "!"]
batch_indexes = [word_to_index[w.lower()] for w in batch_of_words]

embedding_from_batch_lookup = sess.run(
    tf_word_representation_layer, 
    feed_dict={
        tf_word_ids: batch_indexes
    }
)
print("Representations for {}:".format(batch_of_words))
print(embedding_from_batch_lookup)

Representations for ['Hello', 'World', '!']:
[[ 2.5233e-01  1.0176e-01 -6.7485e-01  2.1117e-01  4.3492e-01  1.6542e-01
   4.8261e-01 -8.1222e-01  4.1321e-02  7.8502e-01 -7.7857e-02 -6.6324e-01
   1.4640e-01 -2.9289e-01 -2.5488e-01  1.9293e-02 -2.0265e-01  9.8232e-01
   2.8312e-02 -8.1276e-02 -1.2140e-01  1.3126e-01 -1.7648e-01  1.3556e-01
  -1.6361e-01 -2.2574e-01  5.5006e-02 -2.0308e-01  2.0718e-01  9.5785e-02
   2.2481e-01  2.1537e-01 -3.2982e-01 -1.2241e-01 -4.0031e-01 -7.9381e-02
  -1.9958e-01 -1.5083e-02 -7.9139e-02 -1.8132e-01  2.0681e-01 -3.6196e-01
  -3.0744e-01 -2.4422e-01 -2.3113e-01  9.7980e-02  1.4630e-01 -6.2738e-02
   4.2934e-01 -7.8038e-02 -1.9627e-01  6.5093e-01 -2.2807e-01 -3.0308e-01
  -1.2483e-01 -1.7568e-01 -1.4651e-01  1.5361e-01 -2.9518e-01  1.5099e-01
  -5.1726e-01 -3.3564e-02 -2.3109e-01 -7.8330e-01  1.8029e-02 -1.5719e-01
   2.2930e-02  4.9639e-01  2.9225e-02  5.6690e-02  1.4616e-01 -1.9195e-01
   1.6244e-01  2.3898e-01  3.6431e-01  4.5263e-01  2.4560e-01  2.38

### To avoid loading the embedding twice in RAM, make TensorFlow able to load them from disks directly

In [14]:
if CHAKIN_INDEX == 2:
    prefix = FILE_NAME
elif CHAKIN_INDEX == 16:
    prefix = SUBFOLDER_NAME
elif CHAKIN_INDEX == 17:
    prefix = SUBFOLDER_NAME + "." + str(NUMBER_OF_DIMENSIONS) + "d"
TF_EMBEDDINGS_FILE_NAME = os.path.join(DATA_FOLDER, prefix + ".ckpt")
DICT_WORD_TO_INDEX_FILE_NAME = os.path.join(DATA_FOLDER, prefix + ".json")

variables_to_save = [tf_embedding]
embedding_saver = tf.train.Saver(variables_to_save)
embedding_saver.save(sess, save_path=TF_EMBEDDINGS_FILE_NAME)
print("TF embeddings saved to '{}'.".format(TF_EMBEDDINGS_FILE_NAME))
sess.close()

with open(DICT_WORD_TO_INDEX_FILE_NAME, 'w') as f:
    json.dump(word_to_index, f)
print("word_to_index dict saved to '{}'.".format(DICT_WORD_TO_INDEX_FILE_NAME))


TF embeddings saved to 'embeddings\glove.840B.300d.ckpt'.
word_to_index dict saved to 'embeddings\glove.840B.300d.json'.


In [15]:
words_B = "like absolutely crazy not hate bag sand rock soap"
r = [word_to_index[w.strip()] for w in words_B]
print(words_B)
print(r)

like absolutely crazy not hate bag sand rock soap
[4512, 108, 5364, 1939, 2196018, 6, 1504, 269, 2476, 4512, 1042, 1161, 1939, 4512, 2937, 2196018, 1825, 3228, 6, 9279, 2937, 2196018, 1479, 2476, 1161, 2196018, 4127, 6, 1161, 1939, 2196018, 1504, 6, 3577, 2196018, 269, 6, 1479, 1674, 2196018, 3228, 2476, 1825, 5364, 2196018, 269, 2476, 6, 3523]


## Build a model to get word similarities from word A to a list of many words B

This is for demo purposes. With a GPU, we can fetch many words quickly and compute on them. 

### Restarting from scratch: resetting the Jupyter notebook and loading embeddings from disks, the good way

Now that we have a TensorFlow checkpoint, let's load the embedding without having to parse the txt file into NumPy in CPU:

In [16]:
# Magic iPython/Jupyter command to delete variables and restart the Python kernel
%reset

In [17]:
import json
import numpy as np
import os

from collections import defaultdict
from string import punctuation

import tensorflow.compat.v1 as tf # REPLACED !!!!!
tf.disable_v2_behavior()

In [18]:
batch_size = None  # Any size is accepted

In [19]:
# ENTER THE INDEX YOU WISH TO USE
# Use 2 for the 300 dimension embeddings created using fastText on Wikipedia data
# Use 17 for the 25 dimension embeddings created using GloVe on Twitter data
# Other embedding schemes are not supported and may require further code modification to work

CHAKIN_INDEX = 16

if CHAKIN_INDEX == 2:
    word_representations_dimensions = 300  # Embedding of size (vocab_len, nb_dimensions)

    DATA_FOLDER = "embeddings"
    FILE_NAME = "cc.en.300.vec" 
    TF_EMBEDDING_FILE_NAME = "{}.ckpt".format(FILE_NAME) #cc.en.300.vec.ckpt
    SUFFIX = FILE_NAME  # cc.en.300.vec
    TF_EMBEDDINGS_FILE_PATH = os.path.join(DATA_FOLDER, SUFFIX + ".ckpt")  # embeddings\cc.en.300.vec.ckpt
    DICT_WORD_TO_INDEX_FILE_NAME = os.path.join(DATA_FOLDER, SUFFIX + ".json") # embeddings\cc.en.300.vec.json
elif CHAKIN_INDEX == 16:
    word_representations_dimensions = 300  # Embedding of size (vocab_len, nb_dimensions)
    DATA_FOLDER = "embeddings"
    SUBFOLDER_NAME = "glove.840B.300d"
    TF_EMBEDDING_FILE_NAME = "{}.ckpt".format(SUBFOLDER_NAME) #glove.840B.300d.ckpt
    TF_EMBEDDINGS_FILE_PATH = os.path.join(DATA_FOLDER, TF_EMBEDDING_FILE_NAME)  # embeddings\glove.840B.300d.ckpt
    DICT_WORD_TO_INDEX_FILE_NAME = os.path.join(DATA_FOLDER, SUBFOLDER_NAME + ".json") # embeddings\glove.840B.300d.json
elif CHAKIN_INDEX == 17:
    word_representations_dimensions = 25  # Embedding of size (vocab_len, nb_dimensions)

    DATA_FOLDER = "embeddings"
    SUBFOLDER_NAME = "glove.twitter.27B"
    TF_EMBEDDING_FILE_NAME = "{}.ckpt".format(SUBFOLDER_NAME) # glove.twitter.27B.ckpt
    SUFFIX = SUBFOLDER_NAME + "." + str(word_representations_dimensions) # glove.twitter.27B.25
    TF_EMBEDDINGS_FILE_PATH = os.path.join(DATA_FOLDER, SUFFIX + "d.ckpt") # embeddings/glove.twitter.27B.25d.ckpt
    DICT_WORD_TO_INDEX_FILE_NAME = os.path.join(DATA_FOLDER, SUFFIX + "d.json")
else:
    print("Those embeddings are not currently supported")

In [20]:
def load_word_to_index(dict_word_to_index_file_name):
    """
    Load a `word_to_index` dict mapping words to their id, with a default value
    of pointing to the last index when not found, which is the unknown word.
    """
    with open(dict_word_to_index_file_name, 'r') as f:
        word_to_index = json.load(f)
    # The provided method of calculating _LAST_INDEX does not work for all embeddings
    _LAST_INDEX = np.array(list(word_to_index.values())).max() 
    print("word_to_index dict restored from '{}'.".format(dict_word_to_index_file_name))
    word_to_index = defaultdict(lambda: _LAST_INDEX, word_to_index)

    return word_to_index

def load_embedding_tf(word_to_index, tf_embeddings_file_path, nb_dims):
    """
    Define the embedding tf.Variable and load it.
    """
    # You need to subtract a number from len(word_to_index) below for the code to work
    # (I don't know why)
    # The number differs 

    # 1. Define the variable that will hold the embedding:
    # Was originally len(word_to_index)-1, but subtracting -1 isn't even right for the twitter embedding,
    # and the number to subtract differs based on which embedding is used
    tf_embedding = tf.Variable(
        tf.constant(0.0, shape=[np.array(list(word_to_index.values())).max() + 1, nb_dims]),
        trainable=False,
        name="Embedding"
    )

    # print("BEFORE")
    # print(type(tf_embedding))
    # print(tf_embedding.shape)
    # print("AFTER")

    # 2. Restore the embedding from disks to TensorFlow, GPU (or CPU if GPU unavailable):
    variables_to_restore = [tf_embedding]
    embedding_saver = tf.train.Saver(variables_to_restore)
    embedding_saver.restore(sess, save_path=tf_embeddings_file_path)
    print("TF embeddings restored from '{}'.".format(tf_embeddings_file_path))
    
    return tf_embedding
    
def cosine_similarity_tensorflow(tf_word_representation_A, tf_words_representation_B):
    """
    Returns the `cosine_similarity = cos(angle_between_a_and_b_in_space)` 
    for the two word A to all the words B.
    The first input word must be a 1D Tensors (word_representation).
    The second input words must be 2D Tensors (batch_size, word_representation).
    The result is a tf tensor that must be fetched with `sess.run`.
    """
    a_normalized = tf.nn.l2_normalize(tf_word_representation_A, axis=-1)
    b_normalized = tf.nn.l2_normalize(tf_words_representation_B, axis=-1)
    similarity = tf.reduce_sum(
        tf.multiply(a_normalized, b_normalized), 
        axis=-1
    )
    
    return similarity


# In case you didn't do the "%reset": 
tf.reset_default_graph()
sess = tf.InteractiveSession()  # sess = tf.Session()

# Load the embedding matrix in tf
word_to_index = load_word_to_index(
    DICT_WORD_TO_INDEX_FILE_NAME)
tf_embedding = load_embedding_tf(
    word_to_index,
    TF_EMBEDDINGS_FILE_PATH, 
    word_representations_dimensions)

# Input to the graph where word IDs can be sent in batch. Look at the "shape" args:
tf_word_A_id = tf.placeholder(tf.int32, shape=[1])
tf_words_B_ids = tf.placeholder(tf.int32, shape=[batch_size])

# Conversion of words to a representation
tf_word_representation_A = tf.nn.embedding_lookup(
    params=tf_embedding, ids=tf_word_A_id)
tf_words_representation_B = tf.nn.embedding_lookup(
    params=tf_embedding, ids=tf_words_B_ids)

# The graph output are the "cosine_similarities" which we want to fetch in sess.run(...). 
cosine_similarities = cosine_similarity_tensorflow(
    tf_word_representation_A, 
    tf_words_representation_B)

print("Model created.")


word_to_index dict restored from 'embeddings\glove.840B.300d.json'.
INFO:tensorflow:Restoring parameters from embeddings\glove.840B.300d.ckpt
TF embeddings restored from 'embeddings\glove.840B.300d.ckpt'.
Model created.


Testing the fetch:

In [21]:
def sentence_to_word_ids(sentence, word_to_index):
    """
    Note: there might be a better way to split sentences for GloVe.
    Please look at the documentation or open an issue to suggest a fix.
    """
    # Separating punctuation from words:
    for punctuation_character in punctuation:
        sentence = sentence.replace(punctuation_character, " {} ".format(punctuation_character))
    # Removing double spaces and lowercasing:
    sentence = sentence.replace("  ", " ").replace("  ", " ").lower().strip()
    # Splitting on every space:
    split_sentence = sentence.split(" ")
    # Converting to IDs:
    ids = [(word_to_index[w.strip()]) for w in split_sentence] 
    return ids, split_sentence

def predict_cosine_similarities(sess, word_A, words_B):
    """
    Use the model in sess to predict cosine similarities.
    """

    word_A_id, _ = sentence_to_word_ids(word_A, word_to_index)
    words_B_ids, split_sentence = sentence_to_word_ids(words_B, word_to_index)

    evaluated_cos_similarities = sess.run(
        cosine_similarities, 
        feed_dict={
            tf_word_A_id: word_A_id,
            tf_words_B_ids: words_B_ids
        }
    )
    return evaluated_cos_similarities, split_sentence


word_A = "Science"
words_B = "Hello internet, a vocano erupt like the bitcoin out of the blue and there is an unknownWord00!"

evaluated_cos_similarities, splitted = predict_cosine_similarities(sess, word_A, words_B)

print("Cosine similarities with \"{}\":".format(word_A))
for word, similarity in zip(splitted, evaluated_cos_similarities):
    print("    {}{}".format((word+":").ljust(15), similarity))

Cosine similarities with "Science":
    hello:         0.10984746366739273
    internet:      0.29033395648002625
    ,:             0.21967126429080963
    a:             0.22639356553554535
    vocano:        0.0
    erupt:         0.06329665333032608
    like:          0.31537872552871704
    the:           0.30549389123916626
    bitcoin:       -0.002926294459030032
    out:           0.2886187732219696
    of:            0.3249181807041168
    the:           0.30549389123916626
    blue:          0.1310815066099167
    and:           0.2947123050689697
    there:         0.3527158498764038
    is:            0.30085262656211853
    an:            0.252271831035614
    unknownword00: 0.0
    !:             0.16768565773963928


## Getting the top k most similars words to a word with the embedding matrix

Let's take an input word and compare it to every other words in the embedding matrix to return the most similar words. 

In [22]:
tf.reset_default_graph()

# Transpose word_to_index dict:
index_to_word = dict((val, key) for key, val in word_to_index.items())

# New graph
tf.reset_default_graph()
sess = tf.InteractiveSession()

# Load the embedding matrix in tf
tf_word_to_index = load_word_to_index(
    DICT_WORD_TO_INDEX_FILE_NAME)
tf_embedding = load_embedding_tf(
    tf_word_to_index,
    TF_EMBEDDINGS_FILE_PATH, 
    word_representations_dimensions)

# An input word 
tf_word_id = tf.placeholder(tf.int32, shape=[1])
tf_word_representation = tf.nn.embedding_lookup(
    params=tf_embedding, ids=tf_word_id)

# An input 
tf_nb_similar_words_to_get = tf.placeholder(tf.int32)

# Dot the word to every embedding
tf_all_cosine_similarities = cosine_similarity_tensorflow(
    tf_word_representation, 
    tf_embedding)

# Getting the top cosine similarities. 
tf_top_cosine_similarities, tf_top_word_indices = tf.nn.top_k(
    tf_all_cosine_similarities,
    k=tf_nb_similar_words_to_get+1,
    sorted=True
)

# Discard the first word because it's the input word itself:
tf_top_cosine_similarities = tf_top_cosine_similarities[1:]
tf_top_word_indices = tf_top_word_indices[1:]

# Get the top words' representations by fetching 
# tf_top_words_representation = "tf_embedding[tf_top_word_indices]":
tf_top_words_representation = tf.gather(
    tf_embedding,
    tf_top_word_indices)



word_to_index dict restored from 'embeddings\glove.840B.300d.json'.
INFO:tensorflow:Restoring parameters from embeddings\glove.840B.300d.ckpt
TF embeddings restored from 'embeddings\glove.840B.300d.ckpt'.


In [23]:
# Fetch 10 similar words:
nb_similar_words_to_get = 10

word = "king"
word_id = word_to_index[word]

top_cosine_similarities, top_word_indices, top_words_representation = sess.run(
    [tf_top_cosine_similarities, tf_top_word_indices, tf_top_words_representation],
    feed_dict={
        tf_word_id: [word_id],
        tf_nb_similar_words_to_get: nb_similar_words_to_get
    }
)

print("Top similar words to \"{}\":\n".format(word))
loop = zip(top_cosine_similarities, top_word_indices, top_words_representation)
for cos_sim, word_id, word_repr in loop:
    print(
        (index_to_word[word_id]+ ":").ljust(15),
        (str(cos_sim) + ",").ljust(15),
        np.linalg.norm(word_repr)
    )

# MOST SIMILAR WORDS WORKED W/O CHANGES!!!!?????

Top similar words to "king":

kings:          0.7876614,      7.1117706
prince:         0.73377365,     6.5258965
queen:          0.72526103,     6.82974
King:           0.71067923,     6.142591
throne:         0.67260045,     7.1247444
kingdom:        0.66040456,     6.905996
lord:           0.64396936,     6.844843
royal:          0.6168811,      6.81116
reign:          0.6128068,      6.4697604
princes:        0.59786516,     7.16161


Notice the bad quality of the similar words, embeddings with more dimensions than 25 would make it better. 

Reminder: we chose 25 dimensions for tutorial purposes not to eat all our RAM. There are better embeddings out there.

### Jeremy: 

The most similar words seem to be better for the 300-dimension embeddings created using fastText on Wikipedia.

##  What's next?

I think getting the embeddings into TensorFlow is a good step into building a language model. You may want to grab some data, such as [here](https://github.com/awesomedata/awesome-public-datasets#naturallanguage) and [here](https://github.com/niderhoff/nlp-datasets). You may also want to learn more about how recurrent neural networks can read features such as sentences or signal of varying length, such as [an LSTM (RNN) encoder reading signal](https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition) or an [signal predictor from a seq2seq GRU (RNN)](https://github.com/guillaume-chevalier/seq2seq-signal-prediction) which could be used in practice to [predict next words in a sentence](https://blog.openai.com/unsupervised-sentiment-neuron/). Since signal is closely related to sentences with embedded words, RNNs can be applied on both. 


## References

The pretrained word vectors can be found there: 
- Repo https://github.com/stanfordnlp/GloVe
- Manual download: http://nlp.stanford.edu/data/glove.twitter.27B.zip

Chakin was used to download those word embeddings: 
- https://github.com/chakki-works/chakin

Some images in this notebook are references/links from the TensorFlow website: 
- https://www.tensorflow.org/

To cite my work, point to the URL of the GitHub repository: 
- https://github.com/guillaume-chevalier/GloVe-as-TensorFlow-Embedding

My code is available under the [MIT License](https://github.com/guillaume-chevalier/GloVe-as-TensorFlow-Embedding/blob/master/LICENSE). 

## Connect with me

- https://ca.linkedin.com/in/chevalierg 
- https://twitter.com/guillaume_che
- https://github.com/guillaume-chevalier/


In [24]:
# # Let's convert this notebook to a README for the GitHub project's title page:
# !ipython3 nbconvert --to markdown "GloVe-as-TensorFlow-Embedding-Tutorial.ipynb"
# !mv "GloVe-as-TensorFlow-Embedding-Tutorial.md" README.md