# Reproduce Paragraph Vector - Distributed Memory

## Using Stanford's Sentiment Analysis dataset, based on Rotten Tomatoes ratings

Based on the paper:

Le, Q. V., & Mikolov, T. (2014, June). Distributed Representations of Sentences and Documents. In ICML (Vol. 14, pp. 1188-1196).

And the workd described in:

https://amsterdam.luminis.eu/2017/01/30/implementing-doc2vec/

and:

https://github.com/wangz10/tensorflow-playground/blob/master/doc2vec.py

Install nltk and download punkt package in case it's the first time you run the notebook

In [1]:
!pip install nltk
import nltk
# download punkt package
nltk.download()

Collecting nltk
  Downloading nltk-3.2.2.tar.gz (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 341kB/s ta 0:00:01
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25l- \ | done
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/42/b5/27/718985cd9719e8a44a405d264d98214c7a607fb65f3a006f28
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.2
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/jovyan/nltk_data...
      Unzipping tokenizers/punkt.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h)

True

Import common constants and functions, including functions that build the dictionary and compute logistic regression to test the models

In [2]:
from reproduce_par2vec_commons import *

ImportError: No module named utils

Load labels from Stanford dataset, including the transformation of numerical values to recover the 5 classes by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.

In [3]:
orig_labels = get_labels()


10000


Builds the dictionary of the words present in the training dataset. It also removes the TOP N most frequent words, where N is defined in the shared constants, and takes just a fied amount of words, discarding also the less frequent.

In [4]:
dictionary, vocab_size, data, doclens = build_dictionary()

10000
10433
10335


Compute the window center positions for all documents, sliding the window through the text to obtain the center position that will be used to train the model and shuffle them before using them in the training

In [5]:
twcp = get_text_window_center_positions(data)
print len(twcp)
np.random.shuffle(twcp)
twcp_train_gen = repeater_shuffler(twcp)
del twcp  # save some memory

53398


The Paragraph to vector, in its distributed memory version, combines the embedding of the vector with the embeddings of the word of the window that encloses the word to predict. In the present version, the combination of the vectors is done by concatenating them all together. 
In the DM model, we introduce embeddings for the documents, for the words and also variables for the softmax weights and biases used in the prediction of the center word. In DM, the main goal is to predict the center word of each window based on the rest of words and the document embedding.

In [6]:
def create_training_graph():
    # Input data
    dataset = tf.placeholder(tf.int32, shape=[BATCH_SIZE, TEXT_WINDOW_SIZE])
    labels = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1])
    # Variables.
    # embeddings for words, W in paper
    word_embeddings = tf.Variable(
        tf.random_uniform([vocab_size, EMBEDDING_SIZE], -1.0, 1.0))
    # embedding for documents (can be sentences or paragraph), D in paper
    doc_embeddings = tf.Variable(
        tf.random_uniform([len(doclens), EMBEDDING_SIZE], -1.0, 1.0))
    combined_embed_vector_length = EMBEDDING_SIZE * TEXT_WINDOW_SIZE
    # softmax weights, W and D vectors should be concatenated before applying softmax
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocab_size, combined_embed_vector_length],
                            stddev=1.0 / np.math.sqrt(combined_embed_vector_length)))
    # softmax biases
    softmax_biases = tf.Variable(tf.zeros([vocab_size]))
    # Model.
    # Look up embeddings for inputs.
    # shape: (batch_size, embeddings_size)
    embed = []  # collect embedding matrices with shape=(batch_size, embedding_size)
    for j in range(TEXT_WINDOW_SIZE - 1):
        embed_w = tf.nn.embedding_lookup(word_embeddings, dataset[:, j])
        embed.append(embed_w)
    embed_d = tf.nn.embedding_lookup(doc_embeddings, dataset[:, TEXT_WINDOW_SIZE - 1])
    embed.append(embed_d)
    # concat word and doc vectors
    embed = tf.concat(embed, 1)
    # Compute the softmax loss, using a sample of the negative
    # labels each time
    loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(
            softmax_weights, softmax_biases, labels,
            embed, NUM_SAMPLED, vocab_size))
    # Optimizer
    optimizer = tf.train.AdagradOptimizer(LEARNING_RATE).minimize(
        loss)
    # We use the cosine distance:
    norm_w = tf.sqrt(tf.reduce_sum(tf.square(word_embeddings), 1, keep_dims=True))
    normalized_word_embeddings = word_embeddings / norm_w
    norm_d = tf.sqrt(tf.reduce_sum(tf.square(doc_embeddings), 1, keep_dims=True))
    normalized_doc_embeddings = doc_embeddings / norm_d
    session = tf.Session()
    session.run(tf.global_variables_initializer())

    return optimizer, loss, dataset, labels,\
           normalized_word_embeddings, \
           normalized_doc_embeddings, \
           session, softmax_weights, softmax_biases

SGD is used to optimize the loss. In this case, each batch is composed by a set of text window center positions. For each twcp, we create a list of the surrounding words and we concatenate the embedding of the document to this list. The label to predict is the central word.

In [7]:
def generate_batch_single_twcp(twcp, i, batch, labels):
    tw_start = twcp - (TEXT_WINDOW_SIZE - 1) // 2
    tw_end = twcp + TEXT_WINDOW_SIZE // 2 + 1
    docids, wordids = zip(*data[tw_start:tw_end])

    wordids_list = list(wordids)
    twcp_index = (TEXT_WINDOW_SIZE - 1) // 2
    twcp_docid = data[twcp][0]
    twcp_wordid = data[twcp][1]
    del wordids_list[twcp_index]
    wordids_list.append(twcp_docid)

    batch[i] = wordids_list
    labels[i] = twcp_wordid


def generate_batch(twcp_gen):
    batch = np.ndarray(shape=(BATCH_SIZE, TEXT_WINDOW_SIZE), dtype=np.int32)
    labels = np.ndarray(shape=(BATCH_SIZE, 1), dtype=np.int32)
    for i in range(BATCH_SIZE):
        generate_batch_single_twcp(next(twcp_gen), i, batch, labels)
    return batch, labels

In [8]:
def train(optimizer, loss, dataset, labels):
    avg_training_loss = 0
    for step in range(NUM_STEPS):
        batch_data, batch_labels = generate_batch(twcp_train_gen)
        _, l = session.run(
            [optimizer, loss],
            feed_dict={dataset: batch_data, labels: batch_labels})
        avg_training_loss += l
        if step > 0 and step % REPORT_EVERY_X_STEPS == 0:
            avg_training_loss = \
                avg_training_loss / REPORT_EVERY_X_STEPS
            # The average loss is an estimate of the loss over the
            # last REPORT_EVERY_X_STEPS batches
            print('Average loss at step {:d}: {:.1f}'.format(
                step, avg_training_loss))

We train the embeddings and obtain the computed embeddings and softmax weights and biases.

In [9]:
optimizer, loss, dataset, labels, word_embeddings, doc_embeddings, session, softmax_weights, softmax_biases = create_training_graph()
train(optimizer, loss, dataset, labels)
current_embeddings = session.run(doc_embeddings)
current_word_embeddings = session.run(word_embeddings)
current_softmax_weights = session.run(softmax_weights)
current_softmax_biases = session.run(softmax_biases)

Average loss at step 2000: 1.3
Average loss at step 4000: 0.4
Average loss at step 6000: 0.2
Average loss at step 8000: 0.1
Average loss at step 10000: 0.1


For testing we repeat the process, but this time fixing the word embeddings and the softmax weights and biases obtained in the training phase. We traing the model again for the test documents in order to compute their embeddings.
First we compute the twcp for the test document using the same dictionary that was used during the training.
Then the test graph is build. Now the only variable is the document embedding, as the softmax weights and biases and word embeddings have been learned during the training.
The dataset is prepared by extracting the words from the windows around the twcp, together with the document id, and using the center word id as the label to predict based on the new embeddings.

In [10]:
def test(doc, train_word_embeddings, train_softmax_weights, train_softmax_biases):
    test_data, test_twcp = build_test_twcp(doc, dictionary)
    # Input data
    combined_embed_vector_length = EMBEDDING_SIZE * TEXT_WINDOW_SIZE
    test_dataset = tf.placeholder(tf.int32, shape=[len(test_twcp), TEXT_WINDOW_SIZE])
    test_labels = tf.placeholder(tf.int32, shape=[len(test_twcp), 1])
    test_softmax_weights = tf.placeholder(tf.float32, shape=[vocab_size, combined_embed_vector_length])
    test_softmax_biases = tf.placeholder(tf.float32, shape=[vocab_size])
    test_word_embeddings = tf.placeholder(tf.float32, shape=[vocab_size, EMBEDDING_SIZE])
    # Variables.
    # embedding for documents (can be sentences or paragraph), D in paper
    test_doc_embeddings = tf.Variable(
        tf.random_uniform([1, EMBEDDING_SIZE], -1.0, 1.0))

    # Look up embeddings for inputs.
    # shape: (batch_size, embeddings_size)
    test_embed = []  # collect embedding matrices with shape=(batch_size, embedding_size)
    for j in range(TEXT_WINDOW_SIZE - 1):
        test_embed_w = tf.gather(test_word_embeddings, test_dataset[:,j])
        test_embed.append(test_embed_w)
    test_embed_d = tf.nn.embedding_lookup(test_doc_embeddings, test_dataset[:, TEXT_WINDOW_SIZE - 1])
    test_embed.append(test_embed_d)
    # concat word and doc vectors
    test_embed = tf.concat(test_embed, 1)
    # Compute the softmax loss, using a sample of the negative
    # labels each time
    test_loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(
            test_softmax_weights, test_softmax_biases, test_labels,
            test_embed, NUM_SAMPLED, vocab_size))
    # Optimizer
    test_optimizer = tf.train.AdagradOptimizer(LEARNING_RATE).minimize(
        test_loss)
    # We use the cosine distance:
    test_norm_d = tf.sqrt(tf.reduce_sum(tf.square(test_doc_embeddings), 1, keep_dims=True))
    test_normalized_doc_embeddings = test_doc_embeddings / test_norm_d
    session = tf.Session()
    session.run(tf.global_variables_initializer())

    for step in range(NUM_STEPS):
        test_input = np.ndarray(shape=(len(test_twcp), TEXT_WINDOW_SIZE), dtype=np.int32)
        labels_values = np.ndarray(shape=(len(test_twcp), 1), dtype=np.int32)
        i = 0
        for twcp in test_twcp:
            tw_start = twcp - (TEXT_WINDOW_SIZE - 1) // 2
            tw_end = twcp + TEXT_WINDOW_SIZE // 2 + 1
            docids, wordids = zip(*test_data[tw_start:tw_end])

            wordids_list = list(wordids)
            twcp_index = (TEXT_WINDOW_SIZE - 1) // 2
            twcp_docid = test_data[twcp][0]
            twcp_wordid = test_data[twcp][1]
            del wordids_list[twcp_index]
            wordids_list.append(twcp_docid)

            test_input[i] = wordids_list
            labels_values[i] = twcp_wordid
            i += 1
        _, l = session.run(
            [test_optimizer, test_loss],
            feed_dict={test_dataset: test_input, test_labels: labels_values,
                       test_word_embeddings: train_word_embeddings,
                       test_softmax_weights: train_softmax_weights,
                       test_softmax_biases: train_softmax_biases
                       })
    current_test_embedding = session.run(test_normalized_doc_embeddings)
    return current_test_embedding

In order to validate the new embeddings obtained in the test, we compute the embeddings twice for the same text and compute the cosine distance, checking that it is around 0.

In [11]:
test_embedding_1 = test('something cringe-inducing about seeing an American football stadium nuked as pop entertainment',
                        current_word_embeddings, current_softmax_weights, current_softmax_biases)
test_embedding_2 = test('something cringe-inducing about seeing an American football stadium nuked as pop entertainment',
                        current_word_embeddings, current_softmax_weights, current_softmax_biases)
distance = spatial.distance.cosine(test_embedding_1, test_embedding_2)
print distance

0.024357977433


Finally we compute a Logistic regression taking the embeddings as inputs for the phrases of the dataset and the sentiment labels computed at the beginning. The accuracy obtained must be around 48,7 or above, which was the value obtained by Mikolov in the original Paragraph Vector paper.

In [12]:
test_logistic_regression(current_embeddings, orig_labels)

0.5235
