## Word2Vec

The TensorFlow library has made our lives easier by introducing multiple
predefined functions to be used in the implementation of word2vec
algorithms.


This notebook includes the implementation for both the
word2vec algos, skip-gram.

https://www.tensorflow.org/tutorials/word2vec

**Note:** The data used for our exercise is a compressed format of the
English Wikipedia dump made on March 3, 2006. It is available from the
following link: http://mattmahoney.net/dc/textdata.html.

In [14]:
"""Importing the required packages"""
import random
import collections
import math
import os
import zipfile
import time
import re 

import numpy as np
import tensorflow as tf

from matplotlib import pylab
%matplotlib inline

from six.moves import range
from six.moves.urllib.request import urlretrieve

from tensorflow.contrib.tensorboard.plugins import projector
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

**Importing the required packages**

In [None]:
"""Make sure the dataset link is copied correctly"""

dataset_link = 'http://mattmahoney.net/dc/'
zip_file = 'text8.zip'

def data_download(zip_file):
    """Downloading the required file"""
    if not os.path.exists(zip_file):
        zip_file, _ = urlretrieve(dataset_link + zip_file, zip_file)
        print('File downloaded successfully!')
    return None

data_download(zip_file)

"""Extracting the dataset in separate folder"""
extracted_folder = 'dataset'

if not os.path.isdir(extracted_folder):
    with zipfile.ZipFile(zip_file) as zf:
        zf.extractall(extracted_folder)
        
with open('dataset/text8') as ft_ :
    full_text = ft_.read() 

**Save data into a variable**

In [10]:
with open('dataset/text8') as ft_ :
    full_text = ft_.read()

**Function to do the treatment of text punctuation**

As the input data has multiple punctuation and other symbols
across the text, the same are replaced with their respective tokens, with
the type of punctuation and symbol name in the token. This helps the
model to identify each of the punctuation and other symbols individually
and produce a vector. The function text_processing() performs this
operation. It takes the Wikipedia text data as input.

In [11]:
def text_processing(ft8_text):
    """Replacing punctuation marks with tokens"""
    ft8_text = ft8_text.lower()
    ft8_text = ft8_text.replace('.', ' <period> ')
    ft8_text = ft8_text.replace(',', ' <comma> ')
    ft8_text = ft8_text.replace('"', ' <quotation> ')
    ft8_text = ft8_text.replace(';', ' <semicolon> ')
    ft8_text = ft8_text.replace('!', ' <exclamation> ')
    ft8_text = ft8_text.replace('?', ' <question> ')
    ft8_text = ft8_text.replace('(', ' <paren_l> ')
    ft8_text = ft8_text.replace(')', ' <paren_r> ')
    ft8_text = ft8_text.replace('--', ' <hyphen> ')
    ft8_text = ft8_text.replace(':', ' <colon> ')
    ft8_text_tokens = ft8_text.split()
    
    return ft8_text_tokens

In [12]:
ft_tokens = text_processing(full_text)

**Selecting the words with frequency higher than a threshold**

To improve the quality of the vector representations produced, it is
recommended to remove the noise related to the words, i.e., words with a
frequency of less than 7 in the input dataset, as these words will not have
enough information to provide the context they are present in.
One can change this threshold by checking the distribution of the word
count and in the dataset. For convenience, we have taken it as 7 here.

In [15]:
ft_tokens = text_processing(full_text)
"""Shortlisting words with frequency more than 7"""
word_cnt = collections.Counter(ft_tokens)
shortlisted_words = [w for w in ft_tokens if word_cnt[w] > 7 ]

List the top words present in the dataset on the basis of their
frequency, as follows:

In [16]:
print(shortlisted_words[:15])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']


Check the stats of the total words present in the dataset.

In [17]:
print("Total number of shortlisted words : ",len(shortlisted_words))
print("Unique number of shortlisted words : ",len(set(shortlisted_words)))

Total number of shortlisted words :  16616688
Unique number of shortlisted words :  53721


**Oder words by the frequency**

To process the unique words present in the corpus, we have made a
set of the words, followed by their frequency in the training dataset. The
following function creates a dictionary and converts words to integers
and, conversely, integers to words. The most frequent word is assigned the
least value, 0, and in similar fashion, numbers are assigned to other words.
Conversion of words to integers has been stored in a separate list.

In [18]:
"""The function creates a dictionary of the words present in dataset along with their frequency order"""
def dict_creation(shortlisted_words):
    counts = collections.Counter(shortlisted_words)
    vocabulary = sorted(counts, key=counts.get, reverse=True)
    rev_dictionary_ = {ii: word for ii, word in enumerate(vocabulary)}
    dictionary_ = {word: ii for ii, word in rev_dictionary_.items()}
    return dictionary_, rev_dictionary_

In [19]:
dictionary_, rev_dictionary_ = dict_creation(shortlisted_words)
words_cnt = [dictionary_[word] for word in shortlisted_words]

## Skip-Gram

All the words with higher frequency and
without any significant context around the center words are removed by
putting a threshold on their frequency. This results in faster training and
better word vector representations.

We have made use of the probability score function given in
the paper on skip-gram for the implementation here. For each word,
$w_i$, in the training set, we’ll discard it with the probability given by: 


$P(w_i)= 1- \left( \sqrt \frac{t}{f(w_i)} \right) $

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

In [20]:
"""Creating the threshold and performing the subsampling"""
thresh = 0.00005
word_counts = collections.Counter(words_cnt)
total_count = len(words_cnt)
freqs = {word: count / total_count for word, count in word_counts.items()}
p_drop = {word: 1 - np.sqrt(thresh/freqs[word]) for word in word_counts}
train_words = [word for word in words_cnt if p_drop[word] < random.random()]

As the skip-gram model takes the center word and predicts words
surrounding it, the 

    skipG_target_set_generation() 

function creates the
input for the skip-gram model in the desired format:

In [21]:
def skipG_target_set_generation(batch_, batch_index, word_window): 
    """The function combines the words of given word_window size next to the index, for the SkipGram model"""
    random_num = np.random.randint(1, word_window+1)
    words_start = batch_index - random_num if (batch_index - random_num) > 0 else 0
    words_stop = batch_index + random_num
    window_target = set(batch_[words_start:batch_index] + batch_[batch_index+1:words_stop+1])
    return list(window_target)

The 

    skipG_batch_creation()

function makes use of the 

    skipG_target_set_generation()

function and creates a combined format of the
center word and the words surrounding it on either side as target text and
returns the batch output, as follows:

In [22]:
def skipG_batch_creation(short_words, batch_length, word_window):
    """The function internally makes use of the skipG_target_set_generation() function and combines each of the label 
    words in the shortlisted_words with the words of word_window size around"""
    batch_cnt = len(short_words)//batch_length
    short_words = short_words[:batch_cnt*batch_length]  
    
    for word_index in range(0, len(short_words), batch_length):
        input_words, label_words = [], []
        word_batch = short_words[word_index:word_index+batch_length]
        for index_ in range(len(word_batch)):
            batch_input = word_batch[index_]
            batch_label = skipG_target_set_generation(word_batch, index_, word_window)
            # Appending the label and inputs to the initial list. Replicating input to the size of labels in the window 
            label_words.extend(batch_label)
            input_words.extend([batch_input]*len(batch_label))
        yield input_words, label_words

The following code registers a TensorFlow graph for use of the
skip-gram implementation, declaring the variable’s inputs and labels
placeholders, which will be used to assign one-hot-encoded vectors for
input words and batches of varying size, as per the combination of the
center and surrounding words:

In [23]:
tf_graph = tf.Graph()
with tf_graph.as_default():
    input_ = tf.placeholder(tf.int32, [None], name='input_')
    label_ = tf.placeholder(tf.int32, [None, None], name='label_')



Instructions for updating:
Colocations handled automatically by placer.


The code following declares variables for the embedding matrix, which
has a dimension equal to the size of the vocabulary and the dimension of
the word embedding vector:

In [25]:
with tf_graph.as_default():
    word_embed = tf.Variable(tf.random_uniform((len(rev_dictionary_), 300), -1, 1))
    embedding = tf.nn.embedding_lookup(word_embed, input_)

The 

    tf.train.AdamOptimizer

uses Kingma and Ba's Adam algorithm (http://arxiv.org/pdf/1412.6980v8.pdf) to control the learning rate. For further reference, one can refer to the following paper as well by Bengio, http://arxiv.org/pdf/1206.5533.pdf

In [24]:
"""The code includes the following  :
 # Initializing weights and bias to be used in the softmax layer
 # Loss function calculation using the Negative Sampling
 # Usage of Adam Optimizer
 # Negative sampling on 100 words, to be included in the loss function
 # 300 is the word embedding vector size
"""
vocabulary_size = len(rev_dictionary_)

with tf_graph.as_default():
    sf_weights = tf.Variable(tf.truncated_normal((vocabulary_size, 300), stddev=0.1) )
    sf_bias = tf.Variable(tf.zeros(vocabulary_size) )

    loss_fn = tf.nn.sampled_softmax_loss(weights=sf_weights, biases=sf_bias, 
                                         labels=label_, inputs=embedding, 
                                         num_sampled=100, num_classes=vocabulary_size)
    cost_fn = tf.reduce_mean(loss_fn)
    optim = tf.train.AdamOptimizer().minimize(cost_fn)

Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
Instructions for updating:
Use tf.cast instead.


To ensure that the word vector representation is holding the semantic
similarity among words, a validation set is generated in the following
section of code. This will select a combination of common and uncommon
words across the corpus and return the words closest to them on the basis
of the cosine similarity between the word vectors.

In [26]:
"""The below code performs the following operations :
 # Performing validation here by making use of a random selection of 16 words from the dictionary of desired size
 # Selecting 8 words randomly from range of 1000    
 # Using the cosine distance to calculate the similarity between the words 
"""
with tf_graph.as_default():
    validation_cnt = 16
    validation_dict = 100
    
    validation_words = np.array(random.sample(range(validation_dict), validation_cnt//2))
    validation_words = np.append(validation_words, random.sample(range(1000,1000+validation_dict), validation_cnt//2))
    validation_data = tf.constant(validation_words, dtype=tf.int32)

    normalization_embed = word_embed / (tf.sqrt(tf.reduce_sum(tf.square(word_embed), 1, keep_dims=True)))
    validation_embed = tf.nn.embedding_lookup(normalization_embed, validation_data)
    word_similarity = tf.matmul(validation_embed, tf.transpose(normalization_embed))

Instructions for updating:
keep_dims is deprecated, use keepdims instead


Create a folder model_checkpoint in the current working directory to
store the model checkpoints.

In [27]:
"""Creating the model checkpoint directory"""
!mkdir model_checkpoint

A subdirectory or file model_checkpoint already exists.


In [28]:
epochs = 2            # Increase it as per computation resources. It has been kept low here for users to replicate the process, increase to 100 or more
batch_length = 1000
word_window = 10

with tf_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=tf_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer())

    for e in range(1, epochs+1):
        batches = skipG_batch_creation(train_words, batch_length, word_window)
        start = time.time()
        for x, y in batches:
            train_loss, _ = sess.run([cost_fn, optim], 
                                     feed_dict={input_: x, label_: np.array(y)[:, None]})
            loss += train_loss
            
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e, epochs), ", Iteration: {}".format(iteration),
                      ", Avg. Training loss: {:.4f}".format(loss/100),", Processing : {:.4f} sec/batch".format((end-start)/100))
                loss = 0
                start = time.time()
            
            if iteration % 2000 == 0:
                similarity_ = word_similarity.eval()
                for i in range(validation_cnt):
                    validated_words = rev_dictionary_[validation_words[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-similarity_[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % validated_words
                    for k in range(top_k):
                        close_word = rev_dictionary_[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
            
            iteration += 1
    save_path = saver.save(sess, "model_checkpoint/skipGram_text8.ckpt")
    embed_mat = sess.run(normalization_embed)

Epoch 1/2 , Iteration: 100 , Avg. Training loss: 6.1708 , Processing : 0.3223 sec/batch
Epoch 1/2 , Iteration: 200 , Avg. Training loss: 6.1667 , Processing : 0.2973 sec/batch
Epoch 1/2 , Iteration: 300 , Avg. Training loss: 6.0667 , Processing : 0.3203 sec/batch
Epoch 1/2 , Iteration: 400 , Avg. Training loss: 6.0098 , Processing : 0.3148 sec/batch
Epoch 1/2 , Iteration: 500 , Avg. Training loss: 5.9529 , Processing : 0.2981 sec/batch
Epoch 1/2 , Iteration: 600 , Avg. Training loss: 5.9786 , Processing : 0.2960 sec/batch
Epoch 1/2 , Iteration: 700 , Avg. Training loss: 5.8557 , Processing : 0.2977 sec/batch
Epoch 1/2 , Iteration: 800 , Avg. Training loss: 5.7385 , Processing : 0.3006 sec/batch
Epoch 1/2 , Iteration: 900 , Avg. Training loss: 5.6670 , Processing : 0.2972 sec/batch
Epoch 1/2 , Iteration: 1000 , Avg. Training loss: 5.5703 , Processing : 0.3254 sec/batch
Epoch 1/2 , Iteration: 1100 , Avg. Training loss: 5.4528 , Processing : 0.3336 sec/batch
Epoch 1/2 , Iteration: 1200 , 

KeyboardInterrupt: 

A similar output will be printed for all other iterations, and the trained
network will have been restored for further use.

In [None]:
"""The Saver class adds ops to save and restore variables to and from checkpoints."""
with tf_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=tf_graph) as sess:
    """Restoring the trained network"""
    saver.restore(sess, tf.train.latest_checkpoint('model_checkpoint'))
    embed_mat = sess.run(word_embed)

We have used the t-distributed stochastic neighbor embedding (t-SNE)
for the purpose of visualization (https://lvdmaaten.github.io/tsne/).
The high-dimensional, 300 vector representation of 250 random words has
been used across a two-dimensional vector space. t-SNE ensures that the
initial structure of the vector is reserved in the new dimension, even after
conversion.

In [None]:
word_graph = 250
tsne = TSNE()
word_embedding_tsne = tsne.fit_transform(embed_mat[:word_graph, :])

As we can observe in Figure 2-13, words with semantic similarity
have been placed closer to one another in their representation in the
two-dimensional space, thereby retaining their similarity even after the
dimensions have been further reduced.

For example, words such as year,
years, and age have been placed near one another and far from words such as international and religious.

The model can be trained for a higher
number of iterations, to achieve a better representation of the word
embeddings, and further changes can be made in the threshold values, to
fine-tune the results.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
for idx in range(word_graph):
    plt.scatter(*word_embedding_tsne[idx, :], color='steelblue')
    plt.annotate(rev_dictionary_[idx], (word_embedding_tsne[idx, 0], word_embedding_tsne[idx, 1]), alpha=0.6)