# Word Embeddings

## Disclaimer: Material is credited to
+ [Stanford 224N's](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture01-wordvecs1.pdf) lecture on word embeddings


+ [Pat Coady's](https://github.com/pat-coady/word2vec) repo



+ [This](https://gist.github.com/aneesh-joshi/c8a451502958fa367d84bf038081ee4b) amazing github repo

# Word Representations

## How can we turn words into objects that a computer can process?

## Words as discrete representations:
+ In traditional NLP, words are often represented as unique one-hot encoded vectors, e.g. cat = [0, 0, 0, 1], dog = [0, 0, 1, 0]
+ The length of these vectors is determined by the vocabulary size of the training data, i.e. how many unqiue words there are
+ The problems with this approach?
    - A lot of memory required to represent a large corpus, Need [V x V] matrix where V is the vocabulary size
    - These word vectors are sparse, most of the elements in the word vector matrix are 0
    - There is no natural measure of similarity between these word vectors,
        i.e. hotel = [0, 0, 1], motel = [0, 1, 0] are orthogonal vectors with similar meaning
    - **We want to encode similarity within the word vectors**
        
## Words as continuous representations
+ Distributional semantics is the notion that a word's meaning is given by the words that appear close by
+ “You shall know a word by the company it keeps” - J. R. Firth 1957
+ One of the most successful ideas of modern statistical NLP!
+ We will use dense representations of words so that it is similar to words that often appear in the same context
+ Example: cat = [0.3, 1.5, -0.4, 1.9]
+ With continuous representations, we can naturally measure similarity by using the dot product in the vector space formed by our vectors!
+ The length of the vectors in this case is a hyperparameter, so we only need a [V x D] matrix with D < V

In [None]:
!pip3 install requirements.txt

In [None]:
# import dependencies
import numpy as np
import tensorflow as tf
import data.dataloader as dataloader
import nltk

nltk.download('punkt')

# for deterministic results
tf.random.set_random_seed(1021)

assert (tf.__version__ == "1.13.1")

In [None]:
# download our data
!curl https://www.gutenberg.org/files/11/11-0.txt -o data/alice-in-wonderland.txt

# Word2Vec
+ In 2013, Tomas Mikolov at published [two](https://arxiv.org/pdf/1310.4546.pdf) [papers](https://arxiv.org/pdf/1301.3781.pdf) that shook the NLP community
+ Using a simple neural network, he demonstrated impressive results in the structure of learned word embeddings
+ We will be implementing a naive version of the skip-gram model

In [None]:
class Config:
    """Hyperparameters for the word2vec model
    """
    # embedding dimension
    emb_dim = 50

    # Training options.
    # The training text file.
    train_data = "data/alice-in-wonderland.txt"

    # batch size for training
    batch_size = 64

    # Number of epochs to train. After these many epochs, the learning
    # rate decays linearly to zero and the training stops.
    epochs_to_train = 4

    # The number of words to predict to the left and right of the target word.
    window_size = 5

    # The minimum number of word occurrences for it to be included in the
    # vocabulary.
    min_count = 5

    # Upper limit on the size of our vocabulary
    vocab_size = 2500
    
    # train test split
    train_test_split = 0.8
    
    # interval for printing loss statistics
    print_interval = 1000


# Data Processing
+ To construct our dataset, we will simply tokenize the entire text of the book we are interested in
+ Then we will form the pairs of words within our window 
+ Finally, we'll shuffle and batch our training data and pack it into an iterator for the skip-gram model

In [None]:
word_array, dictionary, _, _ = dataloader.build_word_array(Config.train_data, vocab_size=Config.vocab_size)
int2word = {dictionary[k]: k for k in dictionary}

In [None]:
def build_pairs(words):
    """Builds pairs of center words and context words
    """


def batch_and_shuffle(dataset, batch_size):
    """Returns an iterator of the shuffle and batched dataset
    """


In [None]:
dataset = build_pairs(word_array)
print(f"Vocab size: {len(dictionary)}, Dataset size: {len(dataset)}")

# TensorFlow 1.x.x

+ Numerical computation library that supports automatic-differentiation -> perfect for deep learning!
+ Often difficult and annoying to use since it's almost a compiled language inside of an interpreted language (Python)
+ The user defines "nodes" in the tensorflow graph
+ The user must then compile the graph in a tensorflow session, at which point data can be fed into the nodes and computation can execute

# TensorFlow example

In [None]:
x = tf.placeholder(tf.int32, shape=[])
y = tf.constant(9)
z = tf.constant(10)

# We never defined x, but this doesn't throw an error!
s = x + y + z

# Prints the signature of the tensors and not the data
print("Outside session")
print("x: ", x)
print("y: ", y)
print("z: ", z)
print("s: ", s)


with tf.Session() as session:
    print("Inside session")
    try:
        print("x: ", session.run(x))
    except:
        print("x: ", "Error! We have to feed a value into a placeholder inside a session!")
    print("y: ", session.run(y))
    print("z: ", session.run(z))
    print("s: ", session.run(s, feed_dict={x: 1}))

# The Skip-Gram Model

In [None]:
# making placeholders for x_train and y_train
x_train = tf.placeholder(tf.int32, shape=(None,))
y_label = tf.placeholder(tf.int32, shape=(None,))

# one-hot encode the input vector
x_onehot = tf.one_hot(x_train, Config.vocab_size, dtype=tf.float32)
# convert the labels to one hot vectors
y_onehot = tf.one_hot(y_label, Config.vocab_size, dtype=tf.float32)

### forward pass ###
W1 = tf.Variable(tf.random_normal([Config.vocab_size, Config.emb_dim]))
b1 = tf.Variable(tf.random_normal([Config.emb_dim]))

# compute the hidden representation
hidden = # affine transformation

W2 = tf.Variable(tf.random_normal([Config.emb_dim, Config.vocab_size]))
b2 = tf.Variable(tf.random_normal([Config.vocab_size]))

# compute the predictions
prediction = # affine transformation, then softmax normalization


In [None]:
# define the loss function


# define the training step:


# start the tensorflow session
with tf.Session() as sess:
    
    # initialize all global variables

    
    # run the training loop
    for epoch in range(Config.epochs_to_train):
        
        # train on batches
        for i, (x, y) in enumerate(batch_and_shuffle(dataset, Config.batch_size)):
            
            # run the tensorflow training ops

            
            # log the progress
            if i % Config.print_interval == 0:
                print(f"Epoch: {epoch + 1}, Iteration: {i:4d}, Loss: {loss:.4f}")
                
    # average the two embedding matrices
    vectors = sess.run( """ what goes here? """ )

# Embedding Visualization
+ Once we have trained our model, we can pull the weights from the network and visualize them
+ We will use the Tensorflow [Embedding Projector](https://projector.tensorflow.org) to visualize our learned embeddings
+ We need to write the embedding vectors and our vocab into tsv files for the Embedding Projector to plot

In [None]:
def write_embeddings(vectors, filename=f"data/word2vec_{Config.emb_dim}d.tsv"):
    with open(filename, 'w') as f:
        for i, vector in enumerate(vectors):
            # convert the embedding to string
            write_str = '\t'.join([int2word[i]] + [str(v) for v in vector]) 
            if i < vectors.shape[0] - 1:
                write_str += '\n'
                                  
            f.write(write_str)
    print(f"Wrote embeddings to {filename}")
    
def write_labels(int2word, filename=f"data/label_metadata.tsv"):
    with open(filename, 'w') as f:
        for i in int2word:
            write_str = int2word[i]
            if i < len(int2word) - 1:
                write_str += "\n"
            f.write(write_str)
    print(f"Wrote labels to {filename}")
                

In [None]:
write_embeddings(vectors)
write_labels(int2word)