In [1]:
import tensorflow as tf

### Chapter 1: Word Embeddings

In this section, we'll use word embeddings to give numeric vector representations of words

#### What is NLP?

NLP encompasses any task related to machines dealing with spoken or written language. 

#### Using text data

A machine learning engineer must first convert raw text data into usable machine data, which can then be used as input for NLP algorithms

### Vocabulary

#### Corpus vocabulary

A **text corpus** is a set of texts used for a task. 

The **vocabulary** is the set of unique words in the text corpus

You can use either a **character-based vocabulary** (the set of each unique character in the text corpus) or a **word-based vocabulary** (the set of each unique word in the text corpus)

#### Tokenization

**Tokenization**: the process of representing a piece of text as a vector/list of vocabulary words, rather than one long string. Each vocabulary word in a piece of text is a **token**. 

#### Tokenizer object

Using Tensorflow, we can convert a text corpus into tokenized sequences using the `Tokenizer` object. The `Tokenizer` object contains the functions "fit_on_texts" and "texts_to_sequences", which initialize the object with a text corpus and convert pieces of text into sequences of tokens, respectively. 

The `Tokenizer` converts each vocabulary word into an integer ID (IDs are given to words by descending frequency). This allows the tokenized sequences to be used in NLP algorithms. 

#### Tokenizer parameters

By default, the `Tokenizer` filters out any punctuation and whitespace (though you can customizer with the "filters" parameter). When a new text contains words not in the corpus vocabulary, those words are known as out-of-vocabulary (OOV)words. The "texts_to_sequences" automatically filters out all OOV words. But, if we want to specify each OOV word with a special vocabulary token, we can initialize the `Tokenizer` with the "oov_token" parameter.

The "num_words" parameter lets us specify the maximum number of vocabulary words to use. For example, if num_words = 100, then the tokenizer will only use the 100 most frequent words in the vocabulary and filter out the remaining words. This is helpful if the text corpus is large and you need to limit the vocabulary size to increase training speed or prevent overfitting on infrequent words. 



In [None]:
"""
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(
  oov_token='OOV')
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)
print(tokenizer.texts_to_sequences(['bob ate bacon']))
print(tokenizer.word_index)

import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2)
text_corpus = ['bob ate apples, and pears', 'fred ate apples!']
tokenizer.fit_on_texts(text_corpus)

# the two most common words are 'ate' and 'apples'
# the tokenizer will filter out all other words
# for the sentence 'bob ate pears', only 'ate' will be kept
# since 'ate' maps to an integer ID of 1, the only value 
# in the token sequence will be 1
print(tokenizer.texts_to_sequences(['bob ate pears']))
"""

### Embeddings

#### Word representation

So far, when we tokenize words, we give each word a unique integer ID. But, these integer IDs don't tell us how different words might be related to each other (e.g., if "milk" and "cereal" had IDs of 14 and 103 respectively, we wouldn't know that these two words are often used in the same context). 

The solution is to use a **embedding vector**, which is a higher-dimensional vector representation of a vocabulary word. Embedding vectors give us a word representation that captures relationships between words (e.g., words similar to "milk" are closer to "milk" than are words that aren't similar to "milk").

When creating embedding vectors, we need to consider how large the vectors are (# of dimensions). Larger vectors capture more relational tendencies and are better if you have a large vocabulary size, but they also take up more computational resources and can overfit on smaller vocabularies.

As a rule of thumb, we can set the # of dimensions in the embedding vectors equal to the 4th root of the vocabulary size (e.g., if the vocabulary has 10000 words, the rule says that your embedding vectors should have 10 dimensions). But, we need to try different embedding sizes to see which is best for the task at hand.

#### Target-context

The basis for embedding vectors comes from the concepts of **target word** and **context window**. For each word of each tokenized sequence, we treat it as a ***target***, and the words around it as the ***context window***. Words that often appear near each other tend to be related. 

We refer to the size of the context window as the **window size**. Since we want the context to be symmetric on both sizes of the target word, the window size should be an odd number (since the target word is also included in the context window). 


In [5]:
"""
import tensorflow as tf

def get_target_and_size(sequence, target_index, window_size):
    # define the target word and the size of the window
    target_word = sequence[target_index]
    half_window_size = window_size // 2
    return (target_word, half_window_size)

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Convert a list of text strings into word sequences
    def get_target_and_context(self, sequence, target_index, window_size):
        target_word, half_window_size = get_target_and_size(
            sequence, target_index, window_size
        )
        
def get_window_indices(sequence, target_index, half_window_size):
    # define a helper function to get the word embeddings window
    left_incl = max([0, target_index - half_window_size])
    right_excl = min([len(sequence), target_index + half_window_size + 1])
    return (left_incl, right_excl)
"""

2

### Skip-gram

#### Embedding models

Before, we discussed how embedding vectors are based on target words and context windows. Depending on the type of embedding model used, the pairs will either be target-context pairs or context-target pairs

#### Skip-gram

The **skip-gram** model uses target-context training pairs. Each pair has (target word, context word). So, every context window will produce multiple training pairs. 

For example, if the context window were "I like hot Philly cheesesteaks" (where "hot" is the target word), the training pairs would be (hot, I), (hot, like), (hot, Philly), and (hot, cheesesteaks).

When training an embedding model, we convert the first element in the pair (here, the target word) into its corresponding embedding vector. 

#### Continuous bag of words (CBOW)

The **continuous-bag-of-words (CBOW)** model uses context-target training pairs. Each pair will have all the context words as the first element and the target word as the second element.

For example, if the context window were "I could use a hand", the training pair would be ((I, could, a, hand), use). 

Since the context element for the CBOW model contains multiple words, we use the average context embedding vector when training ourembedding model. 

For example, the context embedding vector would be the average between the four embedding vectors for "I", "could", "a", and "hand". 

#### Skip-gram vs. CBOW

Since the skip-gram model creates a training pair for each context word, it'll give more training samples,holding sample size fixed, than the CBOW model. But, this also means that the CBOW model is faster to train. 

Furthermore, since the skip-gram model is creating multiple instances of training pairs for each target word, it gives more training samples for rarer words and phrases, so it does better in these cases than the CBOW model. But, the CBOW model provides more accurate embeddings for more common words. 

In [None]:
"""
def create_target_context_pairs(self, texts, window_size):
    pairs = []
    sequences = self.tokenize_text_corpus(texts)      
    for sequence in sequences:
        for i in range(len(sequence)):
            target_word, left_incl, right_excl = self.get_target_and_context(
                sequence, i, window_size)
            for j in range(left_incl, right_excl):
                if j != i:
                    pairs.append((target_word, sequence[j]))
    return pairs
"""

### Embedding matrix

We can use a trainable embedding matrix to calculate word embeddings

#### Variable creation

For our embedding model, we need to create a 2-D matrix that contains the embedding vectors for each vocabulary word ID.

For Tf, we need to create this manually using the "tf.get_variable" function. 

`import tensorflow as tf
print(tf.get_variable('v1', shape=(1, 3)))
print(tf.get_variable('v2', shape=(2,), dtype=tf.int64))`

In the example above, we create the first variable with just a shape keyword (so its type is the default tf.float32) while the second is manually set to type tf.int64. 

During training, the first call to tf.get_variable, with a given variable name, will create a new varaible. Each subsequent call to the function using the same variable will retrieve the existing variable, rather than creating a new one. This lets us continue training with the same embedding matrix. 

#### Variable initializers

We can specify how the variable will be initialized by using the "initializer" argument for tf.get_variable. The default initializer for variables created by tf.get_variable is "glorot_uniform_initializer", but we can use our own probability distribution to randomly initialize the variable. A popular one is "tf.random_uniform". 

`import tensorflow as tf
init = tf.random_uniform((5, 10),minval=-1,maxval=2)
v = tf.get_variable('v1', initializer=init)`

Above, we used tf.random_inform to return a tensor with shape (5,10) containing randomly chosen values in the interval [-1, 2]. We then used "init" to initialize v with the randomly chosen values.

#### Embedding lookup

When training the embedding model, the "forward" run consists of variable initialization/retrieval followed by "embedding lookup" for the current iteration's training batch.

Embedding lookup refers to retrieving the embedding vectors for each word ID. We can perform the lookup by retrieving the rows corresponding to the training batch's word ID. 

To do so, we use "tf.nn.embedding_lookup". It takes two arguments, the embedding matrix variable and vocabulary IDs to look up

`import tensorflow as tf
emb_mat = tf.get_variable('v1', shape=(5, 10))
word_ids = tf.constant([0, 3])
emb_vecs = tf.nn.embedding_lookup(emb_mat, word_ids)
print(emb_vecs)
Output: Tensor("embedding_lookup/Identity:0", shape=(2, 10), dtype=float32)`

Here, we used tf.nn.embedding_lookup to retrieve embedding vectors from 'emb_mat' for the word IDs 0 and 3. The output tensor contains the embedding vector for ID = 0 in the first row and the embedding vector for ID = 3 in the second row. 


In [None]:
"""
import tensorflow as tf

def get_initializer(embedding_dim, vocab_size):
    initial_bounds = 0.5 / embedding_dim
    initializer = tf.random_uniform(
        (vocab_size, embedding_dim),
        minval=-initial_bounds,
        maxval=initial_bounds)
    return initializer


# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Forward run of the embedding model to retrieve embeddings
    def forward(self, target_ids):
        initializer = get_initializer(
            self.embedding_dim, self.vocab_size)
"""

### Candidate Sampling

#### Large vocabularies

To obtain good word embeddings, we need to train an embedding model on a large amount of text data. This often means that the vocabulary size will be very large, and having a large vocabulary size can significantly slow down training. 

Training an embedding model is equivalent to multiclass classification, where the possible classes include every single vocabulary word. We need to calculate a softmax loss across every single vocabulary word during training, which can be incredibly time-consuming for large vocabularies. 

To get around this, we use **candidate sampling**, where we choose a smaller fraction of the possible words when we compute the loss for training the embedding model. If we choose the proper candidate samplers and loss function, we can significantly speed up training while maintaining performance

#### Computing logits

When we calculate the loss for an embedding model, we need to compute the model's logits. We set up trainable weights and biases, and these compute the logits, which are then converted into the loss based on the loss function. 



In [None]:
"""
import tensorflow as tf

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Get bias and weights for calculating loss
    def get_bias_weights(self):
        weights_initializer = tf.zeros(shape = (self.vocab_size, self.embedding_dim))
        bias_initializer = tf.zeros(shape = (self.vocab_size, ))
        weights = tf.get_variable('weights', initializer = weights_initializer)
        bias = tf.get_variable('bias', initializer = bias_initializer)
        return (weights, bias)
"""

### Embedding Loss

#### Loss functions

Candidate sampling avoids performing a costly full softmax operation to calculate the embedding loss. There are two main loss functions that are used instead: sampled softmax and NCE loss

#### Sampled softmax

This is a softmax loss with "sampled" classes. The classes used to calculate the softmax include the actual context vocabulary word as well as a randomly chosen set of words from the entire vocabulary as a set of false labels. 

In Tensorflow, we can compute the sampled softmax loss using the tf.nn.sampled_softmax_loss function

![title](sampled_softmax_example.png)

The sampled softmax loss gives a good approximation for the full softmax loss, while being more efficient. 

However, when training a model's loss on a validation or test set, it's best to use the full softmax cross entropy for the most accurate metrics. 

#### NCE Loss

The NCE (noise-contrastive estimation) loss takes a different approach. It converts the multiclass classification problem into a binary classification problem. 

With the softmax-based loss function, the embedding model is trained to predict the singular correct context word out of an entire set of vocabulary words. With noise-contrastive estimation, in contrast, the model instead uses a sigmoid-based probabilistic approach. For each sapmled word, the model would want to output a low probability that the sampled word is part of the target word's context. But, the model should output a high probability for the true context word. 

![title](nce_example.png)

This is simpler than calculating softmax and thus is normally favored when training an embedding model. In Tensorflow, we can compute the NCE loss using the tf.nn.nce_loss function

The NCE loss provides a good loss approximation during training, but should be replaced by the full sigmoid cross entropy loss duriing evaluation or testing, for the most accurate metrics. 

In [None]:
"""
import tensorflow as tf

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Get bias and weights for calculating loss
    def get_bias_weights(self):
        weights_initializer = tf.zeros([self.vocab_size, self.embedding_dim])
        bias_initializer = tf.zeros([self.vocab_size])
        weights = tf.get_variable('weights',
            initializer=weights_initializer)
        bias = tf.get_variable('bias',
            initializer=bias_initializer)
        return weights, bias
    
    # Calculate NCE Loss based on the retrieved embedding and context
    def calculate_loss(self, embeddings, context_ids, num_negative_samples):
        weights, bias = self.get_bias_weights()
        nce_losses = tf.nn.nce_loss(weights, bias, context_ids, embeddings, num_negative_samples, self.vocab_size)
        overall_loss = tf.reduce_mean(nce_losses)
        return overall_loss
"""

### Cosine Similarity

We can use normalized cosine similarity to evaluate the embedding model. 

#### Vector comparison

We can use cosine similarity to compare how "close" two vectors are. Since word embeddings are just vectors of real numbers. 

#### Correlation

The cosine similarity measures the correlation between two vectors, or how closely related they are. The value of cosine similarity ranges from -1 to 1 ([-1, 1]). A value of 1 means that the vectors are identical, a value of -1 means the vectors are complete opposites, and a value of 0 means that they are orthogonal. 

In [None]:
"""
import tensorflow as tf

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Forward run of the embedding model to retrieve embeddings
    def forward(self, target_ids):
        initial_bounds = 0.5 / self.embedding_dim
        initializer = tf.random_uniform(
            [self.vocab_size, self.embedding_dim],
            minval=-initial_bounds,
            maxval=initial_bounds)
        self.embedding_matrix = tf.get_variable('embedding_matrix',
            initializer=initializer)
        embeddings = tf.nn.embedding_lookup(self.embedding_matrix, target_ids)
        return embeddings
    
    # Compute cosine similarites between the word's embedding
    # and all other embeddings for each vocabulary word
    def compute_cos_sims(self, word, training_texts):
        self.tokenizer.fit_on_texts(training_texts)
        word_id = self.tokenizer.word_index[word]
        word_embedding = self.forward([word_id])
        normalized_embedding = tf.nn.l2_normalize(word_embedding)
        normalized_matrix = tf.nn.l2_normalize(self.embedding_matrix, axis = 1)
        cos_sims = tf.matmul(normalized_embedding, normalized_matrix, transpose_b = True)
        return cos_sims
"""

### K-Nearest Neighbors

We can use cosine similarity as a distance metric for K-nearest neighbors. Using KNN helps us evaluate our embedding model and make sure that it was trained propperly. If the KNN for any given word is always the same K words, then something may have gone wrong with the training. Likewise, if the K-nearest neighbors are completely different than that we'd expect to see, there may also be an error as well. 

When the embedding model is well-trained, the KNN metric can provide useful grouping insights. 

In [None]:
"""
import tensorflow as tf

# Skip-gram embedding model
class EmbeddingModel(object):
    # Model Initialization
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Forward run of the embedding model to retrieve embeddings
    def forward(self, target_ids):
        initial_bounds = 0.5 / self.embedding_dim
        initializer = tf.random_uniform(
            [self.vocab_size, self.embedding_dim],
            minval=-initial_bounds,
            maxval=initial_bounds)
        self.embedding_matrix = tf.get_variable('embedding_matrix',
            initializer=initializer)
        embeddings = tf.nn.embedding_lookup(self.embedding_matrix, target_ids)
        return embeddings
    
    # Compute cosine similarites between the word's embedding
    # and all other embeddings for each vocabulary word
    def compute_cos_sims(self, word, training_texts):
        self.tokenizer.fit_on_texts(training_texts)
        word_id = self.tokenizer.word_index[word]
        word_embedding = self.forward([word_id])
        normalized_embedding = tf.nn.l2_normalize(word_embedding)
        normalized_matrix = tf.nn.l2_normalize(self.embedding_matrix, axis=1)
        cos_sims = tf.matmul(normalized_embedding, normalized_matrix,
            transpose_b=True)
        return cos_sims
    
    # Compute K-nearest neighbors for input word
    def k_nearest_neighbors(self, word, k, training_texts):
        cos_sims = self.compute_cos_sims(word, training_texts)
        squeezed_cos_sims = tf.squeeze(cos_sims) # the shape of cos_sims is (1, self.vocab_size), but we don't need the dimension of size 1
        # get the k-nearest neighbor for the word. 
        top_k_output = tf.math.top_k(squeezed_cos_sims, k)
        return top_k_output # the output is a tuple. The first element is the top K cosine similarities, while the second element is the actual word IDs

"""