# Word2vec

word2vec is a singular algorithmn, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets.

This most popular two methods for learning representations of words are: 

- **Continuous bag of words model**: It predicts the middle word based on surrounding context words. The context consists of a few words before and after the middle word. This architecture is called a bag-of-words model as the order of words in the context is not important. 
- **Continuous skip-gram model**: It predicts words within a certain range before and after the current word in the same sentence.

## Skip-gram and negative sampling

Skip-gram is a technique used in natural language processing to learn word representations (word embeddings). Imagine you're trying to understand words by looking at their neighbors.
Here's how Skip-gram works:

- Take a word in a sentence
- Try to predict the words that are likely to appear around it
- Example: In the sentence "The cat sits on the mat"
  - If we're looking at the word "cat", the model tries to predict nearby words like "the", "sits"



Negative sampling is a clever trick to make this learning process more efficient:

- Instead of looking at every single word in the vocabulary (which would be super slow)
- The model randomly selects a few "negative" words that are unlikely to appear near the target word
- This helps the model learn to distinguish between words that are likely and unlikely to be context words

Think of it like a game:

- Positive example: "cat" is near "sits" ✓
- Negative examples: "cat" is probably NOT near "computer" or "rocket" ✗

The magic happens when the model learns to:

- Recognize which words are likely to be together
- Create vector representations that capture word meanings
- Do this efficiently by only checking a few random words instead of all possible words

Essentially, Skip-gram with negative sampling is a smart way to teach computers to understand word relationships by looking at how words typically appear together in text.

In [91]:
tf.config.experimental.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Import Dependencies

In [1]:
import io 
import re
import string 
import tqdm

import numpy as np 
import tensorflow as tf 

In [3]:
%load_ext tensorboard

In [2]:
SEED = 42 
AUTOTUNE = tf.data.AUTOTUNE

### Vectorize an example sentence

In [4]:
sentence = 'The wide road shimmered in the hot sun'
tokens = list(sentence.lower().split())
tokens

['the', 'wide', 'road', 'shimmered', 'in', 'the', 'hot', 'sun']

Create a vocabulary to save mappings from tokens to integer indices:

In [5]:
vocab = {}
index = 1 
vocab['<pad>'] = 0 # add a padding token

for token in tokens: 
    if token not in vocab: 
        vocab[token] = index
        index += 1

vocab_size = len(vocab)
vocab

{'<pad>': 0,
 'the': 1,
 'wide': 2,
 'road': 3,
 'shimmered': 4,
 'in': 5,
 'hot': 6,
 'sun': 7}

Create an inverse vocabulary to save mappings from integer indices to tokens

In [6]:
inverse_vocab = {}
for token, index in vocab.items():
    inverse_vocab[index] = token
inverse_vocab

{0: '<pad>',
 1: 'the',
 2: 'wide',
 3: 'road',
 4: 'shimmered',
 5: 'in',
 6: 'hot',
 7: 'sun'}

### Vectorize the Sentence

In [7]:
example_sequence = []

for word in tokens: 
    example_sequence.append(vocab[word])
example_sequence

[1, 2, 3, 4, 5, 1, 6, 7]

### Generate skip-grams from one sentence

The tf.keras.preprocessing.sequence module provides useful functions that simplify data preparation for word2vec. You can use the tf.keras.preprocessing.sequence.skipgrams to generate skip-gram pairs from the example_sequence with a given window_size from tokens in the range [0, vocab_size).

In [8]:
window_size = 2

positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
    example_sequence, 
    vocabulary_size=vocab_size, 
    window_size=window_size, 
    negative_samples=0
)

len(positive_skip_grams)

26

print a few positive skip grams

In [9]:
for target, context in positive_skip_grams[:5]: 
    print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(7, 1): (sun, the)
(5, 4): (in, shimmered)
(2, 4): (wide, shimmered)
(6, 1): (hot, the)
(5, 1): (in, the)


### Negative sampling for one skip-gram

The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would server as negative samples for training, you need to sample random words from the vocabulary. 

In [10]:
target_word, context_word = positive_skip_grams[0]

num_ns = 4 # the number of negative samples per positive context

context_class = tf.reshape(tf.constant(context_word, dtype='int64'),(1, 1))

negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class, # class that should be sampled as positive
    num_true=1, # each positive skip gram has 1 positive context class 
    num_sampled=num_ns, # number of negative context words to sample
    unique=True, # all the negative samples should be unique 
    range_max=vocab_size, # pick index of the samples from [0, vocab_size]
    seed=SEED, 
    name='negative_sampling'
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

tf.Tensor([2 1 4 3], shape=(4,), dtype=int64)
['wide', 'the', 'shimmered', 'road']


2024-12-08 07:57:25.522669: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2024-12-08 07:57:25.522740: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-12-08 07:57:25.522757: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-12-08 07:57:25.523048: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-12-08 07:57:25.523089: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### Constructing One Training Example

For a given positive (target_word, context_word) skip-gram, you now also have num_ns negative sampled context words that do not appear in the window size neighborhood of target_word. Batch the 1 positive context_word and num_ns negative context words into one tensor. This produces a set of positive skip-grams (labeled as 1) and negative samples (labeled as 0) for each target word.

In [11]:
squeezed_context_class = tf.squeeze(context_class, 1) # Reduce a dimension so you can use concatenation in the next step
squeezed_context_class.shape

TensorShape([1])

In [12]:
context = tf.concat([squeezed_context_class, negative_sampling_candidates], 0) # concatenate a positive context word with negative sampled words
context

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([1, 2, 1, 4, 3])>

In [13]:
label = tf.constant([1] + [0]*num_ns, dtype='int64') # label the first context word as `1` (positive) followed by `num_ns` 0`s negative
target = target_word

Check out the context and the corresponding labels for the target word from the skip-gram example above:

In [14]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 7
target_word     : sun
context_indices : [1 2 1 4 3]
context_words   : ['the', 'wide', 'the', 'shimmered', 'road']
label           : [1 0 0 0 0]


A tuple of (target, context, label) tensors constitutes one training example for training your skip-gram negative sampling word2vec model. Notice that the target is of shape (1,) while the context and label are of shape (1+num_ns,)

In [15]:
print("target  :", target)
print("context :", context)
print("label   :", label)

target  : 7
context : tf.Tensor([1 2 1 4 3], shape=(5,), dtype=int64)
label   : tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)


### Compile all steps into one function 

In [16]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)

In [17]:
sampling_table

array([0.00315225, 0.00315225, 0.00547597, 0.00741556, 0.00912817,
       0.01068435, 0.01212381, 0.01347162, 0.01474487, 0.0159558 ])

### Generate Training data 

In [63]:
# This function generates training data for the skip-gram model with negative sampling
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed): 
    """ 
    Parameters: 
    sequences: Input sequences (sentences encoded as integers)
    window_size: size of the context window around each target word
    num_ns: number of negative samples to generate
    vocab_size: total number of unique words in the vocabulary
    seed: random seed for reproducibility
    """
    targets, contexts, labels = [], [], []

    # Creates a sampling table to help with sub-sampling frequent words
    # this reduces the probability of sampling very common words too often
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

    # tqdm provides a progress bar to track progress
    for sequence in tqdm.tqdm(sequences): 
        # Generates positive skip-gram pairs for the current sequence
        # skipgrams() creates context word pairs within the specified window size
        # negative_samples=0 means only positive pairs are generated at this stage
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
            sequence, 
            vocabulary_size=vocab_size, 
            sampling_table=sampling_table, 
            window_size=window_size, 
            negative_samples=0
        )

        # Iterates through each positive skip-gram pair
        for target_word, context_word in positive_skip_grams: 
            # Creates a tensor for the context word to prepare for negative sampling
            context_class = tf.expand_dims(tf.constant([context_word], dtype='int64'), 1)

            # Generates negative samples using log-uniform sampling
            # Ensures sampled words are unique and within vocabulary range
            negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
                true_classes=context_class,
                num_true=1, 
                num_sampled=num_ns, 
                unique=True, 
                range_max=vocab_size, 
                seed=seed, 
                name='negative_sampling'
            )

            # Combines the positive context word with negative samples 
            context = tf.concat([tf.squeeze(context_class, 1), negative_sampling_candidates], 0)

            # Creates corresponding labels (1 for positive, 0 for negative samples)
            label = tf.constant([1] + [0]*num_ns, dtype="int64")

            targets.append(target_word)
            contexts.append(context)
            labels.append(label)
    
    return targets, contexts, labels

### Prepare training data for word2vec

#### Download text corpus

In [64]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [65]:
with open(path_to_file) as f:
  lines = f.read().splitlines()
for line in lines[:20]:
  print(line)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.


Use the non empty lines to construct a tf.data.TextLineDataset object for the next steps:

In [74]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

### Vectorize sentences from the corpus

In [71]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')

In [73]:

# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

Call TextVectorization.adapt on the text dataset to create vocabulary.

In [76]:
vectorize_layer.adapt(text_ds.batch(1024))

2024-12-08 09:29:22.173756: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Once the state of the layer has been adapted to represent the text corpus, the vocabulary can be accessed with TextVectorization.get_vocabulary. This function returns a list of all vocabulary tokens sorted (descending) by their frequency.

In [77]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'the', 'and', 'to', 'i', 'of', 'you', 'my', 'a', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your']


The vectorize_layer can now be used to generate vectors for each element in the text_ds

In [78]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

### Obtain Sequences from the dataset

In [79]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))

32777


2024-12-08 10:05:28.378658: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [81]:
for seq in sequences[:10]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0] => ['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0] => ['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[   7   41   34 1286  344    4  200   64    4 3690] => ['you', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish']
[34  0  0  0  0  0  0  0  0  0] => ['all', '', '', '', '', '', '', '', '', '']
[1286 1286    0    0    0    0    0    0    0    0] => ['resolved', 'resolved', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0] => ['first', 'citizen', '', '', '', '', '', '', '', '']
[  89    7   93 1187  225   12 2442  592    4    2] => ['first', 'you', 'know', 'c

### Generate training examples from sequences 

sequences is now a list of int encoded sentences. Just call the generate_training_data function defined earlier to generate training examples for the word2vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be the same, representing the total number of training examples.

In [82]:
targets, contexts, labels = generate_training_data(
    sequences=sequences, 
    window_size=2, 
    num_ns=4, 
    vocab_size=vocab_size, 
    seed=SEED
)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)


00%|███████████████████████████████████████████████████████████████████████████████| 32777/32777 [01:02<00:00, 523.12it/s]

### Configure the dataset for performance

In [83]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset

<_BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


Apply Dataset.cache and Dataset.prefetch to improve performance:

In [84]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
dataset

<_PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>

### Model 

The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.

### Subclassed word2vec model

In [105]:
# Define a custom keras model for the word2vec algorithm
# Inherits from `tf.keras.model` to create a neural network model 
class Word2Vec(tf.keras.Model): 
    def __init__(self, vocab_size, embedding_dim): 
        """
        Parameters:
        vocab_size: Total number of unique words in the vocabulary
        embedding_dim: Dimension of the word embedding vector
        """
        super(Word2Vec, self).__init__()
        # Create two embedding layers 
        self.target_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, name='w2v_embedding') #  Represents the input (target) words
        self.context_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) # Represents the context words

    def call(self, pair): # This is a forward pass method that defines how input is processed
        target, context = pair 
        if len(target.shape) == 2: 
            target = tf.squeeze(target, axis=1)
        # Converts integer-encoded words to their embedding representations
        word_emb = self.target_embedding(target) 
        context_emb = self.context_embedding(context)  # IMPORTANT: Changed from target_embedding to context_embedding
        
        # Uses Einstein summation (tf.einsum) to compute dot products
        dots = tf.einsum('be,bce->bc', word_emb, context_emb) 
        return dots

### Define loss function and compile model

In [102]:
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

In [106]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [98]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [107]:
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20


2024-12-08 11:23:12.289828: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 0.2183 - loss: 1.6089
Epoch 2/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.5993 - loss: 1.5900
Epoch 3/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.6112 - loss: 1.5343
Epoch 4/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.5601 - loss: 1.4458
Epoch 5/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.5757 - loss: 1.3497
Epoch 6/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.6094 - loss: 1.2539
Epoch 7/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.6461 - loss: 1.1640
Epoch 8/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.6805 - loss: 1.0805
Epoch 9/20
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

<keras.src.callbacks.history.History at 0x34713e1c0>

In [108]:
#docs_infra: no_execute
%tensorboard --logdir logs

### Embedding lookup and analysis

Obtain the weights from the model using `Model.get_layer` and `Layer.get_weights`. The `TextVectorization.get_vocabulary` function provides the vocabulary to build a metadata file with one token per line

In [None]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Create and save the vectors and metadata files

In [110]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

NameError: name 'weights' is not defined