From *Tensorflow*:

**word2vec** is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through word2vec have proven to be successful on a variety of downstream natural language processing tasks.

Representations of words:

- **Bag-of-words**: predicts a word given the neighboring context
- **Skip-gram**: predicts the neighboring context of a word, given the word itself
    - **Loss** would be the propability of all words in dictionary to be in the *conext*, with huge dictionaries, this would pose a problem
    - So noise contrastive estimation (**NCE**) loss is used as an efficient approximation for a full **softmax**
        - NCE uses **Negative Sampling**

From *Tensorflow*:

A negative sample is defined as a (`target_word`, `context_word`) pair such that the `context_word` does not appear in the `window_size` neighborhood of the `target_word`.

In [1]:
import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

In [2]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

# Tokenize and Vectorize

In [3]:
# tokenize
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(len(tokens))

8


In [4]:
# create vocab
vocab, index = {}, 1  # start indexing from 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
  if token not in vocab:
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}


In [5]:
# inverse vocab, to use numerical indexes
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


In [6]:
# vectorize
example_sequence = [vocab[word] for word in tokens]
print(example_sequence)

[1, 2, 3, 4, 5, 1, 6, 7]


# Skip-grams

In [7]:
window_size = 2

positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence,
      vocabulary_size=vocab_size,
      window_size=window_size,
      negative_samples=0
)
print(len(positive_skip_grams))

26


In [8]:
# positive sampling
for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(1, 7): (the, sun)
(7, 6): (sun, hot)
(4, 3): (shimmered, road)
(3, 4): (road, shimmered)
(1, 3): (the, road)


In [9]:
# negative sampling
target_word, context_word = positive_skip_grams[0]

num_ns = 4 # Set the number of negative samples per positive context

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))

negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

tf.Tensor([2 1 4 3], shape=(4,), dtype=int64)
['wide', 'the', 'shimmered', 'road']
