## 2. Apply word2vec to Wilding chord dataset
We wish to apply the skip-gram word2vec model to learn embeddings from each chord. Ideally, chords that appear in similar contexts will have similar word vectors.

We will follow [this Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/word2vec) to train to model with negative sampling.

In [6]:
import re
import tqdm
import io

from pathlib import Path

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.sequence import skipgrams

In [2]:
%load_ext tensorboard

In [3]:
AUTOTUNE = tf.data.AUTOTUNE
SEED = 42

In [7]:
CHORDS_SAVEPATH = Path('chords.txt')

As a reminder, the skip-gram model works by asking a neural net to maximize the probability of predicting "nearby" words or "neighbor words" given a single word.

<img src="https://miro.medium.com/max/1200/0*FTfdlZ7yDBoQ8c9W.png" width="500px">

**Negative sampling** is an optimization technique. Without it, the model would need to update the weights of *every* word that is unrelated to the target word (everywhere it should output 0). This scales with the vocabulary, so negative sampling offers a way of only updating a few of these "negative" words to save time and boost performance. The original paper recommends using 5-20 words for a smaller dataset, so we'll choose 20 negative samples.

In [33]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []

    # Build the sampling table for `vocab_size` tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

    # Iterate over all sequences (sentences) in the dataset.
    for sequence in tqdm.tqdm(sequences):
        # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
              sequence,
              vocabulary_size=vocab_size,
              sampling_table=sampling_table,
              window_size=window_size,
              negative_samples=0)

        # Iterate over each positive skip-gram pair to produce training examples
        # with a positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_class = tf.expand_dims(
                tf.constant([context_word], dtype="int64"), 1)
            negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
                true_classes=context_class,
                num_true=1,
                num_sampled=num_ns,
                unique=True,
                range_max=vocab_size,
                seed=SEED,
                name="negative_sampling")

            # Build context and label vectors (for one target word)
            negative_sampling_candidates = tf.expand_dims(
                negative_sampling_candidates, 1)

            context = tf.concat([context_class, negative_sampling_candidates], 0)
            label = tf.constant([1] + [0]*num_ns, dtype="int64")

            # Append each element from the training example to global lists.
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)

    return targets, contexts, labels

Create a TF TextLineDataset from the chords we parsed earlier.

In [34]:
text_ds = tf.data.TextLineDataset(CHORDS_SAVEPATH).filter(lambda x: tf.cast(tf.strings.length(x), bool))

Find the maximul length sequence:

In [35]:
max_len = 0
for x in text_ds:
    length = len(x.numpy().decode("utf-8").split())
    max_len = max(max_len, length)
max_len

93

### Define vectorizer

Now we must define a vectorization layer for our Word2Vec model:

In [36]:
# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = max_len

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = layers.TextVectorization(
    standardize=lambda data: data,
    max_tokens=vocab_size,
    pad_to_max_tokens=True,
    output_mode='int')

In [37]:
vectorize_layer.adapt(text_ds.batch(1024))

Let's print the first 20 tokens in our vocabulary:

In [38]:
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', 'G7', 'Dm7', 'Am7', 'CM7', 'C', 'C7', 'A7', 'E7', 'D7', 'F7', 'Em7', 'FM7', 'B%7', 'Am', 'Eaug7', 'Bb7', 'Gm7', 'B7']


In [39]:
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

In [40]:
sequences = list(text_vector_ds.as_numpy_iterator())

And, we can illustrate our vectorizer in action:

In [41]:
for seq in sequences[:2]:
  print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

[15  3  9 15  8  3  2  5 13 14 19 89  9 45  4 16  9 66 15 69  4 16  9 66
 15 15 30 19  9  8  3  2  7 11 19  9 15 30 19  9  8  3  2  7 11  9 15  8
 31  8 31  8 31  8 66  8 31 14  9  8 10 18  7 11 17 11 17 11  9 15  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0] => ['Am', 'Dm7', 'E7', 'Am', 'A7', 'Dm7', 'G7', 'CM7', 'FM7', 'B%7', 'B7', 'EM', 'E7', 'AM7', 'Am7', 'Eaug7', 'E7', 'Do7', 'Am', 'Am,M7', 'Am7', 'Eaug7', 'E7', 'Do7', 'Am', 'Am', 'Gb%7', 'B7', 'E7', 'A7', 'Dm7', 'G7', 'C7', 'F7', 'B7', 'E7', 'Am', 'Gb%7', 'B7', 'E7', 'A7', 'Dm7', 'G7', 'C7', 'F7', 'E7', 'Am', 'A7', 'Dm', 'A7', 'Dm', 'A7', 'Dm', 'A7', 'Do7', 'A7', 'Dm', 'B%7', 'E7', 'A7', 'D7', 'Gm7', 'C7', 'F7', 'Bb7', 'F7', 'Bb7', 'F7', 'E7', 'Am', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
[ 5 37 11 39 34 22 29  3  2  5  4  3  2  5 37 11 39 34 22 29  3  2  5  4
  3  2  5  4  3  2 52 42  3  2  5 37 11 39 34 22 29  3  2  5  0  0  0  0
  0  0  0  0  0  0  0  0

### Prepare data

In [42]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=20,
    vocab_size=vocab_size,
    seed=SEED)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 1958.25it/s]


In [43]:
targets = np.array(targets)
contexts = np.array(contexts)[:,:,0]
labels = np.array(labels)
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

targets.shape: (227,)
contexts.shape: (227, 21)
labels.shape: (227, 21)


Now we create a dataset from our targets, contexts, and labels.

In [44]:
BATCH_SIZE = 16
BUFFER_SIZE = 50
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<BatchDataset element_spec=((TensorSpec(shape=(16,), dtype=tf.int64, name=None), TensorSpec(shape=(16, 21), dtype=tf.int64, name=None)), TensorSpec(shape=(16, 21), dtype=tf.int64, name=None))>


In [45]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<PrefetchDataset element_spec=((TensorSpec(shape=(16,), dtype=tf.int64, name=None), TensorSpec(shape=(16, 21), dtype=tf.int64, name=None)), TensorSpec(shape=(16, 21), dtype=tf.int64, name=None))>


### Define and train model

In [46]:
class Word2Vec(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_ns):
        super(Word2Vec, self).__init__()
        self.target_embedding = layers.Embedding(vocab_size,
                                                 embedding_dim,
                                                 input_length=1,
                                                 name="w2v_embedding")
        self.context_embedding = layers.Embedding(vocab_size,
                                                  embedding_dim,
                                                  input_length=num_ns+1)

    def call(self, pair):
        target, context = pair
        # target: (batch, dummy?)
        # context: (batch, context)
        if len(target.shape) == 2:
            target = tf.squeeze(target, axis=1)
        # target: (batch,)
        word_emb = self.target_embedding(target)
        # word_emb: (batch, embed)
        context_emb = self.context_embedding(context)
        # context_emb: (batch, context, embed)
        dots = tf.einsum('be,bce->bc', word_emb, context_emb)
        # dots: (batch, context)
        return dots

We'll more or less arbitrarily define our embedding dimension to be 128, and choose 4 negative samples per positive sample.

In [167]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim, num_ns=4)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [168]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Now we fit our Word2Vec model to our dataset.

In [169]:
word2vec.fit(dataset, epochs=30, callbacks=[tensorboard_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fd9c57206d8>

In [42]:
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 72687), started 0:27:01 ago. (Use '!kill 72687' to kill it.)

### Save embedding matrix

Let's load the weights and vocabulary from the embedding layer of our Word2Vec model.

In [170]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Finally, let's save the embeddings and vocabulary to file (`.tsv` for tab-separated-values).

In [171]:
Path('embeddings/wilding-w2v/').mkdir(exist_ok=True)

In [172]:
out_v = io.open('embeddings/wilding-w2v/embedding.tsv', 'w', encoding='utf-8')
out_m = io.open('embeddings/wilding-w2v/vocab.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

These embeddings can be loaded in the [Tensorflow embedding projector](https://projector.tensorflow.org/), as shown below.

<img src="images/wilding_embeddings.png" width="500px">

### Explore chord similarity

Let's experimentally verify whether the chord embeddings have learned something about the similarity of chords.

In [173]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [174]:
# Our "starting" chord
chord = 'D7'
# Gbo7 is another voicing for D7b9
# Ab7 is what's known as the tritone substitution for D7
# Gb%7 is another voicing for D9
substitutes = ['Gbo7', 'Ab7', 'Gb%7']
# These chords are very unrelated to D7
not_substitutes = ['EM7', 'FM7', 'Dbm7']
# These are chords that often follow D7
next_chords = ['Gm7', 'G7', 'GM7']

In [175]:
chord_vec = vectorize_layer([chord])[0, 0].numpy()
subs_vec = np.squeeze(vectorize_layer(substitutes).numpy(), axis=-1)
not_subs_vec = np.squeeze(vectorize_layer(not_substitutes).numpy(), axis=-1)
next_chords_vec = np.squeeze(vectorize_layer(next_chords).numpy(), axis=-1)

In [176]:
chord_embedding = weights[chord_vec]
subs_embeddings = [weights[i] for i in subs_vec]
not_subs_embeddings = [weights[i] for i in not_subs_vec]
next_chords_embeddings = [weights[i] for i in next_chords_vec]

In [177]:
print("SUBSTITUTES")
for i, similar in enumerate(subs_embeddings):
    similarity = cosine_similarity(chord_embedding, subs_embeddings[i])
    print(f"{chord} and {substitutes[i]}: {similarity}")

SUBSTITUTES
D7 and Gbo7: 0.11925406754016876
D7 and Ab7: 0.03695443272590637
D7 and Gb%7: -0.07435572147369385


In [178]:
print("NOT SUBSTITUTES")
for i, similar in enumerate(not_subs_embeddings):
    similarity = cosine_similarity(chord_embedding, not_subs_embeddings[i])
    print(f"{chord} and {not_substitutes[i]}: {similarity}")

NOT SUBSTITUTES
D7 and EM7: -0.10294283926486969
D7 and FM7: -0.02963419258594513
D7 and Dbm7: 0.11339980363845825


In [179]:
print("NEXT CHORDS")
for i, similar in enumerate(next_chords_embeddings):
    similarity = cosine_similarity(chord_embedding, next_chords_embeddings[i])
    print(f"{chord} and {next_chords[i]}: {similarity}")

NEXT CHORDS
D7 and Gm7: -0.19272486865520477
D7 and G7: -0.09428456425666809
D7 and GM7: 0.0013562807580456138


Overall, this is not very promising. It's difficult to glean from these similarities whether the Word2Vec model is actually learning anything about the function of chords in a jazz context. Next, we'll try to fit an LSTM model to the chords in this dataset and see how it performs.