# Deep learning Bach. A step-by-step guide.

I decided to merge two of my great passions: for Artificial Intelligence and for Bach music and try to guide You step-by-step through creating a computer system, which by examining Bach's collected works learns itself to compose "like Bach"... Seriously, at the end of this tutorial you will have a working artificial composer and hopefuly you will also understand how and why it works. And I am not going to lie to you: similar systems of course already exist, created by very smart university and corporate reserchers, but also by hobbyists.

We live in exciting times. Amount of data available on-line in public domain is incredible and tools that allow to manipulate that data in really interesting ways (read: Machine Learning) matured. One thing in all of that is really to be thankful for: somehow the idea of sharing prevails and lot of extremely valuable stuff just lies there, waiting to be used and very smart people spend incredible amounts of their time making even more stuff publicly available and understandable. Kudos. I owe them. Hence this guide.

Callout: Why Bach?

Callout: What is a Machine Learning / Deep Learning

So, who is the audience of this guide?

Cerainly you are not an expert in deep learning nor in musicology. You 

So, what are we going to do?

Lets try to sketch

## Plan of attack

As always divide and concquer is highly succesful strategy, so let's try:

1. We will start by creating a workbench we're going to use.
2. Then we need to get a lot of example data to teach our composer
3. Understand the data enough to make it useful
4. Prepare it so it's suitable for ML
5. Build a neural network
6. And teach it using the data we prepared
7. Paradoxically, this is not the end. We now need to uncover all that innate knowledge and make it express itself in writing
8. Now, let's try to ask it to compose something for us
9. Turn that partiture into playable MIDI file and...
10. Finally, play it!

Callout: Useful links.

In [27]:
import numpy as np
import glob
import sys
import collections
import random
import math
from os.path import basename
from itertools import permutations

vocabulary_size = 30**4
MAX_VOICES = 4

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

def permutate(s, pat):
    assert len(s) == len(pat), "string len: {} != pattern len: {}".format(len(s), len(pat))
    r = []
    for ind in pat:
        r.append(s[ind])
    return "".join(r)

def read_data_files(path, validation=True):
    """Read data files according to the specified glob pattern
    Optionnaly set aside the last file as validation data.
    No validation data is returned if there are 5 files or less.
    :param directory: for example "data/*.txt"
    :param validation: if True (default), sets the last file aside as validation data
    :return: training data, validation data, list of loaded file names with ranges
     If validation is
    """
    codetext = []
    opusranges = []
    bachlist = glob.glob(path + '/**/*.txt', recursive=True)
    for bachfile in bachlist:
        bachtext = open(bachfile, "r")
        start = len(codetext)
        bars = (bachtext.read()).split("!")
        bars2 = []
        nb_voices = len(bars[0])
        if nb_voices<=MAX_VOICES:
            print("Loading file: {} ; {} voices".format(bachfile, nb_voices))
            for bar in bars: 
                bar2 = bar.ljust(MAX_VOICES, " ")
                #print("'{}', '{}'".format(bar, bar2))
                bars2.append(bar2)
            #print("bars2:", bars2)
            codetext.extend(bars2)
            end = len(codetext)
            opusranges.append({"start": start, "end": end, "name": basename(bachfile).split(".")[0]})
            bachtext.close()

            patterns = list(permutations(range(MAX_VOICES)))

            for pattern_no in range(1,len(patterns)):
                start2 = len(codetext)
                bars = []
                for j in range(start, end-1): #iterate over the whole opus except the final divider ("********")
                    #print(codetext[j], patterns[pattern_no])
                    bars.append(permutate(codetext[j],patterns[pattern_no]))
                assert codetext[end-1] == "********", "expecting '********', instead '{}'".format(codetext[end])
                bars.append(codetext[end-1]) 
                codetext.extend(bars)
                end2 = len(codetext)
                opusranges.append({"start": start2, "end": end2, "name": "Perm"+str(pattern_no)+basename(bachfile).split(".")[0]})
        else:
            print("Skipping file: {} ; {} voices".format(bachfile, nb_voices))
    if len(opusranges) == 0:
        sys.exit("No training data has been found. Aborting.")
    
    total_len = len(codetext)
    
    data, count, dictionary, reverse_dictionary = build_dataset(codetext)
    
    # For validation, use roughly 90K of text,
    # but no more than 10% of the entire text
    # and no more than 1 book in 5 => no validation at all for 5 files or fewer.

    # 10% of the text is how many files ?
    validation_len = 0
    nb_opus1 = 0
    for opus in reversed(opusranges):
        validation_len += opus["end"]-opus["start"]
        nb_opus1 += 1
        if validation_len > total_len // 10:
            break

    # 90K of text is how many books ?
    validation_len = 0
    nb_opus2 = 0
    for opus in reversed(opusranges):
        validation_len += opus["end"]-opus["start"]
        nb_opus2 += 1
        if validation_len > 90*1024:
            break

    # 20% of the books is how many books ?
    nb_opus3 = len(opusranges) // 5

    # pick the smallest
    nb_opus = min(nb_opus1, nb_opus2, nb_opus3)

    if nb_opus == 0 or not validation:
        cutoff = total_len
    else:
        cutoff = opusranges[-nb_opus]["start"]
    validata = data[cutoff:]
    codedata = data[:cutoff]
    return data, codedata, validata, opusranges, count, dictionary, reverse_dictionary

In [18]:
PATH = "../../ml/Untitled Folder/music_rnn/bach_new/txt"

data, codetext, valitext, opusranges, count, dictionary, reverse_dictionary = read_data_files(PATH, validation=True)

#data, count, dictionary, reverse_dictionary = build_dataset(dypthongs)
#print('Most common words (+UNK)', count[:5])
#print('Sample data', data[:10])

Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/9/bjsbmm12.txt ; 9 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/9/bjsbmm07.txt ; 9 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/9/bjsbmm14.txt ; 9 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/9/bwv29sin.txt ; 9 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/9/bwv667.txt ; 9 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/11/bwv0202.txt ; 11 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/7/bwv668.txt ; 7 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/7/bwv1041b.txt ; 7 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/7/bwv988.txt ; 7 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/16/BOURREE.txt ; 16 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach

Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/988-v10.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/988-v04.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/Wtcii17b.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/Wtcii01b.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/vs2-3and.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/cs3-4sar.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/bwv663.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/cs3-6gig.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/bwv653-2.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/major/4/bwv552p.txt ; 4 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/maj

Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv875.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv861.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv849.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv877.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv863.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv903.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv811.txt ; 8 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/vp1-6sad.txt ; 1 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/bwv971.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/1/Wtcii14b.txt ; 8 voices
Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/mino

Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/sonat_4d.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/sonat_5a.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/bwv630sc.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/1079-02.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/sonat_5c.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/988-v21.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/sonat_5b.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/Wtcii18b.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/bwv525-2.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/3/Wtcii19b.txt ; 3 voices
Loading file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/

Skipping file: ../../ml/Untitled Folder/music_rnn/bach_new/txt/minor/5/bwv659.txt ; 5 voices


In [19]:
len(data)

6006672

In [24]:
from tensorflow.contrib.keras import models as tfm
from tensorflow.contrib.keras import layers as tfl
from tensorflow.contrib.keras import optimizers
from tensorflow.contrib.keras import regularizers
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from tensorflow.contrib.keras import wrappers as tfw
import tensorflow as tf

n_vocab = len(dictionary)+1
n_embed_size = 20
p_dropout = 0.0

SEQLEN = 64
BATCHSIZE = 32
INTERNALSIZE = 256
NLAYERS = 3
NB_EPOCHS = 30

def rnn_minibatch_sequencer(raw_data, batch_size, sequence_size, nb_epochs):
    """
    Divides the data into batches of sequences so that all the sequences in one batch
    continue in the next batch. This is a generator that will keep returning batches
    until the input data has been seen nb_epochs times. Sequences are continued even
    between epochs, apart from one, the one corresponding to the end of raw_data.
    The remainder at the end of raw_data that does not fit in an full batch is ignored.
    :param raw_data: the training text
    :param batch_size: the size of a training minibatch
    :param sequence_size: the unroll size of the RNN
    :param nb_epochs: number of epochs to train on
    :return:
        x: one batch of training sequences
        y: on batch of target sequences, i.e. training sequences shifted by 1
        epoch: the current epoch number (starting at 0)
    """
    data = np.array(raw_data)
    data_len = data.shape[0]
    # using (data_len-1) because we must provide for the sequence shifted by 1 too
    nb_batches = (data_len - 1) // (batch_size * sequence_size)
    assert nb_batches > 0, "Not enough data, even for a single batch. Try using a smaller batch_size."
    rounded_data_len = nb_batches * batch_size * sequence_size
    xdata = np.reshape(data[0:rounded_data_len], [batch_size, nb_batches * sequence_size])
    ydata = np.reshape(data[1:rounded_data_len + 1], [batch_size, nb_batches * sequence_size])

    for epoch in range(nb_epochs):
        for batch in range(nb_batches):
            x = xdata[:, batch * sequence_size:(batch + 1) * sequence_size]
            y = ydata[:, batch * sequence_size:(batch + 1) * sequence_size]
            x = np.roll(x, -epoch, axis=0)  # to continue the text from epoch to epoch (do not reset rnn state!)
            y = np.roll(y, -epoch, axis=0)
            yield x, y, epoch
            
def rnn_validata_sequencer(raw_data, batch_size, sequence_size, nb_epochs):
    """
    Divides the data into batches of sequences so that all the sequences in one batch
    continue in the next batch. This is a generator that will keep returning batches
    until the input data has been seen nb_epochs times. Sequences are continued even
    between epochs, apart from one, the one corresponding to the end of raw_data.
    The remainder at the end of raw_data that does not fit in an full batch is ignored.
    :param raw_data: the training text
    :param batch_size: the size of a training minibatch
    :param sequence_size: the unroll size of the RNN
    :param nb_epochs: number of epochs to train on
    :return:
        x: one batch of training sequences
        y: on batch of target sequences, i.e. training sequences shifted by 1
        epoch: the current epoch number (starting at 0)
    """
    data = np.array(raw_data)
    data_len = data.shape[0]
    # using (data_len-1) because we must provide for the sequence shifted by 1 too
    nb_batches = (data_len - 1) // (batch_size * sequence_size)
    assert nb_batches > 0, "Not enough data, even for a single batch. Try using a smaller batch_size."
    rounded_data_len = nb_batches * batch_size * sequence_size
    xdata = np.reshape(data[0:rounded_data_len], [batch_size, nb_batches * sequence_size])
    ydata = np.reshape(data[1:rounded_data_len + 1], [batch_size, nb_batches * sequence_size])

    for epoch in range(nb_epochs):
        for batch in range(nb_batches):
            x = xdata[:, batch * sequence_size:(batch + 1) * sequence_size]
            y = ydata[:, batch * sequence_size:(batch + 1) * sequence_size]
            x = np.roll(x, -epoch, axis=0)  # to continue the text from epoch to epoch (do not reset rnn state!)
            y = np.roll(y, -epoch, axis=0)
            yield x, y, epoch
  
def create_model():
# create model
    model = tfm.Sequential()
    model.add(tfl.Embedding(n_vocab, n_embed_size, weigths = [final_embeddings], trainable = False, input_length = SEQLEN))
    model.add(tfl.LSTM(INTERNALSIZE, return_sequences=True,
               input_shape=(SEQLEN, n_embed_size)))  # returns a sequence of vectors of dimension 32
    #model.add(tfl.LSTM(INTERNALSIZE, return_sequences=True))  # returns a sequence of vectors of dimension 32
    model.add(tfl.LSTM(INTERNALSIZE))  # return a single vector of dimension 32
    model.add(tfl.Dense(SEQLEN, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
    return model

model = create_model()
print(model.summary())

step = 0
for x, y_, epoch in rnn_minibatch_sequencer(codetext, BATCHSIZE, SEQLEN, nb_epochs=NB_EPOCHS):
    model.train_on_batch(x,y_)
    if step % 50 == 0 and len(valitext) > 0:
        print("Traning epoch: {} / batch:{}, samples: {}".format(epoch, step, (epoch+1)*step*BATCHSIZE*SEQLEN))
        vali_x, vali_y, _ = next(rnn_validata_sequencer(valitext, BATCHSIZE, SEQLEN, nb_epochs=NB_EPOCHS)) 
        loss, accuracy = model.evaluate(vali_x, vali_y, verbose =1)
        print("Loss: {}, Accuracy: {}".format(loss, accuracy))       
    step += 1
    
     

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 64, 20)            3675880   
_________________________________________________________________
lstm_13 (LSTM)               (None, None, 256)         283648    
_________________________________________________________________
lstm_14 (LSTM)               (None, 256)               525312    
_________________________________________________________________
dense_7 (Dense)              (None, 64)                16448     
Total params: 4,501,288
Trainable params: 4,501,288
Non-trainable params: 0
_________________________________________________________________
None
Traning epoch: 0 / batch:0, samples: 0
Loss: 658643.5, Accuracy: 0.0
Traning epoch: 0 / batch:50, samples: 102400
Loss: 658541.4375, Accuracy: 0.0
Traning epoch: 0 / batch:100, samples: 204800
Loss: 660419.0, Accuracy: 0.0
Traning epoch: 0 / batch:150, sample

KeyboardInterrupt: 

In [28]:
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

In [83]:
batch_size = 128
embedding_size = 20 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  # embeddings = tf.Variable(tf.random_uniform([n_vocab, embedding_size], -1.0, 1.0))
  embeddings = tf.Variable(final_embeddings)
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

In [84]:
num_steps = 142001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
        if step > 0:
            average_loss = average_loss / 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step, average_loss))
            average_loss = 0
        sim = similarity.eval()
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 6 # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k+1]
            log = 'Nearest to %s:' % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log = '%s %s,' % (log, close_word)
            print(log)
  final_embeddings = normalized_embeddings.eval()

Initialized
Nearest to J   : cUf , O^ T, G   , S_Z , V^bO, F   ,
Nearest to    J:    I,  @ Q,    H, i ZN,  R@ ,    G,
Nearest to  Q  : kG [, Q`]N,  S  ,  P  , MVj , SZ J,
Nearest to   Y :   [ , NV\_,  QZn,   X , E[Ra, wn V,
Nearest to i   : h   , k   , j   , PNLQ, l   , n   ,
Nearest to T   :  \@S, N gb, O`  , \aP , S   , T[ q,
Nearest to  J  :  I  ,  TEQ, O ZS, kO[ ,  Ja ,  eXH,
Nearest to   ] : dLPl,   \ ,   _ , Zjp , nd\ ,   \N,
Nearest to    `:  lCg,    b,    _,  W[ , [lw , QSe ,
Nearest to `   : _   , b   , X m_, iQJ`,  SId, b  D,
Nearest to     : ********,  UL^, sd]m,  u]N, e]Y ,  jaO,
Nearest to U   : V   , S   , nEa , h]XN, L ^U, L G ,
Nearest to    Y:    [, V\ a, Rmm ,  L M,    Z, XbdQ,
Nearest to X   : W   , Z   , bG S, V   , mZp , h c_,
Nearest to   a :  Lch, VM Y, SVS , M_bV,  K m, h Nh,
Nearest to  O  : ]d L, ]Z >, jL  ,  MlY,  isV,  N  ,
Average loss at step 2000: 7.656385
Nearest to J   : O^ T, F   ,  ba], I  ], fiP , S_Z ,
Nearest to    J:    M,    I,  R@ ,  Jqg,    G, 

Nearest to    `:    _,    a,    c,    b,    ],    d,
Nearest to `   : _   , a   , b   , ^   , c   , X m_,
Nearest to     : ********, 4 @ ,  Ofc, k\ d, V  J, f G ,
Nearest to U   : V   , S   , W   , X   , Q   , T   ,
Nearest to    Y:    W,    Z,    X,    \,    [,    ],
Nearest to X   : W   , Z   , Y   , V   , U   , [   ,
Nearest to   a :   b ,   _ ,   d ,   c ,   ` ,   ] ,
Nearest to  O  :  N  ,  M  ,  L  ,  Q  ,  P  ,  Sa ,
Average loss at step 20000: 4.101001
Nearest to J   : I   , L   , F   , fiP , C   , E   ,
Nearest to    J:    I,    L,    G,    E,    N,    K,
Nearest to  Q  :  P  ,  S  , kG [,  U  ,  R  ,  O  ,
Nearest to   Y :   W ,   Z ,   X ,   U ,   \ ,   V ,
Nearest to i   : h   , k   , k  a, j   , c   ,  \md,
Nearest to T   : R   , S   , V   , F   , W   , M   ,
Nearest to  J  :  I  ,  E  ,  G  ,  K  ,  L  ,  C  ,
Nearest to   ] :   \ ,   _ ,   Z ,   [ ,   W ,   Y ,
Nearest to    `:    _,    a,    c, P  b,  Q f, ]dm\,
Nearest to `   : _   , a   , b   , ^   , c   , `  G,
Neare

Nearest to   a :   b ,   d ,   _ ,   c ,   \ ,   ] ,
Nearest to  O  :  S K,  N  ,  M  ,  R  ,  T  , dK\ ,
Average loss at step 38000: 0.833514
Nearest to J   : I   , F   , L   , ?   , @  h, fiP ,
Nearest to    J:    I,    L,    H,    =,    M,    K,
Nearest to  Q  :  P  ,  S  ,  O  , EP  , kq `, kG [,
Nearest to   Y :   X ,   ] ,   [ ,   Z ,   V , kd s,
Nearest to i   : k   , j   , h   , g   , l   , d   ,
Nearest to T   : S   , R   , V   , Q   , X  D, O   ,
Nearest to  J  :  Dd ,  I  ,  @ h,  L  , O ZS,  >  ,
Nearest to   ] :   _ ,   ` ,   ^ ,   \ ,   b ,   Y ,
Nearest to    `:    b,    _,    d,    ],    e,    ^,
Nearest to `   : _   , b   , d   , ]   , a   , ^   ,
Nearest to     : ********, k\ d, 4 @ ,  hI , h @ ,  Ofc,
Nearest to U   : V   , S   , X   , W   , T   , Y   ,
Nearest to    Y:    [,    X,    ],  U [,    ^,    Z,
Nearest to X   : V   , Z   , \   , W   , Wf \, X @ ,
Nearest to   a :   b ,   d ,   _ ,   c ,   ` ,   ] ,
Nearest to  O  :  S K,  Q  ,  N  ,  T  ,  M  ,  R  ,
Avera

Nearest to i   : k   , h   , j   , g   , TdnX, j U ,
Nearest to T   : Q  O, T  @, T  E, S   , S  J, T  H,
Nearest to  J  :  I  ,  @ h, `q i, O ZS,  Dd ,  F  ,
Nearest to   ] :   _ ,  D_ ,   \N,  U[ ,   \ ,   Z ,
Nearest to    `:    _,    b, P  b, P  `,  M _,    d,
Nearest to `   : b   , _   , `  G, a   , UO L, d   ,
Nearest to     : ********,  hI ,   = , k\ d, h @ ,  ukK,
Nearest to U   : aM h, g Cd, UFN , _ e[, C SX, >Y e,
Nearest to    Y:  \Qa, c N_,    [,  I [, J bL,   ET,
Nearest to X   : bG S, Z K , Wf \,  EU\, X @ , V   ,
Nearest to   a :  D_ ,   b ,   d ,  Ed ,   c , P p ,
Nearest to  O  : dK\ , @P  ,  S K, GT  ,  MlY, LP  ,
Average loss at step 58000: 1.508191
Nearest to J   : I   , F   , fiP , ]O [, M  E,  ba],
Nearest to    J:    H,    I,    =,  E P, gv[ ,   _@,
Nearest to  Q  : EP  ,  O  ,  S  , kG [, kq `, ES  ,
Nearest to   Y :   [ , kd s,   X , ]UQ , ]fWZ, bSXL,
Nearest to i   : k   , h   , j   , g   , f   , j U ,
Nearest to T   : S   , V   , Q  O, Y  E, S  E, T  E,
Neare

Nearest to  O  : @P  ,  MlY,  O G, XTid, GT  , dK\ ,
Average loss at step 76000: 1.816220
Nearest to J   : F   , fiP , ]O [, I   , M  E, J  a,
Nearest to    J: gv[ ,    I,    H, N EX,    =,    M,
Nearest to  Q  : EP  , HP  , LVE , EO  , kG [, gM  ,
Nearest to   Y : kd s, ]UQ ,   [ , ] Y],  Lch, biNQ,
Nearest to i   : k   , h   , j   , g ^ , m[  , g   ,
Nearest to T   : T  @, Q  O,  L_T, S   , S  L, Y  E,
Nearest to  J  :  @ h,  Dd , O ZS, VQ M, `q i,  O G,
Nearest to   ] :   _ ,   [ ,  D_ ,   ` ,   Z ,   \N,
Nearest to    `:    _,    b, d_X],  N ^, an N, P  `,
Nearest to `   : _   , b   , \ W , ]   , `  G, ` B ,
Nearest to     : ********,  ukK, k\ d, ]VLE,   Ya, hM V,
Nearest to U   : UFN , C SX, W   , p hM, P p , SL E,
Nearest to    Y: c N_,    X, j O_, M[ S,    [,   @Z,
Nearest to X   : bG S, V   , Z   , Wf \, X @ ,  EU\,
Nearest to   a :   b ,   c ,  Ih ,   ^N, VJY_,  =]],
Nearest to  O  : @P  ,  MlY,  O G, XTid, dK\ ,  S K,


KeyboardInterrupt: 

In [74]:
142000

11.833333333333334

In [80]:
for i in range(16):
    valid_word = reverse_dictionary[valid_examples[i]]
    top_k = 12 # number of nearest neighbors
    nearest = (-sim[i, :]).argsort()[1:top_k+1]
    log = 'Nearest to %s:' % valid_word
    for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
    print(log)

Nearest to b   : `   , cXp , a   , d   , _ H , dg]X, f  O, NRZ , r` i, nN Q, LT\ , fT W,
Nearest to i   : h   , k   , j   , PNLQ, l   , n   , TdnX, ^[g , r   , m   , p   , q   ,
Nearest to  i  :  k  ,  j  ,  h  ,  r  ,  l  ,  k W, Nh  ,  dEl, X ma,  f  ,  hN ,  p  ,
Nearest to  f  :  d  ,  g  ,  i  ,  c  ,  h  ,  e  , ^Qda, [@ d,  fP , anf , EXLT,  g R,
Nearest to  P  : SZ J,  N  ,  Q  , sVf , Uh M,  hPe,  V]n, P nX, gQ^V, S G\,  Mbq, P_fV,
Nearest to U   : V   , S   , nEa , h]XN, L ^U, L G , H]Lg,  K i, Em ], W   , [I  , VP f,
Nearest to T   :  \@S, N gb, O`  , \aP , S   , T[ q, X   , V   , L I_, Lk @, `[Rd,  IOQ,
Nearest to P   : I Y , N   , @[  , E` f, VY_e, @j  , Q   , Xh H,   @f, th S, SHL , e Le,
Nearest to    _:  AQT,    `,    ], QSe ,  Xd^, lXdi, O jb, d\m , U^dN,  O c,   Ub, XQ\ ,
Nearest to    \:    ],    Z,   N^, Sn X, [dS_,   G`, iT]\,  L^d, OLS , WWP , P TD, SLQ ,
Nearest to   X :   Z ,   W ,   V , kX[d, Rrf , ^pU , ]O[ , nbU ,   YC, [aX@,   Y , BSN ,
Nearest to   k :   l 

In [79]:
final_embeddings.shape

(183794, 20)

In [85]:
from pathlib import Path
outfile = Path(PATH + "/final_embeddings_3.npy")
np.save(outfile, final_embeddings)

In [47]:
%pwd

'/Users/Zufek/ml/music_rnn'

In [164]:
b = list(permutations(range(5)))

def permutate(s, pat):
    assert len(s) == len(pat), "Pattern length must match input length"
    r = []
    for ind in pat:
        r.append(s[ind])
    return "".join(r)

for i in range(1,len(b)):
    print(permutate("abcde", b[i]))

abced
abdce
abdec
abecd
abedc
acbde
acbed
acdbe
acdeb
acebd
acedb
adbce
adbec
adcbe
adceb
adebc
adecb
aebcd
aebdc
aecbd
aecdb
aedbc
aedcb
bacde
baced
badce
badec
baecd
baedc
bcade
bcaed
bcdae
bcdea
bcead
bceda
bdace
bdaec
bdcae
bdcea
bdeac
bdeca
beacd
beadc
becad
becda
bedac
bedca
cabde
cabed
cadbe
cadeb
caebd
caedb
cbade
cbaed
cbdae
cbdea
cbead
cbeda
cdabe
cdaeb
cdbae
cdbea
cdeab
cdeba
ceabd
ceadb
cebad
cebda
cedab
cedba
dabce
dabec
dacbe
daceb
daebc
daecb
dbace
dbaec
dbcae
dbcea
dbeac
dbeca
dcabe
dcaeb
dcbae
dcbea
dceab
dceba
deabc
deacb
debac
debca
decab
decba
eabcd
eabdc
eacbd
eacdb
eadbc
eadcb
ebacd
ebadc
ebcad
ebcda
ebdac
ebdca
ecabd
ecadb
ecbad
ecbda
ecdab
ecdba
edabc
edacb
edbac
edbca
edcab
edcba


In [161]:
for i in range(1,len(b)):
    print(permutate("abcd", b[i]))
    

abdc
acbd
acdb
adbc
adcb
bacd
badc
bcad
bcda
bdac
bdca
cabd
cadb
cbad
cbda
cdab
cdba
dabc
dacb
dbac
dbca
dcab
dcba


In [156]:
b

[(0, 1, 2, 3),
 (0, 1, 3, 2),
 (0, 2, 1, 3),
 (0, 2, 3, 1),
 (0, 3, 1, 2),
 (0, 3, 2, 1),
 (1, 0, 2, 3),
 (1, 0, 3, 2),
 (1, 2, 0, 3),
 (1, 2, 3, 0),
 (1, 3, 0, 2),
 (1, 3, 2, 0),
 (2, 0, 1, 3),
 (2, 0, 3, 1),
 (2, 1, 0, 3),
 (2, 1, 3, 0),
 (2, 3, 0, 1),
 (2, 3, 1, 0),
 (3, 0, 1, 2),
 (3, 0, 2, 1),
 (3, 1, 0, 2),
 (3, 1, 2, 0),
 (3, 2, 0, 1),
 (3, 2, 1, 0)]

In [36]:
words.pop()

'********'

In [220]:
from math import log2

In [221]:
log2(len(dictionary))

17.48772229515735

In [112]:
words

['aabb    ', '    hhii', '  kkll  ']