# Deep Lyrics Generation


## Introduction

In this assignment we examine the application of deep learning in lyrics generation. Specifically we make use of LSTM networks to perform language modeling -a form of self-supervised learning- on thousands of lyrics. We evaluate the model using the perplexity score and visualize the underlying language model and embeddings.
The trained models are used to generate new lyrics or verses

## Dataset

The dataset consists mainly on greek pop and trap songs. The lyrics were downloaded from genius.com using a third-party library *lyricsgenius*. 

In [None]:
token = 'WQXXT8S-yLdN5nlPiG-Xp9Ux7qQAOirSfMLFw2gEYhvsF8NYwLowbSY9zUZwXZW1'
genius = Genius(token)
artist = genius.search_artist('Mazonakis')
artist.save_lyrics()

### Preprocessing

Because the data are scrapped directly from the website, we need to perform some preprocessing like removing special characters and headers like [Intro], [Chorous], etc..

This can be done using simple regex rules like below:

In [1]:
import re

def parse_lyrics(lyrics):
  pattern = '\w+'
  regex = re.compile(pattern)
  sentences = [regex.findall(it) for it in lyrics.replace('Lyrics', '\n').replace('Embed', '\n').split('\n') if it not in ('') and not it.startswith('[')][1:]
  filtered = [[t for t in sent] for sent in sentences]
  return filtered

raw = "Gucci Forema[Chorus]\n\
Μα έλα που δε μπορώ\n\
Πλέον ν’ αντισταθώ\n\
Σ' αυτό το Gucci φόρεμα που φοράς\n\
Και στο ρυθμό που απόψε βράδυ το κορμί σου κουνάς\n\
[Verse2]\n\
Γέννημα θρέμμα δυτικής Αττικής\n\
Έχουμε περηφάνεια εδώ εμείς και λόγο τιμής\n\
Μπουρνάζι, Αιγάλεω, Περιστέρι, ξέρω ζόρι τραβάς\n\
Μέρη που εσύ κι οι φίλες σου τα θεωρείτε μπας κλας"

parsed = parse_lyrics(raw)
parsed

[['Μα', 'έλα', 'που', 'δε', 'μπορώ'],
 ['Πλέον', 'ν', 'αντισταθώ'],
 ['Σ', 'αυτό', 'το', 'Gucci', 'φόρεμα', 'που', 'φοράς'],
 ['Και',
  'στο',
  'ρυθμό',
  'που',
  'απόψε',
  'βράδυ',
  'το',
  'κορμί',
  'σου',
  'κουνάς'],
 ['Γέννημα', 'θρέμμα', 'δυτικής', 'Αττικής'],
 ['Έχουμε', 'περηφάνεια', 'εδώ', 'εμείς', 'και', 'λόγο', 'τιμής'],
 ['Μπουρνάζι', 'Αιγάλεω', 'Περιστέρι', 'ξέρω', 'ζόρι', 'τραβάς'],
 ['Μέρη',
  'που',
  'εσύ',
  'κι',
  'οι',
  'φίλες',
  'σου',
  'τα',
  'θεωρείτε',
  'μπας',
  'κλας']]

We can train the model lyric-per-pyric or we can join multiple lyrics together to create verses. This allows the model to generate lyrics that have rhyme:

In [2]:
def join_lyrics(lyrics, k):
  joined = []
  for i in range(0, len(lyrics)-k+1, k):
    item = []
    for j in range(0, k):
      item.extend(lyrics[i+j] + ["."])
    item = item[:-1]
    joined.append(item)
  return joined

join_lyrics(parsed, 2)

[['Μα', 'έλα', 'που', 'δε', 'μπορώ', '.', 'Πλέον', 'ν', 'αντισταθώ'],
 ['Σ',
  'αυτό',
  'το',
  'Gucci',
  'φόρεμα',
  'που',
  'φοράς',
  '.',
  'Και',
  'στο',
  'ρυθμό',
  'που',
  'απόψε',
  'βράδυ',
  'το',
  'κορμί',
  'σου',
  'κουνάς'],
 ['Γέννημα',
  'θρέμμα',
  'δυτικής',
  'Αττικής',
  '.',
  'Έχουμε',
  'περηφάνεια',
  'εδώ',
  'εμείς',
  'και',
  'λόγο',
  'τιμής'],
 ['Μπουρνάζι',
  'Αιγάλεω',
  'Περιστέρι',
  'ξέρω',
  'ζόρι',
  'τραβάς',
  '.',
  'Μέρη',
  'που',
  'εσύ',
  'κι',
  'οι',
  'φίλες',
  'σου',
  'τα',
  'θεωρείτε',
  'μπας',
  'κλας']]

### Tokenization

A simple method to tokenize is to iterate through the lyrics and consider each unseen word as a new token. 

The following code counts all word occurances and creates the frequency histogram:



In [3]:
from collections import Counter, OrderedDict
import json

def extract_lyrics_from_files(files_list: list, concat_lyrics = 1, format="genius") -> list:
  lyrics = []

  if format == "genius":
    for p in files_list:
      f = open('lyrics/' + p)
      data = json.load(f)

      for song in data['songs']:
        new_lyrics = join_lyrics(parse_lyrics(song['lyrics']), concat_lyrics)
        lyrics.extend(new_lyrics)
        
  return lyrics

greek = [
          'Lyrics_AntonisRemos.json',
          'Lyrics_Giannisploutarhos.json',
          'Lyrics_GiorgosMazonakis.json',
          'Lyrics_NikosOikonomopoulos.json',
          'Lyrics_PanosKiamos.json',
          'Lyrics_GiorgosTsalikis.json',
          'Lyrics_IliasVrettos.json',
          'Lyrics_PantelisPantelidis.json',
          'Lyrics_ΜιχάληςΧατζηγιάννηςMichalisHatzigiannis.json',
          'Lyrics_SteliosRokkos.json',
          'Lyrics_GiorgosSabanis.json',
          'Lyrics_Yianniskotsiras.json',
          'Lyrics_GiorgosKakosaios.json',
          'Lyrics_SakisRouvas.json',
          'Lyrics_Stavento.json',
          'Lyrics_Notissfakianakis.json',
          'Lyrics_Thanospetrelis.json',
          'Lyrics_LefterisPantazis.json',
          'Lyrics_DionisisShinas.json',
          'Lyrics_AnnaVissi.json',
          'Lyrics_DespinaVandi.json',
          'Lyrics_ElliKokkinou.json',
          'Lyrics_NatasaTheodoridou.json',
          'Lyrics_FoivosDelivorias.json',
          'Lyrics_JosephineGR.json',
          'Lyrics_HelenaPaparizou.json',
          'Lyrics_KonstantinosKoufos.json',
          'Lyrics_PeggyZina.json',
          'Lyrics_KonstantinosArgiros.json',
          'Lyrics_Melisses.json'
          ]

lyrics = extract_lyrics_from_files(greek, concat_lyrics=1)
word_list = [w for l in lyrics for w in l]
token_set = OrderedDict(Counter(word_list).most_common())

print("Total unique words: {}".format(len([k for k,v in Counter(word_list).items()])))
print("Percentage of unique words that appear only once: {:.2f}%".format(len([k for k,v in Counter(word_list).items() if v == 1])/len([k for k,v in Counter(word_list).items()])*100))

import plotly.graph_objects as go

fig = go.Figure([go.Bar(x=list(token_set.keys()), y=list(token_set.values()))])
fig.update_yaxes(title="Word occurances (log scale)", type="log")
fig.update_layout(bargap=0.0,bargroupgap=0.0,  title="Frequencies of words in lyrics corpus")
fig.show()

Total unique words: 20550
Percentage of unique words that appear only once: 41.84%


As we'd also expect from Zipf's law, almost 40-60% of the words appear only once in the lyrics corpus. This poses a problem for our model, as there won't be enough samples of the rare words. One way to counter this issue is to replace all words that occur less than a threshold value, with a special '\<UNK>' token. 

By doing so, the model generalizes better and also learns to handle unseen words that may be provided at inference time.

The occurance threshold is a hyperparameter that must be tuned depending on the task. A high threshold results in a smaller vocabulary and may result in the model producing very generic sentences with little context. On the contrary, a low threshold increases the vocabulary size and the input size of the model. 

The following code defines a torch dataset from the provided lyrics, using the above tokenization (regex alphanumeric matching with unk replacement):

In [4]:
from torch import utils

class LyricsDatasetRegex(utils.data.Dataset):
  def __init__(self, lyrics, vocab=None, sent_freq=1, token_freq=0, lowercase = False) -> None:
    super().__init__()
    s = self.append_start_end(lyrics)
    self.lowercase = lowercase
    tokenized_sentences, self.token_to_idx, self.idx_to_token, self.token_set = self.tokens_to_indices(s, vocab, token_freq)
    filtered_sentences = self.filter_sents(tokenized_sentences, sent_freq=sent_freq)

    self.padding_idx = 0
    self.dataset = self.create_dataset(filtered_sentences)
    print(f"Dataset samples: {len(self.dataset)}, vocabulary size: {len(self.token_set)} tokens")


  def tokens_to_indices(self, sentences, vocab, token_freq):
    if vocab:
      unique_token_list = vocab
    else:
      if self.lowercase:
        token_list = [t.lower() for s in sentences for t in s]
      else:
        token_list = [t for s in sentences for t in s]
      token_set = Counter(token_list)
      unique_token_list = [k for k,v in token_set.items() if v >= token_freq]
      unique_token_list = sorted(unique_token_list)
      unique_token_list.insert(0, '<pad>')
      unique_token_list.insert(1, '<unk>')

    self.token_to_idx = {ch: i for i, ch in enumerate(unique_token_list)} 
    self.idx_to_token = {i: ch for i, ch in enumerate(unique_token_list)} 

    return [[self.token_index(t) for t in s] for s in sentences], self.token_to_idx, self.idx_to_token, unique_token_list


  def append_start_end(self, sentences):
    return [['#'] + sent + ['&'] for sent in sentences]

  def create_dataset(self, tokenized_sentences):
    dataset = []
    for sent in tokenized_sentences:
      x = sent[:-1]
      y = sent[1:]
      dataset.append([x, y])
    return dataset

  def filter_sents(self, tokenized_sentences, sent_freq):
    filtered_sentences = []
    for i in range(len(tokenized_sentences)):
      c = 0
      for j,v in enumerate(tokenized_sentences[i]):
        if v == 1:
          c += 1
      if c/len(tokenized_sentences[i]) <= sent_freq:
        filtered_sentences.append(tokenized_sentences[i])

    
    unk_counter = 0
    total_words = 0
    for s in filtered_sentences:
      for idx in s:
        total_words += 1
        if idx == self.token_index('<unk>'):
          unk_counter += 1

    print(f'total tokens: {total_words}, unk tokens: {unk_counter}, percentage of unk tokens: {unk_counter/total_words*100}%')
    print(f"Initial sentences: {len(tokenized_sentences)}, filtered sentences: {len(filtered_sentences)}")
    return filtered_sentences

  def token_index(self, ch):
    try:
      if self.lowercase:
        return self.token_to_idx[ch.lower()]
      else:
        return self.token_to_idx[ch]
    except Exception:
      return self.token_to_idx['<unk>']

  def ids_to_tokens(self, ids):
    return " ".join(list(map(lambda it: self.idx_to_token[it], ids)))

  def tokens_to_ids(self, tokens):
    tokens = tokens.split(' ')
    return list(map(lambda it: self.token_index(it), tokens))

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, index):
    return (
        torch.tensor(self.dataset[index][0]),
        torch.tensor(self.dataset[index][1])
    )

There are two significant limitations with the aforementioned approach.

One is that words with different suffixes or prefixes are treated as completely different tokens, thus have to be learned seperately. 

Consider the words "θέλω", "θέλεις", "ήθελα", "θέλησα", "θέληση". The model cannot directly associate the above words, although they have the same meaning and only the suffix differs.

The second issue arises from the nature of the problem discussed in this assignment. In songwriting, one major aspect is rythm. In fact, rythm between lyrics is more important than maintaining context. Take for example the following chorus, where there is clear rythm between each pair of lyrics, but little contextual meaning:

*'νύχτες δίχως όνομα νύχτες χωρίς σκοπό, 
χαμένοι από χέρι χαμένοι και οι δυο,
ανόητες αγάπες ανόητα φιλιά,
λόγια λόγια λόγια λόγια ψεύτικα'*

Tokenizing purely on words cannot capture directly the relationship between suffixes on different lyrics and, therefore, is more difficult to produce rythm.

For this reason we tried an alternative way of tokenizing words based on subwords. We used the **Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)** library, which provides subword tokenization along with pre-trained embeddings https://github.com/bheinzerling/bpemb . As an example, with BRE the words "θέλω", "θέλεις" could be split as "_θέλ", "ω",  _θέλ", "εις". Because the first subword is common, the model would learn the association of these words.


In [5]:
import torch
from collections import Counter
from torch import utils
from bpemb import BPEmb

class LyricsDatasetBPE(utils.data.Dataset):
  def __init__(self, lyrics, n_vocab=10000) -> None:
    super().__init__()
    s = self.append_start_end(lyrics)

    self.bpemb_el = BPEmb(lang="el", dim=100, vs=n_vocab, add_pad_emb=True)
    self.padding_idx = len(self.bpemb_el.emb)-1

    self.dataset = self.create_dataset(s)
    self.n_vocab = self.bpemb_el.vs+1
    print(f"Dataset samples: {len(self.dataset)}")

  def append_start_end(self, sentences):
    return [['#'] + sent + ['&'] for sent in sentences]

  def create_dataset(self, sentences):
    dataset = []
    for sent in sentences:
      x = " ".join([w.lower() for w in sent[:-1]])
      x = self.bpemb_el.encode_ids(x)

      y = " ".join([w.lower() for w in sent[1:]])
      y = self.bpemb_el.encode_ids(y)
      dataset.append([x, y])
    return dataset

  def embed(self, sentence):
    return self.bpemb_el.embed(sentence)

  def token_to_idx(self, token):
    return self.bpemb_el.encode_ids(token)[0]

  def tokens_to_ids(self, tokens):
    return self.bpemb_el.encode_ids(tokens)

  def ids_to_tokens(self, ids):
    return self.bpemb_el.decode_ids(ids)

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, index):
    return (
        torch.tensor(self.dataset[index][0]),
        torch.tensor(self.dataset[index][1])
    )

**Byte pair encoding tuning**

The wrapper model takes as input the embeddings dimension as well as the vocabulary size. A smaller vocabulary means that words are split into more subwords. Thus the model can learn to also associate suffixes like "ώ", "είς", "εί" and produce rythm. However, the model must concatenate more subwords to make words, thus increasing the likelihood of producing words that sound corrent but have no meaning.

In our experiments we demonstrate the effect of vocabulary size to the produced lyrics.
We found out that regex tokenization performs slightly better in terms of context, because BPE models need to concatenate more syllables into words, thus
it's easier to lost context.

# Language Model

## Definition

The language model task consists of the AI model trying to predict the next word of a sentence, given the words so far. By training on rich corpora, the model can capture the underlying grammatical, syntactic and contextual information of the text and is able to generate new unseen sentences. Since the nature of text is sequential, LSTM models are often used for this task. There are also non-deep learning models such as hidden markov models and n-grams that use probabilities of counted word sequences to produce the next output. In the recent years, though, Transformers such as GPT-3 have, by far, outperformed the previous architectures, and are considered state of the art. 

In this assignment we chose to train and evaluate LSTM models of various sizes for this task, due to the lack of extended hardware and data resources required to train more complicated architectures like GPT-3. Although the model is simple, we argue that it is capable of creating output that resembles human-crafted lyrics with a reasonable training size and time.

## Training process

We prepare the dataset by adding special tokens at start and end of each sample and in between lyrics. Then we tokenize the sentences with one of the above methods and we create the input and output sequences (the output sentence is the input sentence shifted by one timestep). The model processes the input sentence one word at a time and produces the next predicted word. Then for each pair of predicted/actual word we calculate the mean cross entropy loss and use it to tune the weights.

![image.png](https://miro.medium.com/max/687/1*FCVyju8lPTvfFfxT-rzInA.png)

To speed up the learning process, we train on GPU on batches of 32 samples per batch.


In [6]:
from torch import nn

class LSTM(nn.Module):
  def __init__(self, n_vocab, padding_idx, embedding_weights=None, embedding_size=50, hidden_size=128, num_layers=2) -> None:
      super(LSTM, self).__init__()
      self.n_vocab = n_vocab
      if embedding_weights is not None:
        self.embedding = nn.Embedding.from_pretrained(embedding_weights, freeze=False, padding_idx=padding_idx)
      else:
        self.embedding = nn.Embedding(num_embeddings=n_vocab, embedding_dim=embedding_size, padding_idx=padding_idx)


      self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=0.2)
      self.dropout = nn.Dropout(0.2)
      self.num_layers = num_layers
      self.hidden_size = hidden_size
      self.fc = nn.Linear(hidden_size, n_vocab)

  def forward(self, x, seq_len, state):
    x = self.embedding(x)
    x = self.dropout(x)
    packed = torch.nn.utils.rnn.pack_padded_sequence(x, seq_len, enforce_sorted=False)
    output, (h, c) = self.LSTM(packed, state)
    x, lengths = torch.nn.utils.rnn.pad_packed_sequence(output)
    x = self.fc(x)
    return x, (h,c)

  def init_state(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))

In [7]:
lyrics = [['Δεν έχει σίδερα η καρδιά σου να με κλείσει . Δεν έχει σίδερα για να με φυλακίσει.']]
print("Training sample: ", lyrics)
dataset = LyricsDatasetBPE(lyrics)
print("---------------------------")

print("Input/Output sequences (ids):")
print(dataset.dataset)
print("---------------------------")

print("Input/output sequences (subwords):")
for sample in dataset.dataset:
    in_ = []
    for w in sample[0]:
      in_.append(dataset.ids_to_tokens([w]))
    print("Input: ", in_)

    out_ = []
    for w in sample[1]:
      out_.append(dataset.ids_to_tokens([w]))

    print("Output: ", out_)

Training sample:  [['Δεν έχει σίδερα η καρδιά σου να με κλείσει . Δεν έχει σίδερα για να με φυλακίσει.']]
Dataset samples: 1
---------------------------
Input/Output sequences (ids):
[[[680, 300, 341, 5, 177, 607, 45, 6780, 1331, 76, 35, 5520, 128, 1573, 300, 341, 5, 177, 607, 101, 76, 35, 7658, 2186, 9894], [300, 341, 5, 177, 607, 45, 6780, 1331, 76, 35, 5520, 128, 1573, 300, 341, 5, 177, 607, 101, 76, 35, 7658, 2186, 9894, 1580]]]
---------------------------
Input/output sequences (subwords):
Input:  ['#', 'δεν', 'έχει', 'σ', 'ίδ', 'ερα', 'η', 'καρδιά', 'σου', 'να', 'με', 'κλεί', 'σει', '.', 'δεν', 'έχει', 'σ', 'ίδ', 'ερα', 'για', 'να', 'με', 'φυλακ', 'ίσει', '.']
Output:  ['δεν', 'έχει', 'σ', 'ίδ', 'ερα', 'η', 'καρδιά', 'σου', 'να', 'με', 'κλεί', 'σει', '.', 'δεν', 'έχει', 'σ', 'ίδ', 'ερα', 'για', 'να', 'με', 'φυλακ', 'ίσει', '.', '&']






## Evaluation process

Evaluation is made using the perplexity score. Perplexity is a measure of how good the language model is at predicting the next word, given the previous words. Low perplexity means that the model can predict the test sequence more confidently. Mathematically, given a test sequence:

$$
X = x_0, x_1, ..., x_n
$$

the score is calculated by the formula:

$$
PP(X) = e^{-\frac{1}{n}∑_{i=1}^n{\log{p(x_i|{x_0, x_1, ..., x_{i-1}})}}}
$$

We can observe that the term in the exponent is exactly the cross entropy loss between the actual and predicted output. This allows us to easily calculate the score during the training process.

The detaset is split at a 85/5/10 ratio (training, validation and test set). At each epoch we perform a forward pass of the model, and calculate the training and validation loss and perplexity. The training stops when the validation perplexity stops decreasing. We also generate a few random lyrics at each epoch to verify the model's perfomance. 

Note that the perplexity metric is dependent on the tokenization process. In the case of regex tokenization with unk tokens, for example, if the vocabulary is small, the model may demonstrate a low perplexity score due to the number of \<unk> predictions, giving a false impression of a good language model.

In [8]:
import torch
import numpy as np

def train(model, train_set, validation_set, max_epochs, batch_size, weights):

  evaluation_data = {
      'train_loss': [],
      'validation_loss': [],
      'validation_perplexity': [],
  }

  # Define model and loss functions
  model = model.to(device)
  criterion = torch.nn.CrossEntropyLoss(ignore_index=train_set.padding_idx, weight=weights)
  optimizer = torch.optim.Adam(model.parameters(), lr=0.002)


  # Define dataloaders
  def collate_pad(batch):    
    in_ = []
    out_ = []
    seq_len_x = []
    seq_len_y = []
    for x,y in batch:
      in_.append(x)
      out_.append(y)
      seq_len_x.append(len(x))
      seq_len_y.append(len(y))

    return torch.nn.utils.rnn.pad_sequence(in_, padding_value=train_set.padding_idx).cuda(), torch.nn.utils.rnn.pad_sequence(out_, padding_value=train_set.padding_idx).cuda(), seq_len_x, seq_len_y

  dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, collate_fn=collate_pad)
  validation_dataloader = torch.utils.data.DataLoader(validation_set, batch_size=64, collate_fn=collate_pad)

  # Train loop
  for epoch in range(max_epochs):
    model.train()

    ## Train in batches
    for batch, (x,y, seq_len_x, seq_len_y) in enumerate(dataloader):
      h, c = model.init_state(batch_size)

      optimizer.zero_grad()
      x = x.to(device)
      y = y.to(device)

      y_pred, (h,c) = model(x, seq_len_x, (h,c))

      loss = criterion(y_pred.transpose(0,1).transpose(1,2), y.transpose(0,1))
      loss.backward()
      optimizer.step()
    
    
    ## Evaluate performance on train set
    with torch.no_grad():
      model.eval()
      epoch_loss = 0
      loss_counter = 0
      for batch, (x,y, seq_len_x, seq_len_y) in enumerate(dataloader):
        h, c = model.init_state(batch_size)

        x = x.to(device)
        y = y.to(device)

        y_pred, (h,c) = model(x, seq_len_x, (h,c))

        loss = criterion(y_pred.transpose(0,1).transpose(1,2), y.transpose(0,1))

        epoch_loss += loss.item()
        loss_counter +=1

      epoch_loss /= loss_counter

    ## Evaluate performance on validation set
    with torch.no_grad():
      model.eval()
      valid_loss = 0
      loss_counter = 0
      for batch, (x,y, seq_len_x, seq_len_y) in enumerate(validation_dataloader):
        h, c = model.init_state(64)

        x = x.to(device)
        y = y.to(device)

        y_pred, (h,c) = model(x, seq_len_x, (h,c))

        loss = criterion(y_pred.transpose(0,1).transpose(1,2), y.transpose(0,1))

        valid_loss += loss.item()
        loss_counter +=1

      valid_loss /= loss_counter
      perplexity = np.exp(valid_loss)

    ## update evaluation data
    evaluation_data['train_loss'].append(epoch_loss)
    evaluation_data['validation_loss'].append(valid_loss)
    evaluation_data['validation_perplexity'].append(perplexity)

    print("Train Epoch {} Loss - {} | Validation loss - {}".format(epoch, epoch_loss, valid_loss))
    print(f"perplexity on validation set: {perplexity}")
    seq = "Θέλω".split(' ')
    max_seq = 100
    print(f"Probabilistic predictions (top 5): {['Θέλω ' + predict(model, train_set, seq, max_seq, deterministic=False, no_unk=True, top_only=5) for i in range(5)]}")
    print("----------------------------------------")
  return evaluation_data


## Predictions

At inference time, the model receives a sequence of words as input and outputs an array of length equal to the size of the vocabulary. Each element represents the probability of that word being the next word in the sequence. A random model would assign the same probability to each word, therefore each would have a probability of 1/N. 

Because the vocabulary size is usually several thousand words, even with a well-trained language model there is still a significant probability of selecting an irrelevent word. This is why, in practice, we limit the available words to choose from, by sorting them by their probability in decreasing order and keeping only the top words to select from. 

The number of top words to select from, is a hyperparameter: Too small and the model will yield more deterministic and relevant outputs at the cost of repeatability and less variety. On the other hand a high value will produce more imaginative lyrics, but also more irrelative and error prone, especially with smaller, less trained models. We can tweak this parameter at runtime to experiment and produce more conservative or creative lyrics.

In [9]:
from random import choices

def predict(model, dataset, test_seq, max_seq, deterministic=False, no_unk=False, top_only=False):
    model.eval()
    test_seq = " ".join(['#'] + test_seq)
    input = dataset.tokens_to_ids(test_seq)
    h,c = model.init_state(1)

    with torch.no_grad():
      values = []

      for i in range(max_seq):

        # Reinitialize weights
        h,c = model.init_state(1)
        
        # Prepare input
        length = len(input)
        x = torch.tensor(input)
        x = torch.nn.utils.rnn.pad_sequence([x], padding_value=dataset.padding_idx)

        # Predict next token probabilities
        out, (h,c) = model.forward(x.to(device), [length], (h,c))
        p = torch.nn.functional.softmax(out[-1].squeeze(0)).detach().cpu().numpy()


        idx = 1
        if deterministic:
          idx = p.argmax()
        else:
          if top_only:
            indices = p.argsort()[::-1][0:top_only]
            if no_unk:
              while idx == 1:
                idx = choices(indices, p[indices]/sum(p[indices]))[0]
            else:
              idx = choices(indices, p[indices]/sum(p[indices]))[0]
          else:
            if no_unk:
              while idx == 1:
                idx = choices(np.arange(0, model.n_vocab), p)[0]
            else:
              idx = choices(np.arange(0, model.n_vocab), p)[0]
          
        idx = int(idx)
        values.append(idx)

        decoded = dataset.ids_to_tokens([idx])

        if type(decoded) == list:
          token_pred = decoded[0]
        else:
          token_pred = decoded
        
        if token_pred in ('&', '_&'):
          break

        input.append(idx)


      return dataset.ids_to_tokens(values)

## Experiments

### Generating single lyrics

The first experiment consists of the models generating a single lyric each time.
For this task we test the following models and evaluate their performance:

* Model RT1: One-Layer LSTM, hidden units: 128, embedding size: 100, Regex Tokenizer with vocabulary of ~3000 words
* Model BT1: One-Layer LSTM, hidden units: 128, embedding size: 100, BPE Tokenizer with vocabulary of 10000 subwords
* Model RT2: One-Layer LSTM, hidden units: 128, embedding size: 100, Regex Tokenizer with vocabulary of ~10000 words
* Model BT2: One-Layer LSTM, hidden units: 128, embedding size: 100, BPE Tokenizer with vocabulary of 3000 subwords


In [43]:
device="cuda"
print("Generation of single lyrics")

train_lyrics = extract_lyrics_from_files(greek[:-4], concat_lyrics=1, format="genius")
test_lyrics = extract_lyrics_from_files(greek[-4:-2], concat_lyrics=1, format="genius")
validation_lyrics = extract_lyrics_from_files(greek[-2:], concat_lyrics=1, format="genius")

print("-------------------------------------------------------------------------")
print("Training RT1 model")

# Define the train/valid/test set
print("Train set:")
train_set = LyricsDatasetRegex(lyrics = train_lyrics, sent_freq=0.2, token_freq=8, lowercase=True)
print("Validation set:")
validation_set = LyricsDatasetRegex(lyrics = validation_lyrics, vocab=train_set.token_set, lowercase=True)
print("Test set:")
test_set = LyricsDatasetRegex(lyrics = test_lyrics, vocab=train_set.token_set, lowercase=True)

# Define the model
model_rt1 = LSTM(len(train_set.token_set), padding_idx=train_set.padding_idx, embedding_size=100, hidden_size=128, num_layers=1)

# Define training parameters
weights = torch.ones(len(train_set.token_set)).to(device)
weights[1] = 0.1 # Assign a lower weight for the <unk> tokens in the loss function
weights[0] = 0 # The <pad> token should have zero weight

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': weights
}

#Train the model and collect eval data
model_rt1_eval = train(model_rt1, train_set, validation_set, **args)


print("-------------------------------------------------------------------------")
print("Training BT1 model")

print("Train set:")
train_set = LyricsDatasetBPE(lyrics = train_lyrics, n_vocab=10000)
print("Validation set:")
validation_set = LyricsDatasetBPE(lyrics = validation_lyrics, n_vocab=10000)
print("Test set:")
test_set = LyricsDatasetBPE(lyrics = test_lyrics, n_vocab=10000)

weight_model = train_set.bpemb_el.emb
embedding_weights = torch.FloatTensor(weight_model.vectors)

# Define the model
model_bt1 = LSTM(train_set.n_vocab, padding_idx=train_set.padding_idx, embedding_weights=embedding_weights, embedding_size=100, hidden_size=128, num_layers=1)

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': None
}

#Train the model and collect eval data
model_bt1_eval = train(model_bt1, train_set, validation_set, **args)


print("-------------------------------------------------------------------------")
print("Training RT2 model")

# Define the train/valid/test set
print("Train set:")
train_set = LyricsDatasetRegex(lyrics = train_lyrics, sent_freq=0.2, token_freq=2, lowercase=True)
print("Validation set:")
validation_set = LyricsDatasetRegex(lyrics = validation_lyrics, vocab=train_set.token_set, lowercase=True)
print("Test set:")
test_set = LyricsDatasetRegex(lyrics = test_lyrics, vocab=train_set.token_set, lowercase=True)

# Define the model
model_rt2 = LSTM(len(train_set.token_set), padding_idx=train_set.padding_idx, embedding_size=100, hidden_size=128, num_layers=1)

# Define training parameters
weights = torch.ones(len(train_set.token_set)).to(device)
weights[1] = 0.1 # Assign a lower weight for the <unk> tokens in the loss function
weights[0] = 0 # The <pad> token should have zero weight

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': weights
}

#Train the model and collect eval data
model_rt2_eval = train(model_rt2, train_set, validation_set, **args)

print("-------------------------------------------------------------------------")
print("Training BT2 model")

print("Train set:")
train_set = LyricsDatasetBPE(lyrics = train_lyrics, n_vocab=3000)
print("Validation set:")
validation_set = LyricsDatasetBPE(lyrics = validation_lyrics, n_vocab=3000)
print("Test set:")
test_set = LyricsDatasetBPE(lyrics = test_lyrics, n_vocab=3000)

weight_model = train_set.bpemb_el.emb
embedding_weights = torch.FloatTensor(weight_model.vectors)

# Define the model
model_bt2 = LSTM(train_set.n_vocab, padding_idx=train_set.padding_idx, embedding_weights=embedding_weights, embedding_size=100, hidden_size=128, num_layers=1)

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': None
}

#Train the model and collect eval data
model_bt2_eval = train(model_bt2, train_set, validation_set, **args)



Generation of single lyrics
-------------------------------------------------------------------------
Training RT1 model
Train set:
total tokens: 319249, unk tokens: 15242, percentage of unk tokens: 4.7743297551441035%
Initial sentences: 48232, filtered sentences: 42181
Dataset samples: 42181, vocabulary size: 3119 tokens
Validation set:
total tokens: 20357, unk tokens: 1776, percentage of unk tokens: 8.724271749275433%
Initial sentences: 2640, filtered sentences: 2640
Dataset samples: 2640, vocabulary size: 3119 tokens
Test set:
total tokens: 36870, unk tokens: 2853, percentage of unk tokens: 7.7379983726607%
Initial sentences: 4654, filtered sentences: 4654
Dataset samples: 4654, vocabulary size: 3119 tokens



dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1



Train Epoch 0 Loss - 4.685808054328959 | Validation loss - 4.7440949735187345
perplexity on validation set: 114.90376747926139
Probabilistic predictions (top 5): ['Θέλω το σώμα μου &', 'Θέλω κι εγώ θα χαθώ &', 'Θέλω να σε δω &', 'Θέλω να σε ξεχάσω &', 'Θέλω μια ζωή μου να ζήσω &']
----------------------------------------



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Train Epoch 1 Loss - 4.278316253961443 | Validation loss - 4.583861850556874
perplexity on validation set: 97.89170832105135
Probabilistic predictions (top 5): ['Θέλω να σου πω η ζωή &', 'Θέλω να μαι εδώ και να ζήσω &', 'Θέλω να με δεις &', 'Θέλω να μαι να σε ξεχάσω &', 'Θέλω να με δεις &']
----------------------------------------
Train Epoch 2 Loss - 4.018175714390128 | Validation loss - 4.541741632279896
perplexity on validation set: 93.85411721917544
Probabilistic predictions (top 5): ['Θέλω να σαι εγώ να με νοιάζει &', 'Θέλω να μαι εγώ &', 'Θέλω να σαι καλά να μην αφήσω &', 'Θέλω να σε δω &', 'Θέλω να σαι καλά σε σένα &']
----------------------------------------
Train Epoch 3 Loss - 3.829811420831109 | Validation loss - 4.543407099587577
perplexity on validation set: 94.01055842078749
Probabilistic predictions (top 5): ['Θέλω να μαι εγώ &', 'Θέλω να με θυμάσαι &', 'Θέλω να σε ξεχάσω &', 'Θέλω να σου πω &', 'Θέλω να σε δω &']
----------------------------------------
Train Epoch 4 Lo

In [44]:
import plotly.express as px
import pandas as pd

rt1_df = pd.DataFrame(data = model_rt1_eval)
rt1_df['model'] = 'rt1'

rt2_df = pd.DataFrame(data = model_rt2_eval)
rt2_df['model'] = 'rt2'

bt1_df = pd.DataFrame(data = model_bt1_eval)
bt1_df['model'] = 'bt1'

bt2_df = pd.DataFrame(data = model_bt2_eval)
bt2_df['model'] = 'bt2'

df = pd.concat([rt1_df, rt2_df, bt1_df, bt2_df])
df['epoch'] = df.index
print("Model Evaluation")

df1 = df.melt(id_vars=['model', 'epoch'], value_vars=['train_loss', 'validation_loss'], var_name='loss', value_name="value")
fig1 = px.line(df1, x='epoch', y='value', color='loss', facet_col="model", facet_col_wrap=2)
fig1.show()

fig2 = px.line(df, x='epoch', y='validation_perplexity', color='model')
fig2.show()



Model Evaluation


### Explaining the results

We can examine that the models take about 2-3 epochs to train before the validation loss increases. One reasonable question to ask is why the perplexity exibits such variance between the models. The reason is that perplexity is heavily affected by the tokenization process.

Subword tokenization will produce lower perplexity score, because the metric is calculated for each subword. For example, suppose we have the lyric: 

'# Αγαπώ σημαίνει &' 

and we want to calculate the perplexity for models with regex and bpe tokenization. The first model's perplexity would be (omiting the exponent):

\(-1/3) *  (log(p(Αγαπώ|#)) + log(p(σημαίνει|#,Αγαπώ)) + log(p(&|#, Αγαπώ, σημαίνει)))

The second models perplexity would be:

\(-1/6) *  (log(p(_Αγα|#)) + log(p(πώ|#,_Αγα)) + log(p(_σημ|#, _Αγα, πώ)) + log(p(αίν|#, _Αγα, πώ, _σημ))+ log(p(ει|#, _Αγα, πώ, _σημ, αίν)) + log(p(&|#, _Αγα, πώ, _σημ, αίν, ει))

Since the model is very good at predicting suffixes, terms like log(p(πώ|#,_Αγα))
will positively influence the score, thus the model will exibit lower perplexity.

Next we train the RT1 and BT1 models for the optimal number of epochs, evaluate the final perfomance on the test set and use them to generate some lyrics.



In [10]:
def evaluate_test(model, test_set, padding_idx, weights):
  # Define dataloaders
  def collate_pad(batch):    
    in_ = []
    out_ = []
    seq_len_x = []
    seq_len_y = []
    for x,y in batch:
      in_.append(x)
      out_.append(y)
      seq_len_x.append(len(x))
      seq_len_y.append(len(y))

    return torch.nn.utils.rnn.pad_sequence(in_, padding_value=padding_idx).cuda(), torch.nn.utils.rnn.pad_sequence(out_, padding_value=padding_idx).cuda(), seq_len_x, seq_len_y

  dataloader = torch.utils.data.DataLoader(test_set, batch_size=64, collate_fn=collate_pad)
  criterion = torch.nn.CrossEntropyLoss(ignore_index=padding_idx, weight=weights)


  with torch.no_grad():
      model.eval()
      test_loss = 0
      loss_counter = 0
      for batch, (x,y, seq_len_x, seq_len_y) in enumerate(dataloader):
        h, c = model.init_state(64)

        x = x.to(device)
        y = y.to(device)

        y_pred, (h,c) = model(x, seq_len_x, (h,c))

        loss = criterion(y_pred.transpose(0,1).transpose(1,2), y.transpose(0,1))

        test_loss += loss.item()
        loss_counter +=1

      test_loss /= loss_counter
      perplexity = np.exp(test_loss)

  print(f"Perplexity on test set: {perplexity}")

In [45]:
print("Training models")

print("-------------------------------------------------------------------------")
print("Training RT1 model")

# Define the train/valid/test set
print("Train set:")
train_set = LyricsDatasetRegex(lyrics = train_lyrics, sent_freq=0.2, token_freq=8, lowercase=True)
print("Validation set:")
validation_set = LyricsDatasetRegex(lyrics = validation_lyrics, vocab=train_set.token_set, lowercase=True)
print("Test set:")
test_set = LyricsDatasetRegex(lyrics = test_lyrics, vocab=train_set.token_set, lowercase=True)

# Define the model
model_rt1 = LSTM(len(train_set.token_set), padding_idx=train_set.padding_idx, embedding_size=100, hidden_size=128, num_layers=1)
train_set_rt1 = train_set

# Define training parameters
weights = torch.ones(len(train_set.token_set)).to(device)
weights[1] = 0.1 # Assign a lower weight for the <unk> tokens in the loss function
weights[0] = 0 # The <pad> token should have zero weight

args = {
    'max_epochs': 3,
    'batch_size': 32,
    'weights': weights
}

#Train the model and collect eval data
model_rt1_eval = train(model_rt1, train_set, validation_set, **args)

#Performance on test set
evaluate_test(model_rt1, test_set, train_set.padding_idx, weights=weights)


print("-------------------------------------------------------------------------")
print("Training BT1 model")

print("Train set:")
train_set = LyricsDatasetBPE(lyrics = train_lyrics, n_vocab=10000)
print("Validation set:")
validation_set = LyricsDatasetBPE(lyrics = validation_lyrics, n_vocab=10000)
print("Test set:")
test_set = LyricsDatasetBPE(lyrics = test_lyrics, n_vocab=10000)

weight_model = train_set.bpemb_el.emb
embedding_weights = torch.FloatTensor(weight_model.vectors)

# Define the model
model_bt1 = LSTM(train_set.n_vocab, padding_idx=train_set.padding_idx, embedding_weights=embedding_weights, embedding_size=100, hidden_size=128, num_layers=1)
train_set_bt1 = train_set

args = {
    'max_epochs': 4,
    'batch_size': 32,
    'weights': None
}

#Train the model and collect eval data
model_bt1_eval = train(model_bt1, train_set, validation_set, **args)

#Performance on test set
evaluate_test(model_bt1, test_set, train_set.padding_idx, weights=None)

Training models
-------------------------------------------------------------------------
Training RT1 model
Train set:
total tokens: 319249, unk tokens: 15242, percentage of unk tokens: 4.7743297551441035%
Initial sentences: 48232, filtered sentences: 42181
Dataset samples: 42181, vocabulary size: 3119 tokens
Validation set:
total tokens: 20357, unk tokens: 1776, percentage of unk tokens: 8.724271749275433%
Initial sentences: 2640, filtered sentences: 2640
Dataset samples: 2640, vocabulary size: 3119 tokens
Test set:
total tokens: 36870, unk tokens: 2853, percentage of unk tokens: 7.7379983726607%
Initial sentences: 4654, filtered sentences: 4654
Dataset samples: 4654, vocabulary size: 3119 tokens



dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1



Train Epoch 0 Loss - 4.67122659491625 | Validation loss - 4.730627502713885
perplexity on validation set: 113.36667793178194
Probabilistic predictions (top 5): ['Θέλω να με κοιτάς &', 'Θέλω να σε δω &', 'Θέλω το μυαλό μου &', 'Θέλω να μου λες πως μ αγαπάς &', 'Θέλω να σε δω &']
----------------------------------------



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Train Epoch 1 Loss - 4.267085333138004 | Validation loss - 4.574681020918346
perplexity on validation set: 96.99709415777767
Probabilistic predictions (top 5): ['Θέλω να με βρεις &', 'Θέλω ν ένα φιλί &', 'Θέλω να μαι η νύχτα που δεν μπορώ &', 'Θέλω να μαι για να σε δω &', 'Θέλω ν αλλάξει &']
----------------------------------------
Train Epoch 2 Loss - 4.00831067760936 | Validation loss - 4.533466952187674
perplexity on validation set: 93.08070869088142
Probabilistic predictions (top 5): ['Θέλω να σε ξεχάσω &', 'Θέλω να μη με νοιάζει &', 'Θέλω να μη το λες πως μπορείς &', 'Θέλω να με δεις πως &', 'Θέλω να μαι η καρδιά &']
----------------------------------------
Perplexity on test set: 99.32579596898395
-------------------------------------------------------------------------
Training BT1 model
Train set:
Dataset samples: 48232
Validation set:
Dataset samples: 2640
Test set:
Dataset samples: 4654
Train Epoch 0 Loss - 4.465670468636469 | Validation loss - 4.494274792217073
perplexity on

In [None]:
import random

print("Generating lyrics")
print("RT1 model")
model = model_rt1
train_set = train_set_rt1
max_seq = 100
starter_words = [['Μια νύχτα'], ['Αν'], ['Δεν μπορείς να λες']]

print("-> Low temperature (3 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=3, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=3, no_unk=True) for i in range(3)]}")

print("-> Medium temperature (20 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=20, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=20, no_unk=True) for i in range(3)]}")

print("-> High temperature (50 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=50, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=50, no_unk=True) for i in range(3)]}")

print("--------------------------")
print("BT1 model")
model = model_bt1
train_set = train_set_bt1
max_seq = 100
starter_words = [['Μια νύχτα'], ['Αν'], ['Δεν μπορείς να λες']]

print("-> Low temperature (3 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=3) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=3) for i in range(3)]}")

print("-> Medium temperature (20 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=20) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=20) for i in range(3)]}")

print("-> High temperature (50 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=50) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=50) for i in range(3)]}")


Generating lyrics
RT1 model
-> Low temperature (3 top words):
Lyrics with context: ['Μια νύχτα και να ζήσω &', 'Μια νύχτα σαν κύμα να με κρατήσεις &', 'Μια νύχτα και το μυαλό &']
Lyrics with context: ['Αν δε θέλω να μαι μαζί &', 'Αν μ αγαπάς κι εγώ &', 'Αν δε θα πω εγώ &']
Lyrics with context: ['Δεν μπορείς να λες &', 'Δεν μπορείς να λες &', 'Δεν μπορείς να λες πως μ αγαπάς &']
Random lyrics: ['και θα ναι ο έρωτας καημός &', 'και να με δεις &', 'και θα ναι ο έρωτας καημός &']
-> Medium temperature (20 top words):
Lyrics with context: ['Μια νύχτα και πάθος &', 'Μια νύχτα μη γυρνάς &', 'Μια νύχτα για τον εαυτό μου το βράδυ &']
Lyrics with context: ['Αν είμαι πια εσύ το φως &', 'Αν ήσουνα μια στάλα πιο πολύ &', 'Αν θέλεις το μυαλό μου &']
Lyrics with context: ['Δεν μπορείς να λες σε σκέφτομαι &', 'Δεν μπορείς να λες &', 'Δεν μπορείς να λες &']
Random lyrics: ['να ξερες τι ζητάς να σαι εδώ &', 'i was it s go soul &', 'με ρωτάς που δεν έχω κάνει &']
-> High temperature (50 top words):



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Lyrics with context: ['Μια νύχτα ακόμα στο μυαλό &', 'Μια νύχτα και οι καρδιές &', 'Μια νύχτα σαν κερί στο κορμί &']
Lyrics with context: ['Αν κι άλλοι &', 'Αν δε μ αγαπάς &', 'Αν θες να χαθώ &']
Lyrics with context: ['Δεν μπορείς να λες &', 'Δεν μπορείς να λες μακριά &', 'Δεν μπορείς να λες &']
Random lyrics: ['oh oh to take the solution &', 'it crazy just like a prayer &', 'μ εσένα δίπλα μου &']
--------------------------
BT1 model
-> Low temperature (3 top words):
Lyrics with context: ['Μια νύχτα που δεν αντέχω &', 'Μια νύχτα θα μαι η αγάπη &', 'Μια νύχτα θα βγω &']
Lyrics with context: ['Αν το ξέρω πως θα βρω &', 'Αν το ξέρει η καρδιά &', 'Αν μ αγαπάς μη μου λες &']
Lyrics with context: ['Δεν μπορείς να λες να μαι &', 'Δεν μπορείς να λες &', 'Δεν μπορείς να λες να ζω &']
Random lyrics: ['και να βρω και μες στης ζωής μου &', 'και να μαι η νύχτα &', 'i can take me &']
-> Medium temperature (20 top words):
Lyrics with context: ['Μια νύχτα θα το πω και θα βγεις &', 'Μια νύχτα με σένα γ

As shown, the models' performance is similar, with the regex model producing slightly better results. The models are capable of identifying the context and naturally continuing the lyrics with relevant words. Generated lyrics like:

* '*Μια νύχτα σαν κύμα να με κρατήσεις*'
* '*Μια νύχτα σαν κερί στο κορμί*'
* '*Δεν μπορείς να λες πως μ αγαπάς*'
* '*Δεν μπορείς να λες σε σκέφτομαι*'
* '*και θα ναι ο έρωτας καημός*'

actually resemble human-like songwriting. However there are cases where the weakness of the AI models is obvious such as:

* '*Δεν μπορείς να λες είν αυτό ποτέ*'
* '*το μυαλό μου στο πρωί να σε βρω*'

### Generating verses

In the second experiment the models are trained on verses (each sample contains 4 lyrics, concatenated with "."). The goal is to generate verses with correct grammar and syntax, and also rythm if possible.

* Model RT3: Two-Layer LSTM, hidden units: 256, embedding size: 100, Regex Tokenizer with vocabulary of ~3000 words
* Model BT3: Two-Layer LSTM, hidden units: 256, embedding size: 100, BPE Tokenizer with vocabulary of 10000 subwords

In [11]:
device="cuda"
print("Generation of verses")

train_lyrics = extract_lyrics_from_files(greek[:-4], concat_lyrics=4, format="genius")
test_lyrics = extract_lyrics_from_files(greek[-4:-2], concat_lyrics=4, format="genius")
validation_lyrics = extract_lyrics_from_files(greek[-2:], concat_lyrics=4, format="genius")

print("-------------------------------------------------------------------------")
print("Training RT3 model")

# Define the train/valid/test set
print("Train set:")
train_set = LyricsDatasetRegex(lyrics = train_lyrics, sent_freq=0.2, token_freq=8, lowercase=True)
print("Validation set:")
validation_set = LyricsDatasetRegex(lyrics = validation_lyrics, vocab=train_set.token_set, lowercase=True)
print("Test set:")
test_set = LyricsDatasetRegex(lyrics = test_lyrics, vocab=train_set.token_set, lowercase=True)

# Define the model
model_rt3 = LSTM(len(train_set.token_set), padding_idx=train_set.padding_idx, embedding_size=100, hidden_size=256, num_layers=2)
train_set_rt3 = train_set

# Define training parameters
weights = torch.ones(len(train_set.token_set)).to(device)
weights[1] = 0.1 # Assign a lower weight for the <unk> tokens in the loss function
weights[0] = 0 # The <pad> token should have zero weight

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': weights
}

#Train the model and collect eval data
model_rt3_eval = train(model_rt3, train_set, validation_set, **args)
#Performance on test set
evaluate_test(model_rt3, test_set, train_set.padding_idx, weights=weights)


print("-------------------------------------------------------------------------")
print("Training BT3 model")

print("Train set:")
train_set = LyricsDatasetBPE(lyrics = train_lyrics, n_vocab=10000)
print("Validation set:")
validation_set = LyricsDatasetBPE(lyrics = validation_lyrics, n_vocab=10000)
print("Test set:")
test_set = LyricsDatasetBPE(lyrics = test_lyrics, n_vocab=10000)

weight_model = train_set.bpemb_el.emb
embedding_weights = torch.FloatTensor(weight_model.vectors)

# Define the model
model_bt3 = LSTM(train_set.n_vocab, padding_idx=train_set.padding_idx, embedding_weights=embedding_weights, embedding_size=100, hidden_size=256, num_layers=2)
train_set_bt3 = train_set

args = {
    'max_epochs': 10,
    'batch_size': 32,
    'weights': None
}

#Train the model and collect eval data
model_bt3_eval = train(model_bt3, train_set, validation_set, **args)
#Performance on test set
evaluate_test(model_bt3, test_set, train_set.padding_idx, weights=None)

Generation of verses
-------------------------------------------------------------------------
Training RT3 model
Train set:
total tokens: 284538, unk tokens: 19749, percentage of unk tokens: 6.940724964679585%
Initial sentences: 11620, filtered sentences: 10458
Dataset samples: 10458, vocabulary size: 3025 tokens
Validation set:
total tokens: 17616, unk tokens: 1722, percentage of unk tokens: 9.775204359673024%
Initial sentences: 633, filtered sentences: 633
Dataset samples: 633, vocabulary size: 3025 tokens
Test set:
total tokens: 32059, unk tokens: 2792, percentage of unk tokens: 8.708942886552919%
Initial sentences: 1118, filtered sentences: 1118
Dataset samples: 1118, vocabulary size: 3025 tokens
Train Epoch 0 Loss - 5.641179453706887 | Validation loss - 5.600625705718994
perplexity on validation set: 270.5956677238922



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Probabilistic predictions (top 5): ['Θέλω . i το καρδιά . και να το να . να με . κι να μου . i να να μου . i να το να you . να θα you &', 'Θέλω i you . κι i i να you &', 'Θέλω να you μου . i να το να you . you i . i you you you . και you μου . και θα ζωή i be you . i να το να μου . i μου i you . κι i i i μου μου να ζωή να &', 'Θέλω να το καρδιά . και να να το ζωή . και θα το . you να να μου you . κι και θα . i να να you μου . και i να το ζωή . να μου i . και το το μου . i να το καρδιά . και i να να το &', 'Θέλω i . i το να . και i you . να μου you . you i το καρδιά &']
----------------------------------------
Train Epoch 1 Loss - 5.069081338538307 | Validation loss - 5.064779949188233
perplexity on validation set: 158.34559416313132
Probabilistic predictions (top 5): ['Θέλω να σε αγάπη . θα μαι θα σε πω . θα με αγαπώ &', 'Θέλω you you . i you . the you &', 'Θέλω i m i . you i . a t you you you &', 'Θέλω . κι με να σ πω . να σε αγάπη μου . κι αν σ αγάπη &', 'Θέλω you the i i be . i ll t

In [12]:
import plotly.express as px
import pandas as pd

rt3_df = pd.DataFrame(data = model_rt3_eval)
rt3_df['model'] = 'rt3'

bt3_df = pd.DataFrame(data = model_bt3_eval)
bt3_df['model'] = 'bt3'

df = pd.concat([rt3_df, bt3_df])
df['epoch'] = df.index
print("Model Evaluation")

df3 = df.melt(id_vars=['model', 'epoch'], value_vars=['train_loss', 'validation_loss'], var_name='loss', value_name="value")
fig3 = px.line(df3, x='epoch', y='value', color='loss', facet_col="model", facet_col_wrap=2)
fig3.show()

fig4 = px.line(df, x='epoch', y='validation_perplexity', color='model')
fig4.show()

Model Evaluation


In [13]:
import random

print("Generating lyrics")
print("RT3 model")
model = model_rt3
train_set = train_set_rt3
max_seq = 100
starter_words = [['Μια νύχτα'], ['Αν'], ['Δεν μπορείς να λες']]

print("-> Low temperature (3 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=3, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=3, no_unk=True) for i in range(3)]}")

print("-> Medium temperature (20 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=20, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=20, no_unk=True) for i in range(3)]}")

print("-> High temperature (50 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=50, no_unk=True) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=50, no_unk=True) for i in range(3)]}")

print("--------------------------")
print("BT3 model")
model = model_bt3
train_set = train_set_bt3
max_seq = 100
starter_words = [['Μια νύχτα'], ['Αν'], ['Δεν μπορείς να λες']]

print("-> Low temperature (3 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=3) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=3) for i in range(3)]}")

print("-> Medium temperature (20 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=20) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=20) for i in range(3)]}")

print("-> High temperature (50 top words):")
for starter in starter_words:
  print(f"Lyrics with context: {[starter[0] + ' ' + predict(model, train_set, starter, max_seq, top_only=50) for i in range(3)]}")
print(f"Random lyrics: {[predict(model, train_set, [], max_seq, top_only=50) for i in range(3)]}")


Generating lyrics
RT3 model
-> Low temperature (3 top words):
Lyrics with context: ['Μια νύχτα που θα σ αγαπώ . και να μαι η καρδιά μου . και θα ναι εδώ . δε θα μαι εδώ &', 'Μια νύχτα που θα σ αγαπώ . και να με νοιάζει . και θα μαι εδώ . και θα σ αγαπώ &', 'Μια νύχτα θα μαι εδώ για μένα . και να μαι εδώ . και μη μη μη μη μη μη μη μη . αν δεν έχει τίποτα &']



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Lyrics with context: ['Αν μ αρέσει να με δεις . δε θα μαι εγώ . και θα μαι μαζί σου . μα δεν μπορώ να ζήσεις &', 'Αν με βλέπεις να σε βρω . να μη μ αγαπάς . και να σε δω . δε θα μαι εγώ &', 'Αν θα σ έχω να χαθώ . να μη με νοιάζει τι θα πω . δε μ αγαπάς . δε μ αγαπάς &']
Lyrics with context: ['Δεν μπορείς να λες . να μη σε νοιάζει . κι ας μη μ αγαπάς . δε θα μαι μαζί &', 'Δεν μπορείς να λες . να μαι η καρδιά . να μη σε ξεχάσω . δε θα σ αγαπώ &', 'Δεν μπορείς να λες . να μη σε ξεχάσω . να μη μ αφήσεις . και να μαι μαζί σου &']
Random lyrics: ['και θα ναι εδώ . και θα σ έχω αγκαλιά . και θα ναι η αγάπη θα μαι εγώ . και θα ναι εδώ &', 'και θα μαι εγώ να χαθώ . δε θα σε ξεχάσω . δε θα βγεις . να μαι εγώ θα χαθώ &', 'i know you go . let s a dream . you re be popular . you better be the sea &']
-> Medium temperature (20 top words):
Lyrics with context: ['Μια νύχτα θα ναι πια τώρα εδώ . για μένα θα ναι εγώ . κι αν η αλήθεια θα ζω μαζί . κι αν ζω και οι δυο ουρανοί &', 'Μια νύχτα αμαρτία πια να

In [18]:
torch.save(model_bt3.state_dict(), 'model_bt3.pt')
torch.save(train_set_bt3, 'train_set_bt3.pt')
tempd = torch.load('train_set_bt3.pt')
tempd.dataset = None
torch.save(tempd, 'train_set_bt3.pt')
temp = LSTM(train_set.n_vocab, padding_idx=train_set.padding_idx, embedding_weights=None, embedding_size=100, hidden_size=256, num_layers=2)
temp.load_state_dict(torch.load('model_bt3.pt'))

<All keys matched successfully>

## Visualizing the language model

To gain some insight on the trained language model, we can use PCA to visualize the hidden vectors of the LSTM. For this example we will train the model on pop and trap lyrics since the two kinds show some difference in the lyrics context. Then we perform a forward pass on the test set, which contains unseen samples of pop and trap lyrics, of equal sizes. We keep the representations of the hidden lstm layer and perform a PCA dimensionality reduction, visualizing the data on the two most important dimensions.

In [None]:
device="cuda"
trap = ['Lyrics_FY.json', 
          'Lyrics_MadClip.json', 
          'Lyrics_Light.json', 
          'Lyrics_Snik.json', 
          'Lyrics_Ypo.json',
          'Lyrics_TrannosGRC.json',
          'Lyrics_Toquel.json',
          'Lyrics_MenteFuerteGRC.json',
          'Lyrics_VlospaGRC.json',
          'Lyrics_iLLEOoGRC.json',
          'Lyrics_RICTAGRC.json',
          'Lyrics_HawkGRC.json',
          'Lyrics_BillySioGRC.json']
pop = [
          'Lyrics_AntonisRemos.json',
          'Lyrics_Giannisploutarhos.json',
          'Lyrics_GiorgosMazonakis.json',
          'Lyrics_NikosOikonomopoulos.json',
          'Lyrics_PanosKiamos.json',
          'Lyrics_GiorgosTsalikis.json',
          'Lyrics_IliasVrettos.json',
          'Lyrics_PantelisPantelidis.json',
          'Lyrics_ΜιχάληςΧατζηγιάννηςMichalisHatzigiannis.json',
          'Lyrics_SteliosRokkos.json',
          'Lyrics_GiorgosSabanis.json',
          'Lyrics_Yianniskotsiras.json',
          'Lyrics_GiorgosKakosaios.json',
          'Lyrics_SakisRouvas.json',
          'Lyrics_Stavento.json',
          'Lyrics_Notissfakianakis.json',
          'Lyrics_Thanospetrelis.json',
          'Lyrics_LefterisPantazis.json',
          'Lyrics_DionisisShinas.json',
          'Lyrics_AnnaVissi.json',
          'Lyrics_DespinaVandi.json',
          'Lyrics_ElliKokkinou.json',
          'Lyrics_NatasaTheodoridou.json',
          'Lyrics_FoivosDelivorias.json',
          'Lyrics_JosephineGR.json',
          ]


print("-------------------------------------------------------------------------")
print("Training RT1 model with trap and pop")

train_lyrics = extract_lyrics_from_files(pop[:-4] + trap[:-4], concat_lyrics=1, format="genius")
test_lyrics = extract_lyrics_from_files(pop[-4:-2] + trap[:-4:-2], concat_lyrics=1, format="genius")
validation_lyrics = extract_lyrics_from_files(pop[-2:] + trap[-2:], concat_lyrics=1, format="genius")


# Define the train/valid/test set
print("Train set:")
train_set = LyricsDatasetRegex(lyrics = train_lyrics, sent_freq=0.2, token_freq=2, lowercase=True)
print("Validation set:")
validation_set = LyricsDatasetRegex(lyrics = validation_lyrics, vocab=train_set.token_set, lowercase=True)
print("Test set:")
test_set = LyricsDatasetRegex(lyrics = test_lyrics, vocab=train_set.token_set, lowercase=True)

# Define the model
model_mixed = LSTM(len(train_set.token_set), padding_idx=train_set.padding_idx, embedding_size=100, hidden_size=128, num_layers=1)
train_set_mixed = train_set

# Define training parameters
weights = torch.ones(len(train_set.token_set)).to(device)
weights[1] = 0.1 # Assign a lower weight for the <unk> tokens in the loss function
weights[0] = 0 # The <pad> token should have zero weight

args = {
    'max_epochs': 3,
    'batch_size': 32,
    'weights': weights
}

#Train the model and collect eval data
model_mixed_eval = train(model_mixed, train_set, validation_set, **args)
#Performance on test set
evaluate_test(model_mixed, test_set, train_set.padding_idx, weights=None)

-------------------------------------------------------------------------
Training RT1 model with trap and pop
Train set:
total tokens: 486549, unk tokens: 8188, percentage of unk tokens: 1.6828726397546803%
Initial sentences: 60281, filtered sentences: 58711
Dataset samples: 58711, vocabulary size: 14139 tokens
Validation set:
total tokens: 51089, unk tokens: 4981, percentage of unk tokens: 9.749652567088805%
Initial sentences: 6128, filtered sentences: 6128
Dataset samples: 6128, vocabulary size: 14139 tokens
Test set:
total tokens: 97587, unk tokens: 7303, percentage of unk tokens: 7.483578755367006%
Initial sentences: 11963, filtered sentences: 11963
Dataset samples: 11963, vocabulary size: 14139 tokens



dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1



Train Epoch 0 Loss - 5.556022600127176 | Validation loss - 5.707387362917264
perplexity on validation set: 301.08341810565327
Probabilistic predictions (top 5): ['Θέλω να το χω &', 'Θέλω να μας πεις yah yah &', 'Θέλω για να το κάνω &', 'Θέλω να τους πάρω για να μας πεις &', 'Θέλω το κάνω &']
----------------------------------------



Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



Train Epoch 1 Loss - 5.07731520304563 | Validation loss - 5.6332485775152845
perplexity on validation set: 279.56884511258505
Probabilistic predictions (top 5): ['Θέλω να μας πεις yah yah yah yah &', 'Θέλω να με φάνε δεν ξέρω τι να κάνω &', 'Θέλω να με κρίνει και δεν είμαι εγώ το ξες &', 'Θέλω ferrari μου για να μας πεις &', 'Θέλω να μου πεις yah &']
----------------------------------------
Train Epoch 2 Loss - 4.76750993702652 | Validation loss - 5.666759883364041
perplexity on validation set: 289.0963094717534
Probabilistic predictions (top 5): ['Θέλω να σε δω &', 'Θέλω να με βρεις να με δεις &', 'Θέλω να με φάνε &', 'Θέλω να σε βρω &', 'Θέλω να με βρεις &']
----------------------------------------
Perplexity on test set: 238.6567329295838


In [None]:
def get_hidden_vector(model, dataset, test_seq):
    model.eval()
    test_seq = " ".join(['#'] + test_seq)
    input = dataset.tokens_to_ids(test_seq)
    h,c = model.init_state(1)

    with torch.no_grad():

      # Reinitialize weights
      h,c = model.init_state(1)
      
      # Prepare input
      length = len(input)
      x = torch.tensor(input)
      x = torch.nn.utils.rnn.pad_sequence([x], padding_value=dataset.padding_idx)

      # Predict next token probabilities
      out, (h,c) = model.forward(x.to(device), [length], (h,c))

      # Get last hidden layer of lstm
      return h.view(-1)


# Forward-pass the test dataset and capture last hidden layer values
pop_samples = extract_lyrics_from_files(pop[-4:-2], concat_lyrics=1, format="genius")
trap_samples = extract_lyrics_from_files(trap[-4:-2], concat_lyrics=1, format="genius")

hidden_states = list(map(lambda it: list(get_hidden_vector(model_mixed, train_set_mixed, it).cpu().numpy()), pop_samples+trap_samples))

#Generate scatter plot

import numpy as np
import plotly.express as px
from sklearn.decomposition import PCA
X = np.array(hidden_states)
pca = PCA(n_components=2)

pca.fit(X)

Y = pca.transform(hidden_states)
c = ['blue']*len(pop_samples) + ['red']*len(trap_samples)

for i in range(len(pop_samples)):
  for j in range(len(pop_samples[i])):
    pop_samples[i][j] = pop_samples[i][j] if train_set_mixed.token_index(pop_samples[i][j]) != 1 else '<unk>'

for i in range(len(trap_samples)):
  for j in range(len(trap_samples[i])):
    trap_samples[i][j] = trap_samples[i][j] if train_set_mixed.token_index(trap_samples[i][j]) != 1 else '<unk>'

l = list(map(lambda it: " ".join(it), pop_samples)) + list(map(lambda it: " ".join(it), trap_samples))

fig = px.scatter_matrix(Y, dimensions=[0,1], color=c, hover_name=l)
fig.show()

From the above diagram we can identify some core regions:
* In one region, we have mostly trap songs with lyrics containing foreign words like: "Ολα είναι clean no stress", "Πέφτεις knock out σβήνουν τα φώτα black out", "Hey club όλα τα \<unk> όλα τα \<unk>". Those are distringuishable lyrics that differ greatly from pop songs.
* At the center there are mixed pop and trap lyrics.
* In one other region we have mostly pop lyrics and also some trap lyrics that resemble common lyrics like "Τι κι αν δεν αλλάξω τι κι αν δεν αλλάξεις", "Τώρα πια δε θυμάμαι".
* Finally we have a region containing all lyrics that end with a pronoun "Αν την καρδιά μου", "Εγώ θα αντέξω εδώ μπροστά σου", "Έχασα φίλους εχθρούς που δεν θελαν να είχα τα μάτια στην πλάτη μου". 

### Visualizing the embedding layer

For this task we'll use the RT1 model, trained on single pop lyrics. We visualize the embeddings of some sample words onto a lower dimensional space. As we can see, words that have strong contextual meaning are mapped close to one another. On the other hand pronouns are grouped together, further apart from other words.

In [171]:
import random
import pandas as pd
words = train_set_rt1.token_set

ids = torch.tensor([train_set_rt1.token_index(word) for word in words]).to(device)
with torch.no_grad():
  embeddings = model_rt1.embedding(ids)

  X = embeddings.cpu().numpy()
  pca = PCA(n_components=3)

  pca.fit(X)

  strong_words = ['πεθαίνω', 'ζω', 'αγαπώ', 'μισώ', 'λιώνω', 'ζωή', 'θάνατος', 'πιώ', 'γυρνώ', 'στιγμή', 'μωρό', 'πεθάνω', 'βλέμμα']
  general_words = ['νιώθω', 'δει', 'πει', 'ρθει', 'γιατί', 'θέλω', 'έλα', 'απόψε', 'σήμερα', 'αυτό']
  pronouns = ['να', 'σε', 'με', 'μη', 'του', 'σου', 'μου', 'ο', 'η', 'το', 'μια', 'ένα']

  rand_words = strong_words + general_words + pronouns
  rand_ids = torch.tensor([train_set_rt1.token_index(word) for word in rand_words]).to(device)
  rand_embeddings = model_rt1.embedding(rand_ids)

  Y = pca.transform(rand_embeddings.cpu().numpy())
  fig_embed = px.scatter_matrix(Y, dimensions=[0,1], hover_name=rand_words, color=pd.Series(['blue']*len(strong_words) + ['red']*len(general_words) + ['green']*len(pronouns)))
  fig_embed.show()

[0.01620061 0.01399792 0.01317108]


## Conclusion

In this assigment we demonstrated the application of Deep Learning to automatic lyrics generation. It was shown that small LSTM models were able to generate decent lyrics with just a few epochs of training. The underlying language model is able to capture the relationship between words and also provide rythm when generating verses.

Note:

In the project's directory you may find pre-trained models and python scripts with which you can generate your own lyrics.