# Generating Chorales With RNN

## Overview

Step into the captivating realm of automated music composition, as I will walk you through the code of a machine learning model that aims to compose music in the style of Bach chorales. Johann Sebastian Bach was one of the most influential composers of all times, inspiring generations of artists to come with his masterful use of harmonisation techniques such as counterpoint. Bach chorales, specifically, are harmonisations of hymn melodies which contain four parts, meaning that four notes are written to be played at the same time. Today, with the help of machine learning, you too can become a musical genius like Bach. 

Imitating the styles of renowned artists has been a key interest of computational creativity research (a branch of artificial intelligence). The purpose of these studies is to assess whether machine-generated art has reached human-like quality. Within the context of music, Bach chorales generation is one of the most widely used case studies, because J.S. Bach wrote over 400 chorale pieces that all follow the same compositional rules - characteristics that are ideal for machine learning models. The bigger and more homogenous the collection of examples, the easier it is for a machine learning model to understand the complex underlying patterns. Imagine someone showed you different songs by the same artist and asked you what aspects of the music you would describe as their unique "signature sound". It would be much harder to give a useful response if you only heard two songs instead of 400. Similarly, it would be very difficult to summarise what the music has in common, if the artist ventured out into lot of different styles instead of sticking to the same format. 

[Source: DeepBach: a Steerable Model for Bach Chorales Generation](https://arxiv.org/pdf/1612.01010.pdf)


![musical notes being converted into numbers](notes-numbers.jpg)
Author's own work.

The dataset used in this exercise contains the sheet music of 382 J.S. Bach chorales. For the music to be machine-readable, it has to be converted into numbers. This is achieved by representing each note as its index on the piano, with number 0 indicating the absence of a note. The chorales stored in the dataset are broken down into 1/16th time units, each unit containing four numbers which indicate the four notes being played on this count.

[Dataset Source](https://github.com/ageron/handson-ml2/blob/master/15_processing_sequences_using_rnns_and_cnns.ipynb)

The dataset is being fed into a "neural network" - an advanced machine learning model which aims to simulate the structure of the human brain. It consists of different layers of neurons (the equivalent of a brain cell) which communicate with other neurons in the layer behind them in an effort to learn patterns from the data they receive. Each neuron (also called node) in the network is like a sensor that picks up one specific pattern and sends its insights to the next, processing increasingly complex information which each subsequent layer. In this case, the neural network should learn to recognise the relationships between notes in the chorales and through that analyse the compositional style of J.S. Bach. The goal is for the neural network to be able to predict the notes which J.S. Bach would have chosen to continue a given melody. If successful, it can be used to produce novel compositions that sound very musical and are indistinguishable from real Bach chorales. 

Although what is considered musical and what is not is a highly subjective experience, there is some sort of common understanding for what melodies are aesthetically pleasing. The Oxford dictionary defines musical as "having a pleasant sound; melodious or tuneful." Have you ever heard a child improvise on their instrument for the first time? While they might be generating notes on it, it can be quite a painful to sit through. Now compare that to a solo by a professional Jazz musician. Or listen to the [new Dua Lipa Single](https://www.youtube.com/watch?v=suAR1PYFNYA) side-by-side with [Stockhausen's Gesang der Jünglinge](https://www.youtube.com/watch?v=nffOJXcJCDg). Which one do you perceive as more melodious and pleasant-sounding? 


In [None]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/suAR1PYFNYA?si=yJlSXf-IoSEvkWBd" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" ></iframe>

<iframe width="560" height="315" src="https://www.youtube.com/embed/nffOJXcJCDg?si=snzl5sORWvuYFBoS" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" ></iframe>

# Getting the Data

The first code snippet downloads the chorale dataset from [GitHub](https://github.com), a cloud platform that is used to store and share files similiar to Dropbox but with special features for developers (e.g. collaboration features, version history). The dataset is stored in a compressed format (similiar to a .zip file) to save storage space, so the code also needs to extract the raw data to be able to work with it. [Dataset Source](https://github.com/ageron/handson-ml2/blob/master/datasets/jsb_chorales/README.md)

Moreover, the code imports a python library called TensorFlow which can be thought of as a collection of pre-made machine learning tools. Importing this library allows you to use advanced machine learning models without having to program them from scratch. [TensorFlow](https://www.tensorflow.org/)

In [None]:
import tensorflow as tf

tf.keras.utils.get_file(
    "jsb_chorales.tgz",
    "https://github.com/ageron/data/raw/main/jsb_chorales.tgz",
    cache_dir=".",
    extract=True)

### Training, Validation & Test Data

For the model to be able to learn from the chorales dataset, it must be divided into separate training, validation and test sets, that all support either the training or the evaluation process.

- The training dataset is used to teach the model the distinctive musical patterns and styles of J.S. Bach.


- The validation dataset would then be employed to "broaden the horizon" of the model, ensuring that it can also make accurate predictions on chorales that were not included in the training set. This dataset validates that the model does not "overfit" the data inputed during training stages - meaning that it does not merely memorize the training material but can abstract from it.


- The test set serves as a final check to see how well the model generalises to new, unseen examples. The model is shown the remaining chorales from the collection and evaluated on how accurately it can predict sequences of notes. This gives us an indication of how well the model is expected to perform in real-world scenarios. 


The validation and training dataset are used for preliminary assessments of the model's performance, based on which it is improved. The test produces an accuracy score that is representative for the model in its final form. You can think of this as a summative assessment carried out at the end of a university course, whereas the validation and training exercises are more akin to informal feedback or formative assessments that seek to track and refine the models progress. 

In this case, the three subsets were already pre-selected and the code only had to separate them accordingly.

In [None]:
from pathlib import Path

jsb_chorales_dir = Path("datasets/jsb_chorales")
train_files = sorted(jsb_chorales_dir.glob("train/chorale_*.csv"))
valid_files = sorted(jsb_chorales_dir.glob("valid/chorale_*.csv"))
test_files = sorted(jsb_chorales_dir.glob("test/chorale_*.csv"))

The next code snippet loads the data from the three subsets inside this notebook. Previously, they were being stored in separate files but now they can be accessed directly by calling the variables train_chorales, valid_chorales and test_chorales. You can think of this as copypasting the contents from separate excel files into this document to ensure they are immediately accessible and can be edited. 

In [None]:
import pandas as pd

def load_chorales(filepaths):
    return [pd.read_csv(filepath).values.tolist() for filepath in filepaths]

train_chorales = load_chorales(train_files)
valid_chorales = load_chorales(valid_files)
test_chorales = load_chorales(test_files)

# Preparing the Data

This code snippet checks that the data has loaded correctly. This is achieved by extracting all individual notes within the dataset and calculating statistics about the notes, including the total number of unique notes, as well as the highest and lowest notes. The highest and lowest notes are then checked against those anticipated to ensure they are the same as expected and consequently one can assume that the dataset has loaded correctly. Since the model bases all its predictions on patterns within the training data, the output of a model is always only as good as what goes into it. This is referred to as the garbage in, garbage out concept of machine learning. It highlights that it is of uttermost importance to ensure the quality of the training data and therefore, it is important to double-check that it has loaded correctly and completely.

The notes cover a range from 36 (representing C1, C in octave 1) to 81 (representing A5, A in octave 5), with the additional inclusion of 0 to denote silence.

In [None]:
notes = set()
for chorales in (train_chorales, valid_chorales, test_chorales):
    for chorale in chorales:
        for chord in chorale:
            notes |= set(chord)

n_notes = len(notes)
min_note = min(notes - {0}) #0 denotes no notes being played
max_note = max(notes)

assert min_note == 36
assert max_note == 81

### Code for Synthesiser

The following cell is code for a synthesiser to play MIDI. Not part of machine learning code to generate Bach, but useful for listening to the results and samples used for training!

In [None]:
from IPython.display import Audio
import numpy as np

def notes_to_frequencies(notes):
    # Frequency doubles when you go up one octave; there are 12 semi-tones
    # per octave; Note A on octave 4 is 440 Hz, and it is note number 69.
    return 2 ** ((np.array(notes) - 69) / 12) * 440

def frequencies_to_samples(frequencies, tempo, sample_rate):
    note_duration = 60 / tempo # the tempo is measured in beats per minutes
    # To reduce click sound at every beat, we round the frequencies to try to
    # get the samples close to zero at the end of each note.
    frequencies = (note_duration * frequencies).round() / note_duration
    n_samples = int(note_duration * sample_rate)
    time = np.linspace(0, note_duration, n_samples)
    sine_waves = np.sin(2 * np.pi * frequencies.reshape(-1, 1) * time)
    # Removing all notes with frequencies ≤ 9 Hz (includes note 0 = silence)
    sine_waves *= (frequencies > 9.).reshape(-1, 1)
    return sine_waves.reshape(-1)

def chords_to_samples(chords, tempo, sample_rate):
    freqs = notes_to_frequencies(chords)
    freqs = np.r_[freqs, freqs[-1:]] # make last note a bit longer
    merged = np.mean([frequencies_to_samples(melody, tempo, sample_rate)
                     for melody in freqs.T], axis=0)
    n_fade_out_samples = sample_rate * 60 // tempo # fade out last note
    fade_out = np.linspace(1., 0., n_fade_out_samples)**2
    merged[-n_fade_out_samples:] *= fade_out
    return merged

def play_chords(chords, tempo=160, amplitude=0.1, sample_rate=44100, filepath=None):
    samples = amplitude * chords_to_samples(chords, tempo, sample_rate)
    if filepath:
        from scipy.io import wavfile
        samples = (2**15 * samples).astype(np.int16)
        wavfile.write(filepath, sample_rate, samples)
        return display(Audio(filepath))
    else:
        return display(Audio(samples, rate=sample_rate))

## testing the synthesiser
for index in range(3):
    play_chords(train_chorales[index])

## Converting chords into arpeggios

To generate new chorales, the goal is to train a model that uses the preciding chords to predict subsequent chords. However, if the model tries to predict all four notes of each chord at the same time, the risk is higher that result will be non-musical. Bach chorales are harmonisations of existing hymn melodies, which implies that the connections between individual notes within a voice are more significant than dependencies at the chord level. To shift the focus onto individual notes, the developer of the model decided to convert the chords into arpeggios (see image 1). This means that the four notes of each chord are turned into sequences of notes and each note is predicted individually.

![chord vs. arpeggio demonstration](chord-vs-arpeggio.jpg)

[Image 1: classicalguitar.org](https://www.classicalguitar.org/classical-guitar-technique/)

[Reproduced under CC BY-NC-ND License](https://creativecommons.org/licenses/by-nc-nd/4.0/)

The chorales are converted into in a storage format called windows which can be understood by the previously imported "TensorFlow" framework. In this case, windows are subsets of chorales each containing 32 chords in the form of 128 sequential notes. By utilising this technique, the model is trained by being fed "bite-sized" melodic fragments rather than whole chorales at once.

### Sources:

[Source 1: Aurélien Geron](https://github.com/ageron/handson-ml2/blob/master/15_processing_sequences_using_rnns_and_cnns.ipynb)

[Source 2: TensorFlow Documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)

In [None]:
import tensorflow as tf

def create_target(batch):
    X = batch[:, :-1]
    Y = batch[:, 1:] # predict next note in each arpegio, at each step
    return X, Y

def preprocess(window):
    window = tf.where(window == 0, window, window - min_note + 1) # shift values
    return tf.reshape(window, [-1]) # convert to arpegio

def bach_dataset(chorales, batch_size=32, shuffle_buffer_size=None,
                 window_size=32, window_shift=16, cache=True):
    def batch_window(window):
        return window.batch(window_size + 1)

    def to_windows(chorale):
        dataset = tf.data.Dataset.from_tensor_slices(chorale)
        dataset = dataset.window(window_size + 1, window_shift, drop_remainder=True)
        return dataset.flat_map(batch_window)

    chorales = tf.ragged.constant(chorales, ragged_rank=1)
    dataset = tf.data.Dataset.from_tensor_slices(chorales)
    dataset = dataset.flat_map(to_windows).map(preprocess)
    if cache:
        dataset = dataset.cache()
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(create_target)
    return dataset.prefetch(1)

Newly converted into sequential form, the dataset needs to be split into training, validation and test subsets again.

In [None]:
train_set = bach_dataset(train_chorales, shuffle_buffer_size=1000)
valid_set = bach_dataset(valid_chorales)
test_set = bach_dataset(test_chorales)

# Building the Model

The following code defines the architecture of the neural network.

- The first layer is a so-called embedding layer. It is used to represent discrete categorical variables (in this case musical notes) as continuous vectors. It turns each note into a specific point in a five-dimensional space. The position of each note in this space is determined by these five numbers, and notes that are similar or share certain musical characteristics will be closer to each other in this space. Essentially, one can envision these vectors as coordinates for a map of all musical notes and each melody a route for a journey throughout this map. By having five numbers to capture attributes in, the computer can develop a more nuanced understanding of the relationships between notes compared to having a single number represent it. [Source: Keras Documentation](https://keras.io/api/layers/core_layers/embedding/)


- This is followed by four hidden "Conv1D" layers. A Conv1D layer scans through the input sequence, identifies patterns using filters, and passes this learned information to the next layers of the neural network. In this context, these layers act like musical filters that look for specific patterns in the sequence of notes. Imagine you are reading the sheet music for the chorales, and with each filter, you focus on different aspects of the notes. For example, the first layer might look at pairs of consecutive notes, searching for simple patterns. Each layer analyses more complex and intricate sequences than the previous one and passes its insights on to the next. [Source: Keras Documentation](https://keras.io/api/layers/convolution_layers/convolution1d/)


- After the convolutional layers, there is an LSTM layer. LSTM stands for Long Short-Term Memory and, as the name suggests is like the memory of the computer. It helps the model understand the context and dependencies between notes over longer sequences. This is important to capture the musical flow and structure of the chorales because temporal elements such as repetition play a big role in our perception of music. [Source: Music Generation using an LSTM](https://arxiv.org/abs/2203.12105)


- The output layer gives out a likelihood for every note in the corpus (There are 46 unique notes and a "0" for silence). The note which receives the highest likelihood score is the note that the network thinks J.S. Bach himself would have used at this point in the composition, with respect to previous notes. In other words, this is the note that is stastically most likely to appear at this position. [Source: Aurélien Geron](https://github.com/ageron/handson-ml2/blob/master/15_processing_sequences_using_rnns_and_cnns.ipynb)

In [None]:
n_embedding_dims = 5

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_notes, output_dim=n_embedding_dims,
                           input_shape=[None]),
    tf.keras.layers.Conv1D(32, kernel_size=2, padding="causal", activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv1D(48, kernel_size=2, padding="causal", activation="relu", dilation_rate=2),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv1D(64, kernel_size=2, padding="causal", activation="relu", dilation_rate=4),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv1D(96, kernel_size=2, padding="causal", activation="relu", dilation_rate=8),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LSTM(256, return_sequences=True),
    tf.keras.layers.Dense(n_notes, activation="softmax")
])

In [None]:
model.summary()

# Training the Model

This code snippet contains the code for training the model using the training and validation sets.

Here is a brief overview over the training process.

![Visualisation of ML training pipeline](training-process.png)
Source: Author's own work.

- At first, the predictions of the model are going to be very random and non-musical. To be able to accurately predict the next note in a sequence, the model needs to learn which of the patterns it picks up on are the most important indicators for which note comes next. This is determined through internal parameters called weights and biases.

- In each iteration of the training process, the model receives feedback on how well it did. If its prediction was close to the actual next note in a Bach chorale, it gets positive feedback. If it's wrong, it learns from the mistake. A so-called loss function is the model's measure of how wrong its prediction was, i.e. how much the predicted note diverged from the real one.  

- Now, just like a musician adjusting their performance based on feedback, the model tweaks its parameters, to minimise the loss function. An Algorithm called optimiser adjusts the model's weights and biases in ways that make the loss value a little smaller with each iteration. Through this process, the model becomes incrementally better at suggesting the right most Bach-like notes in the sequence.

- During the validation process, the model performs on a set of Bach chorales it has not seen during training. This is like checking how well the model can now compose new chorales in Bach's style, not just repeat what it has memorised.

- The training and validation process is repeated 20 times. It goes through the entire dataset with each iteration referred to as "epochs".

The resulting model can predict the next note in a sequence with a 93.26% accuracy for the training chorales and 81.57% for the validitation set.



In [None]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-3)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
model.fit(train_set, epochs=20, validation_data=valid_set)

# Saving and Evaluating Your Model

Here it becomes relevant again that we set aside a number of chorales for evaluation purposes. The model's performance is now evaluated on the test set, containing chorales the model has not seen before. It achieves an accuracy of 81.33% in identifying the correct note following an input sequence.

In [None]:
model.save("my_bach_model", save_format="tf")
model.evaluate(test_set)

# Generating Chorales

The final blocks of code are putting the newly trained and evaluated model to use.

The code provides the model with a set of two initial chords, which it will transform into arpeggios. The model will then predict the succeeding note and continue this process iteratively. Finally, the generated notes will be organized into groups of four to reconstruct chords, forming the resulting chorale.



In [None]:
def generate_chorale_v2(model, seed_chords, length, temperature=1):
    arpegio = preprocess(tf.constant(seed_chords, dtype=tf.int64))
    arpegio = tf.reshape(arpegio, [1, -1])
    for chord in range(length):
        for note in range(4):
            next_note_probas = model.predict(arpegio)[0, -1:]
            rescaled_logits = tf.math.log(next_note_probas) / temperature
            next_note = tf.random.categorical(rescaled_logits, num_samples=1)
            arpegio = tf.concat([arpegio, next_note], axis=1)
    arpegio = tf.where(arpegio == 0, arpegio, arpegio + min_note - 1)
    return tf.reshape(arpegio, shape=[-1, 4])

## Seed Chords

This short code snippet is extracting the first eight chords from a random test chorale. (Technically, it only picks out two chords, which repeat four times each)

The synthesiser programmed earlier makes the selected chords audible

In [None]:
seed_chords = test_chorales[2][:8]
play_chords(seed_chords, amplitude=0.2)

## Chorale Generation Parameters

To generate a new chorale, the model requires the initial chords and for the user to set a desried length of the chorale. In this case, the model will generate 56 chords in addition to the 8 initial chords.

Aiming to minimize risks, the model consistently opts for the note with the highest score, often leading to monotonous melodies. Running the model multiple times results in the generation of identical melodies.

To address this issue, the developer introduced an additional measure called temperature. The temperature determines how "daring" the model acts. Instead of always choosing the note with the highest score it will randomly choose between all suggested notes according to their probability. If the temperature is set higher, the model will choose a note that is more unlikely to come next in the sequence. Think of a this as allowing the model to "go wild" and free itself from the rules it has learnt. 

![Animal Muppet GIF playing the drums](animal.gif)
[Image Source: Tenor](https://tenor.com/bjEhG.gif)

[Source: Aurélien Geron](https://github.com/ageron/handson-ml2/blob/master/15_processing_sequences_using_rnns_and_cnns.ipynb)



In [None]:
new_chorale_v2_cold = generate_chorale_v2(model, seed_chords, 56, temperature=0.8)
play_chords(new_chorale_v2_cold, filepath="bach_cold.wav")

In [None]:
new_chorale_v2_medium = generate_chorale_v2(model, seed_chords, 56, temperature=1.0)
play_chords(new_chorale_v2_medium, filepath="bach_medium.wav")

In [None]:
new_chorale_v2_hot = generate_chorale_v2(model, seed_chords, 56, temperature=1.5)
play_chords(new_chorale_v2_hot, filepath="bach_hot.wav")

# Reflection

### Evaluating the results

- Unlike other forms of generative AI, music generation systems are usually evaluated by humans listening to the machine compositions ([Carnovalini and Rodà (2020)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7861321/) ). Therefore, I decided to judge the performance of the model by listening to the generated examples. The chorale generated with the "cold" temperature setting sounds the most musical but, as pointed out before, it often holds notes significantly longer than what the listener would expect. Consequently, while the harmonisations sound Bach-like and aesthetically pleasing, the rhythmic composition of the chorale sounds uninteresting and artificial. While increasing the temperature brings more diversity into the compositions, it makes them sound less aesthetically pleasing and stylistically accurate. Overall, while the statistical evaluation resulted in an accuracy of over 80%, the results still lack a deeper sense of musicality that can be hard to capture with numbers.

### Own Application

- I would like to use the machine learning techniques featured in the code to apply them to a different musical context, such as pop music. The required dataset would need to be similarly structured to the existing one. To achieve a similiar four-voice format, it could contain one main vocal melody, one backing vocal harmony, one high-pitched and one low pitched instrumental part.  

- I am interested in exploring if it would perform better with pop music since it is often considered to contain simpler melodies and harmonies. At the same time, the compositional rules are largely less strict and it will be hard to find a comparably homogenous training corpus. I am also just curious to see what kind of songs the model would come up with and if they sound musical at all.

- To feed the pop songs into the computer they would first need to be manually transcribed to sheet music and then converted into numerical form. Most modern productions include more than four voices so one would have to decide which instrumental parts to use. The preprocessing workflow would be very similiar to the one outlined in this exercise.

- This model and similiar AIs generate music based on existing composition. While J.S. Bach's music is in the public domain, datasets containing more recent compostions that are protected by copyright, there is a concern about the originality and ownership of the generated music. If any AI-generated compositions are published, there is a risk of unintentional plagiarism. Also composers are usually not compensated when their music is used as training data.

