# Chapter 15: Music Generation with MuseGAN


You'll learn to generate music in this chapter and the next. However, these two chapters take very different approaches. In this chapter, you'll use the techniques you learned in image GANs and treat a piece of music as a multi-dimensional object (similar to an image). Like all GAN models, there is a discriminator and a generator. The generator first creates a whole piece of music and presents to the discriminator. Based on the feedback from the discriminator, the generator gradually fine tunes the music piece until it can pass as a real piece of music from the training set. In contrast, we'll treat a piece of music as a sequence in the next chapter and use the techniques we learn in natural language processing to create music. 

The MuseGAN model is proposed by Dong, Hsiao, Yang, and Yang in 2017 (https://arxiv.org/abs/1709.06298). The code in this chapter is adapted from Azamat Kanametov's GitHub repository (https://github.com/akanametov/musegan). 

Start a new cell in ch15.ipynb and execute the following lines of code in it:

In [1]:
import os

os.makedirs("files/ch15", exist_ok=True)

# 1. Multi-Track Music Files
Multi-track music pieces have sounds from multiple instruments such as bass, drums, pianos, or strings. 

In this section, we'll first download the training data and learn how to convert music files to a format that the MuseGAN can understand. 

## 1.1. Downlaod the Music Files
We'll use JSB Chorales pianorolls data set from https://github.com/czhuang/JSB-Chorales-dataset. Download the music file *Jsb16thSeparated.npz* and place it in the folder /Desktop/ai/files/ch15/ on your computer.

Next, download the two utility modules *midi_util.py* and *MuseGAN_util.py* from the book's GitHub repository and place them in /Desktop/ai/utils/ on your computer. We can now load up the music files and organize them in batches:

In [2]:
from torch.utils.data import DataLoader
from utils.midi_util import MidiDataset

dataset = MidiDataset('files/ch15/Jsb16thSeparated.npz')
loader = DataLoader(dataset, batch_size=64, 
                        shuffle=True, drop_last=True)

We can print out a song and see the data format:

In [3]:
songs=next(iter(loader))
first_song=songs[0]
print(first_song.shape)

torch.Size([4, 2, 16, 84])


The shape of each song is (4, 2, 16, 84), meaning 4 tracks, 2 bars, 16 steps per bar, and 84 notes per step. All the values are normalized to the range between -1 and 1, and you can verify as follows:

In [4]:
flat=first_song.reshape(-1,)
print(min(flat), max(flat))

tensor(-1.) tensor(1.)


If you recall, in image GANs, we also normalize all image data (i.e., pixels) to the range -1 to 1 and gradually adjust the values during training. The MuseGAN takes a similar approach, with the exception that the numbers represent music notes instead of image pixels. 

## 1.2. Convert Data to Songs
Right now, the songs are formatted as PyTorch tensors and ready to be fed to a neural network. But before we do that, we want to convert the songs to music formats and hear them so we know what type of songs the model will likely generate. 

Below we convert a song to a midi file:

In [5]:
from utils.midi_util import convert_to_midi

music_data=convert_to_midi(first_song.unsqueeze(0))
music_data.write("midi","files/ch15/song1.mid")

'files/ch15/song1.mid'

Now go to the folder /files/ch15/ and open the file *song1.mid* with a music player and you should hear a short piece of piano music. It lasts a few seconds. Alternatively, you can run the following code cell and use the *music21* library to play it:

In [6]:
from music21 import midi

mf = midi.MidiFile()
mf.open("files/ch15/song1.mid") 
mf.read()
mf.close()
stream = midi.translate.midiFileToStream(mf)
stream.show('midi')

Below, we extract the second and third songs and convert them into one single piece of longer music, lasting about 16 seconds: 

In [7]:
two_songs=songs[1:3]
music_data=convert_to_midi(two_songs)
music_data.write("midi","files/ch15/song2.mid")

'files/ch15/song2.mid'

Open the file *song2.mid* with a music player and you should hear a longer piece of piano music. Alternatively, you can press the play button below:

<audio src="https://gattonweb.uky.edu/faculty/lium/ml/song2.mp3" type="audio/mpeg" controls="" controlsList="nodownload"></audio>

# 2.  Create A MuseGAN
In this section, we create a deep 3-D convolutional GAN model so that we can train the model later to generate music pieces. 

## 2.1. A Critic in MuseGAN
As we discussed in Chapter 7, using the Wasserstein distance in the loss function can stablize training. We therefore follow what we did in Chapter 7 and use a critic rather than a discriminator in the MuseGAN. Specifically, the critic is not a binary classifier. Instead, the critic evaluates the work by the generator (in this case, a piece of music) and returns a score between $-\infty$ and $\infty$. The higher the score, the better the quality of the music. 

We create a music critic neural network as follows, and it's defined in the file *MuseGAN_util.py* you just downloacded:

In [8]:
import torch.nn as nn
import torch

class MuseCritic(nn.Module):
    def __init__(self,hid_channels=128,hid_features=1024,
        out_features=1,n_tracks=4,n_bars=2,n_steps_per_bar=16,
        n_pitches=84):
        super().__init__()
        self.n_tracks = n_tracks
        self.n_bars = n_bars
        self.n_steps_per_bar = n_steps_per_bar
        self.n_pitches = n_pitches
        in_features = 4 * hid_channels if n_bars == 2\
            else 12 * hid_channels
        self.seq = nn.Sequential(
            nn.Conv3d(self.n_tracks, hid_channels, 
                      (2, 1, 1), (1, 1, 1), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, hid_channels, 
              (self.n_bars - 1, 1, 1), (1, 1, 1), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, hid_channels, 
                      (1, 1, 12), (1, 1, 12), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, hid_channels, 
                      (1, 1, 7), (1, 1, 7), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, hid_channels, 
                      (1, 2, 1), (1, 2, 1), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, hid_channels, 
                      (1, 2, 1), (1, 2, 1), padding=0),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(hid_channels, 2 * hid_channels, 
                      (1, 4, 1), (1, 2, 1), padding=(0, 1, 0)),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Conv3d(2 * hid_channels, 4 * hid_channels, 
                      (1, 3, 1), (1, 2, 1), padding=(0, 1, 0)),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Flatten(),
            nn.Linear(in_features, hid_features),
            nn.LeakyReLU(0.3, inplace=True),
            nn.Linear(hid_features, out_features))
    def forward(self, x):  
        return self.seq(x)

The input to the critic model is a piece of music with a shape of 4 by 2 by 16 by 84. The Conv3d layer takes the music in each track as a 3-dimensinal object and applies filters on them to extract spatial features. The Conv3d layer works similarly to the conv2d layer on images except that the target is three-dimensional instead of two-dimensional. 

Notice that the last layer in the critic modle is linear and we don't apply an activation function on the ouput. Therefore, the output from the critic model is a value from $-\infty$ to $\infty$.

## 3.2. The Generator G in MuseGAN
The generator G's job is to create a piece of music so that it can be rated as high as possible by the music critic. We create the following neural network to respresent the generator G:

In [9]:
class MuseGenerator(nn.Module):
    def __init__(self,z_dimension=32,hid_channels=1024,
        hid_features=1024,out_channels=1,n_tracks=4,
        n_bars=2,n_steps_per_bar=16,n_pitches=84):
        super().__init__()
        self.n_tracks = n_tracks
        self.n_bars = n_bars
        self.n_steps_per_bar = n_steps_per_bar
        self.n_pitches = n_pitches
        self.chords_network=TemporalNetwork(z_dimension, 
                            hid_channels, n_bars=n_bars)
        self.melody_networks = nn.ModuleDict({})
        for n in range(self.n_tracks):
            self.melody_networks.add_module(
                "melodygen_" + str(n),
                TemporalNetwork(z_dimension, 
                 hid_channels, n_bars=n_bars))
        self.bar_generators = nn.ModuleDict({})
        for n in range(self.n_tracks):
            self.bar_generators.add_module(
                "bargen_" + str(n),BarGenerator(z_dimension,
            hid_features,hid_channels // 2,out_channels,
            n_steps_per_bar=n_steps_per_bar,n_pitches=n_pitches))
    def forward(self,chords,style,melody,groove):
        chord_outs = self.chords_network(chords)
        bar_outs = []
        for bar in range(self.n_bars):
            track_outs = []
            chord_out = chord_outs[:, :, bar]
            style_out = style
            for track in range(self.n_tracks):
                melody_in = melody[:, track, :]
                melody_out = self.melody_networks["melodygen_"\
                          + str(track)](melody_in)[:, :, bar]
                groove_out = groove[:, track, :]
                z = torch.cat([chord_out, style_out, melody_out,\
                               groove_out], dim=1)
                track_outs.append(self.bar_generators["bargen_"\
                                          + str(track)](z))
            track_out = torch.cat(track_outs, dim=1)
            bar_outs.append(track_out)
        out = torch.cat(bar_outs, dim=2)
        return out

We'll feed random data from four difference latent spaces, each representing a differnet track, to the generator. The generator then generates a piece of music in the shape of (4, 2, 16, 84) with values between -1 and 1 based on inputs from the four latent spaces. 

Note that the *MuseGenerator()* class uses several other classes such as *BarGenerator()* and *TemporalNetwork()* that are defined in the file *MuseGAN_util.py*. 

## 3.3. Optimizers and the Loss Function

We'll create a generator and a critic based on the *MuseGenerator()* and *MuseCritic()* classes in the local module, as follows:

In [10]:
import torch
from utils.MuseGAN_util import (
    init_weights, MuseGenerator, MuseCritic)

device = "cuda" if torch.cuda.is_available() else "cpu"
generator = MuseGenerator(z_dimension=32, hid_channels=1024, 
              hid_features=1024, out_channels=1).to(device)
critic = MuseCritic(hid_channels=128,
                    hid_features=1024,
                    out_features=1).to(device)
generator = generator.apply(init_weights)
critic = critic.apply(init_weights)

Since the critic produces a rating rather than a classification, the loss function is defined as the negative average of the product between the prediction and the target, as follows:

In [11]:
def loss_fn(pred,target):
    return -torch.mean(pred*target)

During training, for the generator, we'll set the target as 1 so the objective of the generator is to produce music so that the rating (i.e, the variable *pred* in the above function) can be as high as possible. For the critic, we'll set target to 1 for real music and -1 for fake music in the loss function. That is, the critic's job is to assign a high rating to real music and a low rating to fake music. 

Similar to what we did in Chapter 7, we add the Wasserstein distance with gradient penalty to the critic's loss function to stablize training. The gradient penalty is defined in the file *MuseGAN_util.py*, as follows:

In [12]:
class GradientPenalty(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, inputs, outputs):
        grad = torch.autograd.grad(
            inputs=inputs,
            outputs=outputs,
            grad_outputs=torch.ones_like(outputs),
            create_graph=True,
            retain_graph=True,
        )[0]
        grad_=torch.norm(grad.view(grad.size(0),-1),p=2,dim=1)
        penalty = torch.mean((1. - grad_) ** 2)
        return penalty

We'll use the Adam optimizer for both the critic and the generator:

In [13]:
lr = 0.001
g_optimizer = torch.optim.Adam(generator.parameters(),
                               lr=lr, betas=(0.5, 0.9))
c_optimizer = torch.optim.Adam(critic.parameters(),
                               lr=lr, betas=(0.5, 0.9))

# 3. Train the MuseGAN
Now that we have the training data the and two networks, we'll train the MuseGAN model. After that, we'll discard the critic network and use the generator network to create multi-track music that resembles pieces from the traning set. 

First we define a few hyperparameters:

In [14]:
from utils.MuseGAN_util import loss_fn, GradientPenalty

batch_size=64
repeat=5
display_step=10
epochs=500
alpha=torch.rand((batch_size,1,1,1,1)).requires_grad_().to(device)

The following function *train_epoch()* trains the model for one epoch:

In [15]:
def train_epoch():
    e_gloss = 0
    e_closs = 0
    for real in loader:
        real = real.to(device)
        # Train Critic
        for _ in range(repeat):
            chords = torch.randn(batch_size, 32).to(device)
            style = torch.randn(batch_size, 32).to(device)
            melody = torch.randn(batch_size, 4, 32).to(device)
            groove = torch.randn(batch_size, 4, 32).to(device)
            c_optimizer.zero_grad()
            with torch.no_grad():
                fake = generator(chords, style, melody,\
                                 groove).detach()
            realfake = alpha * real + (1 - alpha) * fake
            fake_pred = critic(fake)
            real_pred = critic(real)
            realfake_pred = critic(realfake)
            fake_loss =  loss_fn(fake_pred, \
                                 - torch.ones_like(fake_pred))
            real_loss = loss_fn(real_pred,\
                                torch.ones_like(real_pred))
            gp = GradientPenalty()
            penalty = gp(realfake, realfake_pred)
            closs = fake_loss + real_loss + 10 * penalty
            closs.backward(retain_graph=True)
            c_optimizer.step()
            e_closs += closs.item() / (repeat*len(loader))
        # Train Generator
        g_optimizer.zero_grad()
        chords = torch.randn(batch_size, 32).to(device)
        style = torch.randn(batch_size, 32).to(device)
        melody = torch.randn(batch_size, 4, 32).to(device)
        groove = torch.randn(batch_size, 4, 32).to(device)
        fake = generator(chords, style, melody, groove)
        fake_pred = critic(fake)
        gloss = loss_fn(fake_pred, torch.ones_like(fake_pred))
        gloss.backward()
        g_optimizer.step()
        e_gloss += gloss.item() / len(loader)
    return e_gloss, e_closs

The training process is very much like that we used in Chapter 7 when we train the conditional GAN with gradient penalty. 

We now train the model for 500 epochs:

In [16]:
for epoch in range(1,501):
    e_gloss, e_closs = train_epoch()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, G loss {e_gloss} C loss {e_closs}")

If you use GPU training, it takes about an hour. Otherwise, it may take several hours. Once done, you can save the model to the local folder as follows:

In [17]:
torch.save(generator.state_dict(),'files/ch15/MuseGAN_G.pth')

# 4. Music Generation with MuseGAN
We first reload the generator as follows:

In [18]:
generator.load_state_dict(torch.load('files/ch15/MuseGAN_G.pth'))

<All keys matched successfully>

To generate music, we first sample from the latent spaces:

In [19]:
num_pieces=5

hords = torch.rand(num_pieces, 32)
style = torch.rand(num_pieces, 32)
melody = torch.rand(num_pieces, 4, 32)
groove = torch.rand(num_pieces, 4, 32)

Notice here I essentially generating five songs at once so we have a longer piece of music. You can change the value of the variable *num_pieces* to your own liking. 

We then feed the latent variables to the generator 

In [20]:
preds = generator(chords, style, melody, groove).detach()

Finally, we convert the generated music to the midi format, like so:

In [21]:
music_data = convert_to_midi(preds.numpy())
music_data.write('midi', 'files/ch15/MuseGAN_song.mid')

'files/ch15/MuseGAN_song.mid'

You can listen to the gererated song like this:

In [22]:
mf = midi.MidiFile()
mf.open("files/ch15/MuseGAN_song.mid") 
mf.read()
mf.close()
stream = midi.translate.midiFileToStream(mf)
stream.show('midi')

Or you can listen to the music by pressing the play button below:

<audio src="https://gattonweb.uky.edu/faculty/lium/ml/MuseGAN_song.mp3" type="audio/mpeg" controls="" controlsList="nodownload"></audio>