Skip to content

monica-ayala/AudioGeneration_MNIST

Repository files navigation

AudioGeneration_MNIST

FULL DOCUMENTATION FOUND HERE:

Autonomous Audio Generation with Variational Autoencoders using the FSDD Dataset (PDF File)


About the proyect

This project employs a Variational Autoencoder for autonomous audio generation and reconstruction based .wav audio files transformed into Short Fourier Transform Spectrograms. The dataset employed is the free-spoken-digit-dataset, also known as an audio version of the MNIST digit dataset.

A google collaboratory project is set up here!

About the model

A Variational Autoencoder (VAE) is a type of generative model in deep learning that consists of an encoder and a decoder, similar to traditional autoencoders. However, the key distinction lies in how it handles the encoding process. Unlike standard autoencoders that compress input data into a fixed point in the latent space, a VAE encodes data as a distribution within this space. This approach allows the model to not only learn the specific features of the input data but also the variability and uncertainty associated with these features, enabling it to generate new data points that are similar yet diverse.

fig1 vae diagram

Training a VAE step by step:

  • The input is encoded as distribution over the latent space.
  • We sample a point from the latent space from that distribution.
  • The sampled point is decoded and the reconstruction error can be computed.
  • The reconstruction error is backpropagated through the network.

During my work on this project, I tried two types of variational autoencoders to solve this problem. In the end, the best performing model was the Standard VAE.

Standard VAE

In Standard_VAE.py we have defined a standard variaitonal autoencoder. The architecture is the following:

VAE

alt text

Encoder

alt text

Decoder

alt text At the end of the encoder we map the mean and the variance to 128 latent variables (dimensions) to then sample a point from this normal distribution that will be the actual output of our encoder to pass into our decoder.

x = Flatten()(x)
    mu = Dense(latent_space_dim, name="mu")(x)
    log_variance = Dense(latent_space_dim, name="log_variance")(x)

def sample_point_from_normal_distribution(args):
        mu, log_variance = args
        epsilon = K.random_normal(shape=K.shape(mu), mean=0., stddev=1.)
        return mu + K.exp(log_variance / 2) * epsilon

    encoder_output = Lambda(
        sample_point_from_normal_distribution,
        name="encoder_output"
    )([mu, log_variance])

The loss functions we use in this standard VAE are the Mean Square Error and the Kullback–Leibler Divergence. The MSE loss is the reconstruction loss, that meassures how the output of the VAE is similar to its input during training, and the KL Divergence loss is the one that quantifies the differences between two distributions.

What distributions, though? The standard distribution and the encoded distribution, as the KL Divergence serves as regularization so that the encoded distributions aren't far from each other, avoiding overfitting by not having the means be too distant or the variances too small.

alt text

The code to do this combined loss function is the following, which we will pass when we compile our model. In Standard VAE the tradeoff between these two losses is easier to manage.

def combined_loss(y_target, y_predicted, mu, log_variance, reconstruction_loss_weight):
    reconstruction_loss = K.mean(K.square(y_target - y_predicted), axis=[1, 2, 3])
    kl_loss = -0.5 * K.sum(1 + log_variance - K.square(mu) - K.exp(log_variance), axis=1)
    return reconstruction_loss_weight * reconstruction_loss + kl_loss

Mixture of Gaussian Distributions VAE

Here we can find the implementation for a VAE based on a mixture of gaussian distributions. In this implementation of VAE we have to define a prior distribution, which we do with the help of tensorflow probability. This prior distribution is a mixture of gaussian distributions, parametrized by their means and a covariance matrix.

This is the key difference between the models, the complexity of the latent space reprsentation as well as the different way we calculate the loss. In this VAE we also use the KL Divergence but our reconstruction loss is defined as the logarithmic distance between the input and the output.

def reconstruction_loss(batch_of_images, decoding_dist):
    return -tf.reduce_sum(decoding_dist.log_prob(batch_of_images), axis = 0)

The architecture in it of itself is very similar to that of the Standard VAE.

Encoder

alt text

Decoder

alt text

Why Standard VAE?

It seems that the reconstruction loss made the Mixture Gaussian VAE stagnate, even as it seemed to learn in less epochs than the Standard VAE. MSE reconstruction loss kept decreasing and allowed the Standard VAE model to learn the patterns of the data after around 250 epochs.

Note: Both these aproaches had been previously implemented in other open source projects, in the case of the gaussian mixture vae I was basing myself on the CelebA VAE and on the case of the standard VAE, in the implementation of Valerio Velardo's sound generation VAE.

Results

For audio samples click here

Here are some reconstructions of files in the training dataset.

alt text

alt text

alt text

alt text

For comparison here are reconstructions of files in the test dataset, never seen by our model and recorded by me.

alt text

alt text

alt text

alt text

Then the generation:

alt text

And some plots of the latent space representation: alt text

alt text

References

Understanding Mel Spectograms

Generating Sound with Neural Networks

VAE for the CelebA dataset

Natsiou, Anastasia et al. “An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones.” xAI (2023).

Briot, JP., Pachet, F. Deep learning for music generation: challenges and directions. Neural Comput & Applic 32, 981–993 (2020). https://doi.org/10.1007/s00521-018-3813-6

Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., & Sutskever, I. (2020). Jukebox: A Generative Model for Music. ArXiv, abs/2005.00341. https://openai.com/research/jukebox

Previous Work

LSTM Model for music generation using MIDI files: Github Repository and Presentation

GM-VAE Model for music generation using spectrograms: Github Repository