Testing the network on music datasets #104

ibab · 2016-09-28T20:44:04Z

I've started to play around with the MagnaTagATune dataset.
There's a small change that needs to be made to the code when training on this dataset:
Because it uses mp3 instead of wav, the pattern in wavenet/audio_reader.py needs to be adjusted.
It would be nice to write a MagnaReader class that inherits from the AudioReader (or contains one), and that's able to filter the content by genre using the provided metadata.

The text was updated successfully, but these errors were encountered:

lemonzi · 2016-09-28T21:31:59Z

I started training it on violin samples today as well, we'll see how it
goes! The AudioReader should be a bit more generic indeed, and then we can
either write wrapper for VCTK and Magna or subclass it.

EDIT: I really need a GPU...

Zeta36 · 2016-09-29T07:43:48Z

I've made a little experiment. I trained the WaveNet model using just a single wav file (4 seconds piano wav - 16bits - Mono). As I guessed, the loss dropped in hardly 4k steps to ~0.060, but when I generated the audio (using as parameter 64000 samples for 4 seconds generated), the result was a little strange. You can see that the model is playing someting like a piano sound (and the tempo looks pretty similar), but I have still a noisy background that I guessed it will not be there because the loss was just 0.060. It's possible this noise to be caused by the resting cross-entropy error in the prediction?

It's worth to note that I had to low the window of the sample in training time to 10.000 (SAMPLE_SIZE = 10000) because I have no GPU and I cannot wait 20 sec/step.

The other parameters are the default ones that are in the master json file.
I set the learning rate down every 1000 steps from 0.01 to 0.0001

This is the original wav file: https://soundcloud.com/samu-283712554/piano-base

And this is the progression of the training:
499 steps - https://soundcloud.com/samu-283712554/piano-499steps
1000 steps - https://soundcloud.com/samu-283712554/piano-1000
3000 steps - https://soundcloud.com/samu-283712554/piano-3000
3500 steps- https://soundcloud.com/samu-283712554/piano-3500

Could somebody with a good GPU repeat this experiment using the same base wav file (https://soundcloud.com/samu-283712554/piano-base) but with a SAMPLE_SIZE of 100k and others parameters?

I don't know if I'm wrong but I think that if the model is right, this experiment should end quickly with a very smooth piano sound (without any noise), isn't it?

Well, if somebody finally do this in a while I'll be very pleased :).

Regards,
Samu.

I made another test with this. I generated the model but in this time a seeded the model with the original base wav file:

This is the result: https://soundcloud.com/samu-283712554/piano-3500-seed (the 4 first seconds are the original wav file seeded, and then there are the 4 seconds generated using the WaveNet trained -4k steps- model).

ibab · 2016-09-29T11:35:46Z

@Zeta36: I don't think there's an easy way to download your audio file from soundcloud.

Zeta36 · 2016-09-29T11:41:19Z

I'm sorry, @ibab . Here you have it: https://github.com/Zeta36/tensorflow-wavenet/blob/master/test/piano_base.wav

genekogan · 2016-09-29T14:12:48Z

this is a simple change to find_files which looks for any audio files instead of just one set of extensions. i am experimenting with vocal tracks from music which are mixed. not sure there's any real need to limit to one extension since librosa handles pretty much any audio file.

def find_files(directory):
    '''Recursively finds aLl files matching the pattern.'''
    files = []
    for root, dirnames, filenames in os.walk(directory):
        for pattern in ['*.wav', '*.mp3', '*.aiff', '*.flac', '*.ape']:
            for filename in fnmatch.filter(filenames, pattern):
                files.append(os.path.join(root, filename))
    print("found %d files"%len(files))
    return files

Zeta36 · 2016-09-29T16:11:35Z

@ibab , don't you think I should have got a clean and smooth wav after reaching a loss of only 0.060 in the test I commented above? Did you tried it by yourself?

ibab · 2016-09-29T16:34:36Z

@Zeta36: Can't use the GPUs at the moment, as they're needed for other things.
I don't think that a loss of 0.060 means that the generated samples will sound clean.
If the network highly overfits the sample, it could be very sensitive to even tiny differences in the input during generation.
That means that if one of the waveform samples it created deviates from what it received during training, the output will quickly leave the region of sample space that the network is familiar with.

If you look at #47, several people have posted very clean sounding samples, so it appears that this is something that gets better with more training samples and a substantially longer training period.

Also, the network configuration that you've used could be relevant to this. The results in #47 were achieved with substantially deeper networks than the one in the default wavenet_params.json.
Using 4 stacks of 1...512 dilation layers seems to be a good starting point.

Finally, there's also the possibility that things are better when using the waveform directly as an input to the network instead of one-hot encoding it first, which is implemented in #106.

Zeta36 · 2016-09-29T16:53:58Z

"That means that if one of the waveform samples it created deviates from what it received during training, the output will quickly leave the region of sample space that the network is familiar with."

Yes, that's true.

Thank you for the comments.

maxhodak · 2016-10-04T00:17:52Z

Here's a 5-second sample that is kind of not completely terrible (first second is seed): https://soundcloud.com/maxhodak/tycho-wavenet-version-1/s-YHbBX

trained to 20k steps, loss around ~2.7, wav files of tycho's whole discography (individual tracks per file)

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true
}

I tried training that out to 40k steps and it actually sounded a lot worse. Smoothed loss was pretty flat between 2.6 - 2.8 between 20k and 40k steps. Trying to understand why subjectively it sounded much worse at 40k.

nakosung · 2016-10-04T00:39:14Z

@maxhodak I love Tycho! :)

genekogan · 2016-10-04T08:31:18Z

@maxhodak nice! the drums sound spot on...
what did you set the learning rate at?
i've also noticed that the loss does not fully translate to subjective quality. i have a few experiments i've been training, will post them here shortly.

ibab · 2016-10-04T08:41:50Z

Trying to understand why subjectively it sounded much worse at 40k.

This could be a result of the network overfitting the dataset.
We can leave some files out of the training and monitor the loss on them to find out whether that's the case. (This is also something that's pointed out in the WaveNet paper)

lemonzi · 2016-10-04T14:27:33Z

Quick note -- for an online listening + easy download with no upload limit
cloud service, you can use Freesound rather than SoundCloud. Downside: it
doesn't allow the cool time-stamped comments.

El dt., 4 oct. 2016 a les 4:41, Igor Babuschkin (notifications@github.com)
va escriure:

Trying to understand why subjectively it sounded much worse at 40k.
This could be a result of the network overfitting the dataset.
We can leave some files out of the training and monitor the loss on them
to find out if that's the case.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#104 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5lWYDoJMIulDW-gw45Pc6su7BzZbks5qwhFRgaJpZM4KJSlp
.

ibab · 2016-10-04T15:43:20Z

@lemonzi: I wasn't aware of Freesound, thanks for pointing it out!

robinsloan · 2016-10-08T22:30:04Z

Just another music test to share… I think this is pretty nice/interesting!

https://soundcloud.com/robinsloan/sets/tensorflow-wavenet-temperature-demo

Model trained on this album to a loss of ~2.8 with these params:

{
    "filter_width": 2,
    "sample_rate": 8000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": false
}

genekogan · 2016-10-08T23:19:35Z

@robinsloan sounds very nice! i am also sharing some early results, though mine are much less coherent.

this is a sample generated on 2.5 hours of opera (Paisiello's Il Barbiere di Siviglia / Barber of Seville), with minimal pre-processing: https://soundcloud.com/genekogan/il-barbiere-di-siviglia-wavenet

this was generated after 25k steps with the following parameters and learning rate 0.001, loss around ~2.6.

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

it's mostly noise/gibberish, but to my ears, there is some coherent operatic material in there. i hear a sort of orchestra drone (sounds kind of like a warm-up/tuning) along with some faint baritone voices. right now it sounds more like granular synthesis than something coherent. i know i was too ambitious trying to train it on too many different sections, so i've started working on a script that uses librosa's analytical tools to narrow down a folder of a music to a more homogenous subset of it. i need to clean it up a bit and will post it as an ipython notebook shortly.

trying to get a sense of ways i can improve it. the obvious one i mentioned: limit to a smaller, more uniform set of audio. parametrically, having more dilations perhaps, lowering the sample rate, etc.

robinsloan · 2016-10-08T23:44:09Z

@genekogan, I like that a lot! There are some really interesting things happening in that clip. I mean it's almost like, forget simulation; on its own merits that's a novel, compelling sound. More, more!

Corpus selection is interesting. I went with SK Kakraba's gyil music because it's (a) stylistically pretty uniform, and (b) naturally quite noisy -- both of which seemed useful in this context.

dunnevan · 2016-10-27T23:38:50Z

I've been training on the MagnaTagATune dataset with clips that are tagged as solo piano.

https://soundcloud.com/evan-dunn-676478257/sets/magnatagatune-solo-piano

I am training with batches of 4 with a loss fluctuating between ~1.7 to ~2.4. The first second of audio is seeded.

{
    "filter_width": 2,
    "sample_rate": 8192,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

lmaxwell · 2016-10-28T01:06:42Z

@dunnevan nice! What optimizer did you use? I noticed you run 400000 steps, have you adjusted learning rate during training?

dunnevan · 2016-10-28T18:26:56Z

@lmaxwell I'm using adam. Started learning rate at 1e-3 and L2 at 1e-4 for the first 80000 steps then moved to 1e-4 and 1e-5. I just recently pushed it down another order of magnitude to see how it helps. My feeling is that feeding it more data is the most important thing, even 400000 steps, 4 buckets at 8192 sampling is only about 3 minutes of trained audio.

eliphatfs · 2017-11-24T12:01:04Z

@dunnevan

I tried with your parameters, however I didn't get satisfactory results... Is it because the different datasets of music should have different parameters...?

ibab mentioned this issue Sep 28, 2016

Generating good audio samples #47

Open

genekogan mentioned this issue Sep 29, 2016

more fine-grained handling of audio corpora #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing the network on music datasets #104

Testing the network on music datasets #104

ibab commented Sep 28, 2016

lemonzi commented Sep 28, 2016 •

edited

Zeta36 commented Sep 29, 2016 •

edited

ibab commented Sep 29, 2016

Zeta36 commented Sep 29, 2016

genekogan commented Sep 29, 2016

Zeta36 commented Sep 29, 2016 •

edited

ibab commented Sep 29, 2016 •

edited

Zeta36 commented Sep 29, 2016

maxhodak commented Oct 4, 2016 •

edited

nakosung commented Oct 4, 2016

genekogan commented Oct 4, 2016

ibab commented Oct 4, 2016 •

edited

lemonzi commented Oct 4, 2016

ibab commented Oct 4, 2016

robinsloan commented Oct 8, 2016

genekogan commented Oct 8, 2016

robinsloan commented Oct 8, 2016

dunnevan commented Oct 27, 2016 •

edited

lmaxwell commented Oct 28, 2016

dunnevan commented Oct 28, 2016 •

edited

eliphatfs commented Nov 24, 2017

Testing the network on music datasets #104

Testing the network on music datasets #104

Comments

ibab commented Sep 28, 2016

lemonzi commented Sep 28, 2016 • edited

Zeta36 commented Sep 29, 2016 • edited

ibab commented Sep 29, 2016

Zeta36 commented Sep 29, 2016

genekogan commented Sep 29, 2016

Zeta36 commented Sep 29, 2016 • edited

ibab commented Sep 29, 2016 • edited

Zeta36 commented Sep 29, 2016

maxhodak commented Oct 4, 2016 • edited

nakosung commented Oct 4, 2016

genekogan commented Oct 4, 2016

ibab commented Oct 4, 2016 • edited

lemonzi commented Oct 4, 2016

ibab commented Oct 4, 2016

robinsloan commented Oct 8, 2016

genekogan commented Oct 8, 2016

robinsloan commented Oct 8, 2016

dunnevan commented Oct 27, 2016 • edited

lmaxwell commented Oct 28, 2016

dunnevan commented Oct 28, 2016 • edited

eliphatfs commented Nov 24, 2017

lemonzi commented Sep 28, 2016 •

edited

Zeta36 commented Sep 29, 2016 •

edited

Zeta36 commented Sep 29, 2016 •

edited

ibab commented Sep 29, 2016 •

edited

maxhodak commented Oct 4, 2016 •

edited

ibab commented Oct 4, 2016 •

edited

dunnevan commented Oct 27, 2016 •

edited

dunnevan commented Oct 28, 2016 •

edited