Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing the network on music datasets #104

Open
ibab opened this issue Sep 28, 2016 · 21 comments
Open

Testing the network on music datasets #104

ibab opened this issue Sep 28, 2016 · 21 comments

Comments

@ibab
Copy link
Owner

ibab commented Sep 28, 2016

I've started to play around with the MagnaTagATune dataset.
There's a small change that needs to be made to the code when training on this dataset:
Because it uses mp3 instead of wav, the pattern in wavenet/audio_reader.py needs to be adjusted.
It would be nice to write a MagnaReader class that inherits from the AudioReader (or contains one), and that's able to filter the content by genre using the provided metadata.

@lemonzi
Copy link
Collaborator

lemonzi commented Sep 28, 2016

I started training it on violin samples today as well, we'll see how it
goes! The AudioReader should be a bit more generic indeed, and then we can
either write wrapper for VCTK and Magna or subclass it.

EDIT: I really need a GPU...

@Zeta36
Copy link

Zeta36 commented Sep 29, 2016

I've made a little experiment. I trained the WaveNet model using just a single wav file (4 seconds piano wav - 16bits - Mono). As I guessed, the loss dropped in hardly 4k steps to ~0.060, but when I generated the audio (using as parameter 64000 samples for 4 seconds generated), the result was a little strange. You can see that the model is playing someting like a piano sound (and the tempo looks pretty similar), but I have still a noisy background that I guessed it will not be there because the loss was just 0.060. It's possible this noise to be caused by the resting cross-entropy error in the prediction?

It's worth to note that I had to low the window of the sample in training time to 10.000 (SAMPLE_SIZE = 10000) because I have no GPU and I cannot wait 20 sec/step.

The other parameters are the default ones that are in the master json file.
I set the learning rate down every 1000 steps from 0.01 to 0.0001

This is the original wav file: https://soundcloud.com/samu-283712554/piano-base

And this is the progression of the training:
499 steps - https://soundcloud.com/samu-283712554/piano-499steps
1000 steps - https://soundcloud.com/samu-283712554/piano-1000
3000 steps - https://soundcloud.com/samu-283712554/piano-3000
3500 steps- https://soundcloud.com/samu-283712554/piano-3500

Could somebody with a good GPU repeat this experiment using the same base wav file (https://soundcloud.com/samu-283712554/piano-base) but with a SAMPLE_SIZE of 100k and others parameters?

I don't know if I'm wrong but I think that if the model is right, this experiment should end quickly with a very smooth piano sound (without any noise), isn't it?

Well, if somebody finally do this in a while I'll be very pleased :).

Regards,
Samu.

I made another test with this. I generated the model but in this time a seeded the model with the original base wav file:

This is the result: https://soundcloud.com/samu-283712554/piano-3500-seed (the 4 first seconds are the original wav file seeded, and then there are the 4 seconds generated using the WaveNet trained -4k steps- model).

@ibab
Copy link
Owner Author

ibab commented Sep 29, 2016

@Zeta36: I don't think there's an easy way to download your audio file from soundcloud.

@Zeta36
Copy link

Zeta36 commented Sep 29, 2016

@genekogan
Copy link
Contributor

this is a simple change to find_files which looks for any audio files instead of just one set of extensions. i am experimenting with vocal tracks from music which are mixed. not sure there's any real need to limit to one extension since librosa handles pretty much any audio file.

def find_files(directory):
    '''Recursively finds aLl files matching the pattern.'''
    files = []
    for root, dirnames, filenames in os.walk(directory):
        for pattern in ['*.wav', '*.mp3', '*.aiff', '*.flac', '*.ape']:
            for filename in fnmatch.filter(filenames, pattern):
                files.append(os.path.join(root, filename))
    print("found %d files"%len(files))
    return files

@Zeta36
Copy link

Zeta36 commented Sep 29, 2016

@ibab , don't you think I should have got a clean and smooth wav after reaching a loss of only 0.060 in the test I commented above? Did you tried it by yourself?

@ibab
Copy link
Owner Author

ibab commented Sep 29, 2016

@Zeta36: Can't use the GPUs at the moment, as they're needed for other things.
I don't think that a loss of 0.060 means that the generated samples will sound clean.
If the network highly overfits the sample, it could be very sensitive to even tiny differences in the input during generation.
That means that if one of the waveform samples it created deviates from what it received during training, the output will quickly leave the region of sample space that the network is familiar with.

If you look at #47, several people have posted very clean sounding samples, so it appears that this is something that gets better with more training samples and a substantially longer training period.

Also, the network configuration that you've used could be relevant to this. The results in #47 were achieved with substantially deeper networks than the one in the default wavenet_params.json.
Using 4 stacks of 1...512 dilation layers seems to be a good starting point.

Finally, there's also the possibility that things are better when using the waveform directly as an input to the network instead of one-hot encoding it first, which is implemented in #106.

@Zeta36
Copy link

Zeta36 commented Sep 29, 2016

"That means that if one of the waveform samples it created deviates from what it received during training, the output will quickly leave the region of sample space that the network is familiar with."

Yes, that's true.

Thank you for the comments.

@maxhodak
Copy link
Contributor

maxhodak commented Oct 4, 2016

Here's a 5-second sample that is kind of not completely terrible (first second is seed): https://soundcloud.com/maxhodak/tycho-wavenet-version-1/s-YHbBX

trained to 20k steps, loss around ~2.7, wav files of tycho's whole discography (individual tracks per file)

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true
}

I tried training that out to 40k steps and it actually sounded a lot worse. Smoothed loss was pretty flat between 2.6 - 2.8 between 20k and 40k steps. Trying to understand why subjectively it sounded much worse at 40k.

@nakosung
Copy link
Contributor

nakosung commented Oct 4, 2016

@maxhodak I love Tycho! :)

@genekogan
Copy link
Contributor

@maxhodak nice! the drums sound spot on...
what did you set the learning rate at?
i've also noticed that the loss does not fully translate to subjective quality. i have a few experiments i've been training, will post them here shortly.

@ibab
Copy link
Owner Author

ibab commented Oct 4, 2016

Trying to understand why subjectively it sounded much worse at 40k.

This could be a result of the network overfitting the dataset.
We can leave some files out of the training and monitor the loss on them to find out whether that's the case. (This is also something that's pointed out in the WaveNet paper)

@lemonzi
Copy link
Collaborator

lemonzi commented Oct 4, 2016

Quick note -- for an online listening + easy download with no upload limit
cloud service, you can use Freesound rather than SoundCloud. Downside: it
doesn't allow the cool time-stamped comments.

El dt., 4 oct. 2016 a les 4:41, Igor Babuschkin (notifications@github.com)
va escriure:

Trying to understand why subjectively it sounded much worse at 40k.
This could be a result of the network overfitting the dataset.
We can leave some files out of the training and monitor the loss on them
to find out if that's the case.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#104 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADCF5lWYDoJMIulDW-gw45Pc6su7BzZbks5qwhFRgaJpZM4KJSlp
.

@ibab
Copy link
Owner Author

ibab commented Oct 4, 2016

@lemonzi: I wasn't aware of Freesound, thanks for pointing it out!

@robinsloan
Copy link
Contributor

Just another music test to share… I think this is pretty nice/interesting!

https://soundcloud.com/robinsloan/sets/tensorflow-wavenet-temperature-demo

Model trained on this album to a loss of ~2.8 with these params:

{
    "filter_width": 2,
    "sample_rate": 8000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": false
}

@genekogan
Copy link
Contributor

@robinsloan sounds very nice! i am also sharing some early results, though mine are much less coherent.

this is a sample generated on 2.5 hours of opera (Paisiello's Il Barbiere di Siviglia / Barber of Seville), with minimal pre-processing: https://soundcloud.com/genekogan/il-barbiere-di-siviglia-wavenet

this was generated after 25k steps with the following parameters and learning rate 0.001, loss around ~2.6.

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

it's mostly noise/gibberish, but to my ears, there is some coherent operatic material in there. i hear a sort of orchestra drone (sounds kind of like a warm-up/tuning) along with some faint baritone voices. right now it sounds more like granular synthesis than something coherent. i know i was too ambitious trying to train it on too many different sections, so i've started working on a script that uses librosa's analytical tools to narrow down a folder of a music to a more homogenous subset of it. i need to clean it up a bit and will post it as an ipython notebook shortly.

trying to get a sense of ways i can improve it. the obvious one i mentioned: limit to a smaller, more uniform set of audio. parametrically, having more dilations perhaps, lowering the sample rate, etc.

@robinsloan
Copy link
Contributor

@genekogan, I like that a lot! There are some really interesting things happening in that clip. I mean it's almost like, forget simulation; on its own merits that's a novel, compelling sound. More, more!

Corpus selection is interesting. I went with SK Kakraba's gyil music because it's (a) stylistically pretty uniform, and (b) naturally quite noisy -- both of which seemed useful in this context.

@dunnevan
Copy link

dunnevan commented Oct 27, 2016

I've been training on the MagnaTagATune dataset with clips that are tagged as solo piano.

https://soundcloud.com/evan-dunn-676478257/sets/magnatagatune-solo-piano

I am training with batches of 4 with a loss fluctuating between ~1.7 to ~2.4. The first second of audio is seeded.

{
    "filter_width": 2,
    "sample_rate": 8192,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

@lmaxwell
Copy link

@dunnevan nice! What optimizer did you use? I noticed you run 400000 steps, have you adjusted learning rate during training?

@dunnevan
Copy link

dunnevan commented Oct 28, 2016

@lmaxwell I'm using adam. Started learning rate at 1e-3 and L2 at 1e-4 for the first 80000 steps then moved to 1e-4 and 1e-5. I just recently pushed it down another order of magnitude to see how it helps. My feeling is that feeding it more data is the most important thing, even 400000 steps, 4 buckets at 8192 sampling is only about 3 minutes of trained audio.

@eliphatfs
Copy link

@dunnevan
image
I tried with your parameters, however I didn't get satisfactory results... Is it because the different datasets of music should have different parameters...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants