New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample RNN #192

Open
nakosung opened this Issue Dec 5, 2016 · 28 comments

Comments

Projects
None yet
@nakosung
Contributor

nakosung commented Dec 5, 2016

@tszumowski

This comment has been minimized.

tszumowski commented Dec 15, 2016

@nakosung This is a very interesting alternative to WaveNet, and they published their code. Did you (or anyone else here) try this out for comparison?

@nakosung

This comment has been minimized.

Contributor

nakosung commented Dec 15, 2016

@tszumowski I haven't tested it yet. ;)

@vonpost

This comment has been minimized.

vonpost commented Dec 26, 2016

@tszumowski I tried to get it running but I encountered too many bugs or inconsistencies in the code to actually get it working without hacking the whole thing.

If anyone was successful in starting a training session using that code I'd be very interested to see their solution.

@richardassar

This comment has been minimized.

richardassar commented Dec 28, 2016

I've submitted a pull request which makes downloading and generating the MUSIC dataset pain-free, other than that no modification to the code was required.

The output occasionally becomes unstable but I've managed to generate long samples which remain coherent.

https://soundcloud.com/psylent-v/samplernn-sample-e33-i246301-t1800-tr10121-v10330-2

@nakosung

This comment has been minimized.

Contributor

nakosung commented Jan 15, 2017

SampleRNN seems to perform as good as wavenet.

https://soundcloud.com/nako-sung/sample_e6_i54546_t72-00_tr0

@weixsong

This comment has been minimized.

weixsong commented Jan 22, 2017

@richardassar , your generated music sounds very good.
What is the piano corpus that you used for training the model? could you share the corpus to me?

@richardassar

This comment has been minimized.

richardassar commented Jan 22, 2017

Hi, the piano corpus is from archive.org

If you clone the SampleRNN repo and run the download script it will gather the corpus for you.

https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/datasets/music/download_archive_preprocess.sh

I've trained a model on some other music, the band Tangerine Dream, maybe it can be called "Deep Tangerine Dream :) I'll upload that when I have decided on the best output sample.

If you decide to train using your own corpus be sure to compute your own mean/variance normalisation stats.

@weixsong

This comment has been minimized.

weixsong commented Jan 22, 2017

@richardassar , thanks very much.

@weixsong

This comment has been minimized.

weixsong commented Jan 24, 2017

Hi, @richardassar , I'm not quite understand about " compute your own mean/variance normalisation stats", it machine learning, usually we do feature normalization for better model convergence.
In this wavenet experiment, do I need to normalize the music sound myself?
Now I'm using the piano musics from SampleRNN.

@richardassar

This comment has been minimized.

richardassar commented Jan 24, 2017

@devinplatt

This comment has been minimized.

Contributor

devinplatt commented Feb 6, 2017

Hey @richardassar and @weixsong . From my reading of the SampleRNN paper (haven't tried their code just yet), the normalization of inputs is only applied for the GMM based models, which weren't the the best performing models anyways. You can see in that same file (dataset.py) that the normalization is only applied if real_valued==True (False is the default), so I don't think computing your own stats is necessary unless for some reason you want to use the real-valued input models.

@nakosung

This comment has been minimized.

Contributor

nakosung commented Feb 6, 2017

In general, audio clips should be normalized by its rms value instead of peak value.

@richardassar

This comment has been minimized.

richardassar commented Feb 7, 2017

The signal needs to be bounded because we model it as a conditional distribution of quantised amplitudes.

They should apply DC removal when using the mu-law nonlinearity, it's not required for the linear mode.

@abhilashi

This comment has been minimized.

abhilashi commented Feb 10, 2017

@nakosung Thanks for the post. Could you please share metrics on SampleRNN performance? Time to generate? On what hardware?

@nakosung

This comment has been minimized.

Contributor

nakosung commented Feb 11, 2017

@abhilashi Trained and generated on Titan XP. Generation speed is 0.5s per 1s clip, which is super fast comparing to WaveNet.

@abhilashi

This comment has been minimized.

abhilashi commented Feb 11, 2017

@nakosung Thanks much for the info 👍 I'm going to run it on Hindi data now!

@richardassar

This comment has been minimized.

richardassar commented Feb 21, 2017

It handles multi-instrument music datasets quite nicely.

https://soundcloud.com/psylent-v/samplernn-tangerine-dream-1
https://soundcloud.com/psylent-v/samplernn-tangerine-dream-2

Trained on 32 hours of Tangerine Dream. I have plenty of other nice samples it generated.

@abhilashi

This comment has been minimized.

abhilashi commented Feb 21, 2017

@dannybtran

This comment has been minimized.

dannybtran commented Feb 21, 2017

Could SampleRNN be used for TTS? The paper uses the term "unconditional" which makes me think it cannot?

@lemonzi

This comment has been minimized.

Collaborator

lemonzi commented Feb 21, 2017

@Zeta36

This comment has been minimized.

Zeta36 commented Feb 21, 2017

And here it is the source code: https://github.com/sotelo/parrot, and the paper: https://openreview.net/forum?id=B1VWyySKx

@Cortexelus

This comment has been minimized.

Cortexelus commented Dec 29, 2017

We're using SampleRNN for music.
Training: 1-2 days to train on NVIDIA V100
Inference: generate 100 4-minute audio clips in 10 minutes

http://dadabots.com/
http://dadabots.com/nips2017/generating-black-metal-and-math-rock.pdf
https://theoutline.com/post/2556/this-frostbitten-black-metal-album-was-created-by-an-artificial-intelligence

@devinroth

This comment has been minimized.

devinroth commented Dec 29, 2017

@richardassar

This comment has been minimized.

richardassar commented Dec 29, 2017

@Cortexelus

This comment has been minimized.

Cortexelus commented Dec 29, 2017

@devinroth re: more musical output

Thanks!

I could be wrong, but IMO I think decreasing the wavenet receptive field is the answer to more musical output.

The ablation studies in the Tacotron 2 paper showed us 10.5ms - 21ms is a good receptive field size if the wavenet conditions on a high-level representation.

Wavenet is great at the low level. It makes audio sound natural and its MOS scores are almost realistic. Keep it there at the 20ms level. Condition it. Dedicate high-level structure to MIDI nets, symbolic music nets, or intermediate representations. Or go top-down with progressively-upsampled mel spectrograms. Do both bottom-up and top-down.

Because these unconditioned predict-the-next-sample models only learn bottom-up as they train. First noise, then texture, then individual hits and notes, then phrases and riffs, then rhythm if you're lucky. The last thing they would learn is song composition or music theory. Struggles to see the forest for the trees.

Even SampleRNN whose receptive field is "sorta unbounded" runs into this problem of learning composition. (LSTMs are able to hold onto memory for long periods of time, but because of TBPTT they are limited in learning when to recall/forget)

Food for thought: A dataset of mp3s has 100 million examples of low-level structure, but only dozens of examples of song-level structure.

Better to learn english from a text corpus than a speech corpus. No?

@Cortexelus

This comment has been minimized.

Cortexelus commented Dec 29, 2017

Logistic mixtures is a good idea. Because pure samplewise loss is like trying to compare images pixelwise...using one-hot encodings for RGB.

@devinroth

This comment has been minimized.

devinroth commented Dec 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment