-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
44.1K Sample Rate Strategies #124
Comments
@paperkettle Thanks, interesting. What's your learning rate and silence trimming history? |
@paperkettle OK one observation: that config you cited has a ~500 msec receptive field only at 16kHz. At 44kHz, it would have ~180 mSec receptive window. The receptive field size is determined by the dilations (and filter width=2). But that determines the filter receptive size in terms of number of samples; to get to wall-clock time you have to take into account the sample rate. I think that probably explains the "short repetitive window" aspect of the result. Cranking up the dilations to get to 500 msec at 44.1 kHz would require a really deep and (memory-wise) large net. You could try that. I suspect to get to larger receptive fields (in terms of samples) efficiently we would need to implement context stacks as described in section 2.6 of the paper. |
@paperkettle Sorry to spam you with so many messages, but another observation. There's a default command line argument sample_size = 100000 into which the input gets chopped up during training, if the length exceeds that number. That's a little over 2 seconds at a 44.1 kHz sample rate. Most audio clips are longer than that. I would try turning that off completely by specifying --sample_size 0 on the command-line. |
Thank you for the issue. Actually I am also faced with the same problem, though for another dataset. |
What's with the clicks? Any idea? 44.1 kHz is not that important for TTS (the energy above 8-10 kHz is very It would be good to see if we can achieve decent results at 10 bits as well El dj., 6 oct. 2016 a les 1:12, Chris Novello (notifications@github.com)
|
You can work it out with a paper and pencil from figure 3 of the paper so as to have an intuition about it. Or search for "compute_receptive_field_size" in this source file: |
@jyegerlehner — spam away, it's very helpful to get feedback! Learning rate is .001 (didn't change it with steps either). Silence trimming is default. "I would try turning that off completely by specifying --sample_size 0 on the command-line." Ah, good yes will try. Still wrapping my head around receptive field <-> sample rate & size <-> net behavior — haven't known how to think about the chunking. I knew I had cut the receptive field and figured that was related to the short window sound (the current output reminds of granular synths). As a result, I tried to do a training pass with a 275000 sample size (not looking at the formula, I just naively scaled the sample size along with my jump from 16000 to 44100). That was when I hit what I think were memory errors (running a 1080 GTX which has 8 gigs), though it was late and I will test again and look more closely. It successfully ran around 125000 sample size. Let's say I feed a single long audio file in with a sample_size 0 — won't I bump into the same issue? If no, under what circumstances would one want to specify a sample size? Just to tune the network into looking at specific lengths of time in relation to the content one is getting it to learn? Or asked differently, what is the design rationale for chopping up inputs..? To be able to dump a massive folder of varied data and ensure that it looks at as many different files as possible? Have some more abstract questions but going to give the paper & codebase some time this evening while I run some more tests before asking. |
@jyegerlehner thank you very much! :) |
@paperkettle For RNN's truncated BPTT is common due to GPU memory limit. |
Yes probably. The sample_size "chunking" was put in for memory reasons.
If you don't specify one at all, you get the default which is 100000 samples.
I don't see how chunking helps performance. It probably hurts because more of the training data, as a fraction of the whole, is training when the input receptive field is not fully filled yet. And it's learning to predict a discontinuity in the audio, wherever that first sample in the chunk happens to be, when there are no preceding samples. So ideally I think we wouldn't chunk at all; you'd always be working on one long continuous stream of data. Except the memory required grows linearly with duration. So I'd make sample size as large as your memory permits. A couple possibilties for overcoming memory constraint: switch from float32 to float16. Every tensor will use exactly half as much memory. I tried a hack where I naively replaced all dtype=float32 with dtype=float16, and immediately got NaNs. I don't know if that's because fp16 is that much more numerically unstable, or if it's because there's something else in the code that I missed that was expecting float32. Also, I think choosing the sgd optimizer uses less memory than adam. Doesn't require as many copies of tensors, if I'm not mistaken. That could be a factor of two or so also. |
Randomly cropped samples might be beneficial to make truncation not to introduce unnecessary side-effects. Although we use same train data set for each epochs, input sample sequences are randomly cropped so the unnecessary 'boundary-effect' could be blurred. |
Yeah, I'd recommend to set the batch size to 1 and choose the sample size as large as possible. |
Here is another clip using those same settings, except
I do hear more quality top end data than in the 16k clips that I've generated, and it is a large enough receptive field to generate language-like chunks (rather than the 44.1k clips I made, which felt stuck in granular synth / timestretch hiss / reverse reverb territory) |
Could you post settings used? 2016년 10월 17일 (월) 오전 11:13, Chris Novello notifications@github.com님이 작성:
|
~40k iterations with { |
@paperkettle, you said: "trained on a dataset of my own voice". How well does the model imitating your voice in that clip?? and how did you prepare the dataset? Did you record yourself saying the same phrases than in one VCTK speaker? |
@Zeta36 The clip reproduces some of the character of my recorded dataset, for sure. Glitchier and otherworldly (I suspect a larger receptive field would help). "Did you record yourself saying the same phrases than in one VCTK speaker?" — yeah. I re-recorded a vocal set from VCTK (toward future experiments of training on my voice + the full dataset). I added some compression and a little bit of de essing. Recorded on a $100 condenser mic in a living room (so yes some room sound in my dataset — more than the VCTK but not that much). I found the VCTK txts awkward to read (I would never speak in the style they're written) and thus the personality of my dataset is pretty announcerly. I'm planning to sample my own writing and do more passes in the near future. |
@paperkettle, some people in here #112 is beginning to develop the local conditioning. Your voice could be one of the first using and testing this feature. I recommend you forking the @alexbeloi development in this sense and make a try. Regards. |
Hello, I have tried several configurations with many datasets and I have seen a pattern over them when it goes to train 44.1kHz songs and 16kHz songs. They show a difference of loss of about 1 with same configurations, any guess why it is happening? I have been training with very low sample size of 16k or 32k as my gpu runs out of memory with bigger numbers. Regards- |
Here is a 44.1k sample rate clip, trained ~50k steps on VCTK speaker 280 (with a 100k sample size).
Any suggestions for how to improve it? I'm noticing:
Settings I used were the ones @jyegerlehner posted here: #47 (comment)
The text was updated successfully, but these errors were encountered: