Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

44.1K Sample Rate Strategies #124

Open
chrisnovello opened this issue Oct 6, 2016 · 19 comments
Open

44.1K Sample Rate Strategies #124

chrisnovello opened this issue Oct 6, 2016 · 19 comments

Comments

@chrisnovello
Copy link

chrisnovello commented Oct 6, 2016

Here is a 44.1k sample rate clip, trained ~50k steps on VCTK speaker 280 (with a 100k sample size).

Any suggestions for how to improve it? I'm noticing:

  • Those pops / that top end distortion. Sounds sort of like zero crossing pops to me?
  • Also this clip sounds like it has a repetitive short window, moreso than other clips I'm hearing? I think this is a sample size issue? I was bumping into problems trying to scale my sample size along with the sample rate (most people here are using 16k sample rate with a 100k sample size), I think they were memory problems. Will check into this tomorrow and test the checkpoints I already have with a wav_seed as well.

Settings I used were the ones @jyegerlehner posted here: #47 (comment)

{
"filter_width": 2,
"sample_rate": 44100,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
"residual_channels": 32,
"dilation_channels":32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true
}

@chrisnovello chrisnovello changed the title 44100 Sample Rate Strategies 44.1K Sample Rate Strategies Oct 6, 2016
@jyegerlehner
Copy link
Contributor

@paperkettle Thanks, interesting. What's your learning rate and silence trimming history?

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Oct 6, 2016

this clip sounds like it has a repetitive short window

@paperkettle OK one observation: that config you cited has a ~500 msec receptive field only at 16kHz. At 44kHz, it would have ~180 mSec receptive window. The receptive field size is determined by the dilations (and filter width=2). But that determines the filter receptive size in terms of number of samples; to get to wall-clock time you have to take into account the sample rate. I think that probably explains the "short repetitive window" aspect of the result.

Cranking up the dilations to get to 500 msec at 44.1 kHz would require a really deep and (memory-wise) large net. You could try that. I suspect to get to larger receptive fields (in terms of samples) efficiently we would need to implement context stacks as described in section 2.6 of the paper.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Oct 6, 2016

@paperkettle Sorry to spam you with so many messages, but another observation. There's a default command line argument sample_size = 100000 into which the input gets chopped up during training, if the length exceeds that number. That's a little over 2 seconds at a 44.1 kHz sample rate. Most audio clips are longer than that. I would try turning that off completely by specifying --sample_size 0 on the command-line.

@shiba24
Copy link

shiba24 commented Oct 6, 2016

Thank you for the issue. Actually I am also faced with the same problem, though for another dataset.
So the dilations, filter width and sampling frequency determine the "receptive field" [ms]: Is there any formula for that? Thank you in advance.

@lemonzi
Copy link
Collaborator

lemonzi commented Oct 6, 2016

What's with the clicks? Any idea?

44.1 kHz is not that important for TTS (the energy above 8-10 kHz is very
low anyway), but it's good that we benchmark it at that sample rate for
when we want to train on music.

It would be good to see if we can achieve decent results at 10 bits as well
(1024 levels seems reasonable, 16 bits is probably too much).

El dj., 6 oct. 2016 a les 1:12, Chris Novello (notifications@github.com)
va escriure:

https://soundcloud.com/paperkettle/wavenet-441k-sample-rate-vctk-speaker-280-50k-steps


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#124, or mute the
thread
https://github.com/notifications/unsubscribe-auth/ADCF5p2a3LRj4xi4jTxPIp-3t9Y4ZvS4ks5qxINGgaJpZM4KPjlI
.

@jyegerlehner
Copy link
Contributor

@shiba24

the dilations, filter width and sampling frequency determine the "receptive field" [ms]: Is there any formula for that?

You can work it out with a paper and pencil from figure 3 of the paper so as to have an intuition about it. Or search for "compute_receptive_field_size" in this source file:
https://github.com/jyegerlehner/tensorflow-wavenet/blob/skip-receptive-field/wavenet/model.py

@chrisnovello
Copy link
Author

@jyegerlehner — spam away, it's very helpful to get feedback! Learning rate is .001 (didn't change it with steps either). Silence trimming is default.

"I would try turning that off completely by specifying --sample_size 0 on the command-line."

Ah, good yes will try. Still wrapping my head around receptive field <-> sample rate & size <-> net behavior — haven't known how to think about the chunking.

I knew I had cut the receptive field and figured that was related to the short window sound (the current output reminds of granular synths). As a result, I tried to do a training pass with a 275000 sample size (not looking at the formula, I just naively scaled the sample size along with my jump from 16000 to 44100). That was when I hit what I think were memory errors (running a 1080 GTX which has 8 gigs), though it was late and I will test again and look more closely. It successfully ran around 125000 sample size.

Let's say I feed a single long audio file in with a sample_size 0 — won't I bump into the same issue?

If no, under what circumstances would one want to specify a sample size? Just to tune the network into looking at specific lengths of time in relation to the content one is getting it to learn?

Or asked differently, what is the design rationale for chopping up inputs..? To be able to dump a massive folder of varied data and ensure that it looks at as many different files as possible?

Have some more abstract questions but going to give the paper & codebase some time this evening while I run some more tests before asking.

@shiba24
Copy link

shiba24 commented Oct 7, 2016

@jyegerlehner thank you very much! :)

@nakosung
Copy link
Contributor

nakosung commented Oct 7, 2016

@paperkettle For RNN's truncated BPTT is common due to GPU memory limit.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Oct 7, 2016

Let's say I feed a single long audio file in with a sample_size 0 — won't I bump into the same issue?

Yes probably. The sample_size "chunking" was put in for memory reasons.

want to specify a sample size

If you don't specify one at all, you get the default which is 100000 samples.

Or asked differently, what is the design rationale for chopping up inputs..? To be able to dump a massive folder of varied data and ensure that it looks at as many different files as possible?

I don't see how chunking helps performance. It probably hurts because more of the training data, as a fraction of the whole, is training when the input receptive field is not fully filled yet. And it's learning to predict a discontinuity in the audio, wherever that first sample in the chunk happens to be, when there are no preceding samples. So ideally I think we wouldn't chunk at all; you'd always be working on one long continuous stream of data. Except the memory required grows linearly with duration. So I'd make sample size as large as your memory permits.

A couple possibilties for overcoming memory constraint: switch from float32 to float16. Every tensor will use exactly half as much memory. I tried a hack where I naively replaced all dtype=float32 with dtype=float16, and immediately got NaNs. I don't know if that's because fp16 is that much more numerically unstable, or if it's because there's something else in the code that I missed that was expecting float32. Also, I think choosing the sgd optimizer uses less memory than adam. Doesn't require as many copies of tensors, if I'm not mistaken. That could be a factor of two or so also.

@nakosung
Copy link
Contributor

nakosung commented Oct 8, 2016

Randomly cropped samples might be beneficial to make truncation not to introduce unnecessary side-effects. Although we use same train data set for each epochs, input sample sequences are randomly cropped so the unnecessary 'boundary-effect' could be blurred.

@ibab
Copy link
Owner

ibab commented Oct 10, 2016

Yeah, I'd recommend to set the batch size to 1 and choose the sample size as large as possible.
Cutting the samples to a fixed size is useful, as it prevents us from crashing when one of the samples is particularly large.
batch_size > 1 should come in handy if we want to use things like batch normalization.

@chrisnovello
Copy link
Author

chrisnovello commented Oct 17, 2016

Here is another clip using those same settings, except

  • trained/generated using a 22.5k sample rate
  • trained on a dataset of my own voice (as large as a VCTK entry)

I do hear more quality top end data than in the 16k clips that I've generated, and it is a large enough receptive field to generate language-like chunks (rather than the 44.1k clips I made, which felt stuck in granular synth / timestretch hiss / reverse reverb territory)

@nakosung
Copy link
Contributor

Could you post settings used?

2016년 10월 17일 (월) 오전 11:13, Chris Novello notifications@github.com님이 작성:

Here is another clip using those same settings, except

  • trained/generated using a 22.5k sample rate
  • trained on a dataset of my own voice as large as each individual
    VCTK entry

I do hear more quality top end data than in the 16k clips that I've
generated, and it is a large enough receptive field to generate
language-like chunks (rather than the 44.1k clips I made, which felt stuck
in granular synth / timestretch hiss / reverse reverb territory)

https://soundcloud.com/paperkettle/wavenet-babble-test-trained-a-neural-network-to-speak-with-my-voice


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#124 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWXUzhJLKcG73rKrVi6m1xcMvbDzlAhks5q0tnKgaJpZM4KPjlI
.

@chrisnovello
Copy link
Author

~40k iterations with
sample rate 100000
learning rate .001

{
"filter_width": 2,
"sample_rate": 22500,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
"residual_channels": 32,
"dilation_channels":32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true,
"scalar_input": false,
"initial_filter_width": 32
}

@Zeta36
Copy link

Zeta36 commented Oct 17, 2016

@paperkettle, you said: "trained on a dataset of my own voice". How well does the model imitating your voice in that clip?? and how did you prepare the dataset? Did you record yourself saying the same phrases than in one VCTK speaker?

@chrisnovello
Copy link
Author

chrisnovello commented Oct 17, 2016

@Zeta36 The clip reproduces some of the character of my recorded dataset, for sure. Glitchier and otherworldly (I suspect a larger receptive field would help).

"Did you record yourself saying the same phrases than in one VCTK speaker?" — yeah.

I re-recorded a vocal set from VCTK (toward future experiments of training on my voice + the full dataset). I added some compression and a little bit of de essing. Recorded on a $100 condenser mic in a living room (so yes some room sound in my dataset — more than the VCTK but not that much).

I found the VCTK txts awkward to read (I would never speak in the style they're written) and thus the personality of my dataset is pretty announcerly. I'm planning to sample my own writing and do more passes in the near future.

@Zeta36
Copy link

Zeta36 commented Oct 17, 2016

@paperkettle, some people in here #112 is beginning to develop the local conditioning. Your voice could be one of the first using and testing this feature. I recommend you forking the @alexbeloi development in this sense and make a try.

Regards.

@adamalpi
Copy link

adamalpi commented Jan 6, 2017

Hello, I have tried several configurations with many datasets and I have seen a pattern over them when it goes to train 44.1kHz songs and 16kHz songs. They show a difference of loss of about 1 with same configurations, any guess why it is happening?

I have been training with very low sample size of 16k or 32k as my gpu runs out of memory with bigger numbers.

Regards-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants