Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train a model that can fully extract the 44100hz frequency #35

Open
dingjibang opened this issue Mar 28, 2022 · 8 comments
Open

Comments

@dingjibang
Copy link

dingjibang commented Mar 28, 2022

I want to train a 2 stems model

I noticed that in the yaml configuration of each model, there are some parameters that will affect the final frequency cutoff, it seems that multigpu_drums.yaml can handle the full 44100hz frequency, but with the reduction of num_blocks (11 => 9), the model size will also decrease accordingly (29mb => 21mb).

Although using something like multigpu_drums.yaml can handle 44100hz in full, but the model shrinks instead. Does this affect the final accuracy?

It seems that dim_t, hop_length, overlap, num_blocks these parameters have a wonderful complementarity that I cannot understand, maybe this 'complementarity' is designed for the competition(mix to demucs), but I want to apply this to the real world without demucs(only mdx-net, after some testing, I think the potential of mdx-net is higher than demucs).

When I try to change num_blocks from 9 to 11, the results of inference have overlapping and broken voices... do you have any good parameters recommendations for me to train a full 44100hz one without loss of accuracy (i.e. the model does not Shrinking)

@Zokhoi
Copy link
Collaborator

Zokhoi commented Mar 29, 2022

Disclaimer: I'm not part of the original team, my collaborator role here is to update some documentations.

I don't fully understand what you mean, but I think what you're trying to achieve here is to train models that do not have a frequency cutoff?

If so, maybe take a look at their presentation slides which mentions that:

And try to change the both num_blocks and dim_f (2^num_blocks=dim_f?) and related parameters?

@dingjibang
Copy link
Author

Currently I am using the below configuration to train the results without freq cutoff ↓

num_blocks: 9
l: 3
g: 32
k: 3
bn: 8
bias: False

n_fft: 4096
dim_f: 2048
dim_t: 128
dim_c: 4
hop_length: 1024

overlap: 2048

Although it works, and no freq cutoff, BUT the generated onnx/ckpt files are smaller than the pretrained vocals/bass/others file

file size
my onnx without freq cutoff: 21.417MB
pretrained onnx with freq cutoff: vocals/bass/others: 29.008MB

So what I want to ask is:

  1. Does the reduction of the model file mean that the information contained is also reduced, thus affecting the quality of the model
  2. If 1 is true, how can the model file be more larger without frequency cutoff

@Zokhoi
Copy link
Collaborator

Zokhoi commented Mar 29, 2022

For 1, from my understanding, for reduction of num_blocks, you are decreasing the number of intermediate blocks/layers that the model would recognize patterns on, so there are less info in the model for the fewer layers.
Paper on TFC-TDF-U-Net v1
Brief explanation of what the intermediate blocks do

For 2, from the paper on MDX-Net:

... high frequencies above the target source’s expected frequency range were cut off from the mixture spectrogram. This way, we can increase n_fft while using the same input spectrogram size (which we needed to constrain for the separation time limit), and using a larger n_fft usually leads to better SDR. It is also why we did not use a multi-target model (a single model that is trained to estimate all four sources), where we could not use source-specific frequency cutting.

So probably for having no frequency cutoff, you would want n_fft and dim_f to be the same.
If were to increase the model size, probably increase dim_f and num_blocks.
Also, here's a brief explanation of the frequency cutoff from the author.

@dingjibang
Copy link
Author

dingjibang commented Mar 29, 2022

Thanks for your reply (double thanks^_^)

Change n_fft and dim_f to same will cause an error

RuntimeError: Error instantiating 'src.models.mdxnet.ConvTDFNet' : Trying to create tensor with negative dimension -2047: [1, 4, -2047, 256]

Error stack & source code: src/models/mdxnet.py#L33

self.freq_pad = nn.Parameter(torch.zeros([1, dim_c, self.n_bins - self.dim_f, self.dim_t]), requires_grad=False)

It seems that n_fft and dim_f cannot be the same in the code, dim_f must less n_fft / 2 to work properly

Sorry I'm a layman in this field and don't know much about these complex things...I just want to get a correct config to train😭😭

@ws-choi
Copy link
Collaborator

ws-choi commented Mar 29, 2022

Hi @dingjibang,
Can you share the inference code that you used below?

When I try to change num_blocks from 9 to 11, the results of inference have overlapping and broken voices..

with a audio sample?

Thank you @Zokhoi for contributions by the way.

@dingjibang
Copy link
Author

dingjibang commented Mar 29, 2022

Conv_TDF_net_trim(
    device=device, load=load,
    model_name='Conv-TDF', target_name='guitar',
    lr=0.0002, epoch=470,
    L=9, l=3, g=32, bn=8, bias=False, <== when I change num_blocks from 9 to 11, this "L" value always changed
    dim_f=11, dim_t=7
)

and n_fft_scale['guitar'] = 2

I found that overlapping and broken sound were caused by too little training time, I was too impatient... After training both quickly for 10 epochs, the above problems did not exist

So things seem to end very simply😭. The other parameters of the above configuration remain unchanged. Just increase num_blocks seems to increase the size of the final model.

Sorry for an extra question, does n_fft also affect the final quality?(I don't consider the time cost of training) If so, in the above configuration, how to safely increase this value, do I need to change other associated parameter values?

Thank you

@Zokhoi
Copy link
Collaborator

Zokhoi commented Mar 29, 2022

@ws-choi What is the importance to this line

 self.n_bins = n_fft // 2 + 1

for n_bins to be half of n_fft? Is it because of sampling theorem?

@dingjibang I think that as the harmonic series for instruments like bass are squashed in one frequency region instead of across the spectrum, having a larger n_fft with fixed dim_f would help classify only the lower frequencies into more bins and thus being clearer for the model to find patterns that corresponds to those compressed bass harmonic series, so probably that's why "using a larger n_fft usually leads to better SDR" when the spectrogram size is fixed.
For different instruments the region on the spectrum they occupy are different, so there are different upper limits for each instruments' frequency cutoffs, and when scaled to the same dim_f, the n_fft for different instruments would be different.
From what I can read from the code, probably if you change n_fft you would also need to change dim_f to retain the ratio between them (n_fft:dim_f 2:1 for no cutoff?).

@ws-choi
Copy link
Collaborator

ws-choi commented Nov 4, 2022

Hi @Zokhoi, sorry I didn't notice this for a while.
That's simply because we used the one-sided output mode of torch.stft: https://pytorch.org/docs/stable/generated/torch.stft.html

Below is the explaination of the onsided mode.


If onesided is True (default for real input), only values for \omegaω in \left[0, 1, 2, \dots, \left\lfloor \frac{\text{n_fft}}{2} \right\rfloor + 1\right][0,1,2,…,⌊
2
n_fft

⌋+1] are returned because the real-to-complex Fourier transform satisfies the conjugate symmetry, i.e., X[m, \omega] = X[m, \text{n_fft} - \omega]^*X[m,ω]=X[m,n_fft−ω]

. Note if the input or window tensors are complex, then onesided output is not possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants