Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Separation Integration: sum(sources + background_noise) != mixture with mels. #38

Closed
popcornell opened this issue Jun 28, 2020 · 5 comments

Comments

@popcornell
Copy link
Contributor

I am experimenting a bit with lhotse integration in asteroid here:
https://github.com/mpariente/asteroid/blob/lhotse_integration_test/egs/MiniLibriMix/lhotse/

One thing i noticed (unless I did something completely wrong) is that the sum of the sources plus background noise features is different from the mixture features:
https://github.com/mpariente/asteroid/blob/lhotse_integration_test/egs/MiniLibriMix/lhotse/test_additive.py
This could be a problem when training a separation model as basically the underlining assumption is that the process is additive.
I guess this is due to the fact that the feature computation via torchaudio.complicance.kaldi.fbank must have some non-linear operations (aside the log operation of course !).
I guess so because dithering is disabled by default ( see pytorch/audio#371 ).
Does any of you have a clue of why this happens ? The difference seems too substantial (first decimal digit) to be ascribed to truncation etc.

BTW the problem is easily side-stepped by summing at training time the sources and noise mels to get the mixture. It is inexpensive + you save space on the disk by avoiding dumping also the mixture feats.

@danpovey
Copy link
Collaborator

Can you please show an example? Wonder how often it's that different; which mel bin is most different; etc.

@popcornell
Copy link
Contributor Author

I don't know how much useful it can be, but here are some plots for now ( I can compute also some stats on the difference distribution). There is a difference of over 3.9 for one bin and it is very strange.

Loaded mixture feats:
c_mix
On the fly np.log(np.sum(np.exp(c_sources), 0) + np.exp(c_noise)):
onthefly
Abs difference between the two:
difference

@danpovey
Copy link
Collaborator

danpovey commented Jun 28, 2020 via email

@popcornell
Copy link
Contributor Author

Thank you very much.
I'll try to train two systems (when i will have some spare GPUs) for separation in feature domain.
One with mixing on-the-fly as above and one without and see what happens. In the past I have always mixed the features on-the-fly and had decent results.

My main concern is that it is sorta like using "noisy labels" for separation.
And because the separation is done on mels (and not in log-mels) those differences actually can be even more substantial and it could be difficult for the DNN to learn a mask for each speaker with that amount of "noise" in the oracle targets.

@pzelasko
Copy link
Collaborator

pzelasko commented Nov 4, 2020

I'm closing as it seems stale - if there're any new developments be sure to let us know!

@pzelasko pzelasko closed this as completed Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants