Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MFCC Feature Request: Log vs dB, and Documentation #1093

Closed
Novak3 opened this issue Mar 31, 2020 · 1 comment
Closed

MFCC Feature Request: Log vs dB, and Documentation #1093

Novak3 opened this issue Mar 31, 2020 · 1 comment
Labels
discussion Open-ended discussion for developers and users

Comments

@Novak3
Copy link

Novak3 commented Mar 31, 2020

There appear to be two rival methods for calculating MFCCs.

One, used in Librosa by default if a specific melspectrogram is not supplied, uses a (power) dB scaled melspectrogram. Per issue #573, this may be to comply with a reference implementation in Matlab, although I did not see where in the reference implementation that was happening.

The other, used in packages such as python_speech_features uses as log scaled melspectrogram. This technique matches my understanding of Davis and Mermelstein, although I welcome correction if I have read it wrong.

This causes considerable confusion when users compare different packages and get wildly different results. It is likely the underlying issue in issue 573, it comes up in stack exchange and other forums, and has caused torchaudio to implement an argument which switches between the two behaviors.

I suggest/request the following:

  1. Documentation mentioning both main approaches and making clear what is the default behavior here
  2. An example in the documentation showing how to force the existing implementation to conform with the other methodology (i.e., construct an alternate melspectrogram input and use it, rather than raw audio samples)
  3. If possible, an argument similar to torchaudio's implementation which will, for raw audio only, switch between the two implementations.
@bmcfee bmcfee added the discussion Open-ended discussion for developers and users label Apr 2, 2020
@bmcfee
Copy link
Member

bmcfee commented Apr 2, 2020

There appear to be two rival methods for calculating MFCCs.

I think this might be severely under-estimating the degree of variability in MFCC implementations. 😁 Since there isn't really a single canonical reference implementation, the best we can do is provide a flexible API and defaults which behave sanely and correspond to a well-known reference.

Per issue #573, this may be to comply with a reference implementation in Matlab, although I did not see where in the reference implementation that was happening.

That's done in https://labrosa.ee.columbia.edu/matlab/rastamat/ (see: melfcc, powspec, audspec), where the starting point is a power spectrum (rather than magnitude spectrum).

This causes considerable confusion when users compare different packages and get wildly different results

This one difference shouldn't cause too much divergence in the results, since after log-scaling, the change between magnitude and power becomes a scaling factor of 2. It's been a while since I looked into it, but I would expect much larger sources of variation to come from differences in how the log is actually computed (eg, bias stabilization), the definition of the mel scale itself, and how the filter-banks are normalized. (Not to mention other parameters involving the STFT windows, pre-emphasis, liftering, etc).

1. Documentation mentioning both main approaches and making clear what is the default behavior here

I hesitate to go down this route because there are so many parameters to explore that documenting "both main approaches" is almost surely going to be inadequate and lead to more confusion.

2\. An example in the documentation showing how to force the existing implementation to conform with the other methodology (i.e., construct an alternate melspectrogram input and use it, rather than raw audio samples)

This is a great idea, and could be seen as building off the previous issue #804. I think the best way to go about this is to provide an "advanced example" notebook that demonstrates how to exactly replicate the behavior of one or two well-known implementations (eg, HTK).

3\. If possible, an argument similar to torchaudio's implementation which will, for raw audio only, switch between the two implementations.

This is already implicit via pass-through parameters from mfcc to melspectrogram. You can call mfcc with power=1 to get log amplitude (instead of dB) behavior for audio input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open-ended discussion for developers and users
Development

No branches or pull requests

2 participants