-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MFCC Feature Request: Log vs dB, and Documentation #1093
Comments
I think this might be severely under-estimating the degree of variability in MFCC implementations. 😁 Since there isn't really a single canonical reference implementation, the best we can do is provide a flexible API and defaults which behave sanely and correspond to a well-known reference.
That's done in https://labrosa.ee.columbia.edu/matlab/rastamat/ (see: melfcc, powspec, audspec), where the starting point is a power spectrum (rather than magnitude spectrum).
This one difference shouldn't cause too much divergence in the results, since after log-scaling, the change between magnitude and power becomes a scaling factor of 2. It's been a while since I looked into it, but I would expect much larger sources of variation to come from differences in how the log is actually computed (eg, bias stabilization), the definition of the mel scale itself, and how the filter-banks are normalized. (Not to mention other parameters involving the STFT windows, pre-emphasis, liftering, etc).
I hesitate to go down this route because there are so many parameters to explore that documenting "both main approaches" is almost surely going to be inadequate and lead to more confusion.
This is a great idea, and could be seen as building off the previous issue #804. I think the best way to go about this is to provide an "advanced example" notebook that demonstrates how to exactly replicate the behavior of one or two well-known implementations (eg, HTK).
This is already implicit via pass-through parameters from mfcc to |
There appear to be two rival methods for calculating MFCCs.
One, used in Librosa by default if a specific melspectrogram is not supplied, uses a (power) dB scaled melspectrogram. Per issue #573, this may be to comply with a reference implementation in Matlab, although I did not see where in the reference implementation that was happening.
The other, used in packages such as python_speech_features uses as log scaled melspectrogram. This technique matches my understanding of Davis and Mermelstein, although I welcome correction if I have read it wrong.
This causes considerable confusion when users compare different packages and get wildly different results. It is likely the underlying issue in issue 573, it comes up in stack exchange and other forums, and has caused torchaudio to implement an argument which switches between the two behaviors.
I suggest/request the following:
The text was updated successfully, but these errors were encountered: