MFCC to Wav file #660

Edresson · 2018-01-25T17:35:13Z

Hello, could anyone give an example of MFCC to WAV with librosa?

I've tried several algorithms, but the rebuilding gets pretty bad.

An example of what I need: https://www.research.ibm.com/haifa/projects/multimedia/recovc/demo/index.html

Thanks in advance for your help :)

rafaelvalle · 2018-01-25T20:36:46Z

Have you checked this #424 in our repo or this http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV on the web ?
You can also train a MFCC-Audio decoder, e.g. a wavenet, but that will probably be overkill.

Edresson · 2018-01-26T01:24:42Z

Hello thanks for your reply I had tried the link: http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV

but the quality of the audio is very compromised, I'm looking for some more efficient method, I had already thought about training an artificial neural network to do this work, but I found it exaggerated ...

are the lower values fixed or should they be changed?
n_mel = 128
n_fft = 2048

the sound of the voice of the person becomes hoarse, for the application that I intend to use I need a good reconstruction ..

bmcfee · 2018-01-26T02:26:56Z

The short answer here is that you're not going to get a good reconstruction from mfccs for two reasons:

MFCCs discard a lot of information by a low-rank linear projection of the mel spectrum. An MFCC representation with n_mel=128 and n_mfcc=40 is analogous to a jpeg image with quality set to 30%.
you lose phase information (though there are ways to estimate it, eg griffin-lim)

Edresson · 2018-01-26T13:47:58Z

Thanks for your answer, if I train an artificial neural network as input X, containing mfcc and with the output Y, containing the wav file, I think it is possible to improve the quality of the reconstructed audio, will it really work?

rafaelvalle · 2018-01-27T01:35:02Z

For training a neural net to decode some spectrogram-like representation to audio look at Ryuchi's work here https://github.com/r9y9/wavenet_vocoder/.

shamoons · 2020-09-01T09:01:13Z

For training a neural net to decode some spectrogram-like representation to audio look at Ryuchi's work here https://github.com/r9y9/wavenet_vocoder/.

If you already have a spectrogram, then do you need to train a NN to recover back to audio? Isn’t the process invertible with something like Griffin-Lim?

JoelStansbury · 2020-09-14T20:33:00Z

Note: Mistakes were made here, see discussion below

@shamoons Spectrograms are reversible as they are just a bunch of Fourier transformations. But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts. It seems to me that MelSpectrograms (MFCCs) take it to another level in that their goal is to throw out everything that is not important to human speech, although I've never looked into how they are made.

I don't know the math behind MFCCs in particular, but for the usual spectrogram, perfect reconstruction requires infinite memory. I'm just guessing that MFCCs operate on a similar principle.

bmcfee · 2020-09-14T21:02:21Z

But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts.
...
perfect reconstruction requires infinite memory.

I don't want to seem rude, but this is quite incorrect.

The DFT is a fully invertible transformation: n samples in, n frequencies (coefficients) out, and you can always recover the original samples up to numerical precision. If the samples are real-valued, you can reduce this to n/2 frequencies due to conjugate symmetry. There is no "dimensionality reduction" going on here, nor is there any "prioritization of important parts". You do not need infinitely many coefficients.

Where you lose information is in discarding the phase of the DFT. If you only have magnitude information, then the DFT is not directly recoverable. The Griffin-Lim algorithm (and other phase retrieval methods) use multiple overlapping frames to infer the phase of each DFT coefficient according to what could plausibly produce the observed magnitude spectrogram.

Mel spectrograms (and MFCCs) are, however, low-dimensional projections of linear spectra, so there is loss of information in that stage beyond what you lose by discarding phase.

shamoons · 2020-09-14T21:18:14Z

@JoelStansbury / @bmcfee thank you both for commenting and increasing my understanding of how this all works

JoelStansbury · 2020-09-14T21:21:38Z

But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts.
...
perfect reconstruction requires infinite memory.

I don't want to seem rude, but this is quite incorrect.

The DFT is a fully invertible transformation: n samples in, n frequencies (coefficients) out, and you can always recover the original samples up to numerical precision. If the samples are real-valued, you can reduce this to n/2 frequencies due to conjugate symmetry. There is no "dimensionality reduction" going on here, nor is there any "prioritization of important parts". You do not need infinitely many coefficients.

Where you lose information is in discarding the phase of the DFT. If you only have magnitude information, then the DFT is not directly recoverable. The Griffin-Lim algorithm (and other phase retrieval methods) use multiple overlapping frames to infer the phase of each DFT coefficient according to what could plausibly produce the observed magnitude spectrogram.

Mel spectrograms (and MFCCs) are, however, low-dimensional projections of linear spectra, so there is loss of information in that stage beyond what you lose by discarding phase.

No offense taken.
Thank you for the correction. I was mixing DFT with continuous FT.

In the context of generating a spectrogram from a series of DFTs, it seems like it would be necessary to have a moving window instead of a single FFT across the entire waveform. My understanding is that this is a trade off between frequency resolution and temporal resolution, i.e. as you make the window smaller, you loose some of the lower frequencies. Is this much correct?

JoelStansbury · 2020-09-14T21:32:12Z

@JoelStansbury / @bmcfee thank you both for commenting and increasing my understanding of how this all works

Happy to help! Even if it was an unintentional example of Cunningham's Law
"the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."

bmcfee · 2020-09-14T21:44:57Z

My understanding is that this is a trade off between frequency resolution and temporal resolution, i.e. as you make the window smaller, you loose some of the lower frequencies. Is this much correct?

I'll hedge and say "sort of" 😁

If you have a sampling rate sr and a window length N, then the DFT will have frequencies [0, sr/N, 2*sr/N, 3*sr/N, ..., sr/2] (assuming real-valued input). So the longer your window, the lower a minimum frequency you can measure: this should make sense, since low frequencies will take a long time to cycle, and you'll need more time to tell them apart. The upper frequency range is always fixed by the sampling rate to Nyquist (sr/2).

There isn't really a "trade-off" beyond that though: the frame rate of the spectrogram and the size of the window (N) are basically independent, though you'll need frames to overlap by at least a little if you hope to have any chance of phase recovery. You can have high time and frequency resolution by using a large frame length and a small hop length. (Some authors conflate these two, and fix the hop length to be a constant fraction of the frame length, which does induce a "trade-off", but this isn't really necessary.)

JoelStansbury · 2020-09-14T21:54:24Z

You can have high time and frequency resolution by using a large frame length and a small hop length.

Gotcha. Thank you for the insight.

Edresson changed the title ~~MFCC to Wave~~ MFCC to Wav file Jan 25, 2018

bmcfee added the question Issues asking for help doing something label Feb 3, 2018

bmcfee closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MFCC to Wav file #660

MFCC to Wav file #660

Edresson commented Jan 25, 2018 •

edited

rafaelvalle commented Jan 25, 2018 •

edited

Edresson commented Jan 26, 2018 •

edited

bmcfee commented Jan 26, 2018

Edresson commented Jan 26, 2018

rafaelvalle commented Jan 27, 2018 •

edited

shamoons commented Sep 1, 2020

JoelStansbury commented Sep 14, 2020 •

edited

bmcfee commented Sep 14, 2020

shamoons commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

bmcfee commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

MFCC to Wav file #660

MFCC to Wav file #660

Comments

Edresson commented Jan 25, 2018 • edited

rafaelvalle commented Jan 25, 2018 • edited

Edresson commented Jan 26, 2018 • edited

bmcfee commented Jan 26, 2018

Edresson commented Jan 26, 2018

rafaelvalle commented Jan 27, 2018 • edited

shamoons commented Sep 1, 2020

JoelStansbury commented Sep 14, 2020 • edited

bmcfee commented Sep 14, 2020

shamoons commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

bmcfee commented Sep 14, 2020

JoelStansbury commented Sep 14, 2020

Edresson commented Jan 25, 2018 •

edited

rafaelvalle commented Jan 25, 2018 •

edited

Edresson commented Jan 26, 2018 •

edited

rafaelvalle commented Jan 27, 2018 •

edited

JoelStansbury commented Sep 14, 2020 •

edited