Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MFCC to Wav file #660

Closed
Edresson opened this issue Jan 25, 2018 · 13 comments
Closed

MFCC to Wav file #660

Edresson opened this issue Jan 25, 2018 · 13 comments
Labels
question Issues asking for help doing something

Comments

@Edresson
Copy link

Edresson commented Jan 25, 2018

Hello, could anyone give an example of MFCC to WAV with librosa?

I've tried several algorithms, but the rebuilding gets pretty bad.

An example of what I need: https://www.research.ibm.com/haifa/projects/multimedia/recovc/demo/index.html

Thanks in advance for your help :)

@Edresson Edresson changed the title MFCC to Wave MFCC to Wav file Jan 25, 2018
@rafaelvalle
Copy link

rafaelvalle commented Jan 25, 2018

Have you checked this #424 in our repo or this http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV on the web ?
You can also train a MFCC-Audio decoder, e.g. a wavenet, but that will probably be overkill.

@Edresson
Copy link
Author

Edresson commented Jan 26, 2018

Hello thanks for your reply I had tried the link: http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV

but the quality of the audio is very compromised, I'm looking for some more efficient method, I had already thought about training an artificial neural network to do this work, but I found it exaggerated ...

are the lower values fixed or should they be changed?
n_mel = 128
n_fft = 2048

the sound of the voice of the person becomes hoarse, for the application that I intend to use I need a good reconstruction ..

@bmcfee
Copy link
Member

bmcfee commented Jan 26, 2018

The short answer here is that you're not going to get a good reconstruction from mfccs for two reasons:

  1. MFCCs discard a lot of information by a low-rank linear projection of the mel spectrum. An MFCC representation with n_mel=128 and n_mfcc=40 is analogous to a jpeg image with quality set to 30%.
  2. you lose phase information (though there are ways to estimate it, eg griffin-lim)

@Edresson
Copy link
Author

Thanks for your answer, if I train an artificial neural network as input X, containing mfcc and with the output Y, containing the wav file, I think it is possible to improve the quality of the reconstructed audio, will it really work?

@rafaelvalle
Copy link

rafaelvalle commented Jan 27, 2018

For training a neural net to decode some spectrogram-like representation to audio look at Ryuchi's work here https://github.com/r9y9/wavenet_vocoder/.

@bmcfee bmcfee added the question Issues asking for help doing something label Feb 3, 2018
@bmcfee bmcfee closed this as completed Feb 12, 2018
@shamoons
Copy link

shamoons commented Sep 1, 2020

For training a neural net to decode some spectrogram-like representation to audio look at Ryuchi's work here https://github.com/r9y9/wavenet_vocoder/.

If you already have a spectrogram, then do you need to train a NN to recover back to audio? Isn’t the process invertible with something like Griffin-Lim?

@JoelStansbury
Copy link

JoelStansbury commented Sep 14, 2020

Note: Mistakes were made here, see discussion below

@shamoons Spectrograms are reversible as they are just a bunch of Fourier transformations. But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts. It seems to me that MelSpectrograms (MFCCs) take it to another level in that their goal is to throw out everything that is not important to human speech, although I've never looked into how they are made.

I don't know the math behind MFCCs in particular, but for the usual spectrogram, perfect reconstruction requires infinite memory. I'm just guessing that MFCCs operate on a similar principle.

@bmcfee
Copy link
Member

bmcfee commented Sep 14, 2020

But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts.
...
perfect reconstruction requires infinite memory.

I don't want to seem rude, but this is quite incorrect.

The DFT is a fully invertible transformation: n samples in, n frequencies (coefficients) out, and you can always recover the original samples up to numerical precision. If the samples are real-valued, you can reduce this to n/2 frequencies due to conjugate symmetry. There is no "dimensionality reduction" going on here, nor is there any "prioritization of important parts". You do not need infinitely many coefficients.

Where you lose information is in discarding the phase of the DFT. If you only have magnitude information, then the DFT is not directly recoverable. The Griffin-Lim algorithm (and other phase retrieval methods) use multiple overlapping frames to infer the phase of each DFT coefficient according to what could plausibly produce the observed magnitude spectrogram.

Mel spectrograms (and MFCCs) are, however, low-dimensional projections of linear spectra, so there is loss of information in that stage beyond what you lose by discarding phase.

@shamoons
Copy link

@JoelStansbury / @bmcfee thank you both for commenting and increasing my understanding of how this all works

@JoelStansbury
Copy link

But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts.
...
perfect reconstruction requires infinite memory.

I don't want to seem rude, but this is quite incorrect.

The DFT is a fully invertible transformation: n samples in, n frequencies (coefficients) out, and you can always recover the original samples up to numerical precision. If the samples are real-valued, you can reduce this to n/2 frequencies due to conjugate symmetry. There is no "dimensionality reduction" going on here, nor is there any "prioritization of important parts". You do not need infinitely many coefficients.

Where you lose information is in discarding the phase of the DFT. If you only have magnitude information, then the DFT is not directly recoverable. The Griffin-Lim algorithm (and other phase retrieval methods) use multiple overlapping frames to infer the phase of each DFT coefficient according to what could plausibly produce the observed magnitude spectrogram.

Mel spectrograms (and MFCCs) are, however, low-dimensional projections of linear spectra, so there is loss of information in that stage beyond what you lose by discarding phase.

No offense taken.
Thank you for the correction. I was mixing DFT with continuous FT.

In the context of generating a spectrogram from a series of DFTs, it seems like it would be necessary to have a moving window instead of a single FFT across the entire waveform. My understanding is that this is a trade off between frequency resolution and temporal resolution, i.e. as you make the window smaller, you loose some of the lower frequencies. Is this much correct?

@JoelStansbury
Copy link

@JoelStansbury / @bmcfee thank you both for commenting and increasing my understanding of how this all works

Happy to help! Even if it was an unintentional example of Cunningham's Law
"the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."

@bmcfee
Copy link
Member

bmcfee commented Sep 14, 2020

My understanding is that this is a trade off between frequency resolution and temporal resolution, i.e. as you make the window smaller, you loose some of the lower frequencies. Is this much correct?

I'll hedge and say "sort of" 😁

If you have a sampling rate sr and a window length N, then the DFT will have frequencies [0, sr/N, 2*sr/N, 3*sr/N, ..., sr/2] (assuming real-valued input). So the longer your window, the lower a minimum frequency you can measure: this should make sense, since low frequencies will take a long time to cycle, and you'll need more time to tell them apart. The upper frequency range is always fixed by the sampling rate to Nyquist (sr/2).

There isn't really a "trade-off" beyond that though: the frame rate of the spectrogram and the size of the window (N) are basically independent, though you'll need frames to overlap by at least a little if you hope to have any chance of phase recovery. You can have high time and frequency resolution by using a large frame length and a small hop length. (Some authors conflate these two, and fix the hop length to be a constant fraction of the frame length, which does induce a "trade-off", but this isn't really necessary.)

@JoelStansbury
Copy link

You can have high time and frequency resolution by using a large frame length and a small hop length.

Gotcha. Thank you for the insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues asking for help doing something
Development

No branches or pull requests

5 participants