-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MFCC to Wav file #660
Comments
Have you checked this #424 in our repo or this http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV on the web ? |
Hello thanks for your reply I had tried the link: http://amyang.xyz/post/Inverse%20MFCC%20to%20WAV but the quality of the audio is very compromised, I'm looking for some more efficient method, I had already thought about training an artificial neural network to do this work, but I found it exaggerated ... are the lower values fixed or should they be changed? the sound of the voice of the person becomes hoarse, for the application that I intend to use I need a good reconstruction .. |
The short answer here is that you're not going to get a good reconstruction from mfccs for two reasons:
|
Thanks for your answer, if I train an artificial neural network as input X, containing mfcc and with the output Y, containing the wav file, I think it is possible to improve the quality of the reconstructed audio, will it really work? |
For training a neural net to decode some spectrogram-like representation to audio look at Ryuchi's work here https://github.com/r9y9/wavenet_vocoder/. |
If you already have a spectrogram, then do you need to train a NN to recover back to audio? Isn’t the process invertible with something like Griffin-Lim? |
@shamoons Spectrograms are reversible as they are just a bunch of Fourier transformations. But perfect (loss-less) reversibility requires an infinite number of coefficients. (Fast) Fourier transformations (FFTs) are useful for dimensionality reduction because they prioritize the important parts of the wave structure, making it easy to throw out the less important parts. It seems to me that MelSpectrograms (MFCCs) take it to another level in that their goal is to throw out everything that is not important to human speech, although I've never looked into how they are made. I don't know the math behind MFCCs in particular, but for the usual spectrogram, perfect reconstruction requires infinite memory. I'm just guessing that MFCCs operate on a similar principle. |
I don't want to seem rude, but this is quite incorrect. The DFT is a fully invertible transformation: n samples in, n frequencies (coefficients) out, and you can always recover the original samples up to numerical precision. If the samples are real-valued, you can reduce this to n/2 frequencies due to conjugate symmetry. There is no "dimensionality reduction" going on here, nor is there any "prioritization of important parts". You do not need infinitely many coefficients. Where you lose information is in discarding the phase of the DFT. If you only have magnitude information, then the DFT is not directly recoverable. The Griffin-Lim algorithm (and other phase retrieval methods) use multiple overlapping frames to infer the phase of each DFT coefficient according to what could plausibly produce the observed magnitude spectrogram. Mel spectrograms (and MFCCs) are, however, low-dimensional projections of linear spectra, so there is loss of information in that stage beyond what you lose by discarding phase. |
@JoelStansbury / @bmcfee thank you both for commenting and increasing my understanding of how this all works |
No offense taken. In the context of generating a spectrogram from a series of DFTs, it seems like it would be necessary to have a moving window instead of a single FFT across the entire waveform. My understanding is that this is a trade off between frequency resolution and temporal resolution, i.e. as you make the window smaller, you loose some of the lower frequencies. Is this much correct? |
Happy to help! Even if it was an unintentional example of Cunningham's Law |
I'll hedge and say "sort of" 😁 If you have a sampling rate There isn't really a "trade-off" beyond that though: the frame rate of the spectrogram and the size of the window ( |
Gotcha. Thank you for the insight. |
Hello, could anyone give an example of MFCC to WAV with librosa?
I've tried several algorithms, but the rebuilding gets pretty bad.
An example of what I need: https://www.research.ibm.com/haifa/projects/multimedia/recovc/demo/index.html
Thanks in advance for your help :)
The text was updated successfully, but these errors were encountered: