Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

Closed
pazeshun opened this issue Jan 20, 2023 · 6 comments · Fixed by #2767
Closed

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

pazeshun opened this issue Jan 20, 2023 · 6 comments · Fixed by #2767

Comments

@pazeshun
Copy link
Contributor

pazeshun commented Jan 20, 2023

I know it is too late to fix this issue even if this issue is correct, so I just write this for reference.

Currently, audio_to_spectrum.py calculates "amplitude" by applying abs and log to FFT result:

amplitude = np.log(np.abs(amplitude))

However, I think this calculation cannot generate "real" amplitude (consistent with the amplitude of the original signal).
If you want to get "real" amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:
https://helve-blog.com/posts/python/numpy-fast-fourier-transform/
https://ryo-iijima.com/fftresult/

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

@iory
Copy link
Member

iory commented Feb 26, 2023

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

It is a good direction to be able to specify the correct calculation as an option.

@708yamaguchi
Copy link
Member

708yamaguchi commented Feb 27, 2023

I am sorry for the lack of explanation.

However, I think this calculation cannot generate real amplitude.
If you want to get real amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:

The word "amplitude" is not appropriate. Sorry..

The reason for using log was so that the spectrogram would include even the small sounds.
When the spectrogram was made without log, small sounds could not be represented when scaling the vibration intensity from 0~255 across the entire image.

In addition, I used the log scale because I found opinions that it was more suitable for learning or closer to the way humans hear, so I used the log scale.
The log-scaled spectrogram is called melspectrogram as far as I know.

ディープ ネットワークを学習させる際は、信号の対数表現を使用すると有利な場合が多くありますが、これは対数がダイナミック レンジの圧縮器のように機能し、大きさ (振幅) は小さくても重要な情報を保持している表現値をブーストするためです。この例では、対数スペクトログラムの方がスペクトログラムより性能が優れています。
https://jp.mathworks.com/help/signal/ug/spoken-digit-recognition-with-custom-log-spectrogram-layer-and-deep-learning.html

メル尺度は、人間の聴覚、すなわち音の聞こえ方に基づいた尺度です。
人間の聴覚には、周波数の低い音に対して敏感で、周波数の高い音に対して鈍感である、という性質から考案された尺度になっています。
https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/

@708yamaguchi
Copy link
Member

708yamaguchi commented Feb 27, 2023

Maybe, the correct thing to do is the following. (But it seems too late...)

  • Avoid using log in spectrum calculations in audio_to_spectrum.py (Follow pazeshun' calculation)
  • Create a new audio_to_melspectrogram.py and use log to the intensity of the spectrum. (This node outputs the same image as our previous spectrogram.)

@pazeshun
Copy link
Contributor Author

@708yamaguchi I see, thank you for your explanation.
How about setting the following pipeline as default? Is this OK from your point of view?

audio_to_spectrum.py -> spectrum
                     -> log_spectrum -> spectrum_to_spectrogram.py -> spectrogram -> recognition node

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?
Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/. Is this correct? Recommending another name is also welcome.

@708yamaguchi
Copy link
Member

Thank you for your suggestion.
I think it's OK.

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?

I have heard of JSK programmers watching spectrum to check the properties of sounds, but I have never heard of an example of inputting spectrum into a recognition node.
So changing topic name spectrum to log_spectrum is not a big problem.

Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to fast-d.hmcom.co.jp/techblog/melspectrum-mfcc. Is this correct? Recommending another name is also welcome.

I think this is correct. Log scale and mel scale are similar, but they are different. fast-d.hmcom.co.jp/techblog/melspectrum-mfcc.

$$mel = 2595.0 \log_{10} \left( 1.0 + \frac{f}{700.0} \right)$$

@pazeshun
Copy link
Contributor Author

Thank you, I'll create a PR introducing the new pipeline.
One note:

So changing topic name spectrum to log_spectrum is not a big problem.

I'll make both spectrum and log_spectrum publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants