[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

pazeshun · 2023-01-20T11:39:26Z

I know it is too late to fix this issue even if this issue is correct, so I just write this for reference.

Currently, audio_to_spectrum.py calculates "amplitude" by applying abs and log to FFT result:

jsk_recognition/audio_to_spectrogram/scripts/audio_to_spectrum.py

Line 56 in c4df514

amplitude = np.log(np.abs(amplitude))

However, I think this calculation cannot generate "real" amplitude (consistent with the amplitude of the original signal).
If you want to get "real" amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:
https://helve-blog.com/posts/python/numpy-fast-fourier-transform/
https://ryo-iijima.com/fftresult/

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

The text was updated successfully, but these errors were encountered:

iory · 2023-02-26T14:34:25Z

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

It is a good direction to be able to specify the correct calculation as an option.

708yamaguchi · 2023-02-27T07:06:20Z

I am sorry for the lack of explanation.

However, I think this calculation cannot generate real amplitude.
If you want to get real amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:

The word "amplitude" is not appropriate. Sorry..

The reason for using log was so that the spectrogram would include even the small sounds.
When the spectrogram was made without log, small sounds could not be represented when scaling the vibration intensity from 0~255 across the entire image.

In addition, I used the log scale because I found opinions that it was more suitable for learning or closer to the way humans hear, so I used the log scale.
The log-scaled spectrogram is called melspectrogram as far as I know.

ディープネットワークを学習させる際は、信号の対数表現を使用すると有利な場合が多くありますが、これは対数がダイナミックレンジの圧縮器のように機能し、大きさ (振幅) は小さくても重要な情報を保持している表現値をブーストするためです。この例では、対数スペクトログラムの方がスペクトログラムより性能が優れています。
https://jp.mathworks.com/help/signal/ug/spoken-digit-recognition-with-custom-log-spectrogram-layer-and-deep-learning.html

メル尺度は、人間の聴覚、すなわち音の聞こえ方に基づいた尺度です。
人間の聴覚には、周波数の低い音に対して敏感で、周波数の高い音に対して鈍感である、という性質から考案された尺度になっています。
https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/

708yamaguchi · 2023-02-27T07:10:24Z

Maybe, the correct thing to do is the following. (But it seems too late...)

Avoid using log in spectrum calculations in audio_to_spectrum.py (Follow pazeshun' calculation)
Create a new audio_to_melspectrogram.py and use log to the intensity of the spectrum. (This node outputs the same image as our previous spectrogram.)

pazeshun · 2023-02-27T07:55:46Z

@708yamaguchi I see, thank you for your explanation.
How about setting the following pipeline as default? Is this OK from your point of view?

audio_to_spectrum.py -> spectrum
                     -> log_spectrum -> spectrum_to_spectrogram.py -> spectrogram -> recognition node

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?
Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/. Is this correct? Recommending another name is also welcome.

708yamaguchi · 2023-02-27T08:10:10Z

Thank you for your suggestion.
I think it's OK.

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?

I have heard of JSK programmers watching spectrum to check the properties of sounds, but I have never heard of an example of inputting spectrum into a recognition node.
So changing topic name spectrum to log_spectrum is not a big problem.

Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to fast-d.hmcom.co.jp/techblog/melspectrum-mfcc. Is this correct? Recommending another name is also welcome.

I think this is correct. Log scale and mel scale are similar, but they are different. fast-d.hmcom.co.jp/techblog/melspectrum-mfcc.

$$mel = 2595.0 \log_{10} \left( 1.0 + \frac{f}{700.0} \right)$$

pazeshun · 2023-02-27T08:20:47Z

Thank you, I'll create a PR introducing the new pipeline.
One note:

So changing topic name spectrum to log_spectrum is not a big problem.

I'll make both spectrum and log_spectrum publish.

This was referenced Feb 27, 2023

[WIP] Fix spectrum amplitude #2765

Closed

[audio_to_spectrogram, sound_classification] Add data_to_spectrogram #2767

Merged

k-okada closed this as completed in #2767 Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

pazeshun commented Jan 20, 2023 •

edited

iory commented Feb 26, 2023

708yamaguchi commented Feb 27, 2023 •

edited

708yamaguchi commented Feb 27, 2023 •

edited

pazeshun commented Feb 27, 2023

708yamaguchi commented Feb 27, 2023

pazeshun commented Feb 27, 2023

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude? #2761

Comments

pazeshun commented Jan 20, 2023 • edited

iory commented Feb 26, 2023

708yamaguchi commented Feb 27, 2023 • edited

708yamaguchi commented Feb 27, 2023 • edited

pazeshun commented Feb 27, 2023

708yamaguchi commented Feb 27, 2023

pazeshun commented Feb 27, 2023

pazeshun commented Jan 20, 2023 •

edited

708yamaguchi commented Feb 27, 2023 •

edited

708yamaguchi commented Feb 27, 2023 •

edited