Query about Mel spectogram normalisation #902

Neuhaus24 · 2023-01-26T11:29:54Z

Neuhaus24
Jan 26, 2023

In lines 121-124 of audio.py the mel_spectogram has a log taken and then sort of normalised but not around zero. I'm familar with the practice of normalising data around zero in neural nets to maximise gradient but am unsure as to why here it is being normalised around a quarter of the maximum value. Just wondering if there is a specific reason or just that experimentally that's what worked

Answered by jongwook

Jan 27, 2023

The main reason for L122 was that during training we used librosa.power_to_db which does this with the default argument top_db=80.0, and we wanted to replicate this without having to depend on librosa.

L123 was to put the numbers roughly roughly into [-1, 1], but as you noticed it doesn't strictly put those within [-1, 1]. In practice, this shouldn't matter much as long as we do the same between training and inference.

whisper/whisper/audio.py

Lines 122 to 123 in 4e635c6

     log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)  
   log_spec = (log_spec + 4.0) / 4.0

As for why that became the default argument for librosa.power_to_db, maybe @bmcfee (hi!!) can answer?

View full answer

jongwook · 2023-01-27T07:50:38Z

jongwook
Jan 27, 2023
Maintainer

The main reason for L122 was that during training we used librosa.power_to_db which does this with the default argument top_db=80.0, and we wanted to replicate this without having to depend on librosa.

L123 was to put the numbers roughly roughly into [-1, 1], but as you noticed it doesn't strictly put those within [-1, 1]. In practice, this shouldn't matter much as long as we do the same between training and inference.

whisper/whisper/audio.py

Lines 122 to 123 in 4e635c6

    
           log_spec = torch.maximum(log_spec, log_spec.max() - 8.0) 
        
           log_spec = (log_spec + 4.0) / 4.0

As for why that became the default argument for librosa.power_to_db, maybe @bmcfee (hi!!) can answer?

1 reply

bmcfee Jan 27, 2023

As for why that became the default argument for librosa.power_to_db, maybe @bmcfee (hi!!) can answer?

👋

This was carried over for compatibility with a reference implementation in matlab. Details are a bit hazy 10 years later, but IIRC it was the mel spectrogram implementation extracted from Dan Ellis's beat tracker code. There were several slightly different implementations kicking around, even within our reference code base, and we settled on that one I suppose because it's both numerically well-behaved (bounded) and it makes some intuitive sense to limit attention to something near human perceptual limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query about Mel spectogram normalisation #902

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

	log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
	log_spec = (log_spec + 4.0) / 4.0

Query about Mel spectogram normalisation #902

Uh oh!

Neuhaus24 Jan 26, 2023

Replies: 1 comment · 1 reply

Uh oh!

jongwook Jan 27, 2023 Maintainer

Uh oh!

bmcfee Jan 27, 2023

Neuhaus24
Jan 26, 2023

Replies: 1 comment 1 reply

jongwook
Jan 27, 2023
Maintainer