-
|
In lines 121-124 of audio.py the mel_spectogram has a log taken and then sort of normalised but not around zero. I'm familar with the practice of normalising data around zero in neural nets to maximise gradient but am unsure as to why here it is being normalised around a quarter of the maximum value. Just wondering if there is a specific reason or just that experimentally that's what worked |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
The main reason for L122 was that during training we used L123 was to put the numbers roughly roughly into [-1, 1], but as you noticed it doesn't strictly put those within [-1, 1]. In practice, this shouldn't matter much as long as we do the same between training and inference. Lines 122 to 123 in 4e635c6 As for why that became the default argument for |
Beta Was this translation helpful? Give feedback.
The main reason for L122 was that during training we used
librosa.power_to_dbwhich does this with the default argumenttop_db=80.0, and we wanted to replicate this without having to depend on librosa.L123 was to put the numbers roughly roughly into [-1, 1], but as you noticed it doesn't strictly put those within [-1, 1]. In practice, this shouldn't matter much as long as we do the same between training and inference.
whisper/whisper/audio.py
Lines 122 to 123 in 4e635c6
As for why that became the default argument for
librosa.power_to_db, maybe @bmcfee (hi!!) can answer?