Maximum window size is 3 milliseconds #85
Since commit fa5dcc8
It seems that
I propose one of the following solutions:
I think the latter solution is the way to go. However, it may be beneficial to store the max window size in the state struct for performance reasons.
The text was updated successfully, but these errors were encountered:
Good catch! What is a good limit for
Lets say we have a 30 second window, 32 channels and a sample rate of 384000.
I guess we could calculate all limits dynamically, but this would complicate the validation logic. I would like to avoid that if possible.
I haven't looked at the code (just glanced at the header file in question) so if I'm speaking out of my ass, my apologies for that.
1000ms (1 second) should be enough, which would be 98304000bytes total if 32channels of 64bit floats at 384000hz.
This is loudness scanning not audio processing (like reverb or echo etc) so there is no need to "remember" more than 1 second at a time.
On a private project I calculate the RMS on audio and there I calculate the RMS over the buffer length (which could be 1 second or only 100ms for example), then the result of that is summed with the previous, after the scanning is over the sum is divided by the number of buffers scanned. This means that during scanning regardless of buffer size only 8 bytes (64bit float aka double) is used per channel, it's hard to get more memory efficient than that.
Sure, there is some precision loss when you start summing thousands or millions of buffers (hours or days) of audio buffers, but at that stage the RMS value (or in the case of EBU R-128 integrated) should be fairly stable and within margin of error anyway. The longest audio would be continuous double albums (1-2hrs) or long films or documentaries (3-5 hours) but those are rare, things instead tend to be split into parts at that point usually.
Another solution (though this is more lossy) is to resample to 48kHz then do the 4x true peak oversampling on that. That would simplify code and use less memory, but no clue if that break EBU R-128 specs or not.
In MLT (and therefore Shotcut), the windowed loudness measurement is an input into an Automatic Gain Control algorithm which uses that measurement to correct the audio loudness. The speed of the gain control is user configurable by setting the size of the window. 1 second is too short for most applications like this. I think that 30 seconds (which I offered previously) would be the smallest duration that I would want to offer as the "maximum window size".
I suppose I could implement this in MLT. The disadvantage is that it is less convenient and accurate than the current method. But this would be extra work with no benefit. The practical applications for this feature are: <= 6 channels, <=48KHz, <= 30 seconds.
My suggestion would be that we not get too worried about the worst case scenarios (32 channels, 384000Hz) because there aren't real use cases that would require that worst case configuration.
For example, a program I'm working on will have a RMS lib (that I've made myself) and I'll also use libebur128. The user can choose either (a choice between EBU R128 and RMS-Z weighted basically). I'll be displaying a visualization of the values for the entire audio track loudness as well as handling crossfades based on these values. It would make little sense to use the windowing in libebu128 and roll my own just for the RMS lib.
BTW! The following is pretty much offtopic (in regards to libebur128 mostly so ignore it if not interested).
Why use AGC? If you do EBU R128 on the entire clip you just need to adjust the gain for the entire clip. Sure if the user split up a video into multiple clips for editing this is no longer correct but the you just recalculate EBU R128 for each clip. Using multithreading (and assuming the user has a 6+ core CPU) this should be relatively quick. If you are relying on the loudness scanner lib you are using to handle your AGC window you might be attacking the problem the wrong way.
Sorry for rambling on, I just find it kinda ludicrous to use several gigs of memory if one happen to use may channels of audio, memory that would be better used (in this case) for I don't know caching video frames and images/overlays etc. If using AGC gobbles up half a gig to a gig on someones rig if they are a DIY home video creator then that could cause issues (the OS would dive into the pagefile system slowdown etc). So for the long term maybe find a more memory efficient way to do this with Shotcut and MLT perhaps(?).
You say "<= 6 channels, <=48KHz" but is this true if they import multiple audio tracks for layered sound? 6 channels seems "ok" as far as rendering out video, but for editing you might have 20 audio tracks and possibly with different bitdepths and frequencies (they might load a 24bit 96khz flac for example for background music). Or is it just the final rendering output you run the AGC? (which is odd as you'd probably want it on dialog etc).
Sorry for droning on, I'm sorta the guy at the back of the room that always goes "um, hang on a sec" and questions everything.
I would suggest that it is in scope. R128 never even specifies the 0.4s and 3s time constants. It is just a recommendation that references other technical documents. One of those documents is EBU Tech Doc 3341 which does specify the 0.4s and 3s time constants. That same document also says:
Another way to look at it: One could consider the time constant specific functions (ebur128_loudness_momentary & ebur128_loudness_shortterm) to be "extra code". The application could just call ebur128_loudness_window() with the appropriate time constants.
I find your application interesting, but I'm not sure I understand how it will work. Shortterm and momentary values are both windowed. They just use different time constants. Which do you plan to use in your application? Momentary (0.4s) or Shortterm (3s)? And you plan to take those windowed values and then feed them into another windowing function?
Your questions presumes that the clips are pre-mastered and just need a fixed gain offset. But many clips are not pre-mastered. Imagine a video recording that a parent makes of their child's band concert. While the band is playing, the loudness may be fine. But during applause it could be too high and while the conductor is announcing the next song, it could be too low. A fixed gain offset would not result in satisfactory results. The user could chop up the clip and apply a fixed gain adjustment to each clip. But that is not convenient. The AGC provides satisfactory results with a very high convenience. But the user needs to be able to optimize the window duration for their specific situation to get best results.