New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASAPI backend doesn't obey latency parameter #324

Open
ligfx opened this Issue Jun 5, 2017 · 58 comments

Comments

Projects
None yet
5 participants
@ligfx
Contributor

ligfx commented Jun 5, 2017

I believe this a limitation of WASAPI shared mode: no matter the requested buffer size, callback events still come in at the default device period.

For example, on my computer (Windows 10, Cirrus Logic CS4208 speakers set to 24-bit 48000Hz), the default device period is 10 milliseconds / 480 frames. In test_audio, which requests a latency of 4096 frames, the first callback gets to fill the entire buffer (4096 frames), but each additional callback gets a number of frames equivalent to the rate-adjusted device period.

One strategy that seems to work is to ignore WASAPI refill events (e.g. don't call IAudioRenderClient::GetBuffer) until the desired number of frames is available to be written to. Of course, since the buffer size is equal to the desired number of frames, this causes audio glitches since the audio engine runs out of data. The default buffer size seems to be twice the period plus some margin, so I tried a buffer that's a little more than twice the requested latency, and that seemed to fix the glitches.

@padenot

This comment has been minimized.

Collaborator

padenot commented Jun 6, 2017

Does it still work of you load the callback like crazy? Try putting a Sleep or something in it.

For now, we're mostly using the lowest latency all the time (because we're using this for WebRTC and other real-time stuff), but we need to implement higher-latency channels in the future (for battery consumption and CPU usage reasons), so it would be super if we can have it solved and solid.

Let me know if I can help, maybe running some tests on some other sound card/OS combination that you don't have?

@ligfx

This comment has been minimized.

Contributor

ligfx commented Jun 8, 2017

If I Sleep in the callback, the number of frames available on the next call increases (makes sense to me, the audio engine has had more time to process the buffer), but the audio starts glitching. I guess that does "work," in one sense 😆

I looked around some more, and it does seem that WASAPI doesn't offer any latency control at all before Windows 8, and only supports latency control for desktop apps starting in Windows 10.

The Chromium project looked at this in the context of reducing power usage (Reduce audio power usage across all platforms by using larger buffer sizes) and solving various audio glitches (AV glitches during high CPU load), and settled for sending large chunks to their audio thread and FIFO'ing them down to the correct WASAPI buffer size (Enable higher latency audio on Windows for basic playback: "The WASAPI callback frequency is fixed irrespective of the buffer size we specify, so the best we can do is reduce the number of IPC round trips.").

Windows 8 added Store Apps support for hardware-offloaded audio processing on supported systems through the IAudioClient2 interface (Enabling Great Audio Experiences in Windows 10 (PPTX) claims that offloaded streams use 1s buffers instead of 10ms ones).

In Windows 10, WASAPI gained the IAudioClient3 interface, which allows querying supported buffer sizes and creating streams with smaller-than-default buffer sizes (Low Latency Audio: "This makes it possible for an application to choose between the default buffer size (10ms) or a small buffer (<10ms)"). This still doesn't allow higher-latency audio, and, unfortunately, smaller-than-default buffers are simply ignored by the old IAudioClient interface. As an example of the supported buffer sizes, on my computer IAudioClient3::GetSharedModeEnginePeriod returns a minimum size of 128 frames and a maximum of 448 frames (which is also the default).

To summarize, the options seem to be:

  • For lower-than-10ms-latency: on Windows 10, use IAudioClient3 instead of IAudioClient.
  • For much-higher-than-10ms-latency: on Windows 8 on supported hardware, use IAudioClient2 in a non-desktop app and request hardware-offloaded processing (does Firefox even care about this? A quick search tells me Mozilla isn't going to make Firefox for Windows Phone).
  • For potentially reducing IPC costs / lowering thread priority in the rest of the app: trigger cubeb user callback less frequently, with larger buffer sizes.

What do you think? What, if any of this, fits into Firefox's/cubeb's goals?

@padenot

This comment has been minimized.

Collaborator

padenot commented Jun 9, 2017

That's great research, thank you very much! Also you're right that don't currently have plans to port Firefox to Windows Phone.

About our goal here:

  • We need to implement latencyHint for the Web Audio API, so we need to be able to have large or very small buffers. It looks like IAudioClient3 is what we want here. Depending on the OS version of the user, we can dynamically switch between IAudioClient and IAudioClient3.
  • For regular playback in an HTMLMediaElement, we don't care too much about the latency (we prefer saving CPU by having large buffers, and simply offset the video frames by a bit to preserve A/V sync). As you noted, we can't reliably get Windows to ask for large buffer, so maybe we can stick to the default.
  • We're currently implementing audio remoting (i.e. audio stream in a parent process that can do system calls, sending callback requests to a child process that is completely lock down in a sandbox and can't do system calls), so we'll have IPC soon, and we'll probably end up doing like Chromium for regular (non-low-latency) playback.

So I suppose we can make a tentative plan (based on your list):

  • Write some code to use IAudioClient3 if possible, requesting lower-than-default latency if that's what the caller asked for.
  • Change wasapi_get_min_latency if IAudioClient3 is available. Gather a bunch of hardware of different grade (crappy laptop up to server-grade workstation, with different sound cards, that we have here in the office) and load the callbacks to check whether the lowest latency is tenable in practice.
  • When we have our audio remoting code, implement your third bullet point.

What are the priorities for Dolphin? It looks like we can't really get Windows to respect the number of frames we want (reliably), so you might want to implement something like Stéphane Letz's Callback Adaptation technique?

@padenot

This comment has been minimized.

Collaborator

padenot commented Jun 9, 2017

Also, here are some stats about the OS versions our users run. Windows 10 is rapidly rising, so using IAudioClient3 seems like a good idea, based on those numbers.

@ligfx

This comment has been minimized.

Contributor

ligfx commented Jun 13, 2017

Glad it's helpful. I agree, dynamically choosing IAudioClient3 where supported seems like the way to go for lower latency. How would that play into Mozilla's build process? It looks like the Windows 8 SDK is used for official builds (please correct me if I'm wrong!), so to take advantage of the new interface would require either updating the SDK at some point or vendoring the relevant parts of headers into cubeb.

Dolphin doesn't really have concrete goals around audio, though the capability to run (anything, not just audio) at lower latencies is always nice. I started looking into getting higher latencies because of some work going on around software Dolby Pro Logic II demultiplexing (which needs larger buffer sizes to work well). That adaptation technique looks helpful, I'll pass it on, thanks!

@padenot

This comment has been minimized.

Collaborator

padenot commented Jun 14, 2017

This is outdated info it seems, I'll fix it. We build with VS2015 and the Windows 10 SDK.

I have no clear timeline to work on this myself, however it does not seem too hard and it could be quite beneficial, maybe I'll find some time.

If we can make high-latency streams work nicely, it would be great as well, please keep us posted if/when you find something !

@sidewayss

This comment has been minimized.

sidewayss commented Nov 13, 2017

...5 months later: I have forked this project and will be attempting to contribute low latency functionality. I see none of the changes described here in cubeb_wasapi.cpp, e.g I see no references to iAudioClient3 in the code. Have any of these changes been made, but not yet integrated? If so, maybe I could takeover from that forked point in some way. If not, I'll just start as if none of the work has been done. Thanks!

@kinetiknz

This comment has been minimized.

Owner

kinetiknz commented Nov 14, 2017

I don't think anyone has started working on this yet (@padenot can confirm in case I've missed something). Looking forward to seeing the patches!

@padenot

This comment has been minimized.

Collaborator

padenot commented Nov 14, 2017

Indeed, no work has started on this. Thanks for working on this !

@sidewayss

This comment has been minimized.

sidewayss commented Nov 14, 2017

I am in the middle of some PC reconfiguring (new SSD, dual boot to Mac OS), but I'll be getting into this soon. My approach will be to create a version of cubeb_wasapi.cpp that uses iAudioClient3 instead of iAudioClient. If I can create a working, low-latency version using iAudioClient3, then we can work out the backwards compatibility, and when to use iAudioClient. Then we can integrate it all.

Henrika@chromium says that the buffer size (periodicity) has global effects (see here). I have confirmed this behavior with Microsoft. This will be something worth testing to see how it really behaves. The periodicity should obviously be reset to its previous or default state when this code is done with the stream. The flip side of the sharing equation is that for this to be robust, the code needs to deal with another client of the same stream changing the periodicity.

There's no way to be any more polite about it. Other clients of the stream will start receiving data in the new buffer size. Is the effect of this on other applications relevant or important to you? Certainly robustness is important, so handling an outside change to periodicity is important. Some of the early comments in this very issue involve ignoring the refill events and dealing with buffer size more directly. That might provide insulation against unexpected periodicity changes. Though I don't see any mention of clear success with that approach either. Do you have examples for testing stream sharing?

Other than that, don't thank me yet! I have accomplished nothing so far, and I need to reset my mind to program Windows SDKs again. It should be fine, but I hate to over-promise and under-deliver...

@sidewayss

This comment has been minimized.

sidewayss commented Nov 14, 2017

I have hit a basic obstacle early on. I have Qt Creator installed as an IDE for building another project (MuseScore). This seems to be creating two problems:

  1. Qt has its own mingw setup, and even after installing VS2017, the CC and CXX environment variables are pointing to the Qt executables. Microsoft's compiler is cl.exe, right? Is that what I should be using for both these env variables? I tried changing those vars and no difference, it's still using the Qt compilers, even after a reboot. Is it a CMake setting?

  2. Maybe once I clear up 1), this will go away. Inside the Developer Command Prompt, the INCLUDE env var has the correct path for finding Audioclient.h, and it has no Qt paths in it. Yet when I execute

    cmake --build .

it fails because it cannot find IAudioClient3, among other things. It is using the Qt include files instead of the Microsoft ones, the compiler output says so. Are there settings for CMake that I need to change somewhere? Please excuse my inexperience in this. I seem to have exhausted my options for trial & error troubleshooting.

As far as I can tell, the correct include path for Audioclient.h is
C:\Program Files (x86)\Windows Kits\10\Include\10.0.16299.0\um

A side question: Should I be building 32 bit, using the Developer Command Prompt?

And finally: How do I test and debug this once I get it configured properly? I see the executables it builds, but executing the test_latency.exe just blinks a command prompt window at me. The test_audio.exe works, but that's not what I'm working on. Are there results to see somewhere? Once I have it working with IAudioClient3, how do I test it for latency?

Thanks!

@kinetiknz

This comment has been minimized.

Owner

kinetiknz commented Nov 14, 2017

My approach will be to create a version of cubeb_wasapi.cpp that uses iAudioClient3 instead of iAudioClient. If I can create a working, low-latency version using iAudioClient3, then we can work out the backwards compatibility, and when to use iAudioClient. Then we can integrate it all.

I think it'd be better to make the changes to cubeb_wasapi.cpp directly. I think in theory all you need to do is replace the existing IAudioClient::Initialize in setup_wasapi_stream_one_side with a call to IAudioClient3::InitializeSharedAudioStream along with the appropriate code to get the IAudioClient3 interface (which needs to be done in a way that falls back correctly to retain compatibility with older Windows versions).

As far as Qt Creator, I can't help with that as I don't use it. You're probably best off using Visual Studio 2017 and opening the CMake project via "Open Folder". That'll also allow you to debug easily with the MSVC debugger by running the tests directly in the debugger.

@kinetiknz

This comment has been minimized.

Owner

kinetiknz commented Nov 14, 2017

test_latency is actually a test of the cubeb_get_min_latency API, so it should run very quickly. You'll probably need to devise some kind of loopback testing to test and measure actual latency, it might be somewhat involved.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 14, 2017

OK, I'll keep that in mind once I get this up and running. I've already made the first global search and replace of IAudioClient with IAudioClient3 :-)
I did open this in VS2017 using the CMake "Open Folder" feature. That works fine. But when I build, within the IDE or in the cmd prompt, it's still using the Qt compilers and include folders. The fact that it's ignoring the environment variables for the compiler executables is what puzzles me. Where is it getting those paths?
Sorry to hear about the "somewhat involved" setup for testing latency. I can place a mic in front of a speaker, the way they do it here: https://sound.io/latency/
Excuse my quick response - I now see that what you're saying is that I will have to devise my own way to test it. That's what I was asking. So the cubeb_wasapi.cpp file is part of which executable(s)?

@kinetiknz

This comment has been minimized.

Owner

kinetiknz commented Nov 14, 2017

I've already made the first global search and replace of IAudioClient with IAudioClient3 :-)

It shouldn't be necessary to do that, and it'll make retaining compatibility with older Windows versions more difficult. You only need the IAudioClient3 interface for the duration of the initialization, after that the code can continue using the interface exposed by the older IAudioClient just fine.

Sorry to hear about the "somewhat involved" setup for testing latency. I can place a mic in front of a speaker, the way they do it here: https://sound.io/latency/

That's probably fine for manual testing. I was talking about automated testing, which we don't have much of as yet.

Excuse my quick response - I now see that what you're saying is that I will have to devise my own way to test it. That's what I was asking. So the cubeb_wasapi.cpp file is part of which executable(s)?

The project builds a library (libcubeb) and the various tests link against the library and test it via the public API. So cubeb_wasapi.cpp is part of libcubeb.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 14, 2017

I understand what you're saying about IAudioClient3 and initialization.
I'll have to get this to compile properly before I can fully grasp how I will debug and test it. The global search and replace is easy to undo.

@ligfx

This comment has been minimized.

Contributor

ligfx commented Nov 15, 2017

@kinetiknz Would it be easier to start by just testing that the callbacks request an expected number of frames? Not necessary an exact number, but it should be approx. centered around the requested latency and not the default 10ms.

@kinetiknz

This comment has been minimized.

Owner

kinetiknz commented Nov 15, 2017

That'd definitely be worthwhile and simple to automate.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 15, 2017

Progress: I cleared the CMake cache using CMake GUI and then cleaned and built the project. It is now using the correct paths and compilers. I can now build this in the Developer Command Prompt, but not in the IDE. The IDE still uses the Qt compilers/paths (is that strange, or am I clueless?).

OK, more progress: I cleared the cache folders via the VS2017 CMake menu, ..., Delete Cache Folders menu option. VS seems to have it's own CMake cache. Now it is compiling using the correct compilers inside the IDE. I completed the build test by using a cubeb_wasapi.cpp with full search/replace of IAudioClient with IAudioClient3, just to verify that it compiles using the latest includes and libraries.

Maybe I should put these as additional tips in the INSTALL.md?
I'll start using the debugger now and orient myself inside the library. Once I fully understood that it's a library (only statically linked, is there a dll somewhere?), it made a lot more sense to me. My main experience using Windows SDKs in C++ is building dlls, though our clients were VB and VBA for Office.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 15, 2017

I have built a version of libcubeb that has zero changes. I am running the debugger inside VS2017. It is running RUN_ALL_TESTS() and failing here while running test_overload_callback.exe:

DWORD r = WaitForSingleObject(stm->thread, 5000);

It is timing out (r == WAIT_TIMEOUT). The console shows:

stream started
Sleeping...
stream stopped
stream error

Then it closes. Is there something wrong, or should I be debugging a different way, using a different .exe file? I understand the idea of testing the periodicity using the callbacks, if you mean the WASAPI callbacks for buffer refilling. But I'm still not up and running, or so it seems.

Now I have switched the build directory using the CMakeSettings.json file:

"name": "x86-Debug",
"buildRoot": "z:\\cubeb-build",

The project is installed in z:\cubeb. ..\cubeb-build is the directory name used in INSTALL.md.
Now the debugger acts differently. It runs test_tone.exe as the first test in the gtests, returns successfully, and exits the debugger. Is that a more correct default behavior? Why would changing the build folder impact that? INSTALL.md mentions changing the folder, but does not indicate that it is required. Should I be running the gtests, or should I configure things differently in CMakeLists.txt? That's where all this is configured, right?

Thanks for your patience!

@sidewayss

This comment has been minimized.

sidewayss commented Nov 16, 2017

I am still not debugging in a 100% productive way, but by stepping the code during the test_tone.exe execution I can see that things are functioning correctly at a basic level. All I have done so far is create the initial, backwards compatible code for initializing the two different interfaces, IAudioClient and IAudioClient3. You can see the changes here:
master...sidewayss:master
This is only testing very basic output streaming, but it works on this machine that is compatible with IAudioClient3, and it should work on older, non-compatible setups.

My next step will be to figure out how to get into a better debugging setup, likely writing another test_xxx app. I need to start testing audio input for latency. On the default output device I was testing via test_tone, all the periodicities available were 480, 10ms at 48khz. I have other devices that offer alternate frame rates, or so I believe.

Do you all have an idea of how this will work externally? Will the caller be requesting a specific latency or will it simply be a preset menu of choices like "low", "high", and "default" latency? For now, if IAudioClient3 is available, I'm going for the lowest frame rate available, but I assume that's only temporary.

Also, I think that all calls to GetDevicePeriod() should use GetCurrentSharedModeEnginePeriod() if IAudioClient3 is being used. That and the initialization functions are the two key differences, so the code can be isolated to those cases.

fyi - InitializeSharedAudioStream() does not support the NOPERSIST flag. It must be abandoned for that form of initialization. Oh well. Volume and mute settings for output (not capture) will persist across sessions. I have been communicating with Matthew van Eerde, an audio developer at Microsoft who is kind enough to post his email address on his blog. He seemed as surprised as I was that NOPERSIST was not supported, but that doesn't mean it will be supported any time soon.

Thanks! And if you have any tips for helping me understand the googletest setup and get my debugging to be more productive, I'd appreciate it.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 16, 2017

Now that I look at it, the check for IAudioClient3 would be best done in wasapi_init(), so the results can be available to not just the initialization functions, but the other functions that don't initialize a stream, like min_latency() and create_device(). This means putting some of the new members(s) in struct cubeb{}.

setup_wasapi_stream_one_side() is the only function that calls get_endpoint() instead of get_default_endpoint(), and it is called internally using the default devices by the main initialization function. That seems like a flaw. It seems like the whole module is designed to deal only with the default devices. Am I missing something?

Unless I get a better idea, I'm simply going to modify my copy of test_tone.cpp to include the tests I want to run, borrowing heavily from the other test_xxx.cpp files. It will be a small pain at rebase time, but that's how it goes.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 17, 2017

Some disappointing results from IAudioClient3::GetSharedModeEnginePeriod(): I've run it against 4 different devices, 2 x input, 2 x output, 3 different pieces of hardware, and they all have only 1 frame rate available. "Low-latency audio requires hardware support." So this approach to lowering latency may be a dead end, as I don't know what hardware has drivers that support this for Windows. My PreSonus USB audio interface is only a year old, and I tested its input and output - no joy.

On the other hand 10ms, which is the frame rate that is returned by all these devices, is much lower than the lowest latency I've measured on my PC in any browser, which is 125ms (input to output delay). Firefox measured at over 200ms. That's a huge difference. What is going on?

@ligfx

This comment has been minimized.

Contributor

ligfx commented Nov 17, 2017

@sidewayss I remember when testing on a Windows install on a MacBook Pro (which I don't have in front of me at the moment), it gave me minimum latencies of ~2.5ms, and a max of ~10ms.

Regarding the high latencies from browsers, I can't really speak to what they may or may not be doing internally (larger buffers to save power? see link above to discussion on Chrome), but remember that 10ms is only the latency between your app and the sound server. There's additional latency between the sound server and what finally gets emitted from the device.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 17, 2017

I got involved in this through this bugzilla:
https://bugzilla.mozilla.org/show_bug.cgi?id=1375466#c22
So my interest in cubeb is oriented around Firefox and audio input latency.
What hardware were you using? What audio device gave you the option for minimum 2.5ms?

@ligfx

This comment has been minimized.

Contributor

ligfx commented Nov 17, 2017

This is on a 2013 MacBook Pro, whose audio system Windows claims is "Cirrus Logic CS4208 (AB 93)". I'm using the Microsoft-provided driver.

On this computer, a call to GetSharedModeEnginePeriod with the default mix format returns a default period of 480, a fundamental of 8, a minimum of 120, and maximum of 480 (this also matches up with GetDevicePeriod's minimum reported exclusive latency, though I don't know if that's guaranteed).

If none of your devices support a latency smaller than 10ms, have you checked what drivers they're running? You might try upgrading/downgrading to the generic Microsoft driver—my understanding is it's intended to support lower latencies on most devices.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 17, 2017

Yes, I'm puzzled by this too, especially with the PreSonus box, which has plenty of features. My contact at Microsoft was not surprised though, which made me pessimistic. I'm glad to hear that someone has had some success with this. The PreSonus box uses ASIO when hooked to a DAW, but I believe that it is using something else in this case. I should verify that. I downloaded and installed a more recent driver for this audio interface, but it gives the same results.

I just tried this with a 3rd input device, a Microsoft branded web cam, with microphone. All the values come back at 441, 10ms frame rate. That driver is the generic USB audio driver by Microsoft, dated July 7, 2017. That's recent and pure Microsoft.

My built-in audio is Realtek Hi Definition Audio, and it uses a driver dated 8/8/2017, Driver Provider = Realtek, Digital Signer = Microsoft Windows Hardware Compatibility Publisher. That's plenty recent and signed by Microsoft. I'll keep digging...

@sidewayss

This comment has been minimized.

sidewayss commented Nov 17, 2017

I installed the generic "High Definition Audio Device" audio driver for my built-in hardware, instead of the Realtek drivers. Now I see variable frame rates for input/output on that device! Apparently the USB subsystem does not "opt in" to the variable frame rate scheme, which is why the PreSonus box offers only 10ms. The webcam is also USB, as is my bluetooth headset - the radio receiver is a USB device. Though apparently the bluetooth audio subsystem has this same deficiency.
It's a shame that it requires user intervention via Device Manager to enable this feature. Here are instructions on how to make the change.

At the urging of Matthew at Microsoft, I submitted a feature request to update the USB driver files in question so that they can take advantage of these features. In the realm of home recording, USB is by far the most popular choice for audio interfaces, even more so for Windows, where Thunderbolt is just barely an option. And there are plenty of consumer audio accessories that connect via USB, like my webcam w/microphone. At least the issue is on the Microsoft radar. We'll see what that's worth. Here is the bug number, the link is only useful if you have the proper credentials:
Bug 14725594: usbaudio.sys and usbaudio2.sys do not opt in to Win10 low latency (variable frame rates)

So, for now, only built-in audio hardware (on the motherboard) can reliably utilize the variable frame rates, and that might involve switching to the "High Definition Audio Device" driver. PCI audio hardware must write its own drivers w/o the Microsoft layer, so they are not limited by that layer, but they must do the work to "opt in" to low latency in Windows 10. I cannot reproduce this problem with bluetooth because I have no Windows machine with built-in bluetooth, only fake USB bluetooth. If this is going to change for bluetooth on Windows, someone who can reproduce it needs to submit the feedback request to get it started. Feel free!

@sidewayss

This comment has been minimized.

sidewayss commented Nov 18, 2017

For reference, here are the output device values available from the High Definition Audio Device driver on my machine. I can see they are slightly different from yours, @ligfx.

bit depth 16 16 16 16 24 24 24 24
sample rate 44.1khz 48khz 96khz 192khz 44.1khz 48khz 96khz 192khz
fundamental 32 32 32 32 16 16 16 16
minimum 128 128 256 480 112 128 240 480
default/max 448 480 960 1920 448 480 960 1920
min period 2.90ms 2.67ms 2.67ms 2.50ms 2.54ms 2.67ms 2.50ms 2.50ms
@sidewayss

This comment has been minimized.

sidewayss commented Nov 18, 2017

I have confirmed that setting the period to the minimum in InitializeSharedAudioStream() causes the callbacks to execute more frequently, at least for output devices. Tomorrow I hope to purchase a microphone with a 3.5mm connector, or an adapter, so I can input via my built-in hardware and test with low latency input.

Is it desirable to match the input and output periodicity? Or does it not matter? It strikes me as desirable in that it avoids excess callbacks, especially if the output device has a lower period (higher frame rate) than the input. The duplex callback will get called a bunch of times when there is no input available to send to the output.

I'm seeing a pattern in the callback frequency that I do not understand. The image below is from stdout for my version of test_tone.exe, which is actually a copy of test_duplex.exe. (I haven't been able to find a log file, so I'm using stdout). It is running input and output at 16bit 48khz, using the default periodicity of 480 == 10ms. The buffer size returned by GetBufferSize() is 1056 for both input and output. That is 3 complete periods to fill the buffer (480 x 2.2 = 1056).

  1. The callbacks are 15-16ms or 0ms apart, not the requested/default 10ms.
  2. The first numeric column is the number of milliseconds since the last call to refill_callback_duplex() or since wasapi_stream_start(). The next column is total elapsed time since stream_start(). After the first row, you'll see a pattern of one or two 15-16ms delays, then a 0ms delay between callbacks. What is causing a callback 0ms after the last one? There is data for every callback, as illustrated by the last column, which shows the packet size from get_input_buffer() always at 480. There is input and output data every call to refill_callback_duplex(). Is this pattern of mostly 3 callbacks that add up to 30ms tied in with the buffer size of 2.2 times the period? Sorry if this is a rookie question, but I am a rookie at this level of audio programming. I'd like to understand this pattern before I tackle measuring round-trip latency.

image

Here is a different behavior, in a more detailed report. All of a sudden I started getting the single 31ms delay every time I run it (2nd row). Note in the 1st and 2nd rows that the output frames are different from the input frames. Interesting that this suddenly changed within the same visual studio session. Interesting to me, maybe it's run-of-the-mill to you all. It adds support to my callback-frequency-rounding hypothesis: Every 1, 2, or 3 callbacks, whenever they occur, will add up to a full buffer after ~30ms, which is 3 x periodicity. That is inline with the Microsoft audio round-trip latency I saw:
latency = (3 x periodicity) + constant-factors or r = 3p + k

image

@sidewayss

This comment has been minimized.

sidewayss commented Nov 18, 2017

I'd like to raise two issues with this wasapi module which I raised previously, but that probably got lost in my tech support issues:

  1. Every externally exposed function except wasapi_stream_init() only deals with the default devices. Is that as it should be? Or should there be a way to specify a non-default device for these functions? Or should it follow the initialized device? These functions all activate their own device so that they can be called without initializing a stream, but you must still call wasapi_init() prior to calling any of these. But wasapi_init() doesn't have a device arg, and it uses the default output device to do a basic functionality test (see here).
  2. wasapi_get_min_latency() only gets output latency. See here where it uses eRender only. Is there not a need to get both input and output latency, depending on your working configuration?
@ligfx

This comment has been minimized.

Contributor

ligfx commented Nov 19, 2017

@sidewayss When you're running at 48Khz and seeing that callback timeline, what's the global sound server frequency set at? Is it also at 48Khz?

@sidewayss

This comment has been minimized.

sidewayss commented Nov 19, 2017

What is "the global sound server"? Part of cubeb? I can't find anything by that name relating to Windows.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 19, 2017

The new final number in each line is the "elapsed" time between refill_callback_duplex() and get_input_buffer(). These 0ms callback delays are part of a cycle where get_input_buffer() gets called immediately, then the callback is called immediately, then there is a 16ms delay until get_input_buffer(), then the callback is called immediately. It's like an extra pair of calls - but the callback initiates the call to get_input_buffer(). And this debug output waits until very end of the callback, so it never returns early.

The output from get_input_buffer is only after the call to getNextPacketSize() - that's the value "buffer:" output value, the next variable in this line of code. That is always 480, in every output line. But if I output timing from the start of get_output_buffer() it's the same, so

Normally it looks like this:
60seconds
4 times it has looked like this:
image

I wanted to see if this pattern sustained past the 500ms it was originally set to run, so I run it for 1min now. The pattern runs for the whole 60 seconds, groups of one or two 15-16ms plus one 0ms.

But there's more. I'm now only outputting when the milliseconds since wasapi_stream_init() is exactly divisible by 1000 (ms % 1000 == 0), once a second, on the second, and I was surprised to find:

a) The callback is called on the second, every second for 60 seconds straight, at millisecond precision. I have run this over 20 times now with the same results. And it is regardless of the initial delay, which is either 109, 125, 140, 141, or 156ms. There seems to be an intentional sync every second.

b) It normally outputs 120 lines in pairs of 15-16ms/0ms. Now 4 times it has output only 1 line per second, every second for 60 seconds straight. No 0ms callback delays.

@ligfx

This comment has been minimized.

Contributor

ligfx commented Nov 19, 2017

This is what I mean: https://forums.adobe.com/thread/973133

By global sound server I mean Window's built-in mixer that's used for non-exclusive audio streams. I don't remember the actual name they use to refer to it in docs.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 19, 2017

Yes, in that window (Sound Settings from the Settings/Devices) both my input and output device are running at 16bit 48khz. That's what I meant when I said that it was running at that spec originally. The mic is mono (my webcam) and the output is stereo. I aligned all that at the start, and used the default 10ms period just to get everything aligned between input and output and see how things were working.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 19, 2017

I am testing this now with the original test_tone.exe, output only. The results are identical. Everything is 15, 16, or 0ms, in spite of the requested/default period of 10ms. So this is not a duplex-related thing.

This is what is looks like when I use the min period of 128 (2.67ms):
image

It varies between five or six 0ms callbacks between 15-16ms callbacks (six or seven callbacks in each group). 16ms / 2.67 = 6, which aligns with the image above showing five 0ms callbacks between 15-16ms callbacks, for a total of 6 callbacks in the 15-16ms time frame. The buffer size is 280, and the period is 128, again at a ~2.2 to 1 ratio. In the space of 16ms it's handling 768 frames, which is 2.74 buffers worth of data. That seems misaligned due to the constant 16ms factor. Maybe it's the groups where there are 7 callbacks that compensate for this misalignment every once in a while. But even when it's aligned (period = 160/3.33ms), the number of callbacks in the group varies because of the fluctuation between/around 15-16ms in the clock, relative to the constant PeriodInFrames.

Regardless, it appears that Windows 10 is running at some kind of internal 15-16ms clock. When it hits the callback late, it decides to make up for lost time by running the callback repeatedly, instantly, in order to get back in sync. As I fiddle with different PeriodInFrames value in InitializeSharedAudioStream(), debug output is confirming this.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 19, 2017

I have confirmed the same output-only results using IAudioClient instead of IAudioClient3 (using the default 480/10ms period). I have changed the default output device's settings to 24bit 96khz using the Windows settings aka "global mixer", and the same results. This appears to be a generalized Windows behavior. Has anyone seen this previously? Has anyone looked at this previously?

@sidewayss

This comment has been minimized.

sidewayss commented Nov 21, 2017

It looks like GetTickCount() is the cause of my strange looking callback frequency numbers. I switched to QueryPerformanceCounter(), and at the default 10ms period, the callback is getting called every 10ms +/- 0.23ms.
Sorry for the bother about this, but I assumed a basic SDK function would return a reasonable value in any situation, and I am not up-to-date in my Windows programming experience.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 21, 2017

A question about test_duplex.cpp. I'm borrowing from it to build a latency tester.
This if statement in the data callback will always return false, won't it? Should it be OR instead of AND?

if (ib[i] <= -1.0 && ib[i] >= 1.0) {
    seen_audio = 0;
    break;
}

Is seen_audio a way to indicate clipping occurred? If not, what is it?

@achronop

This comment has been minimized.

Collaborator

achronop commented Nov 21, 2017

Float point audio data takes values from -1.0 to 1.0. This check verifies that the whole buffer contains values in that range. If true it's assumed (and it's very likely) that buffer contains audio data so seen_audio remains true. If not the buffer contains garbage seen_audio become false which will result in test fail.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 21, 2017

Thus -1.0 and 1.0 are not legal audio data values either. The data can approach those values but not equal them. Thanks for the info.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 21, 2017

I now have a simple latency tester with which I can measure. I grabbed the tone generation code from test_tone.cpp and merged it with test_duplex.exe. I am now testing by connecting my built-in headphone output to my built-in mic input using a cable with stereo 3.5mm connectors. The mic is stereo, but the duplex stream is setting it up in mono - I have not changed that code. Both input and output devices are configured at 24bit 48khz.

I am measuring the time between the start of the tone generator and the start of any signal on the input, any absolute value greater than 0.1. Until the tone begins, the input buffer is all 0.0 values. I'm outputting the results in 1/100th of a millisecond precision. Each line of output is a frame in the stream. In both screenshots, the "data_cb:" number right after the circled numbers is the max absolute value in the buffer for this frame. In the first screenshot you can see it jump from 0.0 to 0.91 in the frame where the "m:" value appears (start of input values >0).

The tone starts at the beginning of a frame, and I measure until the input_buffer[i] > 0.1. I believe that is the correct way to calculate the full round-trip. Start-of-frame to start-of-frame plus whatever input offset. Or should it just be start-of-frame to start-of-frame? Now that I think about it the difference is minimal, as it's all within the data_callback.

At the default 10ms period the latency is ~60ms. Here is a sample start/end time:
latency10ms

At the minimum period, 2.67ms, it tends to runs between 21ms and 23ms, but I have seen values >25ms, and this screenshot is my current all-time-low, 17.72ms:
latency2 5ms

This is a bigger change than I would have expected, at an average of 38ms improvement in latency. I was expecting 22ms (3 x (10 - 2.67)). This is more like 5x than 3x. Not that I'm complaining...

I have some other buffering latency to look at, and I'll continue testing and fiddling with this. The only changes I have made to my code since I checked it in last week have been for outputting debugging info. So the enabling of this is done. Though it certainly needs more testing, especially on devices that don't have IAudioClient3.

@padenot

This comment has been minimized.

Collaborator

padenot commented Nov 22, 2017

What is the frequency of the tone that you're generating? If you're waiting for the sample to be > 0.1 in amplitude, then it's going to be a bit inaccurate because it takes some time for a sine wave to got to > 0.1, and it depends on the frequency of course. If it's 440 (I think that's the default in the test), then it does not matter, here are some maths that show that it's very quick and does not matter, but let's make sure.

This is usually done using diracs or noise burst.

The results look amazing regardless, thanks for working on this!

@sidewayss

This comment has been minimized.

sidewayss commented Nov 22, 2017

I adapted the test_tone.cpp tone from 16bit int to 32bit float. It's supposed to be a US dial tone. The code is this:

t1 = sin(2 * M_PI*(i + u->position) * 350 / SAMPLE_FREQUENCY);
t2 = sin(2 * M_PI*(i + u->position) * 440 / SAMPLE_FREQUENCY);
ob[output_index]=ob[output_index+1]= (((SHRT_MAX / 2) * t1) + ((SHRT_MAX / 2) * t2)) / 32767;
output_index += 2;

It only runs for 2 or 3 frames. But the latency is frame-based anyway, isn't it? This is the first frame that has any data at all, anything over 0.00. My data_cb value is the max within a frame, and that's 0.00 until this point in time. If you have a better formula, I'd be glad to use it.

Also, I set it up to pass the input into the output buffer after the tone completes, so I can measure the echo, at least by frame. The echos are the same latency as the original, with a one or two frame variation sometimes at the min period, but mostly identical.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 22, 2017

I'm going back to this if statement:

if (ib[i] <= -1.0 && ib[i] >= 1.0) {
    seen_audio = 0;
    break;
}

This will still never resolve to true. A number cannot be simultaneously >1 and <-1. That was my original question, and I forgot about it during the seen_audio explanation.

@achronop

This comment has been minimized.

Collaborator

achronop commented Nov 22, 2017

Yeah that's right. You can make a PR otherwise I'll do.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 22, 2017

This is a test app, not the actual cubeb code, though I understand the importance of maintaining the test apps too. I thought that someone might have done it intentionally to ignore the garbage data. Maybe that's what happened, someone made that change intending it to be temporary, then forgot to reset it.

I'd prefer it if you submit the PR. I'm new to the project and don't want to presume. Plus, my fork of this project is not clean at this time. I'll make the change in my code for my own testing purposes. It's certainly an easy change.

Thanks for confirming and PRing.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 22, 2017

I have made a change suggested by @padenot, to this code in setup_wasapi_stream():

#if !defined(DEBUG)
    const int silent_buffer_count = 2;
#else
    const int silent_buffer_count = 4;
#endif
    stm->linear_input_buffer->push_silence(stm->input_buffer_frame_count *
                                          stm->input_stream_params.channels *
                                          silent_buffer_count);

I set silent_buffer_count to 1 instead of 2. I am running inside the debugger with DEBUG not defined, and the test app exits with code 3 if I set it to zero. I'll get around to testing it at zero in a full release .exe sometime soon - unless someone tells me that it's a pointless exercise.

This reduces the round-trip latency by approximately 1 frame. There is a lot of 1-2 frame variation in round-trip latency, even at the 10ms period, in spite of the stability I saw early on in my tests. So it's difficult to give a precise latency reduction number.
At the minimum period of 2.67ms, I am seeing more round-trip values less than 20ms, which indicates a 1 frame reduction in latency.
At the default/max period of 10ms, here are the results, in order:
40ms, 50ms, 40ms, exit code 3, 50ms, 60ms, 40ms, 50ms, 50ms, 60ms
It I had not seen 40ms previously, and I was getting 70ms regularly before this change. I'm going to watch for the code 3 failures more specifically. I know they have happened in the past, but I have dismissed them because I had bigger problems. I do not know where the error is happening. It's probably a buffer underrun/overrun - something related to this initial padding in the input buffer. I need to get a log file up and running...

@sidewayss

This comment has been minimized.

sidewayss commented Nov 22, 2017

After making that last change I am receiving an unacceptable number of failures. I don't know what exactly is failing, but I got effectively zero failures before I made that change, and at least 15% failures after the change. I have reverted the value and am not able to cause a failure in >10 attempts.

Yes, I am running a debug executable in the debugger. I will endeavor to get a set_log_callback() function happening so I can improve the reliability and ease-of-use of the test results. If anyone can point me to where they think the failures are happening, I'd appreciate it. But it does look like 2 is the minimum debug number, so I can reduce that from 4 to 2.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 24, 2017

I made another change suggested by @padenot, and the results are better now. I am not inserting any silence at the beginning of the input buffer. Instead I am discarding the first 9 frames in the refill_callback, which gives the input stream time to stabilize. I got the occasional error discarding the first 4, then 5 frames, so I set it to 9 and have not seen any errors since. I'm still running in the debugger, and we can bring that number down. At 10ms period that's 100ms wait for the duplex stream to effectively start (the first frame of output buffer is always goofy).

But the trade off for a small delay in starting the stream is a clear 20ms reduction in latency at the 10ms period. I am seeing a 40ms latency 75% of the time, and a 50ms latency the other 25%. Never as high as 60ms in >30 test runs.

At the min period, 2.67ms, I fail 75% of the time, but the times I don't fail I get latency as low as 12.6ms. I am now inserting 1 frame of silence again and no more failures, with latency of mostly 21ms, as low as 15ms. I'm still running in the debugger, so these numbers are not precise, and we still might be able to avoid the frame of silence, even in low latency.

But it's looking like the worst case is inserting 1 frame of silence only when low latency is requested, where it adds less than 3ms of latency. I'd be OK with that.
I will get the logging working soon, and these results will get more precise.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 24, 2017

I have a release .exe and I am using ALOGV() to log my values asynchronously. They aren't much different from my debug values, the ones I have been reporting. At low latency I'm seeing an average of 18.2ms round-trip latency. I'll run it various times over the weekend and accumulate results in a table. I have minimized the logging to one number per test, elapsed time, aka latency.

The main thing is that now others can build this version of test_tone.exe and generate their own logs on different hardware. My code is here, only two files are changed: cubeb_wasapi.cpp and test_tone.cpp. It creates a log file named w_latency_log.txt in the same folder as test_tone.exe. The exe overwrites the log file each time it runs. The exe runs for just over 60 seconds to accumulate 60 latency values. If you like I can post an exe here or elsewhere.

This code always selects low latency if it is available. It still chooses the period based on the minimum period of the default output device. I have been using the same built-in hardware for input and output, as it is the only hardware that is currently supported by Windows for low latency.

Is there a way to stop the logging from including .cpp file line numbers? It makes it harder to parse the results for analysis.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 24, 2017

Do you all build this for Windows only as x86 32bit, or do you build 64bit versions as well? I'm running a 64bit OS, but I've been building everything as x86 so far. I'm wondering if it's worth building and testing a 64bit exe.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 28, 2017

test_sanity.cpp:

#define STREAM_RATE 44100
#define STREAM_LATENCY 100 * STREAM_RATE / 1000

STREAM_LATENCY = 4410. That is a 100ms period. On Windows, AFAIK, that isn't available, and it will use 10ms instead. Is 100ms available on other platforms, and this is intentional, or should it be defining STREAM_LATENCY as 10 * STREAM_RATE / 1000?

@padenot

This comment has been minimized.

Collaborator

padenot commented Nov 29, 2017

This is intended. For example, we use a latency of 100ms in Firefox, regardless of the platform, for regular media playback (non-real-time). This github issue is bug is about Windows not obeying this parameter.

Also, this is the total latency, not the callback period.

@sidewayss

This comment has been minimized.

sidewayss commented Nov 29, 2017

It might not be intended as the callback period, and it is reset to 10ms by Windows, but it is the callback period that is requested in IAudioClient.Initialize() in most of test_sanity.cpp. That's why I inquired. For example:

  r = cubeb_stream_init(ctx, &stream, "test", NULL, NULL, NULL, &params, STREAM_LATENCY,
                        test_data_callback, test_state_callback, &dummy);
@sidewayss

This comment has been minimized.

sidewayss commented Nov 29, 2017

Maybe I should rephrase: The way I am making Windows obey this parameter is to treat it as the callback period, the value returned by wasapi_get_min_latency() and the value set in wasapi_stream_init() by IAudioClient3::InitializeSharedAudioStream(). That is the way it was setup when I arrived. I don't understand what you mean by "total latency". I have no way to set total latency, I can only set the callback period.

@sidewayss

This comment has been minimized.

sidewayss commented Dec 2, 2017

I have been testing with the few devices that I have, collecting round-trip latency data. These tests are microphone facing speaker My webcam is delivering 80ms latency, but with too many inconsistent data points (it's not designed to be a close-range microphone?). My bluetooth to USB headset mic is getting 120ms latency, very consistent results. These devices clearly have more things increasing the latency than just a 10ms period. I have not had any success testing my iOne USB interface with loopback cables. So I tested that with a mic placed in front of speaker (SM57 is designed to be a close-up mic). It's averaging 70ms, which is a bit above the built-in devices when using 10ms. Some of the difference might be attributable to the speaker to microphone audio transfer versus a short loopback cable.

I'm collecting a bunch of data. I've designed my test_.exe to execute 60 round-trip measurements in 60 seconds. It logs the results asynchronously. I have a spreadsheet that accumulates the numbers and calculates a few things. Hopefully we can enlarge the range of devices and keep collecting data in an organized way. The log file results are simple to copy/paste into a spreadsheet or a relational table. Each test run is 1 column with 60 rows of data. But it could be 1 row with 60 columns if that's preferred.

I would appreciate it if anyone can provide me with a better click/pop/white/pink-noise formula. My dial-tone is flamming at this short duration. I don't think it's affecting the results in any noticeable way, but it's not generating a tone with consistent level, and it has a flam double-attack. I'd like to clean that up. I'm noticing it now that I'm listening to it over the speakers.

There are other interesting trends in the data, the most interesting being better results inside the debugger than running a release exe: lower latency and less variation in the latency values across tests. It's filling the buffer more consistently on time. Does anyone know why that might be? Is there a way we can make the release exe run faster? It should run faster and more consistently than the debugger, shouldn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment