-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argon containers use only weak cores on systems with heterogeneous CPUs #397
Comments
@Howard-Haiyang-Hao are you familiar with this issue? |
@fady-azmy-msft , It's first time for me to hear this issue. I am in the process to investigate this issue and keep you guys posted. |
@conioh I have successfully replicated the issue locally and am collaborating with the feature teams to gain a deeper understanding of the underlying reasons behind this behavior. |
@Howard-Haiyang-Hao, that's great to hear. Since opening the issue we have made the following findings, both obviously related to scheduling:
I hope that helps. |
This issue has been open for 30 days with no updates. |
@Howard-Haiyang-Hao, is there any update? |
Thanks @conioh for the updates. Here's the reason why we observed the scheduling behaviors: In the host case, primesieve.exe was launched with the following parent chain: In the container case, primesieve.exe has the following parent chain: By setting the process priority to above normal, we can change the schedule to treat the container process with high priority, overcoming the QoS behavior that we experienced. I have initiated an email thread regarding the hyperthreaded core scheduling behaviors you described. I will keep you updated. Thanks! |
@Howard-Haiyang-Hao, thanks for the information. That's very interesting. I made a couple of tests and indeed I see that if I set the process inside the container to HighQoS it utilizes all the cores and if I set the process outside the container to EcoQoS it utilizes only the E-cores. Unfortunately, unlike priority and affinity, thread QoS isn't inherited between processes as far as I can see, so setting my shell inside the container to HighQoS doesn't help its child processes ( Also, although thread QoS in general explains this behavior, there still seems to be something missing here:
If I take the documentation to be the intended behavior it seems like the scheduler on ARM does the right thing, while the scheduler on x64 has a bug which reduces the performance on these (Low/Utility) threads even when it shouldn't. Thank you. |
Wait, ARM machines? I don't think Windows Containers supports ARM ;). |
Obviously off-topic here, but you're confusing two meaning of the word "support":
Microsoft does not offer support (1) to use Windows containers on ARM64 Windows. They don't offer ARM64 base images for Docker for Windows and so on. Windows 11 for ARM64 supports (2) launching Windows process-isolation ("server silo"-based) containers. The Windows software binaries have that code and it works. You can test it yourself with the simple command I provided above. Using that simple command in unsupported (1) by Microsoft. |
So it's been over two years since the Alder Lake CPUs were releases, six months since I've opened this issue and two months since an internal email thread on this issue has been initiated. Is there any news? |
@conioh, an internal discussion is underway regarding the treatment of processes running inside containers, aiming to handle them as regular processes instead of background processes. Implementing this change won't be straightforward, as it significantly affects system behaviors. The workaround you suggested, prioritizing the process, appears to address the issue. Regarding ARM containers, they are not officially supported at the moment. I have communicated your feedback to the feature team for an examination of the behaviors you highlighted. |
@Howard-Haiyang-Hao, treating container foreground processes (e.g. run via Even true background processes, that legitimately run in utility QoS, should be scheduled to all cores when running on wall power. Setting the process priority to above normal isn't a viable workaround since the entire system stops responding until the build ends. Since the build is highly parallelized and the compiler doesn't yield CPU, giving it above normal priority causes CPU starvation for all the other processes running at normal priority. |
Thank you, @conioh. I've relayed the message to the feature team for further discussion. |
Is there any news? |
Apologies for the delayed response. If you'd like your app to opt-out of the "background" QoS behavior, you can use the SetProcessInformation API, with the ProcessPowerThrottling info class, to opt out of this behavior. More information can be found here: https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setprocessinformation For example, this should be able to do something similar to the following to opt out of the background QoS. PROCESS_POWER_THROTTLING_STATE PowerThrottling; SetProcessInformation(GetCurrentProcess(), This should opt the current process out of the heuristics the system would have normally applied to determine the QoS to use for the process. If you need to apply the opt out for a child process you could specify the handle to the child process in the API. Note that you'd need the PROCESS_SET_INFORMATION access right if operating on another process. It wasn't clear from the discussion above if this is already something you tried, or if this would work for your needs. Please let me know if you have tried this approach, and it wasn't sufficient for some reason. Thanks, and I hope this helps. |
@fjs4, we're aware of this API and it's not useful in this scenario. First,
(Second, but less critical, the code you've suggested doesn't simply "opt the current process out of the heuristics" but rather forcefully and explicitly sets the QoS to High. Why not Medium? Well, because there's no public API to set the QoS to anything but Eco or High for one. But do we really want High? Not necessarily. We want the default. The correct default that works correctly.) Third, there's no justification to push this onto us. As I've already said above:
See also more details in microsoft/Windows-Dev-Performance#117 My employer pays good money to Microsoft and we expect Windows to function at least on some basic level. |
I bumped into this issue recently at work. My laptop exhibits this problem, but a co-workers desktop does not. Both CPUs are Raptor Lake, with a mix of performance and efficiency cores. Laptop:
Desktop:
My use case sounds very similar to @conioh. I am attempting to run a build and compile software inside a container. Until this is resolved, running a build inside a windows container on a laptop is not practical. Performance is significantly degraded (e.g. 90-150% longer build times in my experience). |
Hi @nbrinks. Thank you so much for this. I didn't mention it here, but we did check it on a desktop machine and we did encounter the same issue there. We don't have Alder Lake or Raptor Lake desktops at work, but one of my colleagues has a personal custom-built (not from a brand) desktop with an Intel i9 13900K and had the same problem there. To make sure we tried it again and the problems is still there. But the data you provided made me think. Now the differentiating parameter wasn't just ARM vs x64, and it also wasn't Dell vs non-Dell. I thought it might be that something on your desktop environment tinkered with the priority or affinity; if you set the process priority to above normal its threads are schedules to all the cores and similarly if you set it's affinity to only include P-cores it also causes it to run on them. But that was pretty unlikely. I was about to ask you for a bunch of data, including a WPR trace to see what's going on on your desktop. To do that I needed to compile a list of the pieces of information I wanted, and while doing that I seem to have stumbled upon something interesting. It's not a solution because it works on one of our laptop models, doesn't work and our desktop, and we didn't test it on a second laptop model yet, but it's interesting. Basically somehow between Microsoft, Intel and Dell, the power management settings were to schedule utility QoS threads only to E-cores, contrary to what the documentation says. It's a bit delicate, not something I consider production-ready (and indeed works on one machine but not on the other 🤷 ), and may require tweaking depending on your specific configuration, so I'm reluctant to share it publicly. If you'd like, you're welcome to contact me privately and we can see if it helps you. Slack might be the best (I'm just assuming you use it too), but other means are possible. In the meanwhile, if you don't mind, could you share the contents of the following Registry keys?
(This includes almost exclusively power management configuration related information, at least on my machine, and in particular excludes all non-power related settings under |
NOTE: Edited on 2023-07-11 to correct inaccurate data about ARM64 machines.
Describe the bug
Argon containers use only weak cores on systems with heterogeneous CPUs.
We use Docker for our build environment. This is required for many reasons.
For example, it's quote common for our code to be incompatible with past and future Visual C++ versions because of new C++ features going in and bugfixes (or new bugs) modifying behavior of old code.
Generally, people can always build the
master
branch on their host machine, but they can't build an older version (required when servicing an issue with an older but still supported version). But people can't refrain from upgrading Visual Studio because then the current code won't build. It's impractical to keep all the version of the Visual C++ build tools installed.So we have Docker images with the entire build environment, and a text file in the code repository pointing to the tag of the corresponding build tools Docker image. That way in order to build any commit from the product source repository we only need to pull the referred Docker image. This is also what our CI system does.
We make every effort to run Docker containers using process isolation (mostly - make sure our images start from base images compatible to the host), as our tests have shown that running under process has no overhead compared to running directly on the host (sometimes even slightly faster¹), while running under Hyper-V isolation has a significant overhead, among other problems.
Unfortunately we have recently discovered that when running on system with heterogeneous CPUs (Intel Alder Lake and Raptor Lake CPUs and ARM64 CPUs under certain conditions), the processes inside process-isolated containers utilize only the "weak" cores (E-core on Intel, LITTLE cores on ARM). See: docker/for-win#13562
Using sophisticated debugging techniques we have also discovered that the issue is not with Docker/Moby but rather with Windows. See below in the reproduction section.
On all of our machines with Intel CPUs prior to Alder Lake, building inside process-isolated Docker containers runs as fast (or slightly faster¹, but on our Raptor Lake machines with Intel i9 13900H CPUs (with 6 P-cores = 12 logical cores + 8 E-core) we get the following results:
(The numbers are rounded averages. For example, Debug on the host actually takes 490s-497s.)
This it pretty awful. We found out it is actually faster to build on an older model of the same computer, from 2 iterations back, with an Intel i9 11900H CPU, since it doesn't have two kinds of cores and it actually uses all of them.
On the Raptor Lake aforementioned the 12 logical P-core do nothing while the 8 E-cores do all the work. We're assuming Release is more compute-intensive due to the optimizations and there we see x2.5 factor (which is close to the 40% CPU utilization claimed by some tools seeing 100% on 8 cores and practically nothing on 12 other cores), and Debug it not so compute-bound so there the slowdown factor is "only" x2.
A workaround we've considered is using Hyper-V isolation. It kind of works but not really. A complete table with Hyper-V isolation would be:
The Debug build has a significant overhead compared to running directly on the host and compared to what process isolation should have been, but it's still better than what process isolation does on this machine. But the Release build just crashed
MSBuild.exe
for lack of memory.You see, Hyper-V isolation has the "nice" property of causing
MSBuild.exe
to crash with screams about not enough memory when launching the containers with as few as--memory 64GB
and refusing to run at all with, let's say,--memory 96GB
:screenshot
That's a separate issue with Hyper-V isolation being less that useful but we're not here to solve that. The point is that using Hyper-V isolation isn't a valid workaround on the grounds of not working. Not that it would be a good workaround even if it had worked, on grounds of being slow and no reason for process isolation not to work properly.
¹ One element we have discovered that makes processes inside process-isolated containers to run slightly faster than directly on the host is less interference by certain security software. We assume there may be other causes. Generally we say that the performance of process-isolated docker containers is approximately equal to that of running directly on the host, except in pathological cases. Like the one we have here, unfortunately.
To Reproduce
Execute the following PowerShell commands:
Take a look at your favorite CPU utilization tool. I used Sysinternals Process Explorer. You're welcome to use Task Manager,
perfmon.msc
or whatever rocks your boat.Dell XPS 9530 with i9 13900H, 64GB RAM, aforementioned Micron SSD:
If you're not sure which core is which you can hover over the core. Process Explorer, unlike Task Manager tell you which logical core belongs to which physical core. Or you can use Sysinternals Coreinfo:
coreinfo.exe output:
You can see that the first 6 pairs of cores share caches and have larger caches than the other 8 cores.
This also happens on ARM devices such as the Samsung Galaxy Book go (with Snapdragon 7c Gen 2, 4GB RAM) and the Surface Pro X (SQ1, 8GB RAM; SQ2, 16GB RAM):
On both these devices the problem manifests itself only when using Argon containers and running on battery power. That is:
Only when running inside a container and on battery power, the big cores spike for a moment, like on the Alder/Raptor Lake and then aren't utilized by the container anymore. If connected to AC power, they are utilized again and if disconnected they are used once more.
(The small fluctuation after the spike are due the the 7c being extremely weak and even running Task Manager and a browser requires non-negligible CPU power. On the Surface Pro X is smoother after the spike.)
The first six cores are the LITTLE ones and the other two are the big:
coreinfo.exe output:
Expected behavior
All cores utilized.
Configuration:
CmDiag.exe
in reproduction. Because it doesn't matter.Additional context
It certainly doesn't seem to be an issue with the specific container engine, runtime or base image.
Due to the behavior on the ARM64 devices it might be related to power management. Perhaps the container is somehow "confused" about the power state or configuration?
The text was updated successfully, but these errors were encountered: