New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vmmem high CPU usage #6982
Comments
Today, after wake up from sleep I managed to repro. |
The day after, also waking up from night sleep. Like @Matsemann |
The high CPU usage can be seen from Task Manager |
Yes, and from the htop screenshot. But you cannot see what is using the CPU (other than the generic |
This comment was marked as outdated.
This comment was marked as outdated.
Update, I can reproduce this 100% of the time by turning off my 2nd monitor (one core in Linux gets pegged to 100%, but htop doesn't show any processes using CPU power). If I run 'xeyes', as soon as xeyes open the mystery CPU usage disappears, so I assume it's something to do with the new graphics integration, and actually running a program gives it a kick? |
Great find! I've always experienced it after computer has been sleeping (like when coming back after lunch), but I also can reproduce it by reconnecting a monitor. Which I guess is what happens when the computer starts after sleep. Running xeyes or any wslg app doesn't seem to stop it here, though. |
Yes, if someone from wsl team could tell us how to give them better diagnostics that would be nice. As we've said, nothing shows up as using resources in the wsl images themselves. But we can see a high cpu core usage in htop, or vmmem cpu usage in task manager. But don't know how to see what is causing it. |
Can confirm,
I tried messing with refresh rate (my first monitor has 144, my second 60), but no luck. In the process, I found that when I tried to change the refresh rate, the screen "refresh", then the 100% core in htop gone. Although, currently its fresh restart and reproduced reliably, but only 1 core spinning up. In the past I had like 3 core spinning up out of nowhere. I also use Docker for Windows, which for this purpose I turn off first, because I also had bad experience on that docker hogging CPU while no container running Update Windows 10, build 21390.2025 |
I think this is indeed related to WSLg, which creates a hidden "system distro" instance for each running user distro. That's why you can't see anything in htop. When the 100% core usage happens you can run this from the terminal: I bet the culprit is the |
As a possible confirmation of this, I had this issue continuously (with vmmem using 50% of CPU without anything running on WSL2). I disabled WSLg in wslconfig and restarted the distributions. The issue (so far) seems to be fixed. |
With |
If it helps, I was already running 1.0.24 (updated yesterday) and still seeing the issue. Disabling WSLg seems to have fixed it. I'll keep testing. |
I too can reproduce this 100% by turn on/off my second monitor. Is this issue user "workaround-able" or do we have to wait for Microsoft. If so is there a way to expedite such a serious issue? Side note: opening xeyes only drop the CPU usage from 25% to 17%. and after I closed xeyes the cpu went back up to 25%. So far I have not found a reliable way to make it run normally again other than to disable wslg with |
@lishan89uc have you tried |
I occasionally get this issue as well and it usually gets fixed when I shut down the WSL completely. This time I tried pressing the Win+Ctrl+Shift+B combo and that seems to have fixed the CPU usage in this instance. |
I also experienced this today and the Worth mentioning that this is now happening on Windows 11:
|
Also happened on mine. Consistently happens after locking the screen. My system is a laptop with internal 4K display (switched off) and external 5K display via Thunderbolt (set to second screen only). |
When vmmem's CPU usage is very high, in my case, stopping Docker desktop solved the problem.
|
This seems to work for me as a temporary fix. I'v see this in a few different states, either 1 CPU core or 3 CPU cores all at 100% and nothing showing up in top etc as actually using CPU.
|
If you don't need gui apps then you can disable wslg while waiting for a fix. I did it a couple days ago and it seems to have stopped the high-cpu-after-sleep issue. Add the following to
|
I do not use sleep mode on my office PC, it is always on. And I have the same problem. Perhaps it appears after connecting via RDP, but I cannot confirm this yet. |
Could somebody who is hitting this provide the output of dmesg? I suspect this is WSLg-related. |
Hi folks, I was having issues with I hope this is relevant and helps someone else in a similar predicament. |
I also use power toys, but at the moment the cpu load is normal and cannot confirm. Interesting) |
Another thing I notice is that Steps:
|
@ctataryn sorry, can't confirm. As soon as I shutdown the wsl, vmmem is gone |
I see high CPU, sustained 50-70% after resuming from sleep on Windows 11 21H2 22000.2538. WSL becomes completely unresponsive though, so I'm not able to open a terminal to run the script mentioned. |
I'm seeing high load from vmmemwsl on Windows 11 too - it's not bad enough to render my laptop useless or even cause WSL to freeze - but I think it impacts things running inside WSL - I feel it causes minikube tunnel to lose connectivity and give this error
Running my interrupts monitor script - https://github.com/zmajeed/simpleprog/blob/zmajeed_general/wsl_stats/wsl_interrupt_stats.sh - it does not show the extreme level of HVS interrupts rate like on Windows 10 but it's relatively high and there are other interrupts that seem high
On the Windows side I see these Hyper-V performance counter changes
|
I'm the same person as "kelleymh" who previously commented, but now as "mhklinux". I have officially retired from Microsoft so I don't have internal Microsoft connections like I did a month ago. But I'm independently contributing to Linux kernel work, and am interesting in helping to get to the root cause of these WSL issues. The data you've collected shows a problem that's different from the synthetic timer interrupt storm. From the Linux side, the HYP.0 interrupt rate definitely seems high. The HYP interrupts are almost exclusively associated with VMBus devices such as the synthetic SCSI controller, synthetic network interface, the memory balloon synthetic device, and various others. Each such device is identified by an instance GUID. You can see the list of devices and their drivers with this Linux command: "ls -ls /sys/bus/vmbus/devices/*/driver". Whenever the Hyper-V host needs to send the guest an interrupt for such a device, the HYP interrupt is sent. In Linux, the HYP interrupt handler de-multiplexes by looking at some additional data structures to determine which VMBus device the interrupt belongs to, and then runs the interrupt handler for that VMBus device. So my first thought is to determine which VMBus device is receiving the high number of interrupts. The picture is slightly more complex because the interrupts are actually associated with VMBus channels. Most VMBus devices have a single channel, so a channel and a device are equivalent. For these devices, the HYP interrupt always comes to CPU 0. But some VMBus devices (the synthetic SCSI controller and synthetic NIC) have multiple channels that can interrupt multiple CPUs in parallel, only one of which might be CPU 0. Altogether, the cumulative interrupt count for a VMBus channel is available in /sys/bus/vmbus/devices/*/channels/*/interrupts. The CPU that a channel will interrupt is similarly available at /sys/bus/vmbus/devices/*/channels/*/cpu. So the trick is to write a script that iterates through /sys/bus/vmbus/devices/*/channels/* and gets the value of "interrupts" whenever "cpu" is 0. For each such "interrupt" count we would want to know the instance GUID (i.e., the /sys/bus/vmbus/devices/* value) and the driver (i.e., /sys/bus/vmbus/devices/*/driver). If we monitor these interrupts counts, I would expect that one (or more?) VMBus device is taking an unexpectedly high rate of interrupts, which is in turn producing the high value of HYP.0. Learning which VMBus device is the first step in narrowing down the root cause of this high HYP.0 interrupt rate. I'll try to write such a script in the next few days and post it here. But if someone else on this thread gets there first, so much the better. :-) |
Thanks @mhklinux for the detailed writeup - I'll add the data to the monitor script - I see a clear winner Sorted interrupt counts by device and channel
Device fd1d2cbd-ce7c-535c-966b-eb5f811c95f0 channel 5 interrupts CPU 0
Device fd1d2cbd-ce7c-535c-966b-eb5f811c95f0 is hv_storvsc - Hyper-V Storage Virtual Service Consumer
https://manpages.ubuntu.com/manpages/jammy/man4/hv_storvsc.4freebsd.html My laptop has its root SSD and a USB drive |
Since it's over 500 comments here now I've lost a bit track. I experience this daily. Is there anything I can do to help when it occurs? Some setup, diagnostics etc I can do to help pinpoint the issue. |
Just making a quick note here, as a non-technical person. Whenever this happens (which is essentially every weekday morning) I just restart Docker Desktop to clear the issue. Obviously I wish this was not necessary, but it does work, for whatever reason. |
@zmajeed Some possibilities:
You can check for (2) using the "dmesg" command in Linux. Do you see a huge number of disk I/O errors reported by hv_storvsc? If so, can you copy one of those errors and post it here? They are probably a single line each. For (1), you could do "apt install iotop" to install and run the "iotop" command. It should show which Linux processes are doing the most disk I/O. |
Yeah - the high HYP.0 rate is mainly from /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/channels/5/interrupts
Script that generated above data
I don't see disk errors in dmesg but there is memory pressure
top and iotop activity is from Kubernetes processes - kube-apiserver, kubelet, envoy, containerd-shim-runc-v2, etcd - and systemd-journald
I'll see if increasing WSL memory helps |
Since there are no disk I/O errors in the dmesg log, I suspect some user process in Linux is generating the disk I/O requests. From "iotop", did you see if the disk I/O requests are primarily disk reads or disk writes? In the big picture, 5000 I/O requests per second isn't a problem, particularly for an SSD. An SSD intended for a laptop can probably do at least 100,000 I/Os per second, so it isn't working very hard to do 5000 per second. In Linux, file system I/O requests are usually 4096 bytes, so 5000 I/Os per second is about 20 Megabyte/second, which again isn't necessarily a big problem. But the problem would be if you think the WSL Linux should be pretty much idle. Then something unexpected is generating 20 Mbytes/second of disk I/O, and that's probably some user space application. If the 20 Mbytes/second of I/O is mostly writes, that can sometimes be some kind of continuous error in an application that is spewing error messages into the application's log file. The log file just keeps growing. So you might want to check the available disk space in your WSL Linux to make sure the Linux disk isn't getting filled up. If the 20 Mbytes/second of I/O is mostly reads, then I don't have any particular suggestions on what to look for. Hopefully the "iotop" command could tell you which user space process is generating most of the I/O requests. |
Trying out disabling fast startup on my Windows 10 laptop- thanks @jcmalone98 - startup doesn't feel any slower than before - also found this post https://enterprisesecurity.hp.com/s/article/Disabling-Windows-10-Fast-Startup - just so happens HP Wolf Security is running on my laptop - here's powershell commands to view and change fast startup settings via Windows registry Show fast startup setting
Disable fast startup - as admin
|
It's been a week since I disabled fast startup on my Windows 10 laptop - it's gone through many sleeps and holding steady so far - I wonder if @mhklinux has any thoughts on fast startup as a possible trigger for the issue - and if there's anything I could track on its impact when enabled |
So in other words you avoided the bug with WSL & hibernation by not hibernating. |
I haven't even known what "fast startup" is in Windows -- I had to go look it up. :-( It is a modified version of hibernation. When you do a shutdown, you are logged out and your apps are stopped. Once Windows is in a state with no users running, it hibernates. Then when you restart the laptop, it effectively resumes from hibernation rather than doing a fresh boot. However, if you do a "restart" instead of a shutdown, Windows really does a full shutdown and fresh reboot. So "fast startup" only affects the "shutdown" case followed by manually starting the laptop again. It definitely seems relevant that the problem (i.e., the timer interrupt storm in WSL Linux) is associated with a hibernation-based startup rather a full fresh startup. But I don't have any insight beyond that. Hopefully someone from Microsoft with expertise on the Windows/Hyper-V side can comment. Somehow that hibernation-based startup is causing Hyper-V to get confused about the reference time in the Linux guest. |
@chris-findlay - not sure what you mean - I've never enabled or used hibernation |
Fast startup is just a different form of hibernation. |
Why would it affect sleep? Ostensibly the feature only concerns a startup after shutdown |
For what it's worth, I have had "fast startup" disabled for ever* and I do have almost daily issues with resuming from hibernate. *) when I shut down my system I want it to actually shutdown and not keep any RAM around. |
Isn't fast startup a laptop only setting? I can't even see that on my Windows 11 desktop installation.. and still I'm also affected by this issue.. it definitely doesn't help it being so random.. In my case weeks can go by without it manifesting.. |
I've never had fast startup turned on and can reliably reproduce the WSL problem after hibernation or sleep. |
It happens without fast startup or hibernate :-( |
It was good while it lasted! WSL blew up today with HVS interrupt rate approaching 500,000 per second - despite fast startup being disabled |
For anyone following along, the WSL team is doing an AMA @ 8AM PST today for Microsoft's Technical Takeoff week. Maybe, just maybe, a good dog-pile might get them to take this seriously. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Hi folks, Wanted to add on that we are investigating this, and will post any updates to this thread as they're available. Thank you for your patience! |
This comment was marked as off-topic.
This comment was marked as off-topic.
@pbodnar when investigating this issue we do look through the thread, so luckily those comments aren't lost on us and we love that the community here helps upload them to troubleshoot and debug this! And agreed on the frustration, so as a gentle reminder please if you're commenting on this issue try to keep it as relevant to the issue as possible, and please keep the Code of conduct for this repo in mind when commenting. Thank you again for all of your passion on this! |
@craigloewen-msft, thanks for your kind reply. I realize that the comment I'm writing now will end as off-topic again, yet to make things clear about my suggestion above: I've expected that you probably keep all the important findings somewhere in your private corporate systems, but I think that the endless discussion thread, with title and description which are objectively out of date, just don't lead to the situation when the community could help you efficiently (and also get less frustrated). Thank you for the efforts anyway and wishing you good luck with finally resolving this issue. 🤞 |
Windows Build Number
Microsoft Windows [Version 10.0.21387.1]
WSL Version
Kernel Version
5.10.16
Distro Version
Ubuntu 20.04
Other Software
No response
Repro Steps
I'm struggling to actually repro this issue deterministically. But looks like more prominent when waking up from sleep.
Whit this issue I'd like to gather feedback on how to collect the most useful information to debug this behavior.
I tried to isolate the failing component (ex. not running Docker Desktop and WSL2 "only"), but I failed miserably.
Usually my workflow:
In the past I tried to alleviate the issue with a
.wslconfig
but didn't worked.My machine (Lenovo X1 Carbon) has 16GB of RAM and usually Vmmem RAM consumption float around ~4GB.
Excluded those hiccups WSL2 experience is pretty neat and flowlessy.
Expected Behavior
WSL2 stay idle when no user workload are launched.
Actual Behavior
Vmmem "randomly" uses high cpu/power amount (60%-70%) for couple of minutes (2 to 5 min) before settling down. This also happen when on battery without doing anything WSL2 releated (but with Docker Desktop running), killing autonomy.
Diagnostic Logs
I'll provide an .etl dump as the next hiccup occurs.
The text was updated successfully, but these errors were encountered: