Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vmmem high CPU usage #6982

Open
1 of 2 tasks
nitroin opened this issue May 24, 2021 · 580 comments
Open
1 of 2 tasks

Vmmem high CPU usage #6982

nitroin opened this issue May 24, 2021 · 580 comments
Assignees

Comments

@nitroin
Copy link

nitroin commented May 24, 2021

Windows Build Number

Microsoft Windows [Version 10.0.21387.1]

WSL Version

  • WSL 2
  • WSL 1

Kernel Version

5.10.16

Distro Version

Ubuntu 20.04

Other Software

No response

Repro Steps

I'm struggling to actually repro this issue deterministically. But looks like more prominent when waking up from sleep.
Whit this issue I'd like to gather feedback on how to collect the most useful information to debug this behavior.

I tried to isolate the failing component (ex. not running Docker Desktop and WSL2 "only"), but I failed miserably.

Usually my workflow:

  • VSCode Stable v1.56.2 with WSL Remote v0.56.2 for work (usually Java + SQL)
  • VSCode Insiders v1.57.0 with WSL Remote v0.56.2 for personal projects (usually JS related)
  • Docker Desktop (Windows) v3.3.3 (64133) running (with mongo or postgres)

In the past I tried to alleviate the issue with a .wslconfig but didn't worked.

My machine (Lenovo X1 Carbon) has 16GB of RAM and usually Vmmem RAM consumption float around ~4GB.

Excluded those hiccups WSL2 experience is pretty neat and flowlessy.

Expected Behavior

WSL2 stay idle when no user workload are launched.

Actual Behavior

Vmmem "randomly" uses high cpu/power amount (60%-70%) for couple of minutes (2 to 5 min) before settling down. This also happen when on battery without doing anything WSL2 releated (but with Docker Desktop running), killing autonomy.

Diagnostic Logs

I'll provide an .etl dump as the next hiccup occurs.

@nitroin
Copy link
Author

nitroin commented May 25, 2021

Today, after wake up from sleep I managed to repro.

wsl.zip

@Matsemann
Copy link

Matsemann commented May 25, 2021

I also see high vmmem cpu usage from time to time (stuck at ~25% CPU). When checking htop on my only running image, I see that one core is being used 100%, but none of the processes are using any CPU. So not sure what's happening / what's the culprit. Docker images not running at the moment.

C:\Users\mats>wsl -l -v
  NAME                   STATE           VERSION
* Ubuntu-20.04           Running         2
  docker-desktop-data    Stopped         2
  docker-desktop         Stopped         2

Have the opposite problem with docker, though. Suddenly it will use all available ram even when not running any containers (but the wsl distributions are running)

Edit: A pic of htop when this occurs. Nothing visible using CPU, but it's using much CPU on multiple cores anyway.
image

@nitroin
Copy link
Author

nitroin commented May 26, 2021

The day after, also waking up from night sleep.

Like @Matsemann htop do not show anything relevant.

wsl2.zip

@ll931217
Copy link

The day after, also waking up from night sleep.

Like @Matsemann htop do not show anything relevant.

wsl2.zip

The high CPU usage can be seen from Task Manager

@Matsemann
Copy link

Matsemann commented May 27, 2021

Yes, and from the htop screenshot. But you cannot see what is using the CPU (other than the generic vmmem but that can be anything). So there is something spinning somewhere draining resources for no good reason.

@ChrisJefferson

This comment was marked as outdated.

@ChrisJefferson
Copy link

ChrisJefferson commented Jun 17, 2021

Update, I can reproduce this 100% of the time by turning off my 2nd monitor (one core in Linux gets pegged to 100%, but htop doesn't show any processes using CPU power).

If I run 'xeyes', as soon as xeyes open the mystery CPU usage disappears, so I assume it's something to do with the new graphics integration, and actually running a program gives it a kick?

@Matsemann
Copy link

Great find! I've always experienced it after computer has been sleeping (like when coming back after lunch), but I also can reproduce it by reconnecting a monitor. Which I guess is what happens when the computer starts after sleep.

Running xeyes or any wslg app doesn't seem to stop it here, though.

@Matsemann
Copy link

I'm hitting this too, so if anyone has any suggestions on how to capture what is going on, I am happy to record it.

Yes, if someone from wsl team could tell us how to give them better diagnostics that would be nice. As we've said, nothing shows up as using resources in the wsl images themselves. But we can see a high cpu core usage in htop, or vmmem cpu usage in task manager. But don't know how to see what is causing it.

@fmiqbal
Copy link

fmiqbal commented Jun 19, 2021

Update, I can reproduce this 100% of the time by turning off my 2nd monitor (one core in Linux gets pegged to 100%, but htop doesn't show any processes using CPU power).

Can confirm,

  1. wsl --shutdown, set monitor projection to single screen
  2. Startup wsl, normal htop
  3. Make monitor projection to extend
  4. Suddenly 1 core spin up in htop

I tried messing with refresh rate (my first monitor has 144, my second 60), but no luck. In the process, I found that when I tried to change the refresh rate, the screen "refresh", then the 100% core in htop gone.
So I tried to just reset the graphic driver (Win+Ctrl+Shift+B), and the 100% core is also gone.

Although, currently its fresh restart and reproduced reliably, but only 1 core spinning up. In the past I had like 3 core spinning up out of nowhere. I also use Docker for Windows, which for this purpose I turn off first, because I also had bad experience on that docker hogging CPU while no container running

Update
Using docker for windows, I get 3 core spinning up when projecting to second screen, and Win+Ctrl+Shift+B still works like a charm

Windows 10, build 21390.2025
WSL2, Arch Linux,
2 monitor setup,

@onomatopellan
Copy link

onomatopellan commented Jun 22, 2021

I think this is indeed related to WSLg, which creates a hidden "system distro" instance for each running user distro. That's why you can't see anything in htop. When the 100% core usage happens you can run this from the terminal:
wsl.exe --system -d $WSL_DISTRO_NAME top

I bet the culprit is the weston binary inside the system distro. Recently there was a similar problem that has been fixed in the weston repository.

@joewood
Copy link

joewood commented Jun 24, 2021

As a possible confirmation of this, I had this issue continuously (with vmmem using 50% of CPU without anything running on WSL2). I disabled WSLg in wslconfig and restarted the distributions. The issue (so far) seems to be fixed.

@onomatopellan
Copy link

With wsl.exe --update you will get WSLg 1.0.24 which contains some fixes about high cpu usage.

@joewood
Copy link

joewood commented Jun 24, 2021

If it helps, I was already running 1.0.24 (updated yesterday) and still seeing the issue. Disabling WSLg seems to have fixed it. I'll keep testing.

@lishan89uc
Copy link

lishan89uc commented Jul 2, 2021

I too can reproduce this 100% by turn on/off my second monitor. Is this issue user "workaround-able" or do we have to wait for Microsoft. If so is there a way to expedite such a serious issue?

Side note: opening xeyes only drop the CPU usage from 25% to 17%. and after I closed xeyes the cpu went back up to 25%. So far I have not found a reliable way to make it run normally again other than to disable wslg with guiApplications=false added to .wslconfig inside the personal folder, do a complete wsl shutdown and restart, or wait it out. If anyone find a better way to keep the functionality (preferably automatable) as a workaround that be great.

@fmiqbal
Copy link

fmiqbal commented Jul 2, 2021

@lishan89uc have you tried Win+Ctrl+Shift+B

@GoaLitiuM
Copy link

GoaLitiuM commented Jul 5, 2021

I occasionally get this issue as well and it usually gets fixed when I shut down the WSL completely. This time I tried pressing the Win+Ctrl+Shift+B combo and that seems to have fixed the CPU usage in this instance.

@nitroin
Copy link
Author

nitroin commented Jul 6, 2021

I also experienced this today and the Win + Ctrl + Shift + B trick solved the issue.

Worth mentioning that this is now happening on Windows 11:

Edition	Windows 11 Pro
Update	21H2
Installed on	‎01-‎Jul-‎21
OS build	22000.51
Experience	Windows Feature Experience Pack 421.16300.0.3

@jamestut
Copy link

Also happened on mine. Consistently happens after locking the screen. My system is a laptop with internal 4K display (switched off) and external 5K display via Thunderbolt (set to second screen only).
Build 22000.71, WSL2 kernel version 5.10.43, WSL2 last updated 18 July 2021.

@letrhee
Copy link

letrhee commented Jul 22, 2021

When vmmem's CPU usage is very high, in my case, stopping Docker desktop solved the problem.

  • I usually use hibernate to shut down my computer.
    And I use WSLg (and dual monitors), Docker desktop, and VSCode.

@addshore
Copy link

addshore commented Aug 2, 2021

I occasionally get this issue as well and it usually gets fixed when I shut down the WSL completely. This time I tried pressing the Win+Ctrl+Shift+B combo and that seems to have fixed the CPU usage in this instance.

This seems to work for me as a temporary fix.

I'v see this in a few different states, either 1 CPU core or 3 CPU cores all at 100% and nothing showing up in top etc as actually using CPU.

Edition	Windows 11 Pro
Version	21H2
Installed on	‎29/‎07/‎2021
OS build	22000.100
Experience	Windows Feature Experience Pack 421.18901.0.3

@jibbers42
Copy link

If you don't need gui apps then you can disable wslg while waiting for a fix. I did it a couple days ago and it seems to have stopped the high-cpu-after-sleep issue.

Add the following to <USERPROFILE>\.wslconfig and restart the machine (or maybe wsl --shutdown is good enough).

[wsl2]
guiApplications=false

See https://github.com/microsoft/wslg#wslg-system-distro

@Mystic8b
Copy link

I do not use sleep mode on my office PC, it is always on. And I have the same problem. Perhaps it appears after connecting via RDP, but I cannot confirm this yet.
@jibbers42 tx man, i'll try

@benhillis benhillis self-assigned this Aug 10, 2021
@benhillis
Copy link
Member

Could somebody who is hitting this provide the output of dmesg? I suspect this is WSLg-related.

@ctataryn
Copy link

Hi folks, I was having issues with vmmem causing my laptop fans to kick in after I had installed Microsoft PowerToys when I needed to do some batch image resizing. If I quit "Microsoft PowerToys v0.43.0" from the task tray then vmmem went back to normal.

I hope this is relevant and helps someone else in a similar predicament.

@Mystic8b
Copy link

I also use power toys, but at the moment the cpu load is normal and cannot confirm. Interesting)

@ctataryn
Copy link

I also use power toys, but at the moment the cpu load is normal and cannot confirm. Interesting)

Another thing I notice is that vmmem won't actually go away until I issue another wsl --shutdown after quitting PowerToys.

Steps:

  1. Quit Docker Desktop from task tray
  2. from cmd.exe: wsl --shutdown
  3. Observe vmmem is still running as a process and taking about 5% CPU
  4. Quit PowerToys from task tray
  5. Observe vmmem is still running, however CPU usage is 0%
  6. from cmd.exe, again issue: wsl --shutdown
  7. Observe that vmmem is no longer present in the Task Manager

@fmiqbal
Copy link

fmiqbal commented Aug 18, 2021

@ctataryn sorry, can't confirm. As soon as I shutdown the wsl, vmmem is gone
image

@jared-is-ray
Copy link

I see high CPU, sustained 50-70% after resuming from sleep on Windows 11 21H2 22000.2538. WSL becomes completely unresponsive though, so I'm not able to open a terminal to run the script mentioned.

@zmajeed
Copy link

zmajeed commented Nov 9, 2023

I'm seeing high load from vmmemwsl on Windows 11 too - it's not bad enough to render my laptop useless or even cause WSL to freeze - but I think it impacts things running inside WSL - I feel it causes minikube tunnel to lose connectivity and give this error

E1109 10:42:33.067502 3998570 ssh_tunnel.go:84] error listing ingresses: Get 
 https://127.0.0.1:32779/apis/networking.k8s.io/v1/ingresses": http2: client connection lost
E1109 10:42:44.862831 3998570 ssh_tunnel.go:79] error listing services: Get "https://127.0.0.1:32779/api/v1/services": net/http: TLS handshake timeout

Running my interrupts monitor script - https://github.com/zmajeed/simpleprog/blob/zmajeed_general/wsl_stats/wsl_interrupt_stats.sh - it does not show the extreme level of HVS interrupts rate like on Windows 10 but it's relatively high and there are other interrupts that seem high

wsl_interrupt_stats.sh -n4

2023-11-09_10:35:53.985916_CST: Start WSL interrupt stats, frequency_secs 10, num_iterations 4, hvs_rate_thresholds []

2023-11-09_10:35:53.987013_CST: System info
2023-11-09_10:35:53.987665_CST: uname
Linux 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

2023-11-09_10:35:53.989335_CST: WSL version info
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2428

2023-11-09_10:35:54.147511_CST: Available clocksource
==> /sys/devices/system/clocksource/clocksource0/available_clocksource <==
tsc hyperv_clocksource_tsc_page

2023-11-09_10:35:54.149364_CST: Current clocksource
==> /sys/devices/system/clocksource/clocksource0/current_clocksource <==
tsc

2023-11-09_10:36:04.883775_CST: Iteration 1:
interval_sec 10.732084
current_clocksource tsc

2023-11-09_10:36:04.887469_CST: Iteration 1: IRQ diff stats
irq_cpu       rate_per_sec   pct_change_in_rate          change       new_count       old_count
-------       ------------   ------------------          ------       ---------       ---------
HYP.0              3640.30                  Inf           39068        64971385        64932317
CAL.3              1344.01                  Inf           14424        69604307        69589883
CAL.2              1324.91                  Inf           14219        71084127        71069908
CAL.1              1259.12                  Inf           13513        69468278        69454765
HVS.2              1209.27                  Inf           12978        90338906        90325928
HVS.0              1199.21                  Inf           12870        90152803        90139933
HVS.3              1114.69                  Inf           11963        83908378        83896415
HVS.1              1111.71                  Inf           11931        83021790        83009859
CAL.0               819.60                  Inf            8796        63782431        63773635
RES.3                60.47                  Inf             649         4326012         4325363
RES.1                56.47                  Inf             606         4474569         4473963
RES.2                51.16                  Inf             549         4477209         4476660
RES.0                24.23                  Inf             260         2915193         2914933
32.2                  0.37                  Inf               4          153285          153281
HYP.1                 0.19                  Inf               2          363743          363741

2023-11-09_10:36:15.087146_CST: Iteration 2:
interval_sec 10.203455
current_clocksource tsc

2023-11-09_10:36:15.090491_CST: Iteration 2: IRQ diff stats
irq_cpu       rate_per_sec   pct_change_in_rate          change       new_count       old_count
-------       ------------   ------------------          ------       ---------       ---------
HYP.0              9228.34               153.50           94161        65065546        64971385
CAL.3              2895.39               115.43           29543        69633850        69604307
CAL.2              2851.19               115.20           29092        71113219        71084127
CAL.1              2511.50                99.46           25626        69493904        69468278
HVS.0              1462.54                21.96           14923        90167726        90152803
HVS.3              1231.05                10.44           12561        83920939        83908378
HVS.1              1203.12                 8.22           12276        83034066        83021790
HVS.2              1180.48                -2.38           12045        90350951        90338906
CAL.0              1010.15                23.25           10307        63792738        63782431
RES.2                58.22                13.80             594         4477803         4477209
RES.1                57.73                 2.23             589         4475158         4474569
RES.3                45.28               -25.12             462         4326474         4326012
RES.0                40.18                65.83             410         2915603         2915193

2023-11-09_10:36:25.284642_CST: Iteration 3:
interval_sec 10.197577
current_clocksource tsc

2023-11-09_10:36:25.290028_CST: Iteration 3: IRQ diff stats
irq_cpu       rate_per_sec   pct_change_in_rate          change       new_count       old_count
-------       ------------   ------------------          ------       ---------       ---------
HYP.0              8244.61               -10.66           84075        65149621        65065546
CAL.3              2566.20               -11.37           26169        69660019        69633850
CAL.2              2494.02               -12.53           25433        71138652        71113219
CAL.1              2472.84                -1.54           25217        69519121        69493904
HVS.0              1434.26                -1.93           14626        90182352        90167726
HVS.2              1223.33                 3.63           12475        90363426        90350951
HVS.1              1109.48                -7.78           11314        83045380        83034066
HVS.3              1091.24               -11.36           11128        83932067        83920939
CAL.0               884.33               -12.46            9018        63801756        63792738
RES.2                34.81               -40.21             355         4478158         4477803
RES.1                31.38               -45.64             320         4475478         4475158
RES.3                29.61               -34.61             302         4326776         4326474
RES.0                25.10               -37.53             256         2915859         2915603

On the Windows side I see these Hyper-V performance counter changes

cmd                                   pct_cpu
---                                   -------
idle                                    95.12
vmmemwsl                                 4.37
chrome                                   0.38
wmiprvse                                 0.23

name                                    rate_per_sec pct_change   change new_value old_value source group
----                                    ------------ ----------   ------ --------- --------- ------ -----
Other Intercepts Cost                       13494.65     191.82 17300.55  26319.70   9019.15 _total Hyper-V Hypervisor Root Virtual Processor
VTL1 Average Run Time                        3147.35      20.56  4035.00  23660.00  19625.00 _total Hyper-V Hypervisor Root Virtual Processor
HLT Instructions Cost                        2901.77      54.07  3720.15  10600.67   6880.52 _total Hyper-V Hypervisor Virtual Processor
MSR Accesses/sec                             1587.83       8.98  2035.64  24705.36  22669.72 _total Hyper-V Hypervisor Root Virtual Processor
Total Intercepts/sec                         1554.36       5.40  1992.74  38895.59  36902.85 _total Hyper-V Hypervisor Root Virtual Processor
Hypercalls Cost                               904.25      86.59  1159.28   2498.11   1338.83 _total Hyper-V Hypervisor Virtual Processor
CPU Wait Time Per Dispatch                    850.53       5.29  1090.40  21693.69  20603.29 _total Hyper-V Hypervisor Root Virtual Processor
APIC MMIO Accesses/sec                        837.79     149.46  1074.08   1792.70    718.62 _total Hyper-V Hypervisor Root Virtual Processor
Total Interrupts/sec                          825.55       8.89  1058.38  12968.58  11910.20 _total Hyper-V Hypervisor Logical Processor
C1 Transitions/sec                            765.45      66.17   981.32   2464.37   1483.05 _total Hyper-V Hypervisor Logical Processor
External Interrupts/sec                       712.60      62.90   913.57   2366.08   1452.51 _total Hyper-V Hypervisor Root Virtual Processor
Hardware Interrupts/sec                       692.30      51.38   887.56   2614.93   1727.38 _total Hyper-V Hypervisor Logical Processor
Performance Monitoring Interrupts/sec         419.82     153.63   538.22    888.55    350.33 _total Hyper-V Hypervisor Root Virtual Processor
Performance Monitoring Interrupts/sec         416.17     152.30   533.54    883.87    350.33 _total Hyper-V Hypervisor Logical Processor
Other Intercepts Cost                         371.98      17.25   476.89   3241.12   2764.23 _total Hyper-V Hypervisor Virtual Processor
APIC IPIs Sent/sec                            305.65       5.31   391.85   7767.57   7375.72 _total Hyper-V Hypervisor Root Virtual Processor
C2 Transitions/sec                            240.92      39.43   308.86   1092.16    783.29 _total Hyper-V Hypervisor Logical Processor
Hardware Interrupts/sec                       225.75      26.96   289.42   1362.86   1073.44 _total Hyper-V Hypervisor Root Virtual Processor
Total Intercepts Cost                         218.79      15.16   280.50   2130.52   1850.02 _total Hyper-V Hypervisor Virtual Processor
Inter-Processor Interrupts/sec                192.78      43.33   247.15    817.56    570.40 _total Hyper-V Hypervisor Logical Processor
External Interrupts/sec                        85.34       3.68   109.41   3079.10   2969.69 _total Hyper-V Hypervisor Virtual Processor
Scheduler Interrupts/sec                       61.97       1.09    79.45   7368.94   7289.49 _total Hyper-V Hypervisor Logical Processor
Posted Interrupt Notifications/sec             40.89      17.06    52.42    359.63    307.21 _total Hyper-V Hypervisor Logical Processor
External Interrupts Cost                       39.78       2.68    50.99   1953.69   1902.69 _total Hyper-V Hypervisor Virtual Processor
Posted Interrupt Notifications/sec             33.50      13.94    42.94    351.05    308.11 _total Hyper-V Hypervisor Root Virtual Processor
Virtual MMU Hypercalls/sec                     33.06     428.97    42.39     52.27      9.88 _total Hyper-V Hypervisor Root Virtual Processor
Global GVA Range Flushes/sec                   24.58     877.01    31.51     35.11      3.59 _total Hyper-V Hypervisor Root Virtual Processor
MSR Accesses Cost                              16.61       3.95    21.30    560.22    538.92 _total Hyper-V Hypervisor Virtual Processor
Synthetic Interrupts/sec                        9.61       1.61    12.32    778.55    766.23 _total Hyper-V Hypervisor Virtual Processor
Other Intercepts/sec                            7.23      21.95     9.27     51.49     42.22 _total Hyper-V Hypervisor Root Virtual Processor
Interrupts Sent/sec                             6.44       1.07     8.26    781.67    773.41 unique Hyper-V Virtual Machine Bus
External Interrupts Cost                        6.00       0.65     7.69   1190.37   1182.68 _total Hyper-V Hypervisor Root Virtual Processor
Local Flushed GVA Ranges/sec                    5.29     377.65     6.78      8.58      1.80 _total Hyper-V Hypervisor Root Virtual Processor
VTL1 Dispatches/sec                             3.28     117.11     4.21      7.80      3.59 _total Hyper-V Hypervisor Root Virtual Processor
Virtual Processor Hypercalls/sec                2.34     334.23     3.00      3.90      0.90 _total Hyper-V Hypervisor Root Virtual Processor
Address Space Flushes/sec                       2.16     102.64     2.77      5.46      2.69 _total Hyper-V Hypervisor Root Virtual Processor
Address Domain Flushes/sec                      1.55      73.69     1.99      4.68      2.69 _total Hyper-V Hypervisor Root Virtual Processor
% Guest Run Time                                1.53      52.16     1.96      5.73      3.77 _total Hyper-V Hypervisor Root Virtual Processor
% Total Run Time                                1.52      45.82     1.95      6.21      4.26 _total Hyper-V Hypervisor Root Virtual Processor
% Guest-Relative Utilization                    1.43      38.30     1.83      6.62      4.79 _total Hyper-V Hypervisor Root Virtual Processor
Interrupts Received/sec                         1.34       0.40     1.72    427.50    425.78 unique Hyper-V Virtual Machine Bus
% Guest Run Time                                0.81       7.35     1.03     15.10     14.07 _total Hyper-V Hypervisor Logical Processor
% Total Run Time                                0.75       6.52     0.96     15.73     14.77 _total Hyper-V Hypervisor Logical Processor
% C2 Time                                       0.53      24.77     0.68      3.42      2.74 _total Hyper-V Hypervisor Logical Processor
Receive QoS - Exempt Messages/sec               0.52      73.69     0.66      1.56      0.90 _total Hyper-V VM Virtual Device Pipe IO
Synthetic Interrupt Hypercalls/sec              0.52      73.69     0.66      1.56      0.90 _total Hyper-V Hypervisor Virtual Processor
% C1 Time                                       0.20       8.66     0.26      3.29      3.03 _total Hyper-V Hypervisor Logical Processor
% VTL1 Run Time                                 0.00     160.21     0.00      0.00      0.00 _total Hyper-V Hypervisor Root Virtual Processor
% Hypervisor Run Time                          -0.01      -2.76    -0.01      0.48      0.49 _total Hyper-V Hypervisor Root Virtual Processor
% Hypervisor Run Time                          -0.06     -10.16    -0.07      0.63      0.70 _total Hyper-V Hypervisor Logical Processor
% Hypervisor Run Time                          -0.08     -27.56    -0.10      0.26      0.36 _total Hyper-V Hypervisor Virtual Processor
Posted Interrupt Notifications/sec             -0.37     -13.15    -0.47      3.12      3.59 _total Hyper-V Hypervisor Virtual Processor
% Guest-Relative Utilization                   -0.63      -1.55    -0.81     51.28     52.09 _total Hyper-V Hypervisor Virtual Processor
% C3 Time                                      -1.48      -2.40    -1.90     77.55     79.46 _total Hyper-V Hypervisor Logical Processor
% Guest Run Time                               -1.51      -3.80    -1.93     49.01     50.94 _total Hyper-V Hypervisor Virtual Processor
% Total Run Time                               -1.59      -3.96    -2.03     49.27     51.30 _total Hyper-V Hypervisor Virtual Processor
Address Space Flushes/sec                      -2.10    -100.00    -2.69      0.00      2.69 _total Hyper-V Hypervisor Virtual Processor
APIC EOI Accesses/sec                          -2.14      -7.10    -2.74     35.89     38.63 _total Hyper-V Hypervisor Root Virtual Processor
Posted Interrupt Scans/sec                     -3.17     -56.58    -4.07      3.12      7.19 _total Hyper-V Hypervisor Virtual Processor
MSR Accesses Cost                              -3.45      -0.41    -4.42   1085.51   1089.93 _total Hyper-V Hypervisor Root Virtual Processor
% Idle Time                                    -4.04      -6.01    -5.19     81.04     86.23 _total Hyper-V Hypervisor Logical Processor
Memory Intercept Messages/sec                  -4.20    -100.00    -5.39      0.00      5.39 _total Hyper-V Hypervisor Virtual Processor
Total Messages/sec                             -4.20    -100.00    -5.39      0.00      5.39 _total Hyper-V Hypervisor Virtual Processor
Nested Page Fault Intercepts/sec               -4.20    -100.00    -5.39      0.00      5.39 _total Hyper-V Hypervisor Virtual Processor
Virtual TLB Flush Entires/sec                  -8.79     -12.30   -11.27     80.35     91.62 _total Hyper-V Hypervisor Partition
Posted Interrupt Scans/sec                    -10.12      -7.64   -12.97    156.80    169.77 _total Hyper-V Hypervisor Root Virtual Processor
GPA Space Hypercalls/sec                      -13.00     -17.17   -16.66     80.35     97.01 _total Hyper-V Hypervisor Root Virtual Processor
Logical Processor Hypercalls/sec              -13.75      -5.87   -17.62    282.40    300.02 _total Hyper-V Hypervisor Root Virtual Processor
Inter-Processor Interrupts Sent/sec           -14.50      -3.93   -18.58    454.81    473.39 _total Hyper-V Hypervisor Logical Processor
Other Intercepts/sec                          -30.88     -32.17   -39.59     83.47    123.06 _total Hyper-V Hypervisor Virtual Processor
Long Spin Wait Hypercalls/sec                 -32.19     -33.29   -41.27     82.69    123.96 _total Hyper-V Hypervisor Virtual Processor
Address Domain Flushes/sec                    -42.99     -13.94   -55.11    340.13    395.24 _total Hyper-V Hypervisor Virtual Processor
CPU Wait Time Per Dispatch                    -59.60      -3.69   -76.41   1992.61   2069.02 _total Hyper-V Hypervisor Virtual Processor
Hypercalls Cost                               -62.24      -1.43   -79.79   5519.31   5599.10 _total Hyper-V Hypervisor Root Virtual Processor
Timer Interrupts/sec                         -121.51      -6.71  -155.78   2167.15   2322.93 _total Hyper-V Hypervisor Logical Processor
Total Intercepts Cost                        -149.33      -7.20  -191.45   2467.48   2658.92 _total Hyper-V Hypervisor Root Virtual Processor
Virtual Interrupt Hypercalls/sec             -149.76     -24.88  -191.99    579.62    771.62 _total Hyper-V Hypervisor Root Virtual Processor
MSR Accesses/sec                             -197.26     -19.02  -252.89   1076.55   1329.44 _total Hyper-V Hypervisor Virtual Processor
CPUID Instructions Cost                      -269.13    -100.00  -345.04      0.00    345.04 _total Hyper-V Hypervisor Virtual Processor
Virtual Interrupt Hypercalls/sec             -284.34     -76.57  -364.53    111.56    476.08 _total Hyper-V Hypervisor Virtual Processor
Synthetic Interrupts/sec                     -351.60     -20.89  -450.77   1706.88   2157.65 _total Hyper-V Hypervisor Root Virtual Processor
HLT Instructions/sec                         -361.07     -66.49  -462.91    233.25    696.16 _total Hyper-V Hypervisor Virtual Processor
Hypervisor Branch Predictor Flushes/sec      -532.66     -24.17  -682.88   2142.19   2825.07 _total Hyper-V Hypervisor Logical Processor
CPUID Instructions/sec                       -536.71    -100.00  -688.08      0.00    688.08 _total Hyper-V Hypervisor Virtual Processor
Logical Processor Migrations/sec             -559.53     -47.34  -717.33    798.05   1515.39 _total Hyper-V Hypervisor Virtual Processor
Other Hypercalls/sec                         -602.33      -6.68  -772.21  10779.58  11551.79 _total Hyper-V Hypervisor Root Virtual Processor
Logical Processor Dispatches/sec             -702.35      -8.20  -900.43  10078.26  10978.69 _total Hyper-V Hypervisor Root Virtual Processor
Hypercalls/sec                               -748.06      -7.53  -959.03  11775.79  12734.82 _total Hyper-V Hypervisor Root Virtual Processor
Local Flushed GVA Ranges/sec                 -847.80    -100.00 -1086.91      0.00   1086.91 _total Hyper-V Hypervisor Virtual Processor
Virtual MMU Hypercalls/sec                   -892.89     -77.09 -1144.72    340.13   1484.85 _total Hyper-V Hypervisor Virtual Processor
Logical Processor Dispatches/sec             -993.43     -35.00 -1273.61   2365.30   3638.90 _total Hyper-V Hypervisor Virtual Processor
C3 Transitions/sec                          -1117.00     -25.11 -1432.02   4271.11   5703.14 _total Hyper-V Hypervisor Logical Processor
Hypercalls/sec                              -1176.71     -76.90 -1508.58    453.24   1961.83 _total Hyper-V Hypervisor Virtual Processor
Context Switches/sec                        -1623.53      -9.23 -2081.41  20478.71  22560.13 _total Hyper-V Hypervisor Logical Processor
Total Intercepts/sec                        -2221.50     -36.64 -2848.03   4925.63   7773.65 _total Hyper-V Hypervisor Virtual Processor
Nested Page Fault Intercepts Cost           -2964.05    -100.00 -3800.00      0.00   3800.00 _total Hyper-V Hypervisor Virtual Processor

@mhklinux
Copy link

mhklinux commented Nov 10, 2023

I'm the same person as "kelleymh" who previously commented, but now as "mhklinux". I have officially retired from Microsoft so I don't have internal Microsoft connections like I did a month ago. But I'm independently contributing to Linux kernel work, and am interesting in helping to get to the root cause of these WSL issues.

The data you've collected shows a problem that's different from the synthetic timer interrupt storm. From the Linux side, the HYP.0 interrupt rate definitely seems high. The HYP interrupts are almost exclusively associated with VMBus devices such as the synthetic SCSI controller, synthetic network interface, the memory balloon synthetic device, and various others. Each such device is identified by an instance GUID. You can see the list of devices and their drivers with this Linux command: "ls -ls /sys/bus/vmbus/devices/*/driver". Whenever the Hyper-V host needs to send the guest an interrupt for such a device, the HYP interrupt is sent. In Linux, the HYP interrupt handler de-multiplexes by looking at some additional data structures to determine which VMBus device the interrupt belongs to, and then runs the interrupt handler for that VMBus device.

So my first thought is to determine which VMBus device is receiving the high number of interrupts. The picture is slightly more complex because the interrupts are actually associated with VMBus channels. Most VMBus devices have a single channel, so a channel and a device are equivalent. For these devices, the HYP interrupt always comes to CPU 0. But some VMBus devices (the synthetic SCSI controller and synthetic NIC) have multiple channels that can interrupt multiple CPUs in parallel, only one of which might be CPU 0.

Altogether, the cumulative interrupt count for a VMBus channel is available in /sys/bus/vmbus/devices/*/channels/*/interrupts. The CPU that a channel will interrupt is similarly available at /sys/bus/vmbus/devices/*/channels/*/cpu. So the trick is to write a script that iterates through /sys/bus/vmbus/devices/*/channels/* and gets the value of "interrupts" whenever "cpu" is 0. For each such "interrupt" count we would want to know the instance GUID (i.e., the /sys/bus/vmbus/devices/* value) and the driver (i.e., /sys/bus/vmbus/devices/*/driver). If we monitor these interrupts counts, I would expect that one (or more?) VMBus device is taking an unexpectedly high rate of interrupts, which is in turn producing the high value of HYP.0. Learning which VMBus device is the first step in narrowing down the root cause of this high HYP.0 interrupt rate.

I'll try to write such a script in the next few days and post it here. But if someone else on this thread gets there first, so much the better. :-)

@zmajeed
Copy link

zmajeed commented Nov 11, 2023

Thanks @mhklinux for the detailed writeup - I'll add the data to the monitor script - I see a clear winner

Sorted interrupt counts by device and channel

head -10 /sys/bus/vmbus/devices/*/channels/*/interrupts | sed '/^$/d' | paste - - | sort -nr -k4,4 | head -5
==> /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/channels/5/interrupts <==       111374556
==> /sys/bus/vmbus/devices/932e2acb-5576-43a8-b727-e4fca57cb084/channels/15/interrupts <==      369195
==> /sys/bus/vmbus/devices/2dd1ce17-079e-403c-b352-a1921ee207ee/channels/4/interrupts <==       106788
==> /sys/bus/vmbus/devices/a5014576-facb-11e6-bd58-64006a7986d3/channels/21/interrupts <==      8031
==> /sys/bus/vmbus/devices/97e112b8-7c63-11ee-9907-f4ce2392c03d/channels/41/interrupts <==      5897

Device fd1d2cbd-ce7c-535c-966b-eb5f811c95f0 channel 5 interrupts CPU 0

head -10 -v /sys/bus/vmbus/devices/*/channels/5/cpu
==> /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/channels/5/cpu <==
0

Device fd1d2cbd-ce7c-535c-966b-eb5f811c95f0 is hv_storvsc - Hyper-V Storage Virtual Service Consumer

ls -l /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/driver
lrwxrwxrwx 1 root root 0 Nov 11 09:42 /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/driver -> ../../../../../../bus/vmbus/drivers/hv_storvsc

https://manpages.ubuntu.com/manpages/jammy/man4/hv_storvsc.4freebsd.html

My laptop has its root SSD and a USB drive

@Matsemann
Copy link

Since it's over 500 comments here now I've lost a bit track. I experience this daily. Is there anything I can do to help when it occurs? Some setup, diagnostics etc I can do to help pinpoint the issue.

@Bilge
Copy link

Bilge commented Nov 11, 2023

Just making a quick note here, as a non-technical person. Whenever this happens (which is essentially every weekday morning) I just restart Docker Desktop to clear the issue. Obviously I wish this was not necessary, but it does work, for whatever reason.

@mhklinux
Copy link

@zmajeed
Just to confirm, the "interrupts" value for that channel is increasing at the 8000 to 9000 per second rate that roughly matches the HYP.0 interrupt rate? The storvsc driver is the Linux driver for the Hyper-V synthetic SCSI controller, so that high interrupt rate indicates a high rate of disk I/O operations being submitted by Linux to Hyper-V. Hyper-V performs the disk I/O operation, puts a completion message in the VMBus channel back to the Linux guest, and then interrupts the guest.

Some possibilities:

  1. Something in Linux has gotten stuck in a loop doing lots of disk I/O, and these are real disk I/O operations submitted by Linux.
  2. I've seen cases where an error occurs on the Hyper-V side when performing a disk I/O operation on behalf of the Linux guest, and Hyper-V returns a bogus error status that causes the Linux driver to get stuck in a loop resubmitting the same I/O request over-and-over again.
  3. Something else has gone wrong with the basic handshake between the Linux guest and Hyper-V, resulting in Hyper-V sending lots of unnecessary interrupts. To me, this seems less likely than (1) or (2).

You can check for (2) using the "dmesg" command in Linux. Do you see a huge number of disk I/O errors reported by hv_storvsc? If so, can you copy one of those errors and post it here? They are probably a single line each.

For (1), you could do "apt install iotop" to install and run the "iotop" command. It should show which Linux processes are doing the most disk I/O.

@zmajeed
Copy link

zmajeed commented Nov 11, 2023

Yeah - the high HYP.0 rate is mainly from /sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/channels/5/interrupts

 change_per_sec           change           oldval           newval   secs
           3774            18874        121973224        121992098      5
           5729             5729        121992098        121997827      1
           5253             5253        121997827        122003080      1
           5332             5332        122003080        122008412      1
           5215             5215        122008412        122013627      1
           5349             5349        122013627        122018976      1
           5918             5918        122018976        122024894      1
           7310             7310        122024894        122032204      1
           2576             2576        122032204        122034780      1
           1327             1327        122034780        122036107      1

Script that generated above data

printf "%15s  %15s  %15s  %15s  %5s\n" change_per_sec change oldval newval secs
for i in {0..10}; do 
  newtime=$SECONDS
  newval=$(</sys/bus/vmbus/devices/fd1d2cbd-ce7c-535c-966b-eb5f811c95f0/channels/5/interrupts)
  if ((i > 0)); then
    change=$((newval - oldval))
    secs=$((newtime - oldtime))
    rate=$((change/secs))
    printf "%15d  %15d  %15d  %15d  %5d\n" $rate $change $oldval $newval $secs
  fi
  oldtime=$newtime
  oldval=$newval
  sleep 1
done

I don't see disk errors in dmesg but there is memory pressure

[537133.623377] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=bcd9e58530d8010ba1a2b5411c216f0dd25cb5817f1a4b9d618f537da8a8e366,mems_allowed=0,oom_memcg=/docker/bcd9e58530d8010ba1a2b5411c216f0dd25cb5817f1a4b9d618f537da8a8e366,task_memcg=/docker/bcd9e58530d8010ba1a2b5411c216f0dd25cb5817f1a4b9d618f537da8a8e366/kubepods/burstable/pode3c40779-07f3-4221-bff2-d0d4ea6ff91f/e5830c17bcc660f2f0c8ec1f7c52cd53b349580830a95b2095bafbbee1c1e6cd,task=python3,pid=72589,uid=0
[537133.623392] Memory cgroup out of memory: Killed process 72589 (python3) total-vm:2880636kB, anon-rss:1212328kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:4816kB oom_score_adj:935
[537133.774455] potentially unexpected fatal signal 6.
[537133.774469] CPU: 1 PID: 29620 Comm: systemd-journal Not tainted 5.15.90.1-microsoft-standard-WSL2 #1
[537134.148399] systemd-journald[1062182]: File /run/log/journal/cb20689e1f7346d58b19cbaa056bc6d0/system.journal corrupted or uncleanly shut down, renaming and replacing.

top and iotop activity is from Kubernetes processes - kube-apiserver, kubelet, envoy, containerd-shim-runc-v2, etcd - and systemd-journald

top - 12:13:28 up 6 days,  8:21,  2 users,  load average: 57.19, 108.22, 117.15
Tasks: 183 total,  14 running, 169 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us, 51.5 sy,  0.0 ni,  0.0 id, 47.5 wa,  0.0 hi,  0.5 si,  0.0 st
MiB Mem :   3920.8 total,    206.2 free,   2851.4 used,    863.2 buff/cache
MiB Swap:   4096.0 total,   1456.6 free,   2639.4 used.    780.2 avail Mem

I'll see if increasing WSL memory helps

@mhklinux
Copy link

Since there are no disk I/O errors in the dmesg log, I suspect some user process in Linux is generating the disk I/O requests.

From "iotop", did you see if the disk I/O requests are primarily disk reads or disk writes? In the big picture, 5000 I/O requests per second isn't a problem, particularly for an SSD. An SSD intended for a laptop can probably do at least 100,000 I/Os per second, so it isn't working very hard to do 5000 per second. In Linux, file system I/O requests are usually 4096 bytes, so 5000 I/Os per second is about 20 Megabyte/second, which again isn't necessarily a big problem. But the problem would be if you think the WSL Linux should be pretty much idle. Then something unexpected is generating 20 Mbytes/second of disk I/O, and that's probably some user space application. If the 20 Mbytes/second of I/O is mostly writes, that can sometimes be some kind of continuous error in an application that is spewing error messages into the application's log file. The log file just keeps growing. So you might want to check the available disk space in your WSL Linux to make sure the Linux disk isn't getting filled up. If the 20 Mbytes/second of I/O is mostly reads, then I don't have any particular suggestions on what to look for. Hopefully the "iotop" command could tell you which user space process is generating most of the I/O requests.

@jcmalone98
Copy link

My team has been running into this issue for a while. We're using Rancher desktop and we frequently see Vmmem start steadily consuming more and more CPU until it gets to 100%. Once Vmmem consumes 100% it makes our Rancher instance unresponsive. We recently changed a few settings in the "Define power buttons settings". Control panel -> System and Security ->Power Options -> Change what the power buttons do. This may not be available on all devices depending on what you're working with but we unchecked "Turn on fast startup" and "Sleep". We're still working to determining which one fixes the Vmmem issue but that seems to fix the problem for us. We were able to consistently reproduce our issue by simply shutting down and starting back up. So it seems like the problem has something to do with a Rancher Desktop getting shutdown ungracefully. But after this change we haven't seen Vmmem spin out of control and Rancher Desktop seems to be staying up consistently. Hope this helps. I'll reply back if I find that it doesn't actually fix the issue. But so far so good.
power-button-settings

@zmajeed
Copy link

zmajeed commented Nov 14, 2023

Trying out disabling fast startup on my Windows 10 laptop- thanks @jcmalone98 - startup doesn't feel any slower than before - also found this post https://enterprisesecurity.hp.com/s/article/Disabling-Windows-10-Fast-Startup - just so happens HP Wolf Security is running on my laptop - here's powershell commands to view and change fast startup settings via Windows registry

Show fast startup setting

powershell 'gp "hklm:/system/currentcontrolset/control/session manager/power" -name hiberbootenabled'

Disable fast startup - as admin

powershell 'sp "hklm:/system/currentcontrolset/control/session manager/power" -name hiberbootenabled -value 0'

@zmajeed
Copy link

zmajeed commented Nov 21, 2023

It's been a week since I disabled fast startup on my Windows 10 laptop - it's gone through many sleeps and holding steady so far - I wonder if @mhklinux has any thoughts on fast startup as a possible trigger for the issue - and if there's anything I could track on its impact when enabled

@chris-findlay
Copy link

It's been a week since I disabled fast startup on my Windows 10 laptop - it's gone through many sleeps and holding steady so far - I wonder if @mhklinux has any thoughts on fast startup as a possible trigger for the issue - and if there's anything I could track on its impact when enabled

So in other words you avoided the bug with WSL & hibernation by not hibernating.

@mhklinux
Copy link

I haven't even known what "fast startup" is in Windows -- I had to go look it up. :-( It is a modified version of hibernation. When you do a shutdown, you are logged out and your apps are stopped. Once Windows is in a state with no users running, it hibernates. Then when you restart the laptop, it effectively resumes from hibernation rather than doing a fresh boot. However, if you do a "restart" instead of a shutdown, Windows really does a full shutdown and fresh reboot. So "fast startup" only affects the "shutdown" case followed by manually starting the laptop again.

It definitely seems relevant that the problem (i.e., the timer interrupt storm in WSL Linux) is associated with a hibernation-based startup rather a full fresh startup. But I don't have any insight beyond that. Hopefully someone from Microsoft with expertise on the Windows/Hyper-V side can comment. Somehow that hibernation-based startup is causing Hyper-V to get confused about the reference time in the Linux guest.

@zmajeed
Copy link

zmajeed commented Nov 21, 2023

@chris-findlay - not sure what you mean - I've never enabled or used hibernation

@jeffska
Copy link

jeffska commented Nov 22, 2023

@chris-findlay - not sure what you mean - I've never enabled or used hibernation

Fast startup is just a different form of hibernation.

@zmajeed
Copy link

zmajeed commented Nov 22, 2023

Why would it affect sleep? Ostensibly the feature only concerns a startup after shutdown

@elmuerte
Copy link

For what it's worth, I have had "fast startup" disabled for ever* and I do have almost daily issues with resuming from hibernate.

*) when I shut down my system I want it to actually shutdown and not keep any RAM around.

@fonziemedia
Copy link

It's been a week since I disabled fast startup on my Windows 10 laptop - it's gone through many sleeps and holding steady so far - I wonder if @mhklinux has any thoughts on fast startup as a possible trigger for the issue - and if there's anything I could track on its impact when enabled

Isn't fast startup a laptop only setting? I can't even see that on my Windows 11 desktop installation.. and still I'm also affected by this issue.. it definitely doesn't help it being so random.. In my case weeks can go by without it manifesting..

@skovbaer
Copy link

I've never had fast startup turned on and can reliably reproduce the WSL problem after hibernation or sleep.
Windows 10 Enterprise, 22H2, 19045.3570

@ricardok1
Copy link

It happens without fast startup or hibernate :-(

@zmajeed
Copy link

zmajeed commented Nov 28, 2023

It was good while it lasted! WSL blew up today with HVS interrupt rate approaching 500,000 per second - despite fast startup being disabled

@bryandam
Copy link

bryandam commented Nov 30, 2023

For anyone following along, the WSL team is doing an AMA @ 8AM PST today for Microsoft's Technical Takeoff week. Maybe, just maybe, a good dog-pile might get them to take this seriously.

https://techcommunity.microsoft.com/t5/windows-events/ama-windows-subsystem-for-linux-wsl/ev-p/3971579

@jeffska

This comment was marked as off-topic.

@pmartincic

This comment was marked as off-topic.

@craigloewen-msft
Copy link
Member

Hi folks,

Wanted to add on that we are investigating this, and will post any updates to this thread as they're available.

Thank you for your patience!

@pbodnar

This comment was marked as off-topic.

@craigloewen-msft
Copy link
Member

@pbodnar when investigating this issue we do look through the thread, so luckily those comments aren't lost on us and we love that the community here helps upload them to troubleshoot and debug this!

And agreed on the frustration, so as a gentle reminder please if you're commenting on this issue try to keep it as relevant to the issue as possible, and please keep the Code of conduct for this repo in mind when commenting. Thank you again for all of your passion on this!

@pbodnar
Copy link

pbodnar commented Dec 2, 2023

@craigloewen-msft, thanks for your kind reply. I realize that the comment I'm writing now will end as off-topic again, yet to make things clear about my suggestion above: I've expected that you probably keep all the important findings somewhere in your private corporate systems, but I think that the endless discussion thread, with title and description which are objectively out of date, just don't lead to the situation when the community could help you efficiently (and also get less frustrated). Thank you for the efforts anyway and wishing you good luck with finally resolving this issue. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests