Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 reliably crashes when handling files > 10GB #5410

Open
luastoned opened this issue Jun 15, 2020 · 16 comments
Open

WSL2 reliably crashes when handling files > 10GB #5410

luastoned opened this issue Jun 15, 2020 · 16 comments
Labels
needs-investigation likely actionable and/or needs more investigation

Comments

@luastoned
Copy link

luastoned commented Jun 15, 2020

Environment

Windows build number: Microsoft Windows [Version 10.0.19041.329]
Your Distribution version: Ubuntu Release: 20.04
Whether the issue is on WSL 2 and/or WSL 1: WSL 2
Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020

Steps to reproduce

dd if=/dev/zero of=./dump.bin bs=4k iflag=fullblock,count_bytes count=20G

This will crash at around ~10GB, here are 3 runs:

-rw-r--r--  1 root root 12729712640 Jun 15 16:10 dump1.bin
-rw-r--r--  1 root root  9093251072 Jun 15 16:11 dump2.bin
-rw-r--r--  1 root root 11588861952 Jun 15 16:12 dump3.bin

Host OS has plenty of free space/32GB ram.

Expected behavior

It should not crash the entire WSL host along with possible docker containers.

Actual behavior

Ubuntu/WSL window closes, all processes / docker containers crash with it.

I'm actually working with larger gz files, unzipping them, copying them, etc.
Basically every operation touching files > 10GB does not work.

Files mounted via /mnt/c/ do not appear to have this problem.

@therealkenc
Copy link
Collaborator

image

@luastoned
Copy link
Author

Are any other host/os characteristics needed that might help debug this?

@therealkenc
Copy link
Collaborator

Detailed logs might shed something. Your best hope is some me2s.

@luastoned
Copy link
Author

luastoned commented Jun 17, 2020

When I started logman.exe 3 runs with 20GB each went through, upping to 50GB caused the problem to appear again.

lxcore.zip

@benhillis
Copy link
Member

@luastoned - do you see any errors in any of the Hyper-V logs in eventviewer? I suspect the kernel is panicing.

@luastoned
Copy link
Author

Sure enough in Hyper-V-Worker there were some critical logs.

image

Protokollname: Microsoft-Windows-Hyper-V-Worker-Admin
Quelle:        Microsoft-Windows-Hyper-V-Worker
Datum:         24.06.2020 16:16:16
Ereignis-ID:   18560
Aufgabenkategorie:Keine
Ebene:         Kritisch
Schlüsselwörter:
Benutzer:      S-1-5-83-1-1155001845-1082957944-1894944902-2866184998
Computer:      23core
Beschreibung:
"Virtual Machine" wurde zurückgesetzt, da im virtuellen Prozessor ein nicht behebbarer Fehler aufgetreten ist, der einen dreifachen Fehler verursacht hat. Wenn das Problem weiterhin besteht, wenden Sie sich an den Produktsupport. (ID des virtuellen Computers: 44D7EDF5-A078-408C-8690-F2702683D6AA)
Ereignis-XML:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Hyper-V-Worker" Guid="{51ddfa29-d5c8-4803-be4b-2ecb715570fe}" />
    <EventID>18560</EventID>
    <Version>0</Version>
    <Level>1</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2020-06-24T14:16:16.1494269Z" />
    <EventRecordID>475</EventRecordID>
    <Correlation />
    <Execution ProcessID="24368" ThreadID="19312" />
    <Channel>Microsoft-Windows-Hyper-V-Worker-Admin</Channel>
    <Computer>23core</Computer>
    <Security UserID="S-1-5-83-1-1155001845-1082957944-1894944902-2866184998" />
  </System>
  <UserData>
    <VmlEventLog xmlns="http://www.microsoft.com/Windows/Virtualization/Events">
      <VmName>Virtual Machine</VmName>
      <VmId>44D7EDF5-A078-408C-8690-F2702683D6AA</VmId>
      <Rax>0x0</Rax>
      <Rbx>0x0</Rbx>
      <Rcx>0x1000</Rcx>
      <Rdx>0xffff888589d99c00</Rdx>
      <Rsp>0xffffc9000429fcc0</Rsp>
      <Rbp>0xffff88864b623a38</Rbp>
      <Rsi>0x0</Rsi>
      <Rdi>0xffff88840a750000</Rdi>
      <R8>0xffffea001029d400</R8>
      <R9>0x1</R9>
      <R10>0x0</R10>
      <R11>0x0</R11>
      <R12>0xffffffff82531240</R12>
      <R13>0xffffea001029d400</R13>
      <R14>0x0</R14>
      <R15>0xffffea001029d400</R15>
      <Rip>0xffffffff81ac4ca7</Rip>
      <Rflags>0x10246</Rflags>
      <FpControlStatus>7F03000000000000B382D9B7737F0000</FpControlStatus>
      <XmmControlStatus>0000000000000000A31F0000FFFF0000</XmmControlStatus>
      <Cr0>0x80050033</Cr0>
      <Cr2>0x3163d53d0004</Cr2>
      <Cr3>0x5bf208003</Cr3>
      <Cr4>0x3606b0</Cr4>
      <Cr8>0x0</Cr8>
      <Xfem>0x7</Xfem>
      <Dr0>0x0</Dr0>
      <Dr1>0x0</Dr1>
      <Dr2>0x0</Dr2>
      <Dr3>0x0</Dr3>
      <Dr6>0xffff0ff0</Dr6>
      <Dr7>0x400</Dr7>
      <Es>0000000000000000FFFFFFFF00000000</Es>
      <Cs>0000000000000000FFFFFFFF10009BA0</Cs>
      <Ss>0000000000000000FFFFFFFF00000000</Ss>
      <Ds>0000000000000000FFFFFFFF00000000</Ds>
      <Fs>40F793B7737F0000FFFFFFFF00000000</Fs>
      <Gs>0000604B8688FFFFFFFFFFFF00000000</Gs>
      <Ldtr>0000000000000000FFFFFFFF00000000</Ldtr>
      <Tr>0030000000FEFFFF6F20000040008B00</Tr>
      <Idtr>000000000000FF0F0000000000FEFFFF</Idtr>
      <Gdtr>0000000000007F000010000000FEFFFF</Gdtr>
      <Tsc>0x30b6ce792698</Tsc>
      <ApicBase>0xfee00900</ApicBase>
      <SysenterCs>0x10</SysenterCs>
      <SysenterEip>0xffffffff81c01410</SysenterEip>
      <SysenterEsp>0xfffffe0000002200</SysenterEsp>
    </VmlEventLog>
  </UserData>
</Event>

@luastoned
Copy link
Author

It appears I never paid attention to memory usage.

After limiting memory via .wslconfig to 12GB I can create the 20GB dummy file. #4166

Seems like file operations inside the WSL2 filesystem consume/block memory to a point where it crashes.
This also explains why the windows mounted filesystem /mnt/c was not affected by it.

@therealkenc
Copy link
Collaborator

I was still not able to repro here using count=130G on a 128GB box. I watched vmmem working set go to ~106GB. No news there; the page cache filled up to the brim, and that's perfectly okay. But when all my available memory (in Windows) ran dry, inside the VM the Linux RCU became unhappy. Here's the gist of the errors in dmesg. Highlights:

[ 2838.721141] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2838.721141] rcu:     1-...!: (3287 GPs behind) idle=74a/1/0x4000000000000002 softirq=320/320 fqs=732
[ 2838.721141] rcu:     2-...!: (1 GPs behind) idle=cfa/1/0x4000000000000000 softirq=869/4545 fqs=732
[ 2838.721141] rcu:     5-...!: (1 GPs behind) idle=ffa/1/0x4000000000000000 softirq=288/288 fqs=732
...more of same
[ 2838.721141] rcu:     (detected by 26, t=1464 jiffies, g=25105, q=1072)
[ 2838.721141] Sending NMI from CPU 26 to CPUs 1:
[ 3018.488316] NMI backtrace for cpu 1
[ 3018.488317] CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 4.19.121-microsoft-standard #1
[ 3018.488317] RIP: 0010:account_system_index_time+0x20/0x90
[ 3018.488318] Code: 01 30 48 01 70 40 c3 0f 1f 00 0f 1f 44 00 00 41 54 41 89 d4 55 48 89 f5 53 48 8b 87 a8 06 00 00 48 89 fb 48 01 b7 a8 05 00 00 <0f> b6 90 10 01 00 00 84 d2 74 1e 48 83 bf b0 06 00 00 00 74 14 48
...registers
[ 3018.488320] Call Trace:
[ 2838.721141]  <IRQ>
[ 2838.721141]  dump_stack+0x66/0x90
[ 2838.721141]  nmi_cpu_backtrace.cold.3+0x13/0x50
[ 2838.721141]  ? lapic_can_unplug_cpu.cold.31+0x40/0x40
[ 2838.721141]  nmi_trigger_cpumask_backtrace+0xc8/0xca
[ 2838.721141]  rcu_dump_cpu_stacks+0x9b/0xcb
[ 2838.721141]  rcu_check_callbacks.cold.82+0x296/0x359
...

I could not get the kernel to panic despite several tries. But it is possible (probably even) the above isn't a long walk from the kernel going down hard.

@luastoned
Copy link
Author

luastoned commented Jul 8, 2020

No luck here when I disabled the limit again.
Tried to catch some info with dmesg --follow > dmesg.log but the crash does not appear to log anything.

[   49.190857] hv_balloon: Max. dynamic memory size: 26068 MB
[   83.652768] WSL2: Performing memory compaction.

@luastoned
Copy link
Author

After a couple of reboots / wsl --shutdown it appears the crash is linked to VSCode's WSL2 backend.
I was able to dd ~50GB files without crashing when VSCode was closed (ie. after a reboot).

This happens on my work machine so I tend to have VSCode open.. and then the crashes will happen reliably.

@therealkenc
Copy link
Collaborator

Work-around might be to give yourself big Windows page file and let it thrash. Instead of "automatically manage paging" I set a custom size of 128GB. It is not obvious that should have helped. The default-behavior managed size is supposed to be 3x physical RAM. But after the change (and obligatory reboot) I'm not seeing the dmesg errors anymore regardless of whether all physical memory is consumed.

@fawdlstty
Copy link

I catch same question, if I create .wslconfig file, I can't start up my docker desktop (win10 home)

@karlmutch
Copy link

karlmutch commented Sep 10, 2021

I also have this happening consistently during long running file IO on large files regardless of the mounted file system, ext4 on VHDX, Windows 11 native file system mounts etc...

'Virtual Machine' was reset because an unrecoverable error occurred on a virtual processor that caused a triple fault. If the problem persists, contact Product Support. (Virtual machine ID BA977E36-057A-4180-AFA0-4C86DC421029)

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Microsoft-Windows-Hyper-V-Worker" Guid="{51ddfa29-d5c8-4803-be4b-2ecb715570fe}" /> 
  <EventID>18560</EventID> 
  <Version>0</Version> 
  <Level>1</Level> 
  <Task>0</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x8000000000000000</Keywords> 
  <TimeCreated SystemTime="2021-09-10T01:40:11.8986074Z" /> 
  <EventRecordID>3081</EventRecordID> 
  <Correlation /> 
  <Execution ProcessID="17280" ThreadID="33108" /> 
  <Channel>Microsoft-Windows-Hyper-V-Worker-Admin</Channel> 
  <Computer>Razer</Computer> 
  <Security UserID="S-1-5-83-1-3130490422-1098909050-2253168815-688931548" /> 
  </System>
- <UserData>
- <VmlEventLog xmlns="http://www.microsoft.com/Windows/Virtualization/Events">
  <VmName>Virtual Machine</VmName> 
  <VmId>BA977E36-057A-4180-AFA0-4C86DC421029</VmId> 
  <Rax>0xffffffff81d1a3d0</Rax> 
  <Rbx>0xf</Rbx> 
  <Rcx>0xffff8883f37e9f40</Rcx> 
  <Rdx>0x2b156</Rdx> 
  <Rsp>0xffffc9000010bee8</Rsp> 
  <Rbp>0xffff888100352b80</Rbp> 
  <Rsi>0x7ffffe9734db53bf</Rsi> 
  <Rdi>0xffff8883f37dd580</Rdi> 
  <R8>0x66a1710248</R8> 
  <R9>0x5</R9> 
  <R10>0x100</R10> 
  <R11>0x0</R11> 
  <R12>0xffff888100352b80</R12> 
  <R13>0x0</R13> 
  <R14>0x0</R14> 
  <R15>0xffff888100352b80</R15> 
  <Rip>0xffffffff81d1a3e3</Rip> 
  <Rflags>0x202</Rflags> 
  <FpControlStatus>7F030000000000000000000000000000</FpControlStatus> 
  <XmmControlStatus>0000000000000000801F0000FFFF0000</XmmControlStatus> 
  <Cr0>0x80050033</Cr0> 
  <Cr2>0x5638612e0658</Cr2> 
  <Cr3>0x13a8e0005</Cr3> 
  <Cr4>0x3706a0</Cr4> 
  <Cr8>0x1</Cr8> 
  <Xfem>0x7</Xfem> 
  <Dr0>0x0</Dr0> 
  <Dr1>0x0</Dr1> 
  <Dr2>0x0</Dr2> 
  <Dr3>0x0</Dr3> 
  <Dr6>0xffff0ff0</Dr6> 
  <Dr7>0x400</Dr7> 
  <Es>0000000000000000FFFFFFFF00000000</Es> 
  <Cs>0000000000000000FFFFFFFF10009BA0</Cs> 
  <Ss>0000000000000000FFFFFFFF180093C0</Ss> 
  <Ds>0000000000000000FFFFFFFF00000000</Ds> 
  <Fs>0000000000000000FFFFFFFF00000000</Fs> 
  <Gs>00007CF38388FFFFFFFFFFFF00000000</Gs> 
  <Ldtr>0000000000000000FFFFFFFF00000000</Ldtr> 
  <Tr>00E0310000FEFFFF6700000040008B00</Tr> 
  <Idtr>000000000000FF0F0000000000FEFFFF</Idtr> 
  <Gdtr>0000000000007F0000C0310000FEFFFF</Gdtr> 
  <Tsc>0x33f58a5a309</Tsc> 
  <ApicBase>0xfee00800</ApicBase> 
  <SysenterCs>0x10</SysenterCs> 
  <SysenterEip>0xffffffff81e01340</SysenterEip> 
  <SysenterEsp>0xfffffe000031e000</SysenterEsp> 
  </VmlEventLog>
  </UserData>
  </Event>

@fawdlstty
Copy link

After installing Windows 11, I rollback to wsl1 😆

@berlincount
Copy link

I was doing a reasonably simple diff -u on two 12GB files, using a /tmp/bigswap file with 24GB for swapping (additionally to the defaults) and it blew up reproducably. Hmpf.

@tymscar
Copy link

tymscar commented Jan 31, 2024

While attempting to compile Chromium on WSL2, I encountered a consistent crash issue. After extensive research and troubleshooting, detailed in my blog post, I discovered a workable solution:

  • Allocate as much RAM as possible to your VM and ensure there's a substantial amount of SWAP space to prevent OOM (Out of Memory) errors.
  • Address the WSL2 crash issue by setting up large pagefiles on Windows—I used 256GB, but 64GB might suffice for compiling Chromium.
  • Avoid utilizing all CPU threads during compilation to prevent WSL from crashing; using half or up to two-thirds of the threads seems stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation likely actionable and/or needs more investigation
Projects
None yet
Development

No branches or pull requests

7 participants