Windows Kubernetes worker node throws BSoD #66947

ghost · 2018-08-03T01:10:29Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
BSoD occurs on Windows worker nodes more than two or three times a week, although this is not constant. The thing I checked with BSoD is UNEXPECTED_KERNEL_MODE_TRAP, and the related module name is NDIS.sys.

What you expected to happen:
There is no kernel panic when I configure and run multiple Linux Kubernetes worker nodes.

How to reproduce it (as minimally and precisely as possible):
We used KOPS to build a kernel node, kubenet for an existing kernel cluster, and Flannel Windows + L2Bridge configuration for a newly built Windows node.

Anything else we need to know?:
The same problem occurred when using WinCNI, and the same problem occurs when using Flannel + L2Bridge, and it is expected that this problem will occur when an incorrect configuration request is requested to HNS.

Environment:

Kubernetes version (use kubectl version): Existing linux worker nodes are v1.9.4, and Windows worker nodes are v1.10.4
Cloud provider or hardware configuration: AWS EC2
OS (e.g. from /etc/os-release): Existing linux worker nodes are 'Debian GNU/Linux 9.3 (stretch)', and Windows worker nodes are 'Windows Server 1803'.
Kernel (e.g. uname -a): Existing linux worker nodes are 'Linux ip-x-y-z 4.9.0-5-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux', and Windows worker nodes are '10.0.17134.137'.
Install tools: Existing linux worker nodes built with KOPS, and Windows nodes installed manually.
Others: I attach the BSoD screenshot. After restarting the instance, I will collect the memory dump and try to analyze it with WinDBG.

The text was updated successfully, but these errors were encountered:

ghost · 2018-08-03T01:11:08Z

/sig windows

ghost · 2018-08-03T07:24:58Z

@PatrickLang Since the memory dump file on AWS EC2 was not created properly, I am first looking for ways to contact AWS Technical Support to collect a memory dump. It would be nice if you could tell me which kernel dump or full dump is needed.

ghost · 2018-08-04T16:50:12Z

@PatrickLang

Luckily, I found the "MEMORY.DMP" file on another Kubernetes node computer that had the same BSOD and analyzed it with WinDbg. Bug Check 0x7F: UNEXPECTED_KERNEL_MODE_TRAP error, Trap Number is identified as 0x00000008 (Double Fault). And there was ena.sys as an associated external module, presumably a conflict between AWS ENA driver and HNS.

I have difficulty doing detailed analysis. However, AWS ENA seems to be the cause of the problem, so first I want to disable ENA and test again.

It seems hard to share the entire memory dump file with GitHub, so I only attach the analysis report file I obtained with WinDbg. I can also provide a memory dump file if needed.

dump-report.txt

ghost · 2018-08-06T10:12:21Z

@PatrickLang A few more memory dumps took place, collected, and analyzed briefly. When I gather additional information and judge it, I think the frequency of internal errors is particularly high when running HNS on the Xen hypervisor. I would like to know if there is a problem running Windows HNS on a hypervisor other than Hyper-V.

rkttu · 2018-08-06T14:40:44Z

@PatrickLang I analyzed the memory dump with the help of my colleagues and analyzed it as follows. A colleague who helped me with the analysis asked me to tell Microsoft that either I made the wrong configuration or it was a bug in HNS, but neither of them should stop the system at all. I agree with that, and I hope that this issue is addressed at a higher level by Microsoft.

In order to handle "Receiving network data" due to a network interruption in Windows, the NIC driver called NdisMIndicateReceiveNetBufferLists. This data is passed through vmswitch to vfpext.sys, and it is repeated calling vswitch - vfpext - vswitch - vfpext while calling VmsForwardNetBufferListsBySourcePortsAndNblChains.

https://docs.microsoft.com/en-us/windows-hardware/drivers/network/receiving-network-data

It has been repeatedly called, and it has exceeded the kernel stack (24k on 64-bit systems), resulting in a memory dump. Also, all of the dump files collected over multiple times have caused a memory dump for the same reason. (NDIS.SYS, VFPEXT.SYS)

https://blogs.msdn.microsoft.com/ntdebugging/2008/02/01/kernel-stack-overflows/

As a result, Microsoft needs to find out why the vswitch - vfpext - vswitch - vfpext is being called repeatedly.

I attach a report file that my colleague analyzed with WinDbg.

0001.docx

PatrickLang · 2018-08-06T20:40:16Z

Thanks for linking the doc with the stacks! @daschott is going to ping the HNS devs to see if they can identify what needs to happen next. If you have any way to get Amazon to open a support case for this it would also be helpful as ENA.sys isn't something Microsoft has access to.

PatrickLang · 2018-08-06T22:48:56Z

Good & bad news. Good is - it looks like it was fixed in Windows Server 2019 already (MS Bug#17415345). Bad news is I don't have an ETA on a fix for 1803 yet. The Windows Server Insider program has ISOs available so if you want to try it for a proof of concept that should have the fix.

ghost · 2018-08-07T01:21:41Z

@PatrickLang Thank you for your quick reply. I have some inquiries.

A. For my testing, I want to start the Windows Server 2016 EC2 instance and install the Windows Server 2019 Insider Preview on the in-place upgrade method. Is this a feasible method?

B. And I would appreciate it if you can give me a rough idea of when I can get updates for Server 1803.

Thanks.

daschott · 2018-08-21T18:21:38Z

The official HNS-level fix should come out October 16th as a KB.

rkttu · 2018-09-25T14:55:23Z

@PatrickLang @daschott This October 16 is not just a patch for HNS, it's also the expected release date for Windows Server 2019 and Windows Server 1809. Unless it is absolutely necessary, I think that installing a newer version of the LTSC operating system is a better choice than applying the patch to fix this problem on Windows Server 1803.

I think I can close this issue if I can build a service at the production level when I build Windows Kubernetes in AWS using Windows Server 2019 and Kubernetes 1.13 successfully. But in the meantime, since I could not do anything, no one really knows if this problem has been resolved. So I worry about it.

daschott · 2018-09-25T16:19:14Z

@rkttu The official KB for this issue was delayed internally, as a complete fix requires changes in other critical components (VFP) + another subsequent HNS patch. However, if you have a Microsoft support engineer & business justification, we should be able to give you a private hotfix for Windows Server 1803 earlier than October 16th .

This issue will also require a patch on Windows Server 2019 which we are generating. Windows Server 2019 contains only one mitigation patch for the most common scenario of this issue, but to remove it 100% in all cases you need another patch which will be out shortly after release.

rkttu · 2018-09-25T17:58:02Z

@daschott I don't really know how long it will take for server 2019 to be released as a GA version and it will take to publish it as an AWS AMI image, so it's worth pointing out that I'm testing the patch for 1803 first.

Can you tell me more about how to request a patch? I am a Microsoft MVP and know that I have the right to use technical support incidents.

ghost · 2018-10-22T16:25:15Z

@daschott I checked the contents of KB4462932 for the 1709 update on October 18th. I found the following description, is this a fix for MS Bug #17415345?

Addresses an issue that might cause TCP connections opened for an application running on Windows Container to fail sporadically. This occurs when the container runs on a Network Address Translation (NAT) Network provided by Windows Network Address Translation (WinNAT). A SYN timeout occurs after reaching the maximum SYN Retransmit count.

And if the above is correct, when will I receive updates for 1803 and updates for Server 2019?

daschott · 2018-10-22T16:52:36Z

@rkttu @rkttu-devsisters We are ready to distribute the hotfix privately only at this time; the official hotfix is scheduled to come out on 11C (November 27th) under KB4467682 on Windows Server, version 1803. We can give you a private hotfix if you can contact Microsoft or Azure Support and request this hotfix!

daschott · 2018-11-08T18:35:56Z

Contingent on passing regression tests, this issue will be fixed officially in mid-late January 2019 for Windows Server 2019. For Windows Server, version 1803 the official fix date remains the same as in my above comment.

Should you need this fix sooner, we can continue distributing private hotfixes in the meanwhile for both Windows Server, version 1803 and Windows Server 2019 through the regular support channels today.

PatrickLang · 2019-02-05T18:25:34Z

/label sig-aws

daschott · 2019-02-08T17:48:32Z

@rkttu-devsisters Can you confirm whether there are still crashes after installing KB4476976?

rkttu · 2019-02-10T12:17:18Z

@daschott I could not create the same workload as when I first registered this issue. However, BSoD, which was often happening with long running IIS 5 Pods, is no longer happening. I need to re-create the workload for the Windows Server 2019 AMI in order to test it better.

rkttu · 2019-02-10T12:18:24Z

@daschott I heard that the hotfix for Windows Server 2019 will be released in February. Do you have a hotfix available now?

daschott · 2019-02-11T18:03:53Z

@rkttu yes, on Windows Server 2019, please see KB4476976

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 3, 2018

k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 3, 2018

ghost mentioned this issue Aug 3, 2018

vNIC in Windows container suddenly disappears #65163

Closed

PatrickLang added this to Backlog in SIG-Windows Jan 11, 2019

ghost closed this as completed Mar 2, 2019

SIG-Windows automation moved this from Backlog to Done (v.1.14) Mar 2, 2019

daschott mentioned this issue Sep 9, 2022

REQUEST: New membership for daschott kubernetes/org#3671

Closed

9 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Kubernetes worker node throws BSoD #66947

Windows Kubernetes worker node throws BSoD #66947

ghost commented Aug 3, 2018

ghost commented Aug 3, 2018

ghost commented Aug 3, 2018 •

edited by ghost

ghost commented Aug 4, 2018

ghost commented Aug 6, 2018

rkttu commented Aug 6, 2018

PatrickLang commented Aug 6, 2018

PatrickLang commented Aug 6, 2018

ghost commented Aug 7, 2018 •

edited by ghost

daschott commented Aug 21, 2018

rkttu commented Sep 25, 2018

daschott commented Sep 25, 2018 •

edited

rkttu commented Sep 25, 2018

ghost commented Oct 22, 2018

daschott commented Oct 22, 2018

daschott commented Nov 8, 2018

PatrickLang commented Feb 5, 2019

daschott commented Feb 8, 2019

rkttu commented Feb 10, 2019

rkttu commented Feb 10, 2019

daschott commented Feb 11, 2019

Windows Kubernetes worker node throws BSoD #66947

Windows Kubernetes worker node throws BSoD #66947

Comments

ghost commented Aug 3, 2018

ghost commented Aug 3, 2018

ghost commented Aug 3, 2018 • edited by ghost

ghost commented Aug 4, 2018

ghost commented Aug 6, 2018

rkttu commented Aug 6, 2018

PatrickLang commented Aug 6, 2018

PatrickLang commented Aug 6, 2018

ghost commented Aug 7, 2018 • edited by ghost

daschott commented Aug 21, 2018

rkttu commented Sep 25, 2018

daschott commented Sep 25, 2018 • edited

rkttu commented Sep 25, 2018

ghost commented Oct 22, 2018

daschott commented Oct 22, 2018

daschott commented Nov 8, 2018

PatrickLang commented Feb 5, 2019

daschott commented Feb 8, 2019

rkttu commented Feb 10, 2019

rkttu commented Feb 10, 2019

daschott commented Feb 11, 2019

ghost commented Aug 3, 2018 •

edited by ghost

ghost commented Aug 7, 2018 •

edited by ghost

daschott commented Sep 25, 2018 •

edited