Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Kubernetes worker node throws BSoD #66947

Closed
ghost opened this issue Aug 3, 2018 · 20 comments
Closed

Windows Kubernetes worker node throws BSoD #66947

ghost opened this issue Aug 3, 2018 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.

Comments

@ghost
Copy link

ghost commented Aug 3, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
BSoD occurs on Windows worker nodes more than two or three times a week, although this is not constant. The thing I checked with BSoD is UNEXPECTED_KERNEL_MODE_TRAP, and the related module name is NDIS.sys.

What you expected to happen:
There is no kernel panic when I configure and run multiple Linux Kubernetes worker nodes.

How to reproduce it (as minimally and precisely as possible):
We used KOPS to build a kernel node, kubenet for an existing kernel cluster, and Flannel Windows + L2Bridge configuration for a newly built Windows node.

Anything else we need to know?:
The same problem occurred when using WinCNI, and the same problem occurs when using Flannel + L2Bridge, and it is expected that this problem will occur when an incorrect configuration request is requested to HNS.

Environment:

  • Kubernetes version (use kubectl version): Existing linux worker nodes are v1.9.4, and Windows worker nodes are v1.10.4
  • Cloud provider or hardware configuration: AWS EC2
  • OS (e.g. from /etc/os-release): Existing linux worker nodes are 'Debian GNU/Linux 9.3 (stretch)', and Windows worker nodes are 'Windows Server 1803'.
  • Kernel (e.g. uname -a): Existing linux worker nodes are 'Linux ip-x-y-z 4.9.0-5-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux', and Windows worker nodes are '10.0.17134.137'.
  • Install tools: Existing linux worker nodes built with KOPS, and Windows nodes installed manually.
  • Others: I attach the BSoD screenshot. After restarting the instance, I will collect the memory dump and try to analyze it with WinDBG.

unexpected_kernel_mode_trap_ndis_sys

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 3, 2018
@ghost
Copy link
Author

ghost commented Aug 3, 2018

/sig windows

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 3, 2018
@ghost
Copy link
Author

ghost commented Aug 3, 2018

@PatrickLang Since the memory dump file on AWS EC2 was not created properly, I am first looking for ways to contact AWS Technical Support to collect a memory dump. It would be nice if you could tell me which kernel dump or full dump is needed.

@ghost
Copy link
Author

ghost commented Aug 4, 2018

@PatrickLang

Luckily, I found the "MEMORY.DMP" file on another Kubernetes node computer that had the same BSOD and analyzed it with WinDbg. Bug Check 0x7F: UNEXPECTED_KERNEL_MODE_TRAP error, Trap Number is identified as 0x00000008 (Double Fault). And there was ena.sys as an associated external module, presumably a conflict between AWS ENA driver and HNS.

I have difficulty doing detailed analysis. However, AWS ENA seems to be the cause of the problem, so first I want to disable ENA and test again.

It seems hard to share the entire memory dump file with GitHub, so I only attach the analysis report file I obtained with WinDbg. I can also provide a memory dump file if needed.

dump-report.txt

@ghost
Copy link
Author

ghost commented Aug 6, 2018

@PatrickLang A few more memory dumps took place, collected, and analyzed briefly. When I gather additional information and judge it, I think the frequency of internal errors is particularly high when running HNS on the Xen hypervisor. I would like to know if there is a problem running Windows HNS on a hypervisor other than Hyper-V.

@rkttu
Copy link

rkttu commented Aug 6, 2018

@PatrickLang I analyzed the memory dump with the help of my colleagues and analyzed it as follows. A colleague who helped me with the analysis asked me to tell Microsoft that either I made the wrong configuration or it was a bug in HNS, but neither of them should stop the system at all. I agree with that, and I hope that this issue is addressed at a higher level by Microsoft.

In order to handle "Receiving network data" due to a network interruption in Windows, the NIC driver called NdisMIndicateReceiveNetBufferLists. This data is passed through vmswitch to vfpext.sys, and it is repeated calling vswitch - vfpext - vswitch - vfpext while calling VmsForwardNetBufferListsBySourcePortsAndNblChains.

https://docs.microsoft.com/en-us/windows-hardware/drivers/network/receiving-network-data

It has been repeatedly called, and it has exceeded the kernel stack (24k on 64-bit systems), resulting in a memory dump. Also, all of the dump files collected over multiple times have caused a memory dump for the same reason. (NDIS.SYS, VFPEXT.SYS)

https://blogs.msdn.microsoft.com/ntdebugging/2008/02/01/kernel-stack-overflows/

As a result, Microsoft needs to find out why the vswitch - vfpext - vswitch - vfpext is being called repeatedly.

I attach a report file that my colleague analyzed with WinDbg.

0001.docx

@PatrickLang
Copy link
Contributor

Thanks for linking the doc with the stacks! @daschott is going to ping the HNS devs to see if they can identify what needs to happen next. If you have any way to get Amazon to open a support case for this it would also be helpful as ENA.sys isn't something Microsoft has access to.

@PatrickLang
Copy link
Contributor

Good & bad news. Good is - it looks like it was fixed in Windows Server 2019 already (MS Bug#17415345). Bad news is I don't have an ETA on a fix for 1803 yet. The Windows Server Insider program has ISOs available so if you want to try it for a proof of concept that should have the fix.

@ghost
Copy link
Author

ghost commented Aug 7, 2018

@PatrickLang Thank you for your quick reply. I have some inquiries.

A. For my testing, I want to start the Windows Server 2016 EC2 instance and install the Windows Server 2019 Insider Preview on the in-place upgrade method. Is this a feasible method?

B. And I would appreciate it if you can give me a rough idea of when I can get updates for Server 1803.

Thanks.

@daschott
Copy link
Contributor

The official HNS-level fix should come out October 16th as a KB.

@rkttu
Copy link

rkttu commented Sep 25, 2018

@PatrickLang @daschott This October 16 is not just a patch for HNS, it's also the expected release date for Windows Server 2019 and Windows Server 1809. Unless it is absolutely necessary, I think that installing a newer version of the LTSC operating system is a better choice than applying the patch to fix this problem on Windows Server 1803.

I think I can close this issue if I can build a service at the production level when I build Windows Kubernetes in AWS using Windows Server 2019 and Kubernetes 1.13 successfully. But in the meantime, since I could not do anything, no one really knows if this problem has been resolved. So I worry about it.

@daschott
Copy link
Contributor

daschott commented Sep 25, 2018

@rkttu The official KB for this issue was delayed internally, as a complete fix requires changes in other critical components (VFP) + another subsequent HNS patch. However, if you have a Microsoft support engineer & business justification, we should be able to give you a private hotfix for Windows Server 1803 earlier than October 16th .

This issue will also require a patch on Windows Server 2019 which we are generating. Windows Server 2019 contains only one mitigation patch for the most common scenario of this issue, but to remove it 100% in all cases you need another patch which will be out shortly after release.

@rkttu
Copy link

rkttu commented Sep 25, 2018

@daschott I don't really know how long it will take for server 2019 to be released as a GA version and it will take to publish it as an AWS AMI image, so it's worth pointing out that I'm testing the patch for 1803 first.

Can you tell me more about how to request a patch? I am a Microsoft MVP and know that I have the right to use technical support incidents.

@ghost
Copy link
Author

ghost commented Oct 22, 2018

@daschott I checked the contents of KB4462932 for the 1709 update on October 18th. I found the following description, is this a fix for MS Bug #17415345?

Addresses an issue that might cause TCP connections opened for an application running on Windows Container to fail sporadically. This occurs when the container runs on a Network Address Translation (NAT) Network provided by Windows Network Address Translation (WinNAT). A SYN timeout occurs after reaching the maximum SYN Retransmit count. 

And if the above is correct, when will I receive updates for 1803 and updates for Server 2019?

@daschott
Copy link
Contributor

@rkttu @rkttu-devsisters We are ready to distribute the hotfix privately only at this time; the official hotfix is scheduled to come out on 11C (November 27th) under KB4467682 on Windows Server, version 1803. We can give you a private hotfix if you can contact Microsoft or Azure Support and request this hotfix!

@daschott
Copy link
Contributor

daschott commented Nov 8, 2018

Contingent on passing regression tests, this issue will be fixed officially in mid-late January 2019 for Windows Server 2019. For Windows Server, version 1803 the official fix date remains the same as in my above comment.

Should you need this fix sooner, we can continue distributing private hotfixes in the meanwhile for both Windows Server, version 1803 and Windows Server 2019 through the regular support channels today.

@PatrickLang PatrickLang added this to Backlog in SIG-Windows Jan 11, 2019
@PatrickLang
Copy link
Contributor

/label sig-aws

@daschott
Copy link
Contributor

daschott commented Feb 8, 2019

@rkttu-devsisters Can you confirm whether there are still crashes after installing KB4476976?

@rkttu
Copy link

rkttu commented Feb 10, 2019

@daschott I could not create the same workload as when I first registered this issue. However, BSoD, which was often happening with long running IIS 5 Pods, is no longer happening. I need to re-create the workload for the Windows Server 2019 AMI in order to test it better.

@rkttu
Copy link

rkttu commented Feb 10, 2019

@daschott I heard that the hotfix for Windows Server 2019 will be released in February. Do you have a hotfix available now?

@daschott
Copy link
Contributor

@rkttu yes, on Windows Server 2019, please see KB4476976

@ghost ghost closed this as completed Mar 2, 2019
SIG-Windows automation moved this from Backlog to Done (v.1.14) Mar 2, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
None yet
Development

No branches or pull requests

4 participants