-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows Kubernetes worker node throws BSoD #66947
Comments
/sig windows |
@PatrickLang Since the memory dump file on AWS EC2 was not created properly, I am first looking for ways to contact AWS Technical Support to collect a memory dump. It would be nice if you could tell me which kernel dump or full dump is needed. |
Luckily, I found the "MEMORY.DMP" file on another Kubernetes node computer that had the same BSOD and analyzed it with WinDbg. Bug Check 0x7F: UNEXPECTED_KERNEL_MODE_TRAP error, Trap Number is identified as 0x00000008 (Double Fault). And there was ena.sys as an associated external module, presumably a conflict between AWS ENA driver and HNS. I have difficulty doing detailed analysis. However, AWS ENA seems to be the cause of the problem, so first I want to disable ENA and test again. It seems hard to share the entire memory dump file with GitHub, so I only attach the analysis report file I obtained with WinDbg. I can also provide a memory dump file if needed. |
@PatrickLang A few more memory dumps took place, collected, and analyzed briefly. When I gather additional information and judge it, I think the frequency of internal errors is particularly high when running HNS on the Xen hypervisor. I would like to know if there is a problem running Windows HNS on a hypervisor other than Hyper-V. |
@PatrickLang I analyzed the memory dump with the help of my colleagues and analyzed it as follows. A colleague who helped me with the analysis asked me to tell Microsoft that either I made the wrong configuration or it was a bug in HNS, but neither of them should stop the system at all. I agree with that, and I hope that this issue is addressed at a higher level by Microsoft.
I attach a report file that my colleague analyzed with WinDbg. |
Thanks for linking the doc with the stacks! @daschott is going to ping the HNS devs to see if they can identify what needs to happen next. If you have any way to get Amazon to open a support case for this it would also be helpful as ENA.sys isn't something Microsoft has access to. |
Good & bad news. Good is - it looks like it was fixed in Windows Server 2019 already (MS Bug#17415345). Bad news is I don't have an ETA on a fix for 1803 yet. The Windows Server Insider program has ISOs available so if you want to try it for a proof of concept that should have the fix. |
@PatrickLang Thank you for your quick reply. I have some inquiries. A. For my testing, I want to start the Windows Server 2016 EC2 instance and install the Windows Server 2019 Insider Preview on the in-place upgrade method. Is this a feasible method? B. And I would appreciate it if you can give me a rough idea of when I can get updates for Server 1803. Thanks. |
The official HNS-level fix should come out October 16th as a KB. |
@PatrickLang @daschott This October 16 is not just a patch for HNS, it's also the expected release date for Windows Server 2019 and Windows Server 1809. Unless it is absolutely necessary, I think that installing a newer version of the LTSC operating system is a better choice than applying the patch to fix this problem on Windows Server 1803. I think I can close this issue if I can build a service at the production level when I build Windows Kubernetes in AWS using Windows Server 2019 and Kubernetes 1.13 successfully. But in the meantime, since I could not do anything, no one really knows if this problem has been resolved. So I worry about it. |
@rkttu The official KB for this issue was delayed internally, as a complete fix requires changes in other critical components (VFP) + another subsequent HNS patch. However, if you have a Microsoft support engineer & business justification, we should be able to give you a private hotfix for Windows Server 1803 earlier than October 16th . This issue will also require a patch on Windows Server 2019 which we are generating. Windows Server 2019 contains only one mitigation patch for the most common scenario of this issue, but to remove it 100% in all cases you need another patch which will be out shortly after release. |
@daschott I don't really know how long it will take for server 2019 to be released as a GA version and it will take to publish it as an AWS AMI image, so it's worth pointing out that I'm testing the patch for 1803 first. Can you tell me more about how to request a patch? I am a Microsoft MVP and know that I have the right to use technical support incidents. |
@daschott I checked the contents of KB4462932 for the 1709 update on October 18th. I found the following description, is this a fix for MS Bug #17415345?
And if the above is correct, when will I receive updates for 1803 and updates for Server 2019? |
@rkttu @rkttu-devsisters We are ready to distribute the hotfix privately only at this time; the official hotfix is scheduled to come out on 11C (November 27th) under KB4467682 on Windows Server, version 1803. We can give you a private hotfix if you can contact Microsoft or Azure Support and request this hotfix! |
Contingent on passing regression tests, this issue will be fixed officially in mid-late January 2019 for Windows Server 2019. For Windows Server, version 1803 the official fix date remains the same as in my above comment. Should you need this fix sooner, we can continue distributing private hotfixes in the meanwhile for both Windows Server, version 1803 and Windows Server 2019 through the regular support channels today. |
/label sig-aws |
@rkttu-devsisters Can you confirm whether there are still crashes after installing KB4476976? |
@daschott I could not create the same workload as when I first registered this issue. However, BSoD, which was often happening with long running IIS 5 Pods, is no longer happening. I need to re-create the workload for the Windows Server 2019 AMI in order to test it better. |
@daschott I heard that the hotfix for Windows Server 2019 will be released in February. Do you have a hotfix available now? |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
BSoD occurs on Windows worker nodes more than two or three times a week, although this is not constant. The thing I checked with BSoD is UNEXPECTED_KERNEL_MODE_TRAP, and the related module name is NDIS.sys.
What you expected to happen:
There is no kernel panic when I configure and run multiple Linux Kubernetes worker nodes.
How to reproduce it (as minimally and precisely as possible):
We used KOPS to build a kernel node, kubenet for an existing kernel cluster, and Flannel Windows + L2Bridge configuration for a newly built Windows node.
Anything else we need to know?:
The same problem occurred when using WinCNI, and the same problem occurs when using Flannel + L2Bridge, and it is expected that this problem will occur when an incorrect configuration request is requested to HNS.
Environment:
kubectl version
): Existing linux worker nodes are v1.9.4, and Windows worker nodes are v1.10.4uname -a
): Existing linux worker nodes are 'Linux ip-x-y-z 4.9.0-5-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux', and Windows worker nodes are '10.0.17134.137'.The text was updated successfully, but these errors were encountered: