-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WindowsRS5 CI is borked #38114
Comments
What's current status? |
No change, but I'll try to ping some people to see if someone can look into this, or otherwise just remove the RS5 CI |
I've been OOF (surgery) and only just back. I'll try to look at it today or tomorrow. I think it's somehow relating to when the CI script moved into the repo. |
Welcome back @jhowardmsft 🎉 |
Making progress. I have a manual workaround, but I'd probably need to do that every few days which isn't sustainable. There's a leak somewhere in the platform. I'm trying to repro on a newer OS version and involve the relevant kernel folks. |
(Should add a couple of nodes are currently back and working as of this moment) |
Thanks @jhowardmsft - would this be something we could do with a cronjob? |
What's cron 😆 This is Windows 😇. No, it's a kernel leak. |
Haha, thought about that after writing it That's a pity 🤷♂️ was hoping we could make these machines less "pet", more "cattle" |
Quick update - being tracked internally by VSO#19599026. I've got it down to a minimal repro. Unfortunately, once the system is in this state, a reboot appears to be the only solution as kernel resources are locked exclusively - it appears that it relates to a silo not completely exiting. There's at least one CI test which I know to hit this condition. As a temporary measure, I'll disable that test, and possibly one of three others (still narrowing down) which seem to also sometimes cause it. That will at least make the CI servers resilient and not have to be cleaned up every few days due to leaks. Investigation is now with the kernel team here to find the bug and hopefully we can get a fix in a future Windows update. |
Signed-off-by: John Howard <jhoward@microsoft.com> Certain tests here have testRequires(c, NotWindowsRS5Plus). This is being tracked internally by VSO#19599026, and externally through moby#38114. @jhowardmsft. As of 11/12/2018, there's no workaround except a reboot. Under certain circumstances, silos are not completely exiting causing resources to remain locked exclusively in the kernel, and can't be cleaned up. This is causing the RS5 CI servers to fill up with disk space. The bug seems to occur when a container is stopped then re-started almost immediately after, which is typical "restart" pattern. Unforunately, that's almost all of the tests here.
@jhowardmsft I found that all three RS5 CI servers was on state that all builds was failing. So I did some investigations about this with my PR #38391
and
but after all tests have been run second Nuke fails (but it does no fail test):
Also just to add also here same what I said on #38376 (comment) that rebuild/windowsRS5-process label does not work. |
This appears to be an issue related to the fontdrvhost package directory. After a full, successful run, I saw 20 instances of open file handles in the system process to the |
Thanks @johnstep |
@jhowardmsft any news from kernel team to this one? |
Nothing I can share externally yet, but folks are back looking at it now that Christmas/New Year is over. |
@olljanat @thaJeztah @johnstep We think we might have a fix. I'm going to re-deploy these machines in the next 24 hrs and hopefully all will be good again. |
Awesome, thanks! (and thanks to the team!) |
OK, nodes 2 and 3 are redeployed and online. 4 is on it's way. 1 I'll have to do this evening. |
@jhowardmsft I do not want to spoil the party here but I can see from this build which I just started that OS build is now 17763.55 (was 17763.1 earlier) but it looks to be that removing data from CI folder still fails:
|
@olljanat That's actually still an older build. It's the .253 machines you want to be looking at. But no, in a full CI run, it's not fixed. Back to the kernel team.... |
Further update - the storage folks have found a leak in the filter driver which could likely be the cause. I'll deploy a private fix on the servers later this week once I have a binary and have done some validation locally. |
@jhowardmsft good. Do you know KB number for hotfix already? Will it be released on February? |
It’s not available publically. I don’t think realistically it will be available until April or May at the earliest. That is a best guess only, not a commitment. Will update here when I know more, but bear in mind I have no control of servicing Windows and timelines. |
For example on #38103; https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS5-Process/237/console
These scripts were last updated in #37715, but I'm not sure if that's related, or if it's something with the machines.
ping @ddebroy @jhowardmsft @johnstep PTAL
The text was updated successfully, but these errors were encountered: