Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WindowsRS5 CI is borked #38114

Closed
thaJeztah opened this issue Oct 31, 2018 · 25 comments
Closed

WindowsRS5 CI is borked #38114

thaJeztah opened this issue Oct 31, 2018 · 25 comments
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows

Comments

@thaJeztah
Copy link
Member

For example on #38103; https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS5-Process/237/console

23:11:20 INFO: Base image for tests is microsoft/windowsservercore
23:11:20 INFO: Loading windowsservercore .tar from disk into the daemon under test. This may take some time...
23:12:08 
23:12:08 
23:12:08 ERROR: Failed 'ERROR: Failed to load c:\baseimages\windowsservercore.tar into daemon under test' at 10/30/2018 23:12:08
23:12:08 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:705 char:21
23:12:08 + ...             Throw $("ERROR: Failed to load $readBaseFrom`:\baseimages ...
23:12:08 +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These scripts were last updated in #37715, but I'm not sure if that's related, or if it's something with the machines.

ping @ddebroy @jhowardmsft @johnstep PTAL

@thaJeztah thaJeztah added platform/windows kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/testing labels Oct 31, 2018
@AkihiroSuda
Copy link
Member

What's current status?

@thaJeztah
Copy link
Member Author

No change, but I'll try to ping some people to see if someone can look into this, or otherwise just remove the RS5 CI

@lowenna
Copy link
Member

lowenna commented Nov 8, 2018

I've been OOF (surgery) and only just back. I'll try to look at it today or tomorrow. I think it's somehow relating to when the CI script moved into the repo.

@thaJeztah
Copy link
Member Author

Welcome back @jhowardmsft 🎉

@lowenna
Copy link
Member

lowenna commented Nov 9, 2018

Making progress. I have a manual workaround, but I'd probably need to do that every few days which isn't sustainable. There's a leak somewhere in the platform. I'm trying to repro on a newer OS version and involve the relevant kernel folks.

@lowenna
Copy link
Member

lowenna commented Nov 9, 2018

(Should add a couple of nodes are currently back and working as of this moment)

@thaJeztah
Copy link
Member Author

Thanks @jhowardmsft - would this be something we could do with a cronjob?

@lowenna
Copy link
Member

lowenna commented Nov 9, 2018

What's cron 😆 This is Windows 😇. No, it's a kernel leak.

@thaJeztah
Copy link
Member Author

Haha, thought about that after writing it CRON.EXE then 😉😂☺️

That's a pity 🤷‍♂️ was hoping we could make these machines less "pet", more "cattle"

@lowenna
Copy link
Member

lowenna commented Nov 12, 2018

Quick update - being tracked internally by VSO#19599026. I've got it down to a minimal repro. Unfortunately, once the system is in this state, a reboot appears to be the only solution as kernel resources are locked exclusively - it appears that it relates to a silo not completely exiting. There's at least one CI test which I know to hit this condition. As a temporary measure, I'll disable that test, and possibly one of three others (still narrowing down) which seem to also sometimes cause it. That will at least make the CI servers resilient and not have to be cleaned up every few days due to leaks. Investigation is now with the kernel team here to find the bug and hopefully we can get a fix in a future Windows update.

lowenna pushed a commit to microsoft/docker that referenced this issue Nov 13, 2018
Signed-off-by: John Howard <jhoward@microsoft.com>

Certain tests here have testRequires(c, NotWindowsRS5Plus).

This is being tracked internally by VSO#19599026, and externally
through moby#38114. @jhowardmsft.
As of 11/12/2018, there's no workaround except a reboot.

Under certain circumstances, silos are not completely exiting
causing resources to remain locked exclusively in the kernel,
and can't be cleaned up. This is causing the RS5 CI servers to
fill up with disk space.

The bug seems to occur when a container is stopped then re-started
almost immediately after, which is typical "restart" pattern.
Unforunately, that's almost all of the tests here.
@olljanat
Copy link
Contributor

@jhowardmsft I found that all three RS5 CI servers was on state that all builds was failing.

So I did some investigations about this with my PR #38391

  • Added commit 2c9e03d to get more detailed errors out from it. They was:
18:07:46 docker : re-exec error: exit status 1: output: write \\?\D:\control\tmp\hcs008067931\Files\go.zip: There is not enough 
18:07:46 space on the disk.
18:07:46 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:461 char:39
18:07:46 + ... sure-Command { docker build -t docker -f Dockerfile.windows . | Out-H ...
18:07:46 +                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18:07:46     + CategoryInfo          : NotSpecified: (re-exec error: ...ce on the disk.:String) [], RemoteException
18:07:46     + FullyQualifiedErrorId : NativeCommandError

and

18:45:51 INFO: Base image for tests is microsoft/windowsservercore
18:45:52 INFO: Loading windowsservercore .tar from disk into the daemon under test. This may take some time...
18:46:41 docker-3fcffea1dc.exe : re-exec error: exit status 1: output: BackupWrite \\?\D:\CI\CI-3fcffea1dc\daemon\windowsfilter\
18:46:41 3610ea027b15254dbd127fad6cc562b9ed03a20979723c04edf904585e775393\Files\Windows\Boot\Fonts\jpn_boot.ttf: There is not 
18:46:41 enough space on the disk.
18:46:41 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:710 char:17
18:46:41 + ...             & "$env:TEMP\binary\docker-$COMMITHASH" "-H=$($DASHH_CUT) ...
18:46:41 +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18:46:41     + CategoryInfo          : NotSpecified: (re-exec error: ...ce on the disk.:String) [], RemoteException
18:46:41     + FullyQualifiedErrorId : NativeCommandError
  • I used to couple of versions of commit ff5b444 to get all three nodes rebooted.
  • Then I reset Add curators rights for olljanat #38391 back to state where it only contains needed commit and saw that build was now working on RS5. When looking for log more carefully I can see that first Nuke after reboot works just fine:
19:39:50 INFO: Commit hash is dd82172cc4
19:39:50 INFO: Nuke-Everything...
19:39:50 INFO: Container count on control daemon to delete is 0
19:39:51 INFO: Nuking d:\CI
19:55:51 INFO: Zapped successfully
19:55:51 INFO: Location for testing is d:\CI\CI-dd82172cc4
19:55:51 INFO: Validating GOLang consistency in Dockerfile.windows...

but after all tests have been run second Nuke fails (but it does no fail test):

20:40:10 INFO: Tidying pidfile d:\CI\CI-dd82172cc4\docker.pid
20:40:10 INFO: Nuking d:\CI
20:40:12 ERROR:  hcsshim::DestroyLayer - Win32 API call returned error r1=0x80070020 err=The process cannot access the file because it is being used by another process.id=CI flavour=0
20:40:12 
20:40:12 INFO: executeCI.ps1 exiting at Tue Dec 18 20:40:12 CUT 2018. Duration 01:00:22.9026065
20:40:12 
20:40:12 Set build name.
20:40:12 New build name is '#38391'
20:40:13 [PostBuildScript] - Executing post build scripts.
20:40:13 Notifying endpoint with url 'https://leeroy.dockerproject.org/notification/jenkins'
20:40:14 Finished: SUCCESS

Also just to add also here same what I said on #38376 (comment) that rebuild/windowsRS5-process label does not work.

@johnstep
Copy link
Member

This appears to be an issue related to the fontdrvhost package directory. After a full, successful run, I saw 20 instances of open file handles in the system process to the Windows\System32\config\systemprofile\AppData\Local\Packages\microsoft.windows.fontdrvhost directory in one of the test images, and the same count for each of its subdirectories, recursively (AC, AC\INetCache, AC\INetCookies, AC\INetHistory, and AC\Temp).

@thaJeztah
Copy link
Member Author

Thanks @johnstep

@olljanat
Copy link
Contributor

olljanat commented Jan 2, 2019

@jhowardmsft any news from kernel team to this one?

@lowenna
Copy link
Member

lowenna commented Jan 2, 2019

Nothing I can share externally yet, but folks are back looking at it now that Christmas/New Year is over.

@lowenna
Copy link
Member

lowenna commented Jan 3, 2019

@olljanat @thaJeztah @johnstep We think we might have a fix. I'm going to re-deploy these machines in the next 24 hrs and hopefully all will be good again.

@thaJeztah
Copy link
Member Author

Awesome, thanks! (and thanks to the team!)

@lowenna
Copy link
Member

lowenna commented Jan 4, 2019

OK, nodes 2 and 3 are redeployed and online. 4 is on it's way. 1 I'll have to do this evening.

@olljanat
Copy link
Contributor

olljanat commented Jan 4, 2019

@jhowardmsft I do not want to spoil the party here but I can see from this build which I just started that OS build is now 17763.55 (was 17763.1 earlier) but it looks to be that removing data from CI folder still fails:

05:15:28 INFO: Nuking d:\CI
05:15:29 ERROR:  hcsshim::DestroyLayer - Win32 API call returned error r1=0x80070020 err=The process cannot access the file because it is being used by another process.id=CI flavour=0

@lowenna
Copy link
Member

lowenna commented Jan 4, 2019

@olljanat That's actually still an older build. It's the .253 machines you want to be looking at. But no, in a full CI run, it's not fixed. Back to the kernel team....

@lowenna
Copy link
Member

lowenna commented Jan 7, 2019

Further update - the storage folks have found a leak in the filter driver which could likely be the cause. I'll deploy a private fix on the servers later this week once I have a binary and have done some validation locally.

@lowenna
Copy link
Member

lowenna commented Jan 7, 2019

With a private on a local run. I'll patch the servers this afternoon. Hence closing this.

image

@lowenna lowenna closed this as completed Jan 7, 2019
@thaJeztah
Copy link
Member Author

zap zap zap

@olljanat
Copy link
Contributor

@jhowardmsft good. Do you know KB number for hotfix already? Will it be released on February?

@lowenna
Copy link
Member

lowenna commented Jan 11, 2019

It’s not available publically. I don’t think realistically it will be available until April or May at the earliest. That is a best guess only, not a commitment. Will update here when I know more, but bear in mind I have no control of servicing Windows and timelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows
Projects
None yet
Development

No branches or pull requests

5 participants