WindowsRS5 CI is borked #38114

thaJeztah · 2018-10-31T09:41:16Z

For example on #38103; https://jenkins.dockerproject.org/job/Docker-PRs-WoW-RS5-Process/237/console

23:11:20 INFO: Base image for tests is microsoft/windowsservercore
23:11:20 INFO: Loading windowsservercore .tar from disk into the daemon under test. This may take some time...
23:12:08 
23:12:08 
23:12:08 ERROR: Failed 'ERROR: Failed to load c:\baseimages\windowsservercore.tar into daemon under test' at 10/30/2018 23:12:08
23:12:08 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:705 char:21
23:12:08 + ...             Throw $("ERROR: Failed to load $readBaseFrom`:\baseimages ...
23:12:08 +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These scripts were last updated in #37715, but I'm not sure if that's related, or if it's something with the machines.

ping @ddebroy @jhowardmsft @johnstep PTAL

The text was updated successfully, but these errors were encountered:

AkihiroSuda · 2018-11-08T04:09:53Z

What's current status?

thaJeztah · 2018-11-08T08:00:05Z

No change, but I'll try to ping some people to see if someone can look into this, or otherwise just remove the RS5 CI

lowenna · 2018-11-08T17:03:46Z

I've been OOF (surgery) and only just back. I'll try to look at it today or tomorrow. I think it's somehow relating to when the CI script moved into the repo.

thaJeztah · 2018-11-08T17:06:50Z

Welcome back @jhowardmsft 🎉

lowenna · 2018-11-09T19:54:35Z

Making progress. I have a manual workaround, but I'd probably need to do that every few days which isn't sustainable. There's a leak somewhere in the platform. I'm trying to repro on a newer OS version and involve the relevant kernel folks.

lowenna · 2018-11-09T19:54:58Z

(Should add a couple of nodes are currently back and working as of this moment)

thaJeztah · 2018-11-09T20:14:27Z

Thanks @jhowardmsft - would this be something we could do with a cronjob?

lowenna · 2018-11-09T20:18:17Z

What's cron 😆 This is Windows 😇. No, it's a kernel leak.

thaJeztah · 2018-11-09T20:26:50Z

Haha, thought about that after writing it CRON.EXE then 😉😂☺️

That's a pity 🤷‍♂️ was hoping we could make these machines less "pet", more "cattle"

lowenna · 2018-11-12T17:02:33Z

Quick update - being tracked internally by VSO#19599026. I've got it down to a minimal repro. Unfortunately, once the system is in this state, a reboot appears to be the only solution as kernel resources are locked exclusively - it appears that it relates to a silo not completely exiting. There's at least one CI test which I know to hit this condition. As a temporary measure, I'll disable that test, and possibly one of three others (still narrowing down) which seem to also sometimes cause it. That will at least make the CI servers resilient and not have to be cleaned up every few days due to leaks. Investigation is now with the kernel team here to find the bug and hopefully we can get a fix in a future Windows update.

Signed-off-by: John Howard <jhoward@microsoft.com> Certain tests here have testRequires(c, NotWindowsRS5Plus). This is being tracked internally by VSO#19599026, and externally through moby#38114. @jhowardmsft. As of 11/12/2018, there's no workaround except a reboot. Under certain circumstances, silos are not completely exiting causing resources to remain locked exclusively in the kernel, and can't be cleaned up. This is causing the RS5 CI servers to fill up with disk space. The bug seems to occur when a container is stopped then re-started almost immediately after, which is typical "restart" pattern. Unforunately, that's almost all of the tests here.

olljanat · 2018-12-18T21:06:00Z

@jhowardmsft I found that all three RS5 CI servers was on state that all builds was failing.

So I did some investigations about this with my PR #38391

Added commit 2c9e03d to get more detailed errors out from it. They was:

18:07:46 docker : re-exec error: exit status 1: output: write \\?\D:\control\tmp\hcs008067931\Files\go.zip: There is not enough 
18:07:46 space on the disk.
18:07:46 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:461 char:39
18:07:46 + ... sure-Command { docker build -t docker -f Dockerfile.windows . | Out-H ...
18:07:46 +                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18:07:46     + CategoryInfo          : NotSpecified: (re-exec error: ...ce on the disk.:String) [], RemoteException
18:07:46     + FullyQualifiedErrorId : NativeCommandError

and

18:45:51 INFO: Base image for tests is microsoft/windowsservercore
18:45:52 INFO: Loading windowsservercore .tar from disk into the daemon under test. This may take some time...
18:46:41 docker-3fcffea1dc.exe : re-exec error: exit status 1: output: BackupWrite \\?\D:\CI\CI-3fcffea1dc\daemon\windowsfilter\
18:46:41 3610ea027b15254dbd127fad6cc562b9ed03a20979723c04edf904585e775393\Files\Windows\Boot\Fonts\jpn_boot.ttf: There is not 
18:46:41 enough space on the disk.
18:46:41 At C:\gopath\src\github.com\docker\docker\hack\ci\windows.ps1:710 char:17
18:46:41 + ...             & "$env:TEMP\binary\docker-$COMMITHASH" "-H=$($DASHH_CUT) ...
18:46:41 +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18:46:41     + CategoryInfo          : NotSpecified: (re-exec error: ...ce on the disk.:String) [], RemoteException
18:46:41     + FullyQualifiedErrorId : NativeCommandError

I used to couple of versions of commit ff5b444 to get all three nodes rebooted.
Then I reset Add curators rights for olljanat #38391 back to state where it only contains needed commit and saw that build was now working on RS5. When looking for log more carefully I can see that first Nuke after reboot works just fine:

19:39:50 INFO: Commit hash is dd82172cc4
19:39:50 INFO: Nuke-Everything...
19:39:50 INFO: Container count on control daemon to delete is 0
19:39:51 INFO: Nuking d:\CI
19:55:51 INFO: Zapped successfully
19:55:51 INFO: Location for testing is d:\CI\CI-dd82172cc4
19:55:51 INFO: Validating GOLang consistency in Dockerfile.windows...

but after all tests have been run second Nuke fails (but it does no fail test):

20:40:10 INFO: Tidying pidfile d:\CI\CI-dd82172cc4\docker.pid
20:40:10 INFO: Nuking d:\CI
20:40:12 ERROR:  hcsshim::DestroyLayer - Win32 API call returned error r1=0x80070020 err=The process cannot access the file because it is being used by another process.id=CI flavour=0
20:40:12 
20:40:12 INFO: executeCI.ps1 exiting at Tue Dec 18 20:40:12 CUT 2018. Duration 01:00:22.9026065
20:40:12 
20:40:12 Set build name.
20:40:12 New build name is '#38391'
20:40:13 [PostBuildScript] - Executing post build scripts.
20:40:13 Notifying endpoint with url 'https://leeroy.dockerproject.org/notification/jenkins'
20:40:14 Finished: SUCCESS

Finally create PR proposal Reboot RS5 after test run if there is less than 15 GB free disk space #38392 which would reboot these RS5 servers after each CI build. It is ugly solution of course but IMO would be better than randomly failing CI builds.

Also just to add also here same what I said on #38376 (comment) that rebuild/windowsRS5-process label does not work.

johnstep · 2018-12-23T07:20:24Z

This appears to be an issue related to the fontdrvhost package directory. After a full, successful run, I saw 20 instances of open file handles in the system process to the Windows\System32\config\systemprofile\AppData\Local\Packages\microsoft.windows.fontdrvhost directory in one of the test images, and the same count for each of its subdirectories, recursively (AC, AC\INetCache, AC\INetCookies, AC\INetHistory, and AC\Temp).

thaJeztah · 2018-12-23T08:33:11Z

Thanks @johnstep

olljanat · 2019-01-02T21:24:00Z

@jhowardmsft any news from kernel team to this one?

lowenna · 2019-01-02T21:28:58Z

Nothing I can share externally yet, but folks are back looking at it now that Christmas/New Year is over.

lowenna · 2019-01-03T23:24:57Z

@olljanat @thaJeztah @johnstep We think we might have a fix. I'm going to re-deploy these machines in the next 24 hrs and hopefully all will be good again.

thaJeztah · 2019-01-03T23:36:22Z

Awesome, thanks! (and thanks to the team!)

lowenna · 2019-01-04T00:06:55Z

OK, nodes 2 and 3 are redeployed and online. 4 is on it's way. 1 I'll have to do this evening.

olljanat · 2019-01-04T05:32:36Z

@jhowardmsft I do not want to spoil the party here but I can see from this build which I just started that OS build is now 17763.55 (was 17763.1 earlier) but it looks to be that removing data from CI folder still fails:

05:15:28 INFO: Nuking d:\CI
05:15:29 ERROR:  hcsshim::DestroyLayer - Win32 API call returned error r1=0x80070020 err=The process cannot access the file because it is being used by another process.id=CI flavour=0

lowenna · 2019-01-04T17:42:57Z

@olljanat That's actually still an older build. It's the .253 machines you want to be looking at. But no, in a full CI run, it's not fixed. Back to the kernel team....

lowenna · 2019-01-07T16:32:41Z

Further update - the storage folks have found a leak in the filter driver which could likely be the cause. I'll deploy a private fix on the servers later this week once I have a binary and have done some validation locally.

lowenna · 2019-01-07T20:40:38Z

With a private on a local run. I'll patch the servers this afternoon. Hence closing this.

thaJeztah · 2019-01-07T21:11:07Z

olljanat · 2019-01-11T12:01:10Z

@jhowardmsft good. Do you know KB number for hotfix already? Will it be released on February?

lowenna · 2019-01-11T15:02:48Z

It’s not available publically. I don’t think realistically it will be available until April or May at the earliest. That is a best guess only, not a commitment. Will update here when I know more, but bear in mind I have no control of servicing Windows and timelines.

thaJeztah added platform/windows kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/testing labels Oct 31, 2018

thaJeztah mentioned this issue Oct 31, 2018

cluster: set bigger grpc limit for array requests #38103

Merged

lowenna mentioned this issue Nov 12, 2018

[IGNORE FOR NOW - WIP] Mitigate Restart on RS5+ #38192

Closed

olljanat mentioned this issue Nov 15, 2018

What are use cases for this? MicrosoftDocs/Virtualization-Documentation#851

Open

olljanat mentioned this issue Dec 18, 2018

Reboot RS5 after test run if there is less than 15 GB free disk space #38392

Closed

This was referenced Jan 5, 2019

Epic: Linux Containers on Windows (LCOW) #33850

Closed

Windows Server 2019 known issues #38498

Closed

lowenna closed this as completed Jan 7, 2019

olljanat mentioned this issue Jan 11, 2019

Revert "Disable TestExecWindowsOpenHandles on RS5 temporarily" #38535

Closed

olljanat mentioned this issue Jan 25, 2019

INVALID DEVICE REQUEST error in container running on Windows 1803 #38112

Closed

olljanat mentioned this issue Apr 16, 2019

Windows Server 2019: LockFile lock is not released by OS when process exits #39088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WindowsRS5 CI is borked #38114

WindowsRS5 CI is borked #38114

thaJeztah commented Oct 31, 2018

AkihiroSuda commented Nov 8, 2018

thaJeztah commented Nov 8, 2018

lowenna commented Nov 8, 2018

thaJeztah commented Nov 8, 2018

lowenna commented Nov 9, 2018

lowenna commented Nov 9, 2018

thaJeztah commented Nov 9, 2018

lowenna commented Nov 9, 2018

thaJeztah commented Nov 9, 2018

lowenna commented Nov 12, 2018

olljanat commented Dec 18, 2018

johnstep commented Dec 23, 2018

thaJeztah commented Dec 23, 2018

olljanat commented Jan 2, 2019

lowenna commented Jan 2, 2019

lowenna commented Jan 3, 2019

thaJeztah commented Jan 3, 2019

lowenna commented Jan 4, 2019

olljanat commented Jan 4, 2019

lowenna commented Jan 4, 2019

lowenna commented Jan 7, 2019

lowenna commented Jan 7, 2019

thaJeztah commented Jan 7, 2019

olljanat commented Jan 11, 2019

lowenna commented Jan 11, 2019

WindowsRS5 CI is borked #38114

WindowsRS5 CI is borked #38114

Comments

thaJeztah commented Oct 31, 2018

AkihiroSuda commented Nov 8, 2018

thaJeztah commented Nov 8, 2018

lowenna commented Nov 8, 2018

thaJeztah commented Nov 8, 2018

lowenna commented Nov 9, 2018

lowenna commented Nov 9, 2018

thaJeztah commented Nov 9, 2018

lowenna commented Nov 9, 2018

thaJeztah commented Nov 9, 2018

lowenna commented Nov 12, 2018

olljanat commented Dec 18, 2018

johnstep commented Dec 23, 2018

thaJeztah commented Dec 23, 2018

olljanat commented Jan 2, 2019

lowenna commented Jan 2, 2019

lowenna commented Jan 3, 2019

thaJeztah commented Jan 3, 2019

lowenna commented Jan 4, 2019

olljanat commented Jan 4, 2019

lowenna commented Jan 4, 2019

lowenna commented Jan 7, 2019

lowenna commented Jan 7, 2019

thaJeztah commented Jan 7, 2019

olljanat commented Jan 11, 2019

lowenna commented Jan 11, 2019