Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

Open
thaJeztah opened this issue Aug 12, 2021 · 8 comments
Open

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

thaJeztah opened this issue Aug 12, 2021 · 8 comments
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows

Comments

@thaJeztah
Copy link
Member

thaJeztah commented Aug 12, 2021

This test is failing frequently with a There is not enough space on the disk. failure;

=== RUN   TestBuildWCOWSandboxSize
    build_test.go:572: assertion failed: value "{\"stream\":\"Step 1/8 : FROM busybox AS intermediate\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e ea8f1b6e5c44\\n\"}\r\n{\"stream\":\"Step 2/8 : WORKDIR C:\\\\\\\\stuff\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in a2d453ddd861\\n\"}\r\n{\"stream\":\"Removing intermediate container a2d453ddd861\\n\"}\r\n{\"stream\":\" ---\\u003e 395a1b07f269\\n\"}\r\n{\"stream\":\"Step 3/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_0.txt 22548578304 \\u0026\\u0026 del bigfile_0.txt\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 1f6957162b60\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_0.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 1f6957162b60\\n\"}\r\n{\"stream\":\" ---\\u003e 0a7a087b0669\\n\"}\r\n{\"stream\":\"Step 4/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_1.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 478b5492a998\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_1.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 478b5492a998\\n\"}\r\n{\"stream\":\" ---\\u003e d26ac7985ca8\\n\"}\r\n{\"stream\":\"Step 5/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_2.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in a776a76797c8\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_2.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container a776a76797c8\\n\"}\r\n{\"stream\":\" ---\\u003e 0a0dd247d26f\\n\"}\r\n{\"stream\":\"Step 6/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_3.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 8dcb313b2e5c\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_3.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 8dcb313b2e5c\\n\"}\r\n{\"stream\":\" ---\\u003e 9111caa5c35e\\n\"}\r\n{\"aux\":{\"ID\":\"sha256:9111caa5c35ea45c8a2bcc410a5e2a62532ce48fe877fd065c1e457c2fafd7f6\"}}\r\n{\"stream\":\"Step 7/8 : FROM busybox\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e ea8f1b6e5c44\\n\"}\r\n{\"stream\":\"Step 8/8 : COPY --from=intermediate C:\\\\\\\\stuff C:\\\\\\\\stuff\"}\r\n{\"stream\":\"\\n\"}\r\n{\"errorDetail\":{\"message\":\"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)\"},\"error\":\"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)\"}\r\n" does not match regexp "Successfully built|re-exec error: exit status 1: output: write.*daemon\\\\\\\\tmp\\\\\\\\hcs.*bigfile_[1-3].txt: There is not enough space on the disk."
--- FAIL: TestBuildWCOWSandboxSize (547.71s)

Cleaned-up output:

{"stream":"Step 1/8 : FROM busybox AS intermediate"}
{"stream":"\\n"}
{"stream":" ---\\u003e ea8f1b6e5c44\\n"}
{"stream":"Step 2/8 : WORKDIR C:\\\\\\\\stuff"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in a2d453ddd861\\n"}
{"stream":"Removing intermediate container a2d453ddd861\\n"}
{"stream":" ---\\u003e 395a1b07f269\\n"}
{"stream":"Step 3/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_0.txt 22548578304 \\u0026\\u0026 del bigfile_0.txt"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 1f6957162b60\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_0.txt is created\\r\\n"}
{"stream":"Removing intermediate container 1f6957162b60\\n"}
{"stream":" ---\\u003e 0a7a087b0669\\n"}
{"stream":"Step 4/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_1.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 478b5492a998\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_1.txt is created\\r\\n"}
{"stream":"Removing intermediate container 478b5492a998\\n"}
{"stream":" ---\\u003e d26ac7985ca8\\n"}
{"stream":"Step 5/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_2.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in a776a76797c8\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_2.txt is created\\r\\n"}
{"stream":"Removing intermediate container a776a76797c8\\n"}
{"stream":" ---\\u003e 0a0dd247d26f\\n"}
{"stream":"Step 6/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_3.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 8dcb313b2e5c\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_3.txt is created\\r\\n"}
{"stream":"Removing intermediate container 8dcb313b2e5c\\n"}
{"stream":" ---\\u003e 9111caa5c35e\\n"}
{"aux":{"ID":"sha256:9111caa5c35ea45c8a2bcc410a5e2a62532ce48fe877fd065c1e457c2fafd7f6"}}
{"stream":"Step 7/8 : FROM busybox"}
{"stream":"\\n"}
{"stream":" ---\\u003e ea8f1b6e5c44\\n"}
{"stream":"Step 8/8 : COPY --from=intermediate C:\\\\\\\\stuff C:\\\\\\\\stuff"}
{"stream":"\\n"}
{"errorDetail":{"message":"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)"},"error":"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)"}

We should

  • check if the error is genuine (is there indeed insufficient disk space, or is it an error that's converted incorrectly?
  • if disk space is insufficient for the test, we should check space before running the test, and skip if if there's not enough space.

It's possible that the Jenkins agents run out of space over time, but they are purged/rotated frequently, which should still give us coverage for that specific test (I think?)

@thaJeztah thaJeztah added platform/windows kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/testing labels Aug 12, 2021
@thaJeztah
Copy link
Member Author

Ahm, right, so looking more closely at the test, the There is not enough space on the disk is not the error in CI, but the string that the test expects to find.

The test is a bit weird, because it passes if it either "builds successfully" or "fails" because it ran out of disk space.

The test passes if either:

  • the image build succeeded; or
  • The COPY --from=intermediate step ran out of space during re-exec'd writing of the transport layer information to hcsshim's temp directory

The latter case means we finished the COPY operation, so the sandbox must have been larger than 20GB, which was the test, and then ran out of space on the host during importLayer in the WindowsFilter graph driver, while committing the layer. See #41636 (comment) for more details on the operations being done here.
Specifically, this happens on the Docker Jenkins CI Windows-RS5 build nodes. The two parts of the acceptable-failure case are on different lines, so we need two regexp checks.

For the failure case, the output should contain:

re-exec error: exit status 1: output: write.*daemon\\tmp\\hcs.*bigfile_[1-3].txt: There is not enough space on the disk.

But it sometimes gets:

re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)

Perhaps the error changed in Windows or in hcsshim?

@thaJeztah
Copy link
Member Author

Gotta love the double failed in the error message (hcsshim::ImportLayer - failed failed)

@thaJeztah
Copy link
Member Author

Opened microsoft/hcsshim#1139 to fix the duplicate failed failed in the error messages

@thaJeztah
Copy link
Member Author

So the (0x3) looks to indeed correspond with ERROR_PATH_NOT_FOUND; https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-

Not sure why it now fails with a different error. I somewhat suspect this is a change in Windows ?

@thaJeztah
Copy link
Member Author

@TBBle any ideas for this one?

@TBBle
Copy link
Contributor

TBBle commented Sep 1, 2021

I was looking at this when it showed up in CI earlier, and I'm mystified. And then forgot to come back, sorry. >_<

This is coming from Win32, so it's some HCS API call that's failing. The data layout should be fine, because this is importing from a tarball it created with exportLayer elsewhere, so it's also HCS that generated the data being consumed.

So I can find it later, I believe we're failing this block:

	err = importLayer(&stdDriverInfo, path, importFolderPath, layers)
	if err != nil {
		return hcserror.New(err, title+" - failed", "")
	}

where importLayer is a syscall to vmcompute.dll's ImportLayer.

I don't know why this would have suddenly appeared, since I put this test through its paces pretty heavily at the time I added it.

My first guess is that something is cleaning up the temp directory from which the import is running, while the import is running, but that would be very odd; unless running low/out of space is somehow triggering a naive temp-dir clean-up, e.g. Storage Sense self-activating (see the FAQ).

Oddly enough, earlier this or last week we saw a problem from Storage Sense self-activating at my work on ue4-docker Discord and deleting files from $TEMP between creation and consumption, which is why I know about the behaviour.

We might get more info (like, which file was missing? That would help a lot) if it reproduces without re-exec and maybe hook up the opencensus tracer to dockerd's log output. Unless I misremember, the equivalent of this doesn't appear in this project, so we haven't had hcsshim logs since hcsshim moved from logrus to opencensus internally. (I assume that was an oversight, but perhaps it was intentional?)

It's also possible that there is no more information to be had from the API call, and we'd need to reproduce it under Process Monitor or something to see the filesystem access that's failing inside vmcompute.dll.

Poking around, there's microsoft/hcsshim#835 which is the same error message (but that doesn't mean a lot; I reckon there's three different issues just in that ticket which lead to the same error text). Possibilities that arise from that ticket are:

@thaJeztah
Copy link
Member Author

Virus scanner or other filesystem messer-with is removing a file (or making it inaccessible temporarily) I really really hope isn't a thing on the CI nodes? Did a virus scanner get recently installed or enabled? I'd have expected 'file in use' for this case though, if it was being scanned, and if it's been quarantined... Why would the virus scanner quarantine a multi-gigabyte file full of 0?

Good point on virus scanner; I recall there were issues in CI w.r.t. Defender (could've been because CI uses non-standard paths for some storage, don't recall the details). And there's some code in place to check if it's enabled;

moby/hack/ci/windows.ps1

Lines 273 to 283 in 8207c05

# Make sure Windows Defender is disabled
$defender = $false
Try {
$status = Get-MpComputerStatus
if ($status) {
if ($status.RealTimeProtectionEnabled) {
$defender = $true
}
}
} Catch {}
if ($defender) { Write-Host -ForegroundColor Magenta "WARN: Windows Defender real time protection is enabled, which may cause some integration tests to fail" }

(Haven't checked the logs yet if it's printing that warning)

@thaJeztah
Copy link
Member Author

Nope; just checked a failed run, and it doesn't have that message, so shouldn't be the cause 🤔 (https://ci-next.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-42736/runs/5/nodes/305/log/?start=0)

thaJeztah added a commit to thaJeztah/docker that referenced this issue Nov 10, 2021
This test is failing frequently once nodes have less disk space
available. Skipping the test for now, but we can continue looking
for a good solution.

Tracked through moby#42743

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Nov 10, 2021
This test is failing frequently once nodes have less disk space
available. Skipping the test for now, but we can continue looking
for a good solution.

Tracked through moby/moby#42743

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 2a6a4587fab0c89317585965dd80e6bb4bd03293
Component: engine
evol262 pushed a commit to evol262/moby that referenced this issue Jan 12, 2022
This test is failing frequently once nodes have less disk space
available. Skipping the test for now, but we can continue looking
for a good solution.

Tracked through moby#42743

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows
Projects
None yet
Development

No branches or pull requests

2 participants