CNTRLPLANE-3329: Fix gocacheprog cache corruption with atomic writes#8624
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@celebdor: This pull request references CNTRLPLANE-3329 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughAdded writeFileAtomic to write bytes to a temp file in the target directory and atomically rename it into place. handlePut now uses writeFileAtomic for both the output data file and the action entry file and returns errors if those atomic writes fail. A new test TestConcurrentPutSameOutputID runs concurrent PUTs (same OutputID) and interleaved GETs to detect any partial/corrupted reads. Sequence Diagram(s)sequenceDiagram
participant Client
participant handlePut
participant writeFileAtomic
participant Filesystem
participant Reader
Client->>handlePut: PUT request (OutputID, Body)
handlePut->>writeFileAtomic: writeFileAtomic(dPath, Body)
writeFileAtomic->>Filesystem: create temp file & write bytes
writeFileAtomic->>Filesystem: close temp file
writeFileAtomic->>Filesystem: rename temp -> dPath (atomic)
Filesystem-->>handlePut: write success
handlePut-->>Client: respond with DiskPath
Reader->>Filesystem: GET dPath (concurrent reads)
Filesystem-->>Reader: return full file contents (post-rename)
Possibly related PRs
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/override ci/prow/images |
|
/override ci/prow/okd-scos-images |
|
/override ci/prow/verify-deps |
|
@celebdor: Overrode contexts on behalf of celebdor: ci/prow/verify-deps DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@celebdor: Overrode contexts on behalf of celebdor: ci/prow/okd-scos-images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@celebdor: Overrode contexts on behalf of celebdor: ci/prow/images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Actionable comments posted: 0 |
|
/hold Adding tests and fine tuning |
os.WriteFile uses O_TRUNC which temporarily zeros the file. During concurrent PUTs, a data file can be truncated while Go reads it via DiskPath, causing "bad checksum" errors in golangci-lint. Use temp-file-then-rename for atomic writes and skip writing data files that already exist with the correct size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b10fff2 to
64c2027
Compare
|
/hold cancel |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/ci/gocacheprog/main_test.go`:
- Around line 463-471: The test currently ignores GET misses and file read
errors; update the loop where getReq is created and handleGet is called (see
getReq, handleGet, resp.Miss, and os.ReadFile) so that if resp.Miss is true or
os.ReadFile returns an error you increment badReads (or the test failure
counter) just like when the bytes differ, and do not silently return; this
ensures misses and read errors are counted as test failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1b0226b8-c931-497f-8399-90fb3393c8fc
📒 Files selected for processing (2)
contrib/ci/gocacheprog/main.gocontrib/ci/gocacheprog/main_test.go
| getReq := &request{ID: int64(i + 5000), Command: "get", ActionID: seedAction} | ||
| resp := handleGet(getReq, "", rwDir) | ||
| if resp.Miss { | ||
| return | ||
| } | ||
| data, err := os.ReadFile(resp.DiskPath) | ||
| if err != nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
Count GET misses and read errors as test failures.
Right now this only increments badReads when the bytes differ, so the test still passes if concurrent PUTs make handleGet return a miss or make DiskPath unreadable. Those are part of the failure mode this test is supposed to catch.
🔧 Suggested fix
go func() {
defer wg.Done()
getReq := &request{ID: int64(i + 5000), Command: "get", ActionID: seedAction}
resp := handleGet(getReq, "", rwDir)
if resp.Miss {
+ badReads.Add(1)
return
}
data, err := os.ReadFile(resp.DiskPath)
if err != nil {
+ badReads.Add(1)
return
}
if string(data) != string(body) {
badReads.Add(1)
}
}()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| getReq := &request{ID: int64(i + 5000), Command: "get", ActionID: seedAction} | |
| resp := handleGet(getReq, "", rwDir) | |
| if resp.Miss { | |
| return | |
| } | |
| data, err := os.ReadFile(resp.DiskPath) | |
| if err != nil { | |
| return | |
| } | |
| go func() { | |
| defer wg.Done() | |
| getReq := &request{ID: int64(i + 5000), Command: "get", ActionID: seedAction} | |
| resp := handleGet(getReq, "", rwDir) | |
| if resp.Miss { | |
| badReads.Add(1) | |
| return | |
| } | |
| data, err := os.ReadFile(resp.DiskPath) | |
| if err != nil { | |
| badReads.Add(1) | |
| return | |
| } | |
| if string(data) != string(body) { | |
| badReads.Add(1) | |
| } | |
| }() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@contrib/ci/gocacheprog/main_test.go` around lines 463 - 471, The test
currently ignores GET misses and file read errors; update the loop where getReq
is created and handleGet is called (see getReq, handleGet, resp.Miss, and
os.ReadFile) so that if resp.Miss is true or os.ReadFile returns an error you
increment badReads (or the test failure counter) just like when the bytes
differ, and do not silently return; this ensures misses and read errors are
counted as test failures.
Run gocacheprog unit tests on PRs that modify contrib/ci/gocacheprog/. Since gocacheprog is a separate Go module, the main repo test workflow does not cover it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/gocacheprog-test.yaml:
- Line 13: The workflow currently references the reusable workflow as
"openshift/hypershift/.github/workflows/gocacheprog-test-reusable.yaml@main",
which is mutable; update that reference to an immutable revision by replacing
"`@main`" with a specific commit SHA (e.g. "@<commit-sha>") or point to a local
workflow file (e.g. use a relative path to a vendored copy) so CI runs a
deterministic reusable workflow revision.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: da86a048-52fa-4fc7-93e3-fd97fd5bb262
📒 Files selected for processing (2)
.github/workflows/gocacheprog-test-reusable.yaml.github/workflows/gocacheprog-test.yaml
|
|
||
| jobs: | ||
| test: | ||
| uses: openshift/hypershift/.github/workflows/gocacheprog-test-reusable.yaml@main |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== File: .github/workflows/gocacheprog-test.yaml =="
if [ -f ".github/workflows/gocacheprog-test.yaml" ]; then
nl -ba .github/workflows/gocacheprog-test.yaml | sed -n '1,160p'
else
echo "Missing file .github/workflows/gocacheprog-test.yaml"
fi
echo
echo "== Look for local reusable workflow file =="
ls -la .github/workflows | head -n 200 || true
echo
fd -H "gocacheprog-test-reusable\.yaml" .github/workflows || true
echo
echo "== Search for references to gocacheprog-test-reusable.yaml =="
rg -n "gocacheprog-test-reusable\.yaml(@|$)" .github/workflows -S || trueRepository: openshift/hypershift
Length of output: 177
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/gocacheprog-test.yaml"
echo "== File: $FILE =="
if [ -f "$FILE" ]; then
cat -n "$FILE" | sed -n '1,220p'
else
echo "Missing $FILE"
fi
echo
echo "== Listing .github/workflows (first 200 entries) =="
ls -la .github/workflows | head -n 200 || true
echo
echo "== Find local reusable workflow candidates =="
fd -H "gocacheprog-test-reusable\.yaml$" .github/workflows || true
fd -H "gocacheprog-test-reusable" .github/workflows || true
echo
echo "== Search for references to gocacheprog-test-reusable.yaml =="
rg -n "gocacheprog-test-reusable\.yaml(@|$)" .github/workflows -S || true
echo
echo "== Show any uses: .../gocacheprog-test-reusable... lines =="
rg -n "uses:\s*openshift/hypershift/\.github/workflows/gocacheprog-test-reusable\.yaml@" .github/workflows -S || trueRepository: openshift/hypershift
Length of output: 2943
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/gocacheprog-test-reusable.yaml"
echo "== File: $FILE =="
if [ -f "$FILE" ]; then
cat -n "$FILE" | sed -n '1,220p'
else
echo "Missing $FILE"
fi
echo
echo "== Search for tag/mutable refs in reusable workflow =="
rg -n "uses:\s*[^ ]+@((main)|(master)|(v[0-9]+)|(\w+))\b" .github/workflows/gocacheprog-test-reusable.yaml -S || true
echo
echo "== Search specifically for `@main` in reusable workflow =="
rg -n "`@main`\b" .github/workflows/gocacheprog-test-reusable.yaml -S || trueRepository: openshift/hypershift
Length of output: 1217
Pin reusable workflow reference to an immutable revision (avoid @main).
In .github/workflows/gocacheprog-test.yaml, the job calls openshift/hypershift/.github/workflows/gocacheprog-test-reusable.yaml@main (mutable), so CI won’t necessarily run the same reusable workflow revision deterministically. Use a local workflow reference (or a full commit SHA) instead.
🔒 Proposed fix
- uses: openshift/hypershift/.github/workflows/gocacheprog-test-reusable.yaml@main
+ uses: ./.github/workflows/gocacheprog-test-reusable.yaml🧰 Tools
🪛 zizmor (1.25.2)
[error] 13-13: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/gocacheprog-test.yaml at line 13, The workflow currently
references the reusable workflow as
"openshift/hypershift/.github/workflows/gocacheprog-test-reusable.yaml@main",
which is mutable; update that reference to an immutable revision by replacing
"`@main`" with a specific commit SHA (e.g. "@<commit-sha>") or point to a local
workflow file (e.g. use a relative path to a vendored copy) so CI runs a
deterministic reusable workflow revision.
|
@celebdor: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Now I have the complete picture. Let me produce the report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe lint job fails with a Go toolchain Root CauseThe The shared read-only cache at The corruption is non-deterministic in which file it manifests — different runs hit different files (e.g., Ironically, PR #8624 itself introduces the fix for this class of problem — it replaces Recommendations
Evidence
|
|
/approve |
|
Pipeline controller notification No second-stage tests were triggered for this PR. This can happen when:
Use |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: celebdor, csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
typecheckfailures caused byGOCACHEPROGserving corrupted cache entriesos.WriteFileusesO_TRUNCwhich temporarily zeros a file — concurrent readers viaDiskPathsee partial content and get "bad checksum" errorsrenameis atomic on POSIX, so readers never see partial content)TestConcurrentPutSameOutputIDwhich deterministically reproduces the corruption (50/50 failures without the fix, 0 failures with it)contrib/ci/gocacheprog/(since it's a separate Go module, the main test workflow doesn't cover it)Supersedes #8623 (the lint workaround) once this merges and the runner image rebuilds.
Test plan
-raceTestConcurrentPutSameOutputIDfails 50/50 without fix, passes 20/20 with fixgo build ./...thenmake lintvia gocacheprog — fails without fix, passes with fix🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Chores