OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically #2009

tchap · 2025-09-09T15:41:28Z

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

openshift-ci-robot · 2025-09-10T12:56:08Z

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-10T12:57:22Z

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set. This should not be a problem as this call is supported since Linux 3.15 on all modern file systems.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-10T13:11:53Z

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2025-09-10T13:12:25Z

I actually have to make sure this can be merged as this is only supported on Linux 3.15 or later.

/hold

tchap · 2025-09-10T13:47:04Z

This patch should be OK for RHEL 8 or later based on https://access.redhat.com/articles/3078

The latest CI for OCP 4.21 actually uses RHEL 9.6.

tchap · 2025-09-11T12:35:32Z

The PR using this change in cluster-kube-apiserver-operator seems to be passing on CI, I deem this ready.

/unhold

p0lyn0mial · 2025-09-15T07:51:42Z

@tchap is there a must-gather from an incident i could take a look at ?

tchap · 2025-09-15T07:57:19Z

@p0lyn0mial I think that I consulted this one: https://access.redhat.com/support/cases/#/case/03849958/discussion?attachmentId=a096R00003JpGgMQAV

p0lyn0mial · 2025-09-15T08:08:24Z

@vrutkovs do you have time to take a look at this issue ?

I think that the issue might be real. I think the issue is when a two file cert is replaced. It can happen that the server picks up the update and notices the public/private key mismatch and crashes. Is there a way to repo this issue ?

tchap · 2025-09-23T08:58:47Z

at some point we should vendor this PR to kas-o and run the above tests to make sure the changes in this pr work.

@p0lyn0mial Yeah, there is already openshift/cluster-kube-apiserver-operator#1917 that I used initially to run additional tests, but I will update it once I attend to all review issues.

vrutkovs · 2025-09-23T09:04:17Z

Pretty sure previous approach would break on dynamic configmaps (its already broken on static TLS secrets!). The difference in approach is important: previously we treated files one by one, now we treat an entire configmap/secret as a directory and make each element depend on the success of the entire resource update

vrutkovs · 2025-09-23T09:04:39Z

/approve

to make sure we agree on the approach and need to sort out the details

tchap · 2025-09-25T13:42:46Z

There is really only one last pending issue regarding code from my perspective, which is that @p0lyn0mial mentioned we should have SwapDirectoriesAtomic implemented for all platforms. This is simply not possible because Linux is the only OS that supports such operation. So I am not really sure how to proceed. It can be implemented in a non-atomic manner for other OSes, but I am not sure we really want that. Or is that just for running tests and we can be sure this is never gonna be deployed?

vrutkovs · 2025-09-26T08:42:58Z

we should have SwapDirectoriesAtomic implemented for all platforms. This is simply not possible because Linux is the only OS that supports such operation

Iiuc we want an implementation for unittests, which means it doesn't have to be truly atomic, just good enough to pass unittests on other platforms (mac primarily, let's hope no one uses Windows/BSD to run unit tests)

tchap · 2025-09-26T09:50:56Z

I added the mock implementation and tested it on my Mac.

vrutkovs

/lgtm

Let's bump kube-apiserver test PR and make sure it doesn't cause any sideeffects on e2e runs?

tchap · 2025-09-26T11:14:46Z

I squshed all the commits into one, LGTM removed 😐

Updated the KASO PR to include the current version of this PR and also added the extra tests mentioned by @vrutkovs

openshift/cluster-kube-apiserver-operator#1917

vrutkovs · 2025-09-26T11:56:48Z

/lgtm

openshift-ci · 2025-09-26T11:57:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap, vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign tkashem for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/operator/staticpod/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tchap · 2025-09-26T11:59:24Z

/hold

until we are sure the testing KASO PR works just fine.

p0lyn0mial · 2025-09-30T10:42:55Z

pkg/operator/staticpod/certsyncpod/certsync_controller.go

+		eventRecorder.Warningf("CertificateUpdateFailed", "Failed to hash current content directory for %s: %s/%s: %v", typeName, o.Namespace, o.Name, err)
+		return err
+	}
+	if !bytes.Equal(targetDirHashBefore, targetDirHashAfter) {


If two processes are writing to the same directory, hashing won’t guarantee atomicity, because the second process could start modifying the directory after the check.

A possible solution would be to make the second process also swap the directory atomically, perhaps using the same technique.

I don’t think the order matters, so atomically swapping the contents should be enough. Thoughts?

It needs to happen twice - when both targetDirHashBefore and targetDirHashAfter are calculated and needs to end up in the consistent state (if the second write is a bit slower/faster, the hashes will mismatch). IMO it's too rare to guard against

The number of times we calculate and compare the hashes doesn’t matter, because the second process can start writing or changing the target directory after the last comparison or calculation.

In this case, the second process is the installer pod.

We should change the installer pod to perform an atomic swap rather than rely on the hashes.

In my understanding, the desired procedure is:

make a hash of source dir

prepare a new dir from source dir

hash source dir again to make sure it didn't change during previous step

we swap atomically if hashes match

So the only chance of invalid swap here is if it initiates after step 1 and between step 3 and 4 - that would be a very rare event, wouldn't it?

step 3 and 4 - that would be a very rare event, wouldn't it?

This PR addresses a rare bug where kas can observe an old prv key together with a new pub key, which causes a crash :(

I think we should take this potential issue into account and fix it.

I aligned installerpod to use the same atomic swap. I also refactored the PR a bit because I had to pull shared code into a separate package. The swapping function is now called SyncDirectory since it replaces the content actually.

openshift-ci · 2025-09-30T13:31:50Z

New changes are detected. LGTM label has been removed.

p0lyn0mial

@tchap for easier review please open a new pr with the SwapDirectories function + tests.

and then a new PR for the SyncDirectory function.

pkg/operator/staticpod/internal/atomicfiles/sync_directory.go

p0lyn0mial · 2025-10-01T09:19:50Z

pkg/operator/staticpod/internal/atomicfiles/sync_directory.go

+//
+// SyncDirectory is supposed to be used with secrets/configmaps, so typeName is expected to be "configmap" or "secret".
+// This does not affect the logic, but it's included in error messages.
+func SyncDirectory[C string | []byte](


How about changing this method to: Write(targetDir string, files map[string][]byte, filePerm os.FileMode

After all we are writing files to the target directory in an atomic way.
Callers have to transform files to []byte which is easy since strings in go are []bytes.

I used Sync instead of Write because Write feels like if doesn't affect existing files. But I don't have a strong opinion really.

p0lyn0mial · 2025-10-01T09:20:32Z

pkg/operator/staticpod/internal/atomicfiles/sync_directory.go

+	}
+	defer func() {
+		if err := fs.RemoveAll(tmpDir); err != nil {
+			klog.Errorf("Failed to remove temporary directory %q during cleanup: %v", tmpDir, err)


We should return the error the the callers not only log it.

pkg/operator/staticpod/internal/atomicfiles/swap_directories_linux.go

pkg/operator/staticpod/internal/atomicfiles/swap_directories_other.go

The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.

openshift-ci · 2025-10-02T14:17:09Z

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2025

openshift-ci bot requested review from p0lyn0mial and tkashem September 9, 2025 15:44

tchap force-pushed the atomic-certsync branch from b3eca57 to a389cbd Compare September 10, 2025 09:12

tchap mentioned this pull request Sep 10, 2025

WIP: Update library-go to improve cert-syncer openshift/cluster-kube-apiserver-operator#1917

Open

tchap force-pushed the atomic-certsync branch 4 times, most recently from 8dec327 to 60f05a8 Compare September 10, 2025 12:45

tchap changed the title ~~WIP: certsyncpod: Swap secret/cm directories atomically~~ OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 10, 2025

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 10, 2025

openshift-ci bot requested a review from wangke19 September 10, 2025 12:56

tchap changed the title ~~OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically~~ WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025

tchap changed the title ~~WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically~~ OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025

tchap force-pushed the atomic-certsync branch from 60f05a8 to f6df27a Compare September 10, 2025 14:03

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2025

tchap force-pushed the atomic-certsync branch from 8087421 to 2eb032a Compare September 26, 2025 09:31

vrutkovs approved these changes Sep 26, 2025

View reviewed changes

openshift-ci bot assigned vrutkovs Sep 26, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025

tchap force-pushed the atomic-certsync branch from 2eb032a to c8b5cec Compare September 26, 2025 11:09

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2025

p0lyn0mial reviewed Sep 30, 2025

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025

tchap force-pushed the atomic-certsync branch from 5c2ed6a to ccd7851 Compare September 30, 2025 14:01

p0lyn0mial reviewed Oct 1, 2025

View reviewed changes

This was referenced Oct 1, 2025

OCPBUGS-33013: Add atomicdir package #2026

Merged

OCPBUGS-33013: Add atomicdir.Sync function #2027

Open

openshift-ci-robot mentioned this pull request Oct 2, 2025

OCPBUGS-33013: operator/staticpod/internal/atomicdir/swap_other: add missing import #2028

Merged

tchap added 2 commits October 2, 2025 15:25

Add atomicdir.Sync function

4bebc96

The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.

certsync+installerpod: Use atomicdir.Sync

b57293f

tchap force-pushed the atomic-certsync branch from ccd7851 to b57293f Compare October 2, 2025 13:58

OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically #2009

Are you sure you want to change the base?

OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically #2009

Uh oh!

Conversation

tchap commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 10, 2025

Uh oh!

openshift-ci-robot commented Sep 10, 2025

Uh oh!

openshift-ci-robot commented Sep 10, 2025

Uh oh!

tchap commented Sep 10, 2025

Uh oh!

tchap commented Sep 10, 2025

Uh oh!

tchap commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p0lyn0mial commented Sep 15, 2025

Uh oh!

tchap commented Sep 15, 2025

Uh oh!

p0lyn0mial commented Sep 15, 2025

Uh oh!

tchap commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Sep 23, 2025

Uh oh!

vrutkovs commented Sep 23, 2025

Uh oh!

tchap commented Sep 25, 2025

Uh oh!

vrutkovs commented Sep 26, 2025

Uh oh!

tchap commented Sep 26, 2025

Uh oh!

vrutkovs left a comment

Choose a reason for hiding this comment

Uh oh!

tchap commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Sep 26, 2025

Uh oh!

openshift-ci bot commented Sep 26, 2025

Uh oh!

tchap commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 30, 2025

Uh oh!

p0lyn0mial left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchap commented Sep 9, 2025 •

edited

Loading

tchap commented Sep 11, 2025 •

edited

Loading

tchap commented Sep 23, 2025 •

edited

Loading

tchap commented Sep 26, 2025 •

edited

Loading