Skip to content

Conversation

tchap
Copy link
Contributor

@tchap tchap commented Sep 9, 2025

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2025
@openshift-ci openshift-ci bot requested review from p0lyn0mial and tkashem September 9, 2025 15:44
@tchap tchap force-pushed the atomic-certsync branch 4 times, most recently from 8dec327 to 60f05a8 Compare September 10, 2025 12:45
@tchap tchap changed the title WIP: certsyncpod: Swap secret/cm directories atomically OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 10, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 10, 2025
@openshift-ci openshift-ci bot requested a review from wangke19 September 10, 2025 12:56
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set. This should not be a problem as this call is supported since Linux 3.15 on all modern file systems.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap changed the title OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap changed the title WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@tchap
Copy link
Contributor Author

tchap commented Sep 10, 2025

I actually have to make sure this can be merged as this is only supported on Linux 3.15 or later.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025
@tchap
Copy link
Contributor Author

tchap commented Sep 10, 2025

This patch should be OK for RHEL 8 or later based on https://access.redhat.com/articles/3078

The latest CI for OCP 4.21 actually uses RHEL 9.6.

@tchap
Copy link
Contributor Author

tchap commented Sep 11, 2025

The PR using this change in cluster-kube-apiserver-operator seems to be passing on CI, I deem this ready.

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2025
@p0lyn0mial
Copy link
Contributor

@tchap is there a must-gather from an incident i could take a look at ?

@tchap
Copy link
Contributor Author

tchap commented Sep 15, 2025

@p0lyn0mial
Copy link
Contributor

@vrutkovs do you have time to take a look at this issue ?

I think that the issue might be real. I think the issue is when a two file cert is replaced. It can happen that the server picks up the update and notices the public/private key mismatch and crashes. Is there a way to repo this issue ?

@tchap
Copy link
Contributor Author

tchap commented Sep 23, 2025

at some point we should vendor this PR to kas-o and run the above tests to make sure the changes in this pr work.

@p0lyn0mial Yeah, there is already openshift/cluster-kube-apiserver-operator#1917 that I used initially to run additional tests, but I will update it once I attend to all review issues.

@vrutkovs
Copy link
Member

Pretty sure previous approach would break on dynamic configmaps (its already broken on static TLS secrets!). The difference in approach is important: previously we treated files one by one, now we treat an entire configmap/secret as a directory and make each element depend on the success of the entire resource update

@vrutkovs
Copy link
Member

/approve

to make sure we agree on the approach and need to sort out the details

@tchap
Copy link
Contributor Author

tchap commented Sep 25, 2025

There is really only one last pending issue regarding code from my perspective, which is that @p0lyn0mial mentioned we should have SwapDirectoriesAtomic implemented for all platforms. This is simply not possible because Linux is the only OS that supports such operation. So I am not really sure how to proceed. It can be implemented in a non-atomic manner for other OSes, but I am not sure we really want that. Or is that just for running tests and we can be sure this is never gonna be deployed?

@vrutkovs
Copy link
Member

we should have SwapDirectoriesAtomic implemented for all platforms. This is simply not possible because Linux is the only OS that supports such operation

Iiuc we want an implementation for unittests, which means it doesn't have to be truly atomic, just good enough to pass unittests on other platforms (mac primarily, let's hope no one uses Windows/BSD to run unit tests)

@tchap
Copy link
Contributor Author

tchap commented Sep 26, 2025

I added the mock implementation and tested it on my Mac.

Copy link
Member

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Let's bump kube-apiserver test PR and make sure it doesn't cause any sideeffects on e2e runs?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025
@tchap
Copy link
Contributor Author

tchap commented Sep 26, 2025

I squshed all the commits into one, LGTM removed 😐

Updated the KASO PR to include the current version of this PR and also added the extra tests mentioned by @vrutkovs

openshift/cluster-kube-apiserver-operator#1917

@vrutkovs
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2025
Copy link
Contributor

openshift-ci bot commented Sep 26, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap, vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign tkashem for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tchap
Copy link
Contributor Author

tchap commented Sep 26, 2025

/hold

until we are sure the testing KASO PR works just fine.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2025
eventRecorder.Warningf("CertificateUpdateFailed", "Failed to hash current content directory for %s: %s/%s: %v", typeName, o.Namespace, o.Name, err)
return err
}
if !bytes.Equal(targetDirHashBefore, targetDirHashAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two processes are writing to the same directory, hashing won’t guarantee atomicity, because the second process could start modifying the directory after the check.

A possible solution would be to make the second process also swap the directory atomically, perhaps using the same technique.

I don’t think the order matters, so atomically swapping the contents should be enough. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to happen twice - when both targetDirHashBefore and targetDirHashAfter are calculated and needs to end up in the consistent state (if the second write is a bit slower/faster, the hashes will mismatch). IMO it's too rare to guard against

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of times we calculate and compare the hashes doesn’t matter, because the second process can start writing or changing the target directory after the last comparison or calculation.

In this case, the second process is the installer pod.

We should change the installer pod to perform an atomic swap rather than rely on the hashes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the desired procedure is:

  1. make a hash of source dir
  2. prepare a new dir from source dir
  3. hash source dir again to make sure it didn't change during previous step
  4. we swap atomically if hashes match

So the only chance of invalid swap here is if it initiates after step 1 and between step 3 and 4 - that would be a very rare event, wouldn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step 3 and 4 - that would be a very rare event, wouldn't it?

This PR addresses a rare bug where kas can observe an old prv key together with a new pub key, which causes a crash :(

I think we should take this potential issue into account and fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aligned installerpod to use the same atomic swap. I also refactored the PR a bit because I had to pull shared code into a separate package. The swapping function is now called SyncDirectory since it replaces the content actually.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
Copy link
Contributor

openshift-ci bot commented Sep 30, 2025

New changes are detected. LGTM label has been removed.

Copy link
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchap for easier review please open a new pr with the SwapDirectories function + tests.

and then a new PR for the SyncDirectory function.

//
// SyncDirectory is supposed to be used with secrets/configmaps, so typeName is expected to be "configmap" or "secret".
// This does not affect the logic, but it's included in error messages.
func SyncDirectory[C string | []byte](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing this method to: Write(targetDir string, files map[string][]byte, filePerm os.FileMode

After all we are writing files to the target directory in an atomic way.
Callers have to transform files to []byte which is easy since strings in go are []bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Sync instead of Write because Write feels like if doesn't affect existing files. But I don't have a strong opinion really.

}
defer func() {
if err := fs.RemoveAll(tmpDir); err != nil {
klog.Errorf("Failed to remove temporary directory %q during cleanup: %v", tmpDir, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the error the the callers not only log it.

tchap added 2 commits October 2, 2025 15:25
The function can be used to atomically sync a directory with the desired
state. This uses atomicdir.swap implemented earlier.
Copy link
Contributor

openshift-ci bot commented Oct 2, 2025

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants