Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tls-artifacts-registry enhancement #1502

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented Oct 26, 2023

cc @dgrisonnet @dinhxuanvu @deads2k

TODO:

  • Describe dev workflows
  • Add information about e2e test verifying the information and a link to PR (see autoregenerate after expiry)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dustymabe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

and consults it to ensure that a particular certificate is platform-managed. The administrator can use JIRA component metadata to file a
new issue related to this certificate.

Similarly, TRT members use this registry to file issues found in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not TRT.

Can you describe the workflow for adding a new cert? I think it looks like

  1. open PR adding new cert to operator repo
  2. CI fails due to new cert appearing
  3. CI includes the new raw-tls info required, but it's still missing an owner
  4. dev takes the file and adds it to origin PR
  5. origin PR fails CI because he forgot to update
  6. dev runs the hack/update
  7. origin PR fails CI because dev doesn't have an owner (or other annotation like "works with recert tool) for his new cert
  8. dev fakes in his origin PR, thus taking ownership. OR
  9. dev goes to his operator PR and adds the necessary annotations, redoes his origin, PR
  10. origin PR now passes CI and merges
  11. operator PR can now land.

Copy link
Member Author

@vrutkovs vrutkovs Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. dev runs the hack/update

would fail if owner is unset, so step 7 won't happen.

I expect the happy path to look like this:

  1. Developer is aware of this enhancement, so they create a PR with necessary annotation
  2. PR merges and TLS test in origin shows a flaking test - certificate is not registered in TLS registry
  3. Dev creates a PR to origin, updates ownership, hack/update generates new expected raw TLS info file
  4. After PR is merged TLS info test no longer flakes

Most unhappy path:

  1. open PR adding new cert to operator repo
  2. CI fails due to new cert appearing
  3. CI includes the new raw-tls info required, but it's still missing an owner
  4. dev takes the file and adds it to origin PR
  5. origin PR fails CI because he forgot to update
  6. dev runs the hack/update. hack/validate-tls-info requires owner to be set
  7. dev updates the owner thus taking ownership, origin PR is approved, operator PR can now land

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, TRT may be interested in this too, as if they encounter a problem with certificate its easy to find the team owning it - and quickly find a bug before reverting the offending PR (if available)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding a new cert is considered a failure based on the scenario that David described? Adding a new cert that meets all requirements (such as valid annotations, ownership info and proper expiration time) should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added expected flow in "Workflow Description" section

Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions.


### Workflow Description

The cluster administrator opens the TLS artifacts registry in the `origin` repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have a test designed to detect issues with certificates and I expect folks will look at the test to see what certificates needs to be fixed or issues need to be addressed. However, I wonder if we can generate a text-based output that will summarize all of failures that admin/dev can see quickly instead of having to go through the whole markdown that contains both good and bad certs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will exclude known violations. Currently all certs are added there, so the test would fail only on the certificates added in the PR

and consults it to ensure that a particular certificate is platform-managed. The administrator can use JIRA component metadata to file a
new issue related to this certificate.

Similarly, TRT members use this registry to file issues found in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding a new cert is considered a failure based on the scenario that David described? Adding a new cert that meets all requirements (such as valid annotations, ownership info and proper expiration time) should be fine.

being generated, so that most operators would use the recommended
method.

### Workflow Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it clear in this enhancement that the registry is a snapshot and auto-generated of the current stage of certificates on cluster and serves as a source of truth for validation purposes. It is meant to be immutable from viewer perspective as you shouldn't open a PR in origin to change the registry. Instead, you should open PR to address the issues directly to where the problematic certs are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding a new cert is considered a failure based on the scenario that David described?

Yes. The cert name first needs to be registered in origin repo and then have the PR adding it merged.

Before we start adding more metadata requirements in the test we'll ensure PRs to add necessary metadata are merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serves as a source of truth for validation purposes

Good point, will add this to the enhancement

It is meant to be immutable from viewer perspective as you shouldn't open a PR in origin to change the registry.

The violations list is meant to be remove-only, but users are expected to update expected metadata (i.e. description) and move certs from violations list after certificate metadata (==annotations) are updated


Similarly, TRT members use this registry to file issues found in
periodic or release-blocking jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section about, "how to add capability/X to all certificates".

Intro about potential capabilities: on demand rotation, on demand revocation, auto regeneration on expiry, recert for single node.

  1. create new directory under origin/tls/<capability> peer to ownership
  2. directory must contain a readme.MD that is manually created
  3. directory must contain a generated <capability>.md that organized by owning component (jira components) . Generator lives in blah and has matching verify script in blah, point to examples. The markdown file is for humans to follow progress achieving capability/X
  4. directory must contain json file for machine reability of progress.
  5. generator must read from rawTLS to generate the json and markdown. raw TLS gathers data from annotations specified on kube resources.
  6. to have PKI indicate they support capability/X, the owning team will merge a PR adding the annotation AFTER that team is satisfied that it is supported and has some regression protection.
  7. more and more.

@vrutkovs
Copy link
Member Author

vrutkovs commented Dec 6, 2023

cc @omertuc @jakobmoellerdev PTAL

## Summary

This enhancement describes a single registry to store all produced
TLS artifacts. Both engineering and customers have expressed concern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS artifacts

The registry should possibly also cover non-TLS cryptography, such as JWT tokens, the private keys used to sign them and their corresponding public keys

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JWT tokens

Could you give me an example where these can be stored and detected?

the private keys used to sign them and their corresponding public keys

Yes, that is something we cover. Should we specify the way we find those?
Currently its "list secrets and look for values which can be decoded as either certificate or the key" and "find files which can be decoded as certificate or a key", but I don't mind having it expanded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me an example where these can be stored and detected?

There are many of them, but one particular example:

e.g. a JWT could be found in Secret openshift-kube-apiserver:default-token-bzg9f .data.token

Signed by the private key at Secret openshift-kube-controller-manager:service-account-private-key-4 .data."service-account.key"

And the matching public key at ConfigMap openshift-kube-apiserver:sa-token-signing-certs-3 .data."service-account-001.pub"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service account secret and public keys are certainly being included, but things like JWT are not, as they are derivatives and thus harder to be tracked

Copy link
Contributor

@omertuc omertuc Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. recert has to re-sign all JWTs after regenerating the service account private key, so it needs to know where they all are. So without them being tracked in the registry recert would still have to do scanning to find JWTs.

Also, pedantic nit: you might want to change the wording of the enhancement from "TLS" to something else since those service account private/public keys which the registry tracks are used to the best of my knowledge exclusively for signing/verifying JWTs, and that technically doesn't have anything to do with TLS, so the term TLS is overly specific

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TLS is correct here - certificates and CA bundles unlike JWT have a set of properties like key algorithm, key size, validity period etc. Hopefully it would be useful for recert tool, but its not meant to cover all secrets recert is working with.

Copy link
Contributor

@omertuc omertuc Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TLS is correct here

The registry will cover service account keys (which don't even have certs associated with them, and even if they did, that still doesn't make them TLS-related, TLS uses certs, but certs are not exclusive to TLS), they have nothing to do with TLS, they're used just for signing and verifying JWTs, so mentioning TLS is technically too specific and excludes them, despite the registry including them. Of course I'm just being pedantic here, it's just a suggestion to fix the terminology in the enhancement to avoid confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few examples to Non goals to cover this discussion

TLS artifacts registry is useful for several teams in Red Hat (i.e.
which component manages a particular certificate) and customers -
for instance, helping them to separate platform-managed certificates
from user-managed.
Copy link
Contributor

@omertuc omertuc Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from user-managed.

Beyond user-managed vs platform-managed, within the umbrella of platform-managed it would also be nice if the registry could make a distinction between cluster internal certificates (which are platform managed) and well-known Internet CA certificates (which are also technically platform managed). The latter are static and provided from outside the cluster, they don't relate cryptographically to the certificates and keys within the cluster. The former are internal to the cluster and can be rotated as long as all their related keys and certs are rotated as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the registry could make a distinction between cluster

This would be a property of the cert, similar to the "owning-component" we have now - but we don't need to list all properties in the enhancement describing TLS registry as a structure. This enhancement however should include a way of adding new metadata requirements - which way would fit you?

Copy link
Contributor

@omertuc omertuc Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not critical to me whether it's covered within the scope of the enhancement or some generic metadata field in the future - I guess it depends on the definition of "Ownership" - who owns a trusted e.g. Verisign certificate - it's platform-managed (i.e. not user provided), but it's not our certificate - it's a globally well known Internet CA certificate. Who owns such certificate? (not looking for an answer, just something to consider. I'm not opinionated on whether this is something the enhancement should cover)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a matter of correctly phrasing requirements. So far we used "ownership" as a substitute for "owning component", meaning "JIRA component which is responsible for its lifecycle". One of valid values there however is "End User" (if no component created it).
Its certainly a problem to define terms correctly, but I don't think there is a way to prevent this problem in principle

@omertuc
Copy link
Contributor

omertuc commented Dec 11, 2023

Another obstacle we have with recert is that some keys and certs can be found in really obscure locations:

  • APIService have certs under /spec/caBundle
  • ControllerConfig have certs under /spec/kubeAPIServerServingCAData
  • MachineConfig has certs under e.g. /spec/config/storage/files[15]/contents/source
  • MutatingWebhookConfiguration has certs under e.g. /webhooks[0]/clientConfig/caBundle
  • ValidatingWebhookConfiguration has certs under e.g. /webhooks[0]/clientConfig/caBundle

We take care of them for now, but it's inconvenient and I think we should restrict keys/certs to be only in configmaps/secrets

But even within secrets/configmaps, some are not directly inside the secret/configmap, they're nested within a kubeconfig that's inside the secret/configmap, which is so convoluted that we have implemented specialized workarounds to handle them in particular, these currently include:

  • Some secrets in the openshift-kube-apiserver namespace called webhook-authenticator* have a .data.kubeConfig entry that contains a kubeconfig and within that kubeconfig you can find certs/keys. That's very obscure and I think we should disallow such obscure behavior as a rule
    • Another secret in that same namespace called node-kubeconfigs has 4 kubeconfigs nested within it called lb-ext.kubeconfig lb-int.kubeconfig localhost-recovery.kubeconfig localhost.kubeconfig

Our current solution for some of them is to delete them and let them reconcile, but for others that reconcile takes too long so we have to give them special handling for performance reasons, and we wish we could stop doing that

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2024
Add a test to the conformance suite that verifies the actual cluster TLS
artifacts have necessary metadata.

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call out here the bits from earlier comments, ie. JWT tokens and cluster external certificates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added few items here

registry
* Create a test that verifies that TLS artifacts in a cluster have
necessary annotations to provide required metadata.
* Some TLS artifacts can be added to known violation list so that the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add it as non-goal, since this will be only a temporary approach while we get to the point where the full registry is working as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining a list of violations is part of the registry, its not a non-goal really


#### Adding a new TLS artifact

The cluster administrator opens the TLS artifacts registry in the `origin` repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you consider publishing the registry in our docs? Would it be feasible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least a link from the docs describing this problem to a particular file in origin repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have docs generated in openshift/api about which certificates we expect and how they interact. This enhancement is more focused on ensuring that certs in cluster actually match this model. These features definitely overlap, but there is no need to publish actual certs from CI cluster to docs.

Its worth adding a link to the api docs though, its useful to keep expected and actual state in sync

### Non-Goals

* Include certificates provided by user (additional CA bundle, custom ingress / API certificates)
* Include other encrypted artifacts - e.g. JWT tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getting back to this in light of the recent discussions above. Is there any plan or idea on how this registry would be usable for the IBI/IBU flows or replacing recert if they are not including JWT or other derivatives? AFAIK @omertuc comments are basic requirements for replacement of recert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with IBI/IBU flows - do you have a link to describe those?

This registry won't replace recert, but hopefully it would be useful for recert tool - it would be a source of truth on cert locations and its properties, e.g. which signer has issued this cert and where on disk or in cluster this cert needs to be present etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding IBI/IBU - Theres this high level write up which I recommend, but I think its irrelevant to the point im making.

I think this registry will not be that useful to recert because recert will still have to iterate over all objects for crypto analysis due to JWT explicitly being excluded from this EP.

This is because before looking into the objects subject to modification, recert does not know where the JWTs / Certs are it has to modify. With this enhancement we could make recert use the certificate directory to skip one part of this, however JWTs would need either

  • A) A new registry that is similar to the one discussed here (leading to a new EP or to the existing recert logic staying as is)
  • B) An extension to this registry so that they can be mapped in this EP (removing it from non-goals)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth extending this registry for JWT, although for this first phase its listed as a non-goal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omertuc @jerpeter1 Do we concur then that we need to build out this extension before being able to use it for recert or should we introudce an intermediary step?

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2024
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 2, 2024
Copy link
Contributor

openshift-ci bot commented Feb 2, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vrutkovs
Copy link
Member Author

vrutkovs commented Feb 2, 2024

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this Feb 2, 2024
Copy link
Contributor

openshift-ci bot commented Feb 2, 2024

@vrutkovs: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 2, 2024
@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@vrutkovs: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024
@vrutkovs
Copy link
Member Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024
@vrutkovs
Copy link
Member Author

vrutkovs commented May 8, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants