New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create tls-artifacts-registry enhancement #1502
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,350 @@ | ||
--- | ||
title: tls-artifacts-registry | ||
authors: | ||
- "@vrutkovs" | ||
- "@dgrisonnet" | ||
- "@dinhxuanvu" | ||
reviewers: | ||
- "@deads2k" | ||
approvers: | ||
- "@deads2k" | ||
api-approvers: | ||
- "@deads2k" | ||
creation-date: 2023-10-26 | ||
last-updated: 2024-02-14 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/API-1603 | ||
--- | ||
|
||
# TLS Artifacts Registry | ||
|
||
## Summary | ||
|
||
This enhancement describes a single registry to store all produced | ||
TLS artifacts. Both engineering and customers have expressed concern | ||
about a lack of understanding what is the purpose of a particular | ||
certificate or CA bundle, who issued it, and which team is responsible | ||
for its contents and lifecycle. This enhancement describes a registry | ||
to store necessary information about all certificates in the cluster | ||
and sets requirements for TLS artifacts metadata. | ||
|
||
## Motivation | ||
|
||
TLS artifacts registry is useful for several teams in Red Hat (i.e. | ||
which component manages a particular certificate) and customers - | ||
for instance, helping them to separate platform-managed certificates | ||
from user-managed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Beyond user-managed vs platform-managed, within the umbrella of platform-managed it would also be nice if the registry could make a distinction between cluster internal certificates (which are platform managed) and well-known Internet CA certificates (which are also technically platform managed). The latter are static and provided from outside the cluster, they don't relate cryptographically to the certificates and keys within the cluster. The former are internal to the cluster and can be rotated as long as all their related keys and certs are rotated as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This would be a property of the cert, similar to the "owning-component" we have now - but we don't need to list all properties in the enhancement describing TLS registry as a structure. This enhancement however should include a way of adding new metadata requirements - which way would fit you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not critical to me whether it's covered within the scope of the enhancement or some generic metadata field in the future - I guess it depends on the definition of "Ownership" - who owns a trusted e.g. Verisign certificate - it's platform-managed (i.e. not user provided), but it's not our certificate - it's a globally well known Internet CA certificate. Who owns such certificate? (not looking for an answer, just something to consider. I'm not opinionated on whether this is something the enhancement should cover) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its a matter of correctly phrasing requirements. So far we used "ownership" as a substitute for "owning component", meaning "JIRA component which is responsible for its lifecycle". One of valid values there however is "End User" (if no component created it). |
||
|
||
### User Stories | ||
|
||
- As an Openshift administrator, I want to differentiate | ||
platform-managed TLS artifacts from user-managed. | ||
- As an Openshift developer, I want to be able to find which team | ||
manages a particular certificate. | ||
- As an Openshift developer, I want to have a test which verifies | ||
that all certificates managed by the team have sufficient metadata | ||
to explain its purpose | ||
- As a member of OpenShift concerned with the release process | ||
(TRT, dev, staff engineer, maybe even PM), I want to make sure all | ||
managed certificates have sufficient metadata to find the team | ||
responsible for its lifecycle. | ||
|
||
### Goals | ||
|
||
Create a registry of all managed TLS artifacts, and ensure that no new | ||
certificates are added without sufficient metadata. | ||
Add a test to the conformance suite that verifies the actual cluster TLS | ||
artifacts have necessary metadata. | ||
|
||
### Non-Goals | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd call out here the bits from earlier comments, ie. JWT tokens and cluster external certificates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, added few items here |
||
|
||
* Include certificates provided by user (additional CA bundle, custom ingress / API certificates) | ||
* Include other encrypted artifacts - e.g. JWT tokens | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe getting back to this in light of the recent discussions above. Is there any plan or idea on how this registry would be usable for the IBI/IBU flows or replacing recert if they are not including JWT or other derivatives? AFAIK @omertuc comments are basic requirements for replacement of recert. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with IBI/IBU flows - do you have a link to describe those? This registry won't replace recert, but hopefully it would be useful for recert tool - it would be a source of truth on cert locations and its properties, e.g. which signer has issued this cert and where on disk or in cluster this cert needs to be present etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding IBI/IBU - Theres this high level write up which I recommend, but I think its irrelevant to the point im making. I think this registry will not be that useful to recert because recert will still have to iterate over all objects for crypto analysis due to JWT explicitly being excluded from this EP. This is because before looking into the objects subject to modification, recert does not know where the JWTs / Certs are it has to modify. With this enhancement we could make recert use the certificate directory to skip one part of this, however JWTs would need either
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think its worth extending this registry for JWT, although for this first phase its listed as a non-goal There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @omertuc @jerpeter1 Do we concur then that we need to build out this extension before being able to use it for recert or should we introudce an intermediary step? |
||
* Include objects, which include TLS artifacts - e.g. kubeconfigs | ||
* They contain CA bundles, cert and key - but verifying them requires kubeconfig parsing | ||
* List all TLS artifact users. This registry only keeps track of TLS artifacts producers, not consumers | ||
* Track MicroShift TLS artifacts | ||
|
||
## Proposal | ||
|
||
* Define a structure for the certificate registry origin repo | ||
* Each TLS artifact can be either | ||
* on-disk file | ||
* in-cluster secret or config map | ||
* Additional expected metadata stored for this artifact | ||
* JIRA component | ||
* TLS artifact description (one sentence describing the purpose | ||
of this artifact) | ||
* Each TLS artifact is associated with a set of required metadata | ||
* Create a test which saves existing information about TLS artifacts | ||
in cluster. | ||
* Some secrets may be versioned, contain hash or IP address / DNS | ||
vrutkovs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
address - i.e. | ||
`etcd-peer-ip-10-10-10-10.us-east-1.internal` becomes | ||
`etcd-peer-<master-2>` etc. | ||
The test would not store versioned copies and replace IPs with "master-n" | ||
to make them look uniform when saving raw TLS information | ||
* Create a new test, which ensures that all secrets are present in the | ||
registry | ||
* Create a test that verifies that TLS artifacts in a cluster have | ||
necessary annotations to provide required metadata. | ||
* Some TLS artifacts can be added to known violation list so that the test | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd probably add it as non-goal, since this will be only a temporary approach while we get to the point where the full registry is working as expected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maintaining a list of violations is part of the registry, its not a non-goal really |
||
is not permafailing to give teams time to update metadata. Each metadata | ||
property has its own violation list. | ||
* Metadata is split into required and optional properties. The test would | ||
record missing optional metadata and fail if required metadata is missing. | ||
This gives time to update certificate metadata when optional metadata | ||
requirements become required. Violation files are meant to be "remove-only" | ||
mode, meaning when metadata property becomes required users are not meant | ||
to be added to the respective violation file. | ||
* Initially required metadata properties would be "owning JIRA component" and | ||
"one line description of artifact purpose". Later on other properties would be | ||
added and eventually become required for new certificates. | ||
* TLS artifact registry is being composed as a JSON file from several | ||
parts (usually generated from the test job artifacts). | ||
* In the `origin` repo create `hack/update-tls-artifacts.json` and | ||
`hack/verify-tls-artifacts.json` scripts to compose TLS artifact | ||
registry JSON from several JSONs and verify that all necessary | ||
metadata is set respectively. `update` script also generate violation lists | ||
in JSON and Markdown formats. For each required metadata field the Markdown | ||
report can be customized. | ||
* Update library-go functions to set metadata when the TLS artifact is | ||
being generated, so that most operators would use the recommended | ||
method. | ||
|
||
### Workflow Description | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should make it clear in this enhancement that the registry is a snapshot and auto-generated of the current stage of certificates on cluster and serves as a source of truth for validation purposes. It is meant to be immutable from viewer perspective as you shouldn't open a PR in origin to change the registry. Instead, you should open PR to address the issues directly to where the problematic certs are. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. The cert name first needs to be registered in Before we start adding more metadata requirements in the test we'll ensure PRs to add necessary metadata are merged There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good point, will add this to the enhancement
The violations list is meant to be remove-only, but users are expected to update expected metadata (i.e. description) and move certs from violations list after certificate metadata (==annotations) are updated |
||
|
||
#### Adding a new TLS artifact | ||
|
||
The cluster administrator opens the TLS artifacts registry in the `origin` repo | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know we have a test designed to detect issues with certificates and I expect folks will look at the test to see what certificates needs to be fixed or issues need to be addressed. However, I wonder if we can generate a text-based output that will summarize all of failures that admin/dev can see quickly instead of having to go through the whole markdown that contains both good and bad certs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test will exclude known violations. Currently all certs are added there, so the test would fail only on the certificates added in the PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you consider publishing the registry in our docs? Would it be feasible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or at least a link from the docs describing this problem to a particular file in origin repo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already have docs generated in openshift/api about which certificates we expect and how they interact. This enhancement is more focused on ensuring that certs in cluster actually match this model. These features definitely overlap, but there is no need to publish actual certs from CI cluster to docs. Its worth adding a link to the api docs though, its useful to keep expected and actual state in sync |
||
and consults it to ensure that a particular certificate is platform-managed. | ||
The administrator can use JIRA component metadata to file a | ||
new issue related to this certificate. | ||
|
||
When a components requires a new certificate created in cluster, . In order to properly | ||
register a new TLS certificate the developer should do the following: | ||
|
||
* the developers should start create a pull request adding this functionality. | ||
"all tls certificates should be registered" test would fail. | ||
The test job generates `rawTLSInformation` file which | ||
contains collected information from the prow job, containing info about this new | ||
certificate. | ||
* the developer creates a PR to `origin` repo, adding this file to | ||
soltysh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`tls/raw-data` directory, runs `hack/update-tls-artifacts.sh` script. | ||
The script would update `tls/ownership/tls-ownership.json` so that the test would no longer fail. | ||
This script would also update `violation` files, so that the test verifying required metadata | ||
would include this certificate | ||
* the developer reruns tests and makes sure that the certificate is indeed registered and | ||
its metadata has required properties and it matches the expected values stored in `origin` repo. | ||
If tls artifact tests still fail, the developer repeats previous step making necessary changes. | ||
|
||
Similarly, TRT members use this registry to file issues found in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably not TRT. Can you describe the workflow for adding a new cert? I think it looks like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
would fail if owner is unset, so step 7 won't happen. I expect the happy path to look like this:
Most unhappy path:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However, TRT may be interested in this too, as if they encounter a problem with certificate its easy to find the team owning it - and quickly find a bug before reverting the offending PR (if available) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if adding a new cert is considered a failure based on the scenario that David described? Adding a new cert that meets all requirements (such as valid annotations, ownership info and proper expiration time) should be fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added expected flow in "Workflow Description" section |
||
periodic or release-blocking jobs. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a section about, "how to add capability/X to all certificates". Intro about potential capabilities: on demand rotation, on demand revocation, auto regeneration on expiry, recert for single node.
|
||
#### Adding a new requirement for TLS artifacts | ||
|
||
Reports and violations mechanisms can be extended to add new requirements. To add a new | ||
certificate metadata requirements developers would need to: | ||
* Implement [`Requirement` interface](https://github.com/openshift/origin/blob/master/pkg/cmd/update-tls-artifacts/generate-owners/tlsmetadatainterfaces/types.go#L9-L14): | ||
```golang | ||
type Requirement interface { | ||
GetName() string | ||
|
||
// InspectRequirement generates and returns the result for a particular set of raw data | ||
InspectRequirement(rawData []*certgraphapi.PKIList) (RequirementResult, error) | ||
} | ||
|
||
|
||
type RequirementResult interface { | ||
GetName() string | ||
|
||
// WriteResultToTLSDir writes the content this requirement expects in directory. | ||
// tlsDir is the parent directory and must be nested as: <tlsDir>/<GetName()>/<content here>. | ||
// The content MUST include | ||
// 1. <tlsDir>/<GetName()>/<GetName()>.md | ||
// 2. <tlsDir>/<GetName()>/<GetName().json | ||
// 3. <tlsDir>/violations/<GetName()>/<GetName()>-violations.json | ||
WriteResultToTLSDir(tlsDir string) error | ||
|
||
// DiffExistingContent compares the content of the result with what currently exists in the tlsDir. | ||
// returns | ||
// string representation to display to user (ideally a diff) | ||
// bool that is true when content matches and false when content does not match | ||
// error which non-nil ONLY when the comparison itself could not complete. A completed diff that is non-zero is not an error | ||
DiffExistingContent(tlsDir string) (string, bool, error) | ||
|
||
// HaveViolationsRegressed compares the violations of the result with was passed in and returns | ||
// allViolationsFS is the tls/violations/<GetName> directory | ||
// returns | ||
// string representation to display to user (ideally a diff of what is worse) | ||
// bool that is true when no regressions have been introduced and false when content has gotten worse | ||
// error which non-nil ONLY when the comparison itself could not complete. A completed check that is non-zero is not an error | ||
HaveViolationsRegressed(allViolationsFS embed.FS) ([]string, bool, error) | ||
} | ||
``` | ||
|
||
A simplified type exists for annotation requirements: | ||
```golang | ||
type annotationRequirement struct { | ||
// requirementName is a unique name for metadata requirement | ||
requirementName string | ||
// annotationName is the annotation looked up in cert metadata | ||
annotationName string | ||
// title for the markdown | ||
title string | ||
// explanationMD is exactly the markdown to include that explains the purposes of the check | ||
explanationMD string | ||
} | ||
|
||
func NewAnnotationRequirement(requirementName, annotationName, title, explanationMD string) AnnotationRequirement { | ||
return annotationRequirement{ | ||
requirementName: requirementName, | ||
annotationName: annotationName, | ||
title: title, | ||
explanationMD: explanationMD, | ||
} | ||
} | ||
``` | ||
|
||
Example: | ||
```golang | ||
package autoregenerate_after_expiry | ||
|
||
import "github.com/openshift/origin/pkg/cmd/update-tls-artifacts/generate-owners/tlsmetadatainterfaces" | ||
|
||
const annotationName string = "certificates.openshift.io/auto-regenerate-after-offline-expiry" | ||
|
||
type AutoRegenerateAfterOfflineExpiryRequirement struct{} | ||
|
||
func NewAutoRegenerateAfterOfflineExpiryRequirement() tlsmetadatainterfaces.Requirement { | ||
|
||
md := tlsmetadatainterfaces.NewMarkdown("") | ||
md.Text("This is a report header, customized for particular requirement") | ||
md.Text("Afterwards `NewAnnotationRequirement` will add a list of TLS artifacts, grouped by type and owning component") | ||
|
||
return tlsmetadatainterfaces.NewAnnotationRequirement( | ||
// requirement name | ||
"autoregenerate-after-expiry", | ||
// cert or configmap annotation | ||
annotationName, | ||
// report name | ||
"Auto Regenerate After Offline Expiry", | ||
// report name | ||
string(md.ExactBytes()), | ||
) | ||
} | ||
``` | ||
|
||
In order to ensure that the requirement is backed by e2e, annotation value should contain e2e test name | ||
verifying this requirement property: | ||
``` | ||
annotations: | ||
certificates.openshift.io/auto-regenerate-after-offline-expiry: https//github.com/link/to/pr/adding/annotation, "quote escaped formatted name of e2e test that ensures the PKI artifact functions properly" | ||
``` | ||
|
||
|
||
### API Extensions | ||
|
||
Not required | ||
|
||
### Topology Considerations | ||
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
No specific changes required | ||
|
||
#### Standalone Clusters | ||
|
||
No specific changes required | ||
|
||
#### Single-node Deployments or MicroShift | ||
|
||
No specific changes required for SNO. Tracking TLS artifacts in MicroShift cluster is a non-goal | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
None | ||
|
||
### Risks and Mitigations | ||
|
||
Teams are responsible for accurate descriptions and timely owners | ||
update of TLS artifacts | ||
|
||
### Drawbacks | ||
|
||
None | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
* Which metadata should be present for all TLS artifacts? | ||
* Minimum set: | ||
* Description | ||
* JIRA component | ||
* Additional data: | ||
* Validity period | ||
* Refreshed after n days / at n% of lifetime | ||
* Can be versioned / can contain IP address or hash | ||
* Suggested by customers | ||
* SSLConnections - which IP address / DNS name this certificate | ||
is valid for. | ||
* This can be derived from certificate content, not required to | ||
be included in the TLS artifact registry and may depend on the base | ||
domain | ||
* Parent - a reference to another TLS artifact this cert is derived | ||
from. | ||
* Management type - operator-managed, user-managed. | ||
|
||
|
||
## Test Plan | ||
|
||
`origin` unit test would verify that the TLS artifact registry is valid: | ||
* All TLS artifacts have owners | ||
* All TLS artifacts have the necessary metadata | ||
|
||
e2e conformance tests would ensure that: | ||
* All certificates in `openshift-*` namespaces or on disk are | ||
included in the TLS registry. | ||
* in-cluster certificates have necessary annotations and their value | ||
matches registry entries. | ||
|
||
## Graduation Criteria | ||
|
||
Not applicable | ||
|
||
### Dev Preview -> Tech Preview | ||
|
||
Not applicable | ||
|
||
### Tech Preview -> GA | ||
|
||
Not applicable | ||
|
||
### Removing a deprecated feature | ||
|
||
Not applicable | ||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
Not applicable | ||
|
||
## Version Skew Strategy | ||
|
||
Not applicable | ||
|
||
## Operational Aspects of API Extensions | ||
|
||
Not applicable | ||
|
||
## Support Procedures | ||
|
||
Not applicable | ||
|
||
## Implementation History | ||
|
||
* Current implementation: https://github.com/openshift/origin/pull/28305 | ||
|
||
## Alternatives | ||
|
||
None known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registry should possibly also cover non-TLS cryptography, such as JWT tokens, the private keys used to sign them and their corresponding public keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give me an example where these can be stored and detected?
Yes, that is something we cover. Should we specify the way we find those?
Currently its "list secrets and look for values which can be decoded as either certificate or the key" and "find files which can be decoded as certificate or a key", but I don't mind having it expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many of them, but one particular example:
e.g. a JWT could be found in Secret openshift-kube-apiserver:default-token-bzg9f
.data.token
Signed by the private key at Secret openshift-kube-controller-manager:service-account-private-key-4
.data."service-account.key"
And the matching public key at ConfigMap openshift-kube-apiserver:sa-token-signing-certs-3
.data."service-account-001.pub"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Service account secret and public keys are certainly being included, but things like JWT are not, as they are derivatives and thus harder to be tracked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. recert has to re-sign all JWTs after regenerating the service account private key, so it needs to know where they all are. So without them being tracked in the registry recert would still have to do scanning to find JWTs.
Also, pedantic nit: you might want to change the wording of the enhancement from "TLS" to something else since those service account private/public keys which the registry tracks are used to the best of my knowledge exclusively for signing/verifying JWTs, and that technically doesn't have anything to do with TLS, so the term TLS is overly specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TLS is correct here - certificates and CA bundles unlike JWT have a set of properties like key algorithm, key size, validity period etc. Hopefully it would be useful for recert tool, but its not meant to cover all secrets recert is working with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registry will cover service account keys (which don't even have certs associated with them, and even if they did, that still doesn't make them TLS-related, TLS uses certs, but certs are not exclusive to TLS), they have nothing to do with TLS, they're used just for signing and verifying JWTs, so mentioning TLS is technically too specific and excludes them, despite the registry including them. Of course I'm just being pedantic here, it's just a suggestion to fix the terminology in the enhancement to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few examples to Non goals to cover this discussion