Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tls-artifacts-registry enhancement #1502

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
350 changes: 350 additions & 0 deletions enhancements/security/tls-artifacts-registry.md
@@ -0,0 +1,350 @@
---
title: tls-artifacts-registry
authors:
- "@vrutkovs"
- "@dgrisonnet"
- "@dinhxuanvu"
reviewers:
- "@deads2k"
approvers:
- "@deads2k"
api-approvers:
- "@deads2k"
creation-date: 2023-10-26
last-updated: 2024-02-14
tracking-link:
- https://issues.redhat.com/browse/API-1603
---

# TLS Artifacts Registry

## Summary

This enhancement describes a single registry to store all produced
TLS artifacts. Both engineering and customers have expressed concern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS artifacts

The registry should possibly also cover non-TLS cryptography, such as JWT tokens, the private keys used to sign them and their corresponding public keys

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JWT tokens

Could you give me an example where these can be stored and detected?

the private keys used to sign them and their corresponding public keys

Yes, that is something we cover. Should we specify the way we find those?
Currently its "list secrets and look for values which can be decoded as either certificate or the key" and "find files which can be decoded as certificate or a key", but I don't mind having it expanded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me an example where these can be stored and detected?

There are many of them, but one particular example:

e.g. a JWT could be found in Secret openshift-kube-apiserver:default-token-bzg9f .data.token

Signed by the private key at Secret openshift-kube-controller-manager:service-account-private-key-4 .data."service-account.key"

And the matching public key at ConfigMap openshift-kube-apiserver:sa-token-signing-certs-3 .data."service-account-001.pub"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service account secret and public keys are certainly being included, but things like JWT are not, as they are derivatives and thus harder to be tracked

Copy link
Contributor

@omertuc omertuc Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. recert has to re-sign all JWTs after regenerating the service account private key, so it needs to know where they all are. So without them being tracked in the registry recert would still have to do scanning to find JWTs.

Also, pedantic nit: you might want to change the wording of the enhancement from "TLS" to something else since those service account private/public keys which the registry tracks are used to the best of my knowledge exclusively for signing/verifying JWTs, and that technically doesn't have anything to do with TLS, so the term TLS is overly specific

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TLS is correct here - certificates and CA bundles unlike JWT have a set of properties like key algorithm, key size, validity period etc. Hopefully it would be useful for recert tool, but its not meant to cover all secrets recert is working with.

Copy link
Contributor

@omertuc omertuc Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TLS is correct here

The registry will cover service account keys (which don't even have certs associated with them, and even if they did, that still doesn't make them TLS-related, TLS uses certs, but certs are not exclusive to TLS), they have nothing to do with TLS, they're used just for signing and verifying JWTs, so mentioning TLS is technically too specific and excludes them, despite the registry including them. Of course I'm just being pedantic here, it's just a suggestion to fix the terminology in the enhancement to avoid confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few examples to Non goals to cover this discussion

about a lack of understanding what is the purpose of a particular
certificate or CA bundle, who issued it, and which team is responsible
for its contents and lifecycle. This enhancement describes a registry
to store necessary information about all certificates in the cluster
and sets requirements for TLS artifacts metadata.

## Motivation

TLS artifacts registry is useful for several teams in Red Hat (i.e.
which component manages a particular certificate) and customers -
for instance, helping them to separate platform-managed certificates
from user-managed.
Copy link
Contributor

@omertuc omertuc Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from user-managed.

Beyond user-managed vs platform-managed, within the umbrella of platform-managed it would also be nice if the registry could make a distinction between cluster internal certificates (which are platform managed) and well-known Internet CA certificates (which are also technically platform managed). The latter are static and provided from outside the cluster, they don't relate cryptographically to the certificates and keys within the cluster. The former are internal to the cluster and can be rotated as long as all their related keys and certs are rotated as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the registry could make a distinction between cluster

This would be a property of the cert, similar to the "owning-component" we have now - but we don't need to list all properties in the enhancement describing TLS registry as a structure. This enhancement however should include a way of adding new metadata requirements - which way would fit you?

Copy link
Contributor

@omertuc omertuc Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not critical to me whether it's covered within the scope of the enhancement or some generic metadata field in the future - I guess it depends on the definition of "Ownership" - who owns a trusted e.g. Verisign certificate - it's platform-managed (i.e. not user provided), but it's not our certificate - it's a globally well known Internet CA certificate. Who owns such certificate? (not looking for an answer, just something to consider. I'm not opinionated on whether this is something the enhancement should cover)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a matter of correctly phrasing requirements. So far we used "ownership" as a substitute for "owning component", meaning "JIRA component which is responsible for its lifecycle". One of valid values there however is "End User" (if no component created it).
Its certainly a problem to define terms correctly, but I don't think there is a way to prevent this problem in principle


### User Stories

- As an Openshift administrator, I want to differentiate
platform-managed TLS artifacts from user-managed.
- As an Openshift developer, I want to be able to find which team
manages a particular certificate.
- As an Openshift developer, I want to have a test which verifies
that all certificates managed by the team have sufficient metadata
to explain its purpose
- As a member of OpenShift concerned with the release process
(TRT, dev, staff engineer, maybe even PM), I want to make sure all
managed certificates have sufficient metadata to find the team
responsible for its lifecycle.

### Goals

Create a registry of all managed TLS artifacts, and ensure that no new
certificates are added without sufficient metadata.
Add a test to the conformance suite that verifies the actual cluster TLS
artifacts have necessary metadata.

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call out here the bits from earlier comments, ie. JWT tokens and cluster external certificates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added few items here


* Include certificates provided by user (additional CA bundle, custom ingress / API certificates)
* Include other encrypted artifacts - e.g. JWT tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getting back to this in light of the recent discussions above. Is there any plan or idea on how this registry would be usable for the IBI/IBU flows or replacing recert if they are not including JWT or other derivatives? AFAIK @omertuc comments are basic requirements for replacement of recert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with IBI/IBU flows - do you have a link to describe those?

This registry won't replace recert, but hopefully it would be useful for recert tool - it would be a source of truth on cert locations and its properties, e.g. which signer has issued this cert and where on disk or in cluster this cert needs to be present etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding IBI/IBU - Theres this high level write up which I recommend, but I think its irrelevant to the point im making.

I think this registry will not be that useful to recert because recert will still have to iterate over all objects for crypto analysis due to JWT explicitly being excluded from this EP.

This is because before looking into the objects subject to modification, recert does not know where the JWTs / Certs are it has to modify. With this enhancement we could make recert use the certificate directory to skip one part of this, however JWTs would need either

  • A) A new registry that is similar to the one discussed here (leading to a new EP or to the existing recert logic staying as is)
  • B) An extension to this registry so that they can be mapped in this EP (removing it from non-goals)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth extending this registry for JWT, although for this first phase its listed as a non-goal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omertuc @jerpeter1 Do we concur then that we need to build out this extension before being able to use it for recert or should we introudce an intermediary step?

* Include objects, which include TLS artifacts - e.g. kubeconfigs
* They contain CA bundles, cert and key - but verifying them requires kubeconfig parsing
* List all TLS artifact users. This registry only keeps track of TLS artifacts producers, not consumers
* Track MicroShift TLS artifacts

## Proposal

* Define a structure for the certificate registry origin repo
* Each TLS artifact can be either
* on-disk file
* in-cluster secret or config map
* Additional expected metadata stored for this artifact
* JIRA component
* TLS artifact description (one sentence describing the purpose
of this artifact)
* Each TLS artifact is associated with a set of required metadata
* Create a test which saves existing information about TLS artifacts
in cluster.
* Some secrets may be versioned, contain hash or IP address / DNS
vrutkovs marked this conversation as resolved.
Show resolved Hide resolved
address - i.e.
`etcd-peer-ip-10-10-10-10.us-east-1.internal` becomes
`etcd-peer-<master-2>` etc.
The test would not store versioned copies and replace IPs with "master-n"
to make them look uniform when saving raw TLS information
* Create a new test, which ensures that all secrets are present in the
registry
* Create a test that verifies that TLS artifacts in a cluster have
necessary annotations to provide required metadata.
* Some TLS artifacts can be added to known violation list so that the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add it as non-goal, since this will be only a temporary approach while we get to the point where the full registry is working as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining a list of violations is part of the registry, its not a non-goal really

is not permafailing to give teams time to update metadata. Each metadata
property has its own violation list.
* Metadata is split into required and optional properties. The test would
record missing optional metadata and fail if required metadata is missing.
This gives time to update certificate metadata when optional metadata
requirements become required. Violation files are meant to be "remove-only"
mode, meaning when metadata property becomes required users are not meant
to be added to the respective violation file.
* Initially required metadata properties would be "owning JIRA component" and
"one line description of artifact purpose". Later on other properties would be
added and eventually become required for new certificates.
* TLS artifact registry is being composed as a JSON file from several
parts (usually generated from the test job artifacts).
* In the `origin` repo create `hack/update-tls-artifacts.json` and
`hack/verify-tls-artifacts.json` scripts to compose TLS artifact
registry JSON from several JSONs and verify that all necessary
metadata is set respectively. `update` script also generate violation lists
in JSON and Markdown formats. For each required metadata field the Markdown
report can be customized.
* Update library-go functions to set metadata when the TLS artifact is
being generated, so that most operators would use the recommended
method.

### Workflow Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it clear in this enhancement that the registry is a snapshot and auto-generated of the current stage of certificates on cluster and serves as a source of truth for validation purposes. It is meant to be immutable from viewer perspective as you shouldn't open a PR in origin to change the registry. Instead, you should open PR to address the issues directly to where the problematic certs are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding a new cert is considered a failure based on the scenario that David described?

Yes. The cert name first needs to be registered in origin repo and then have the PR adding it merged.

Before we start adding more metadata requirements in the test we'll ensure PRs to add necessary metadata are merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serves as a source of truth for validation purposes

Good point, will add this to the enhancement

It is meant to be immutable from viewer perspective as you shouldn't open a PR in origin to change the registry.

The violations list is meant to be remove-only, but users are expected to update expected metadata (i.e. description) and move certs from violations list after certificate metadata (==annotations) are updated


#### Adding a new TLS artifact

The cluster administrator opens the TLS artifacts registry in the `origin` repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have a test designed to detect issues with certificates and I expect folks will look at the test to see what certificates needs to be fixed or issues need to be addressed. However, I wonder if we can generate a text-based output that will summarize all of failures that admin/dev can see quickly instead of having to go through the whole markdown that contains both good and bad certs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will exclude known violations. Currently all certs are added there, so the test would fail only on the certificates added in the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you consider publishing the registry in our docs? Would it be feasible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least a link from the docs describing this problem to a particular file in origin repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have docs generated in openshift/api about which certificates we expect and how they interact. This enhancement is more focused on ensuring that certs in cluster actually match this model. These features definitely overlap, but there is no need to publish actual certs from CI cluster to docs.

Its worth adding a link to the api docs though, its useful to keep expected and actual state in sync

and consults it to ensure that a particular certificate is platform-managed.
The administrator can use JIRA component metadata to file a
new issue related to this certificate.

When a components requires a new certificate created in cluster, . In order to properly
register a new TLS certificate the developer should do the following:

* the developers should start create a pull request adding this functionality.
"all tls certificates should be registered" test would fail.
The test job generates `rawTLSInformation` file which
contains collected information from the prow job, containing info about this new
certificate.
* the developer creates a PR to `origin` repo, adding this file to
soltysh marked this conversation as resolved.
Show resolved Hide resolved
`tls/raw-data` directory, runs `hack/update-tls-artifacts.sh` script.
The script would update `tls/ownership/tls-ownership.json` so that the test would no longer fail.
This script would also update `violation` files, so that the test verifying required metadata
would include this certificate
* the developer reruns tests and makes sure that the certificate is indeed registered and
its metadata has required properties and it matches the expected values stored in `origin` repo.
If tls artifact tests still fail, the developer repeats previous step making necessary changes.

Similarly, TRT members use this registry to file issues found in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not TRT.

Can you describe the workflow for adding a new cert? I think it looks like

  1. open PR adding new cert to operator repo
  2. CI fails due to new cert appearing
  3. CI includes the new raw-tls info required, but it's still missing an owner
  4. dev takes the file and adds it to origin PR
  5. origin PR fails CI because he forgot to update
  6. dev runs the hack/update
  7. origin PR fails CI because dev doesn't have an owner (or other annotation like "works with recert tool) for his new cert
  8. dev fakes in his origin PR, thus taking ownership. OR
  9. dev goes to his operator PR and adds the necessary annotations, redoes his origin, PR
  10. origin PR now passes CI and merges
  11. operator PR can now land.

Copy link
Member Author

@vrutkovs vrutkovs Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. dev runs the hack/update

would fail if owner is unset, so step 7 won't happen.

I expect the happy path to look like this:

  1. Developer is aware of this enhancement, so they create a PR with necessary annotation
  2. PR merges and TLS test in origin shows a flaking test - certificate is not registered in TLS registry
  3. Dev creates a PR to origin, updates ownership, hack/update generates new expected raw TLS info file
  4. After PR is merged TLS info test no longer flakes

Most unhappy path:

  1. open PR adding new cert to operator repo
  2. CI fails due to new cert appearing
  3. CI includes the new raw-tls info required, but it's still missing an owner
  4. dev takes the file and adds it to origin PR
  5. origin PR fails CI because he forgot to update
  6. dev runs the hack/update. hack/validate-tls-info requires owner to be set
  7. dev updates the owner thus taking ownership, origin PR is approved, operator PR can now land

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, TRT may be interested in this too, as if they encounter a problem with certificate its easy to find the team owning it - and quickly find a bug before reverting the offending PR (if available)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding a new cert is considered a failure based on the scenario that David described? Adding a new cert that meets all requirements (such as valid annotations, ownership info and proper expiration time) should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added expected flow in "Workflow Description" section

periodic or release-blocking jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section about, "how to add capability/X to all certificates".

Intro about potential capabilities: on demand rotation, on demand revocation, auto regeneration on expiry, recert for single node.

  1. create new directory under origin/tls/<capability> peer to ownership
  2. directory must contain a readme.MD that is manually created
  3. directory must contain a generated <capability>.md that organized by owning component (jira components) . Generator lives in blah and has matching verify script in blah, point to examples. The markdown file is for humans to follow progress achieving capability/X
  4. directory must contain json file for machine reability of progress.
  5. generator must read from rawTLS to generate the json and markdown. raw TLS gathers data from annotations specified on kube resources.
  6. to have PKI indicate they support capability/X, the owning team will merge a PR adding the annotation AFTER that team is satisfied that it is supported and has some regression protection.
  7. more and more.

#### Adding a new requirement for TLS artifacts

Reports and violations mechanisms can be extended to add new requirements. To add a new
certificate metadata requirements developers would need to:
* Implement [`Requirement` interface](https://github.com/openshift/origin/blob/master/pkg/cmd/update-tls-artifacts/generate-owners/tlsmetadatainterfaces/types.go#L9-L14):
```golang
type Requirement interface {
GetName() string

// InspectRequirement generates and returns the result for a particular set of raw data
InspectRequirement(rawData []*certgraphapi.PKIList) (RequirementResult, error)
}


type RequirementResult interface {
GetName() string

// WriteResultToTLSDir writes the content this requirement expects in directory.
// tlsDir is the parent directory and must be nested as: <tlsDir>/<GetName()>/<content here>.
// The content MUST include
// 1. <tlsDir>/<GetName()>/<GetName()>.md
// 2. <tlsDir>/<GetName()>/<GetName().json
// 3. <tlsDir>/violations/<GetName()>/<GetName()>-violations.json
WriteResultToTLSDir(tlsDir string) error

// DiffExistingContent compares the content of the result with what currently exists in the tlsDir.
// returns
// string representation to display to user (ideally a diff)
// bool that is true when content matches and false when content does not match
// error which non-nil ONLY when the comparison itself could not complete. A completed diff that is non-zero is not an error
DiffExistingContent(tlsDir string) (string, bool, error)

// HaveViolationsRegressed compares the violations of the result with was passed in and returns
// allViolationsFS is the tls/violations/<GetName> directory
// returns
// string representation to display to user (ideally a diff of what is worse)
// bool that is true when no regressions have been introduced and false when content has gotten worse
// error which non-nil ONLY when the comparison itself could not complete. A completed check that is non-zero is not an error
HaveViolationsRegressed(allViolationsFS embed.FS) ([]string, bool, error)
}
```

A simplified type exists for annotation requirements:
```golang
type annotationRequirement struct {
// requirementName is a unique name for metadata requirement
requirementName string
// annotationName is the annotation looked up in cert metadata
annotationName string
// title for the markdown
title string
// explanationMD is exactly the markdown to include that explains the purposes of the check
explanationMD string
}

func NewAnnotationRequirement(requirementName, annotationName, title, explanationMD string) AnnotationRequirement {
return annotationRequirement{
requirementName: requirementName,
annotationName: annotationName,
title: title,
explanationMD: explanationMD,
}
}
```

Example:
```golang
package autoregenerate_after_expiry

import "github.com/openshift/origin/pkg/cmd/update-tls-artifacts/generate-owners/tlsmetadatainterfaces"

const annotationName string = "certificates.openshift.io/auto-regenerate-after-offline-expiry"

type AutoRegenerateAfterOfflineExpiryRequirement struct{}

func NewAutoRegenerateAfterOfflineExpiryRequirement() tlsmetadatainterfaces.Requirement {

md := tlsmetadatainterfaces.NewMarkdown("")
md.Text("This is a report header, customized for particular requirement")
md.Text("Afterwards `NewAnnotationRequirement` will add a list of TLS artifacts, grouped by type and owning component")

return tlsmetadatainterfaces.NewAnnotationRequirement(
// requirement name
"autoregenerate-after-expiry",
// cert or configmap annotation
annotationName,
// report name
"Auto Regenerate After Offline Expiry",
// report name
string(md.ExactBytes()),
)
}
```

In order to ensure that the requirement is backed by e2e, annotation value should contain e2e test name
verifying this requirement property:
```
annotations:
certificates.openshift.io/auto-regenerate-after-offline-expiry: https//github.com/link/to/pr/adding/annotation, "quote escaped formatted name of e2e test that ensures the PKI artifact functions properly"
```


### API Extensions

Not required

### Topology Considerations

#### Hypershift / Hosted Control Planes

No specific changes required

#### Standalone Clusters

No specific changes required

#### Single-node Deployments or MicroShift

No specific changes required for SNO. Tracking TLS artifacts in MicroShift cluster is a non-goal

### Implementation Details/Notes/Constraints

None

### Risks and Mitigations

Teams are responsible for accurate descriptions and timely owners
update of TLS artifacts

### Drawbacks

None

## Design Details

### Open Questions

* Which metadata should be present for all TLS artifacts?
* Minimum set:
* Description
* JIRA component
* Additional data:
* Validity period
* Refreshed after n days / at n% of lifetime
* Can be versioned / can contain IP address or hash
* Suggested by customers
* SSLConnections - which IP address / DNS name this certificate
is valid for.
* This can be derived from certificate content, not required to
be included in the TLS artifact registry and may depend on the base
domain
* Parent - a reference to another TLS artifact this cert is derived
from.
* Management type - operator-managed, user-managed.


## Test Plan

`origin` unit test would verify that the TLS artifact registry is valid:
* All TLS artifacts have owners
* All TLS artifacts have the necessary metadata

e2e conformance tests would ensure that:
* All certificates in `openshift-*` namespaces or on disk are
included in the TLS registry.
* in-cluster certificates have necessary annotations and their value
matches registry entries.

## Graduation Criteria

Not applicable

### Dev Preview -> Tech Preview

Not applicable

### Tech Preview -> GA

Not applicable

### Removing a deprecated feature

Not applicable

## Upgrade / Downgrade Strategy

Not applicable

## Version Skew Strategy

Not applicable

## Operational Aspects of API Extensions

Not applicable

## Support Procedures

Not applicable

## Implementation History

* Current implementation: https://github.com/openshift/origin/pull/28305

## Alternatives

None known