Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basics of Cert-Count Non-Locking Telemetry #16676

Merged
merged 26 commits into from
Sep 20, 2022

Conversation

kitography
Copy link
Contributor

@kitography kitography commented Aug 10, 2022

This is the basics of the Certificate Count Telemetry.

Lock is Removed. See Comment.

  • Test sort and duplicate-remove
  • Fix "early" reporting of count metric.
  • Add count-metric to tidy-status (with warning about search)

When we store a cert now, if certs haven't been counted, that is added to a slice which will be checked for double counting.

I'm pretty sure I got every place that a cert is stored:

cert_util.go:218:       if err = req.Storage.Put(ctx, certEntry); err != nil {
* This is path-update for a legacy cert storage.  This could plausibly result in a double counting during initialise (if the 'read' - which happens first - is picked up, but not the 'delete').  Unfortunately, I'm not sure there's anything to be done about that without a lot more complexity (and considerations of error cases).
cert_util_test.go:42:           err := storage.Put(context.Background(), &logical.StorageEntry{
* This is a test
cert_util_test.go:77:           err := storage.Put(context.Background(), &logical.StorageEntry{
* This is also a test
crl_util.go:235:                err = req.Storage.Put(ctx, revEntry)
* Increments when we successfully write a revoked cert
crl_util.go:567:                        err = req.Storage.Put(ctx, revokedEntry)
* This is an update call, not a new entry
crl_util.go:643:        err = sc.Storage.Put(sc.Context, &logical.StorageEntry{
* Path here is crls/ (not a cert)
path_config_crl.go:112: err = req.Storage.Put(ctx, entry)
* Path here is config/crl (not a cert)
path_config_urls.go:92: err = storage.Put(ctx, entry)
* Path here is urls/ (not a cert)
path_issue_sign.go:413:         err = req.Storage.Put(ctx, &logical.StorageEntry{
* We increment the certificate count here
path_roles.go:608:              if err := s.Put(ctx, jsonEntry); err != nil {
* Path here is role/ (not a cert)
path_roles.go:762:      if err := req.Storage.Put(ctx, jsonEntry); err != nil {
* Path here is role/ (not a cert)
path_roles.go:967:      if err := req.Storage.Put(ctx, jsonEntry); err != nil {
* Path here is role/ (not a cert)
path_roles_test.go:79:  if err := storage.Put(context.Background(), entry); err != nil {
* Path here is role/testrole (not a cert)
path_roles_test.go:179: if err := storage.Put(context.Background(), entry); err != nil {
* Path here is role/testrole (not a cert)
path_roles_test.go:275: if err := storage.Put(context.Background(), entry); err != nil {
* Path here is role/testrole (not a cert)
path_roles_test.go:373: if err := storage.Put(context.Background(), entry); err != nil {
* Path here is role/testrole (not a cert)
path_root.go:260:       err = req.Storage.Put(ctx, &logical.StorageEntry{
* We increment after a successful save
path_root.go:444:       err = req.Storage.Put(ctx, &logical.StorageEntry{
* We increment after a successful save
storage.go:222: return sc.Storage.Put(sc.Context, json)
* This is keys, not certs
storage.go:566: return sc.Storage.Put(sc.Context, json)
* This is issuer, not certs
storage.go:739: return sc.Storage.Put(sc.Context, json)
* This is CRL Config, not certs
storage.go:772: return sc.Storage.Put(sc.Context, json)
* This is keys config, not certs
storage.go:797: return sc.Storage.Put(sc.Context, json)
* This is issuers config, not certs
storage_migrations.go:167:      return s.Put(ctx, json)
* This is legacy bundle, not certs
storage_migrations_test.go:77:  err = s.Put(ctx, json)
* Also legacy bundle, not certs
storage_migrations_test.go:182: err = s.Put(ctx, json)
* Also legacy bundle, not certs

@kitography kitography marked this pull request as ready for review August 12, 2022 14:39
@kitography kitography requested a review from a team August 12, 2022 14:39
@kitography
Copy link
Contributor Author

For anyone looking in: after discussion internally, we've settled on it being more important not to lock (and block all issuance and revocation) on startup than to have perfectly correct metrics here. Instead, we'll do a "best effort" metric - which might include some double counting of certificate entries.

If we do miscount, we want to make sure we 'overcount' rather than undercount the number of certs in storage.

( https://hashicorp.slack.com/archives/C0386B7KPHR/p1660332542341629 )

Copy link
Contributor

@stevendpclark stevendpclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, I do like where this is headed. Sorry though I went a little nuts on the comments...

builtin/logical/pki/backend.go Outdated Show resolved Hide resolved
builtin/logical/pki/backend.go Show resolved Hide resolved
builtin/logical/pki/backend.go Outdated Show resolved Hide resolved
builtin/logical/pki/backend.go Show resolved Hide resolved
builtin/logical/pki/backend.go Outdated Show resolved Hide resolved
builtin/logical/pki/backend.go Show resolved Hide resolved
}

func (b *backend) decrementTotalRevokedCertificatesCount() {
atomic.AddUint32(b.revokedCertCount, ^uint32(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can't happen in the current code/usage I'm seeing with the two decrements, but should we add a guard here for b.revokedCertCount being 0 and we under-flowing the unit32? The only safe way I can think of this is this ugly code block

testInt := uint32(1)

for i := 0; i <= 5; i++ {
	oldVal := atomic.LoadUint32(&testInt)
	if oldVal == 0 {
		break
	}
	newVal := oldVal - 1
	if atomic.CompareAndSwapUint32(&testInt, oldVal, newVal) {
		break
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I don't like this. If someone ever saw an underflow it would be solvable by "turning off and turning on" and would be a really really helpful report that we aren't honouring the promise of "always overcount".

If someone else agrees with this solution, I'll go ahead and add it, but in general I don't think we can protect against people writing bad code in the future with uglier code now.

@kitography kitography changed the title Basics of Cert-Count Telemetry, including locks. Basics of Cert-Count Non-Locking Telemetry Sep 1, 2022
Copy link
Contributor

@stevendpclark stevendpclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for addressing all my concerns overall.

@kitography kitography enabled auto-merge (squash) September 9, 2022 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants