Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basics of Cert-Count Non-Locking Telemetry #16676

Merged
merged 26 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
71587d3
Basics of Cert-Count Telemetry, including locks.
kitography Aug 10, 2022
9065949
Missing cert-writing counter locations.
kitography Aug 12, 2022
cb009a6
Add changelog.
kitography Aug 12, 2022
eb427f2
Remove issuance/cert-storage lock. Replace with "best attempt" slice…
kitography Aug 17, 2022
64a16eb
make fmt.
kitography Aug 17, 2022
8771730
Make the certsCounted non-lock an atomic boolean.
kitography Aug 22, 2022
78acc72
Move sorting of possibleDoubleCountedRevokedSerials to after compare …
kitography Aug 22, 2022
5355dd2
Add values to counter when still initializing.
kitography Aug 22, 2022
485044e
Remove true.
kitography Aug 22, 2022
3c10e8e
Merge branch 'main' into VAULT-7543-telemetry-stored-cert-count
kitography Aug 22, 2022
0e09502
Merge branch 'main' into VAULT-7543-telemetry-stored-cert-count
kitography Aug 23, 2022
635bce9
Merge branch 'main' into VAULT-7543-telemetry-stored-cert-count
kitography Aug 24, 2022
c86daae
make fmt.
kitography Aug 24, 2022
b269ab2
Fix atomic2 import (removed on main).
kitography Aug 24, 2022
8d6fa59
make fmt
kitography Sep 1, 2022
09b5130
Merge branch 'main' into VAULT-7543-telemetry-stored-cert-count
kitography Sep 1, 2022
f74928c
Delay reporting metrics until after deduplication has completed.
kitography Sep 1, 2022
15acb79
Fix lock, allow tidy to report status (and report our status).
kitography Sep 1, 2022
1fdf9ce
make fmt
kitography Sep 1, 2022
f8ad83e
The test works now.
kitography Sep 3, 2022
eb441f5
Move string slice to helper function; make fmt.
kitography Sep 8, 2022
29e01e6
Add the extra pki-tidy field expectations (test-fix, oops)
kitography Sep 8, 2022
8bb37d3
Add backendUUID to gauge name.
kitography Sep 9, 2022
8912bc3
Fix test - backendUUID drawn from earlier in the test.
kitography Sep 19, 2022
0cfc3a4
Merge branch 'main' into VAULT-7543-telemetry-stored-cert-count
kitography Sep 20, 2022
ccff788
make fmt
kitography Sep 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion builtin/logical/aws/iam_policies_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ func Test_combinePolicyDocuments(t *testing.T) {
`{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "NotAction": "ec2:DescribeAvailabilityZones", "Resource": "*"}]}`,
},
expectedOutput: `{"Version": "2012-10-17","Statement":[{"Effect": "Allow","NotAction": "ec2:DescribeAvailabilityZones", "Resource": "*"}]}`,
expectedErr: false,
expectedErr: false,
},
{
description: "one blank policy",
Expand Down
188 changes: 188 additions & 0 deletions builtin/logical/pki/backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,14 @@ package pki
import (
"context"
"fmt"
"sort"
"strings"
"sync"
"sync/atomic"
"time"

atomic2 "go.uber.org/atomic"

"github.com/hashicorp/vault/sdk/helper/consts"

"github.com/armon/go-metrics"
Expand Down Expand Up @@ -203,6 +206,13 @@ func Backend(conf *logical.BackendConfig) *backend {
// Delay the first tidy until after we've started up.
b.lastTidy = time.Now()

// Metrics initialization for count of certificates in storage
b.certsCounted = atomic2.NewBool(false)
b.certCount = new(uint32)
b.revokedCertCount = new(uint32)
b.possibleDoubleCountedSerials = make([]string, 0, 250)
b.possibleDoubleCountedRevokedSerials = make([]string, 0, 250)

return &b
}

Expand All @@ -219,6 +229,12 @@ type backend struct {
tidyStatus *tidyStatus
lastTidy time.Time

certCount *uint32
revokedCertCount *uint32
certsCounted *atomic2.Bool
possibleDoubleCountedSerials []string
possibleDoubleCountedRevokedSerials []string

pkiStorageVersion atomic.Value
crlBuilder *crlBuilder

Expand Down Expand Up @@ -330,6 +346,21 @@ func (b *backend) initialize(ctx context.Context, _ *logical.InitializationReque
return err
}

err := b.initializePKIIssuersStorage(ctx)
if err != nil {
return err
}

// Initialize also needs to populate our certificate and revoked certificate count
err = b.initializeStoredCertificateCounts(ctx)
if err != nil {
return err
}

return nil
}

func (b *backend) initializePKIIssuersStorage(ctx context.Context) error {
// Grab the lock prior to the updating of the storage lock preventing us flipping
// the storage flag midway through the request stream of other requests.
b.issuersLock.Lock()
Expand Down Expand Up @@ -532,3 +563,160 @@ func (b *backend) periodicFunc(ctx context.Context, request *logical.Request) er
// All good!
return nil
}

func (b *backend) initializeStoredCertificateCounts(ctx context.Context) error {
b.tidyStatusLock.RLock()
defer b.tidyStatusLock.RUnlock()
// For performance reasons, we can't lock on issuance/storage of certs until a list operation completes,
// but we want to limit possible miscounts / double-counts to over-counting, so we take the tidy lock which
// prevents (most) deletions - in particular we take a read lock (sufficient to block the write lock in
// tidyStatusStart while allowing tidy to still acquire a read lock to report via its endpoint)

entries, err := b.storage.List(ctx, "certs/")
if err != nil {
return err
}
atomic.AddUint32(b.certCount, uint32(len(entries)))

revokedEntries, err := b.storage.List(ctx, "revoked/")
if err != nil {
return err
}
atomic.AddUint32(b.revokedCertCount, uint32(len(revokedEntries)))

b.certsCounted.Store(true)
// Now that the metrics are set, we can switch from appending newly-stored certificates to the possible double-count
// list, and instead have them update the counter directly. We need to do this so that we are looking at a static
// slice of possibly double counted serials. Note that certsCounted is computed before the storage operation, so
// there may be some delay here.

// Sort the listed-entries first, to accommodate that delay.
sort.Slice(entries, func(i, j int) bool {
return entries[i] < entries[j]
})

sort.Slice(revokedEntries, func(i, j int) bool {
return revokedEntries[i] < revokedEntries[j]
})

// We assume here that these lists are now complete.
sort.Slice(b.possibleDoubleCountedSerials, func(i, j int) bool {
return b.possibleDoubleCountedSerials[i] < b.possibleDoubleCountedSerials[j]
})

listEntriesIndex := 0
possibleDoubleCountIndex := 0
for {
if listEntriesIndex >= len(entries) {
break
}
if possibleDoubleCountIndex >= len(b.possibleDoubleCountedSerials) {
break
}
if entries[listEntriesIndex] == b.possibleDoubleCountedSerials[possibleDoubleCountIndex] {
kitography marked this conversation as resolved.
Show resolved Hide resolved
// This represents a double-counted entry
b.decrementTotalCertificatesCountNoReport()
listEntriesIndex = listEntriesIndex + 1
possibleDoubleCountIndex = possibleDoubleCountIndex + 1
continue
}
if entries[listEntriesIndex] < b.possibleDoubleCountedSerials[possibleDoubleCountIndex] {
listEntriesIndex = listEntriesIndex + 1
continue
}
if entries[listEntriesIndex] > b.possibleDoubleCountedSerials[possibleDoubleCountIndex] {
possibleDoubleCountIndex = possibleDoubleCountIndex + 1
continue
}
}

sort.Slice(b.possibleDoubleCountedRevokedSerials, func(i, j int) bool {
return b.possibleDoubleCountedRevokedSerials[i] < b.possibleDoubleCountedRevokedSerials[j]
})

listRevokedEntriesIndex := 0
possibleRevokedDoubleCountIndex := 0
for {
if listRevokedEntriesIndex >= len(revokedEntries) {
break
}
if possibleRevokedDoubleCountIndex >= len(b.possibleDoubleCountedRevokedSerials) {
break
}
if revokedEntries[listRevokedEntriesIndex] == b.possibleDoubleCountedRevokedSerials[possibleRevokedDoubleCountIndex] {
// This represents a double-counted revoked entry
b.decrementTotalRevokedCertificatesCountNoReport()
listRevokedEntriesIndex = listRevokedEntriesIndex + 1
possibleRevokedDoubleCountIndex = possibleRevokedDoubleCountIndex + 1
continue
}
if revokedEntries[listRevokedEntriesIndex] < b.possibleDoubleCountedRevokedSerials[possibleRevokedDoubleCountIndex] {
listRevokedEntriesIndex = listRevokedEntriesIndex + 1
continue
}
if revokedEntries[listRevokedEntriesIndex] > b.possibleDoubleCountedRevokedSerials[possibleRevokedDoubleCountIndex] {
possibleRevokedDoubleCountIndex = possibleRevokedDoubleCountIndex + 1
continue
}
}

b.possibleDoubleCountedRevokedSerials = nil
b.possibleDoubleCountedSerials = nil

metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_certificates_stored"}, float32(*b.certCount))
metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_revoked_certificates_stored"}, float32(*b.revokedCertCount))

return nil
kitography marked this conversation as resolved.
Show resolved Hide resolved
}

// The "certsCounted" boolean here should be loaded from the backend certsCounted before the corresponding storage call:
// eg. certsCounted := b.certsCounted.Load()
func (b *backend) incrementTotalCertificatesCount(certsCounted bool, newSerial string) {
atomic.AddUint32(b.certCount, 1)
switch {
case !certsCounted:
// This is unsafe, but a good best-attempt
if strings.HasPrefix(newSerial, "certs/") {
newSerial = newSerial[6:]
}
b.possibleDoubleCountedSerials = append(b.possibleDoubleCountedSerials, newSerial)
default:
metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_certificates_stored"}, float32(*b.certCount))
}
}

func (b *backend) decrementTotalCertificatesCountReport() {
b.decrementTotalCertificatesCountNoReport()
metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_certificates_stored"}, float32(*b.certCount))
}

// Called directly only by the initialize function to deduplicate the count, when we don't have a full count yet
func (b *backend) decrementTotalCertificatesCountNoReport() {
atomic.AddUint32(b.certCount, ^uint32(0))
}

// The "certsCounted" boolean here should be loaded from the backend certsCounted before the corresponding storage call:
// eg. certsCounted := b.certsCounted.Load()
func (b *backend) incrementTotalRevokedCertificatesCount(certsCounted bool, newSerial string) {
kitography marked this conversation as resolved.
Show resolved Hide resolved
atomic.AddUint32(b.revokedCertCount, 1)
switch {
case !certsCounted:
// This is unsafe, but a good best-attempt
if strings.HasPrefix(newSerial, "revoked/") { // allow passing in the path (revoked/serial) OR the serial
newSerial = newSerial[8:]
}
b.possibleDoubleCountedRevokedSerials = append(b.possibleDoubleCountedRevokedSerials, newSerial)
default:
metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_revoked_certificates_stored"}, float32(*b.revokedCertCount))
}
}

func (b *backend) decrementTotalRevokedCertificatesCountReport() {
b.decrementTotalRevokedCertificatesCountNoReport()
metrics.SetGauge([]string{"secrets", "pki", b.backendUUID, "total_revoked_certificates_stored"}, float32(*b.revokedCertCount))
}

// Called directly only by the initialize function to deduplicate the count, when we don't have a full count yet
func (b *backend) decrementTotalRevokedCertificatesCountNoReport() {
atomic.AddUint32(b.revokedCertCount, ^uint32(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can't happen in the current code/usage I'm seeing with the two decrements, but should we add a guard here for b.revokedCertCount being 0 and we under-flowing the unit32? The only safe way I can think of this is this ugly code block

testInt := uint32(1)

for i := 0; i <= 5; i++ {
	oldVal := atomic.LoadUint32(&testInt)
	if oldVal == 0 {
		break
	}
	newVal := oldVal - 1
	if atomic.CompareAndSwapUint32(&testInt, oldVal, newVal) {
		break
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I don't like this. If someone ever saw an underflow it would be solvable by "turning off and turning on" and would be a really really helpful report that we aren't honouring the promise of "always overcount".

If someone else agrees with this solution, I'll go ahead and add it, but in general I don't think we can protect against people writing bad code in the future with uglier code now.

}
Loading