New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ca: improve test coverage for RenewIntermediate #11780
Conversation
@@ -56,8 +56,6 @@ func TestCAManager_Initialize_Vault_Secondary_SharedVault(t *testing.T) { | |||
testrpc.WaitForTestAgent(t, serverDC1.RPC, "dc1") | |||
|
|||
codec := rpcClient(t, serverDC1) | |||
defer codec.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rpcClient
function already does this with t.Cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about t.Cleanup
. Thanks! 🙌🏻
_, activeRoot, err = store.CARootActive(nil) | ||
codec := rpcClient(t, s1) | ||
roots := structs.IndexedCARoots{} | ||
err = msgpackrpc.CallWithCodec(codec, "ConnectCA.Roots", &structs.DCSpecificRequest{}, &roots) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed from a direct state store call to an RPC call because we need the TrustDomain
below to create a valid spiffe ID.
I'm attempting to move where we store this value (see #11687) so I prefer fetching it from a public interface instead of reaching into the state store config.
agent/consul/leader_connect_test.go
Outdated
leafCsr, err := connect.ParseCSR(raw) | ||
require.NoError(err) | ||
|
||
leafPEM, err := provider.Sign(leafCsr) | ||
require.NoError(err) | ||
|
||
cert, err := connect.ParseCert(leafPEM) | ||
require.NoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced the direct call to provider.Sign
with an RPC to ConenctCA.Sign
because those two things return different data. The RPC call returns a pem that includes intermediate certs, which is required to verify the cert.
In general using public interfaces in tests is preferable, so this felt like a good change.
agent/consul/leader_connect_test.go
Outdated
_, err = cert.Verify(x509.VerifyOptions{ | ||
Intermediates: intermediatePool, | ||
Roots: rootPool, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done in verifyLeafCert
, which will prevent bugs like #11671 because verifyLeafCert
checks the cert with intermediates from both locations (leafCertPEM and Root.IntermediateCerts
)
func (r IndexedCARoots) Active() *CARoot { | ||
for _, root := range r.Roots { | ||
if root.ID == r.ActiveRootID { | ||
return root | ||
} | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be slowly replacing the many places we do this with this new method in future PRs.
@@ -330,10 +330,8 @@ func TestCAManager_RenewIntermediate_Vault_Primary(t *testing.T) { | |||
require := require.New(t) | |||
|
|||
testVault := ca.NewTestVaultServer(t) | |||
defer testVault.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already done by NewTestVaultServer
using t.Cleanup
.
|
||
dir1, s1 := testServerWithConfig(t, func(c *Config) { | ||
c.Build = "1.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these were copy/paste from other tests where they were relevant. The build number does not matter for these tests.
"LeafCertTTL": "2s", | ||
"IntermediateCertTTL": "7s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was another source of flakes. The leaf cert could expire before we got to checking if it was valid.
Increasing these values does not actually slow down the test at all , because we use a "NotBefore" date for the cert that is a full minute in the past, so we only end up waiting on the "renew interval", which is 200ms.
provider, _ := getCAProviderWithLock(s1) | ||
intermediatePEM, err := provider.ActiveIntermediate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the cert from the provider directly was another possible source of flakes, because that value will be updated before the state store.
I've tried to change these tests to always use either the RPC interface (preferred) or the state store (backs the RPC interface). That should make the tests less flaky.
I think we should make a similar change in other tests, but that is for a future PR.
@@ -162,7 +163,7 @@ func runTestVault(t testing.T) (*TestVaultServer, error) { | |||
} | |||
t.Cleanup(func() { | |||
if err := testVault.Stop(); err != nil { | |||
t.Log("failed to stop vault server: %w", err) | |||
t.Logf("failed to stop vault server: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why %w
was not working here, but it was not formatting correctly, so %v
should be fine. Maybe %w
only works in fmt.Errorf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provider, _ = getCAProviderWithLock(s1) | ||
newIntermediatePEM, err := provider.ActiveIntermediate() | ||
store := s1.caManager.delegate.State() | ||
_, storedRoot, err := store.CARootActive(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another cause of flakes. Mentioned in another comment maybe, but it relates to this flake so I'll mention it here as well.
We poll checking the provider, which is updated first, but it is the state store that backs RPC requests, so we need to poll the state store (or RPC) not the provider.
The provider is updated before the state store, which causes a race in tests.
Which will be used in the next commit.
Use the new verifyLearfCert to show the cert verifies with intermediates from both sources. This required using the RPC interface so that the leaf pem was constructed correctly. Add IndexedCARoots.Active since that is a common operation we see in a few places.
I suspect one problem was that we set structs.IntermediateCertRenewInterval to 1ms, which meant that in some cases the intermediate could renew before we stored the original value. Another problem was that the 'wait for intermediate' loop was calling the provider.ActiveIntermediate, but the comparison needs to use the RPC endpoint to accurately represent a user request. So changing the 'wait for' to use the state store ensures we don't race. Also moves the patching into a separate function. Removes the addition of ca.CertificateTimeDriftBuffer as part of calculating halfTime. This was added in a previous commit to attempt to fix the flake, but it did not appear to fix the problem. Adding the time here was making the tests fail when using the shared patch function. It's not clear to me why, but there's no reason we should be including this time in the halfTime calculation.
6525f13
to
4116a14
Compare
@@ -162,7 +163,7 @@ func runTestVault(t testing.T) (*TestVaultServer, error) { | |||
} | |||
t.Cleanup(func() { | |||
if err := testVault.Stop(); err != nil { | |||
t.Log("failed to stop vault server: %w", err) | |||
t.Logf("failed to stop vault server: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | ||
} | ||
|
||
func verifyLeafCert(t *testing.T, root *structs.CARoot, leafCertPEM string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks super useful -- we don't have to do it with this PR but curious how we can propagate/ export this across server/ agent wherever we deal w certs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find a use for it outside of this package, I generally like to export test helpers in their own internal
packages (ref).
So maybe internal/catest
or internal/pkitest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments! It made reviewing easier -- LGTM! 🚀
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/522605. |
I'll batch up a few changes to manually cherry-pick at the same time. |
Nope, was still on my list. I'll do that now along with #11721. Thanks for the reminder! |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/549495. |
🍒✅ Cherry pick of commit f9647ec onto |
…ondary ca: improve test coverage for RenewIntermediate
I managed to get this auto-backported into 1.11 by backporting #11721. I'll do the other two release branches manually now. |
…ondary ca: improve test coverage for RenewIntermediate
…ondary ca: improve test coverage for RenewIntermediate
There's a bunch of stuff here, so probably best viewed by individual commit. I'll try to comment on any interesting lines. I've been running the tests with
-count=100
in a loop and the flakes appear to be fixed now.Primarily this PR:
Other changes made:
Active()
method to both theIndexedCARoots
(the RPC response) andCARoots
(the state store result) so that we can stop repeating the "lookup active root" loop in dozens of places.I'm planning on back porting to reduce conflicts when I back port other fixes.
Fixes #7520 (hopefully for real this time)
You can see a recent flake of this on main here.