Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ca: improve test coverage for RenewIntermediate #11780

Merged
merged 7 commits into from Dec 9, 2021

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Dec 8, 2021

There's a bunch of stuff here, so probably best viewed by individual commit. I'll try to comment on any interesting lines. I've been running the tests with -count=100 in a loop and the flakes appear to be fixed now.

Primarily this PR:

  1. adds a test for using the Vault provider in a secondary PR (previously we mostly only tested with Vault provider in the primary)
  2. improves some existing RenewIntermediate tests to be less flaky (hopefully not flaky at all!) and to be less verbose, so easier to read.

Other changes made:

  • adds an Active() method to both the IndexedCARoots (the RPC response) and CARoots (the state store result) so that we can stop repeating the "lookup active root" loop in dozens of places.

I'm planning on back porting to reduce conflicts when I back port other fixes.

Fixes #7520 (hopefully for real this time)

You can see a recent flake of this on main here.

@dnephin dnephin added theme/testing Testing, and related enhancements theme/certificates Related to creating, distributing, and rotating certificates in Consul backport/1.9 pr/no-changelog PR does not need a corresponding .changelog entry labels Dec 8, 2021
@dnephin dnephin requested a review from a team December 8, 2021 18:40
@github-actions github-actions bot added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies and removed theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/testing Testing, and related enhancements labels Dec 8, 2021
@@ -56,8 +56,6 @@ func TestCAManager_Initialize_Vault_Secondary_SharedVault(t *testing.T) {
testrpc.WaitForTestAgent(t, serverDC1.RPC, "dc1")

codec := rpcClient(t, serverDC1)
defer codec.Close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rpcClient function already does this with t.Cleanup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about t.Cleanup. Thanks! 🙌🏻

Comment on lines -423 to +384
_, activeRoot, err = store.CARootActive(nil)
codec := rpcClient(t, s1)
roots := structs.IndexedCARoots{}
err = msgpackrpc.CallWithCodec(codec, "ConnectCA.Roots", &structs.DCSpecificRequest{}, &roots)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed from a direct state store call to an RPC call because we need the TrustDomain below to create a valid spiffe ID.

I'm attempting to move where we store this value (see #11687) so I prefer fetching it from a public interface instead of reaching into the state store config.

Comment on lines 441 to 448
leafCsr, err := connect.ParseCSR(raw)
require.NoError(err)

leafPEM, err := provider.Sign(leafCsr)
require.NoError(err)

cert, err := connect.ParseCert(leafPEM)
require.NoError(err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the direct call to provider.Sign with an RPC to ConenctCA.Sign because those two things return different data. The RPC call returns a pem that includes intermediate certs, which is required to verify the cert.

In general using public interfaces in tests is preferable, so this felt like a good change.

Comment on lines 459 to 462
_, err = cert.Verify(x509.VerifyOptions{
Intermediates: intermediatePool,
Roots: rootPool,
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in verifyLeafCert, which will prevent bugs like #11671 because verifyLeafCert checks the cert with intermediates from both locations (leafCertPEM and Root.IntermediateCerts)

Comment on lines +58 to +65
func (r IndexedCARoots) Active() *CARoot {
for _, root := range r.Roots {
if root.ID == r.ActiveRootID {
return root
}
}
return nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be slowly replacing the many places we do this with this new method in future PRs.

@@ -330,10 +330,8 @@ func TestCAManager_RenewIntermediate_Vault_Primary(t *testing.T) {
require := require.New(t)

testVault := ca.NewTestVaultServer(t)
defer testVault.Stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done by NewTestVaultServer using t.Cleanup.


dir1, s1 := testServerWithConfig(t, func(c *Config) {
c.Build = "1.6.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were copy/paste from other tests where they were relevant. The build number does not matter for these tests.

Comment on lines +343 to +344
"LeafCertTTL": "2s",
"IntermediateCertTTL": "7s",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another source of flakes. The leaf cert could expire before we got to checking if it was valid.

Increasing these values does not actually slow down the test at all , because we use a "NotBefore" date for the cert that is a full minute in the past, so we only end up waiting on the "renew interval", which is 200ms.

Comment on lines -372 to -397
provider, _ := getCAProviderWithLock(s1)
intermediatePEM, err := provider.ActiveIntermediate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the cert from the provider directly was another possible source of flakes, because that value will be updated before the state store.

I've tried to change these tests to always use either the RPC interface (preferred) or the state store (backs the RPC interface). That should make the tests less flaky.

I think we should make a similar change in other tests, but that is for a future PR.

@@ -162,7 +163,7 @@ func runTestVault(t testing.T) (*TestVaultServer, error) {
}
t.Cleanup(func() {
if err := testVault.Stop(); err != nil {
t.Log("failed to stop vault server: %w", err)
t.Logf("failed to stop vault server: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why %w was not working here, but it was not formatting correctly, so %v should be fine. Maybe %w only works in fmt.Errorf ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems right, since %w doesn't show up as a verb in this list, and is only documented alongside Errorf

Comment on lines -412 to +370
provider, _ = getCAProviderWithLock(s1)
newIntermediatePEM, err := provider.ActiveIntermediate()
store := s1.caManager.delegate.State()
_, storedRoot, err := store.CARootActive(nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another cause of flakes. Mentioned in another comment maybe, but it relates to this flake so I'll mention it here as well.

We poll checking the provider, which is updated first, but it is the state store that backs RPC requests, so we need to poll the state store (or RPC) not the provider.

The provider is updated before the state store, which causes a race in tests.

dnephin and others added 7 commits December 8, 2021 18:41
Which will be used in the next commit.
Use the new verifyLearfCert to show the cert verifies with intermediates
from both sources. This required using the RPC interface so that the
leaf pem was constructed correctly.

Add IndexedCARoots.Active since that is a common operation we see in a
few places.
I suspect one problem was that we set structs.IntermediateCertRenewInterval to 1ms, which meant
that in some cases the intermediate could renew before we stored the original value.

Another problem was that the 'wait for intermediate' loop was calling the provider.ActiveIntermediate,
but the comparison needs to use the RPC endpoint to accurately represent a user request. So
changing the 'wait for' to use the state store ensures we don't race.

Also moves the patching into a separate function.

Removes the addition of ca.CertificateTimeDriftBuffer as part of calculating halfTime. This was added
in a previous commit to attempt to fix the flake, but it did not appear to fix the problem. Adding the
time here was making the tests fail when using the shared patch
function. It's not clear to me why, but there's no reason we should be
including this time in the halfTime calculation.
@dnephin dnephin force-pushed the dnephin/ca-test-vault-in-secondary branch from 6525f13 to 4116a14 Compare December 8, 2021 23:43
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging December 8, 2021 23:43 Inactive
@vercel vercel bot temporarily deployed to Preview – consul December 8, 2021 23:43 Inactive
@@ -162,7 +163,7 @@ func runTestVault(t testing.T) (*TestVaultServer, error) {
}
t.Cleanup(func() {
if err := testVault.Stop(); err != nil {
t.Log("failed to stop vault server: %w", err)
t.Logf("failed to stop vault server: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems right, since %w doesn't show up as a verb in this list, and is only documented alongside Errorf

})
}

func verifyLeafCert(t *testing.T, root *structs.CARoot, leafCertPEM string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks super useful -- we don't have to do it with this PR but curious how we can propagate/ export this across server/ agent wherever we deal w certs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find a use for it outside of this package, I generally like to export test helpers in their own internal packages (ref).

So maybe internal/catest or internal/pkitest.

Copy link
Contributor

@acpana acpana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments! It made reviewing easier -- LGTM! 🚀

@dnephin dnephin merged commit f9647ec into main Dec 9, 2021
@dnephin dnephin deleted the dnephin/ca-test-vault-in-secondary branch December 9, 2021 17:29
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/522605.

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit f9647ec onto release/1.11.x failed! Build Log

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit f9647ec onto release/1.10.x failed! Build Log

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit f9647ec onto release/1.9.x failed! Build Log

@dnephin
Copy link
Contributor Author

dnephin commented Dec 9, 2021

I'll batch up a few changes to manually cherry-pick at the same time.

@kisunji
Copy link
Contributor

kisunji commented Jan 11, 2022

@dnephin Have these been cherry-picked? I think 1.11.x is failing on CA tests related to this change

@dnephin
Copy link
Contributor Author

dnephin commented Jan 11, 2022

Nope, was still on my list. I'll do that now along with #11721. Thanks for the reminder!

@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/549495.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit f9647ec onto release/1.11.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jan 11, 2022
…ondary

ca: improve test coverage for RenewIntermediate
@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit f9647ec onto release/1.10.x failed! Build Log

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit f9647ec onto release/1.9.x failed! Build Log

@dnephin
Copy link
Contributor Author

dnephin commented Jan 11, 2022

I managed to get this auto-backported into 1.11 by backporting #11721.

I'll do the other two release branches manually now.

dnephin added a commit that referenced this pull request Jan 11, 2022
…ondary

ca: improve test coverage for RenewIntermediate
dnephin added a commit that referenced this pull request Jan 11, 2022
…ondary

ca: improve test coverage for RenewIntermediate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-changelog PR does not need a corresponding .changelog entry theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky test: agent/consul TestLeader_SecondaryCA_IntermediateRenew
6 participants