Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if intermediate cert needs to be renewed. #6835

Merged
merged 8 commits into from
Jan 17, 2020

Conversation

hanshasselberg
Copy link
Member

@hanshasselberg hanshasselberg commented Nov 25, 2019

Fixes #6542.

Currently when using the built-in CA provider for Connect, root certificates are valid for 10 years, however secondary DCs get intermediates that are valid for only 1 year. There is no mechanism currently short of rotating the root in the primary that will cause the secondary DCs to renew their intermediates.
This PR adds a check that renews the cert if it is half way through its validity period.

In order to be able to test these changes, a new configuration option was added: IntermediateCertTTL which is set extremely low in the tests.

The commits for this PR are done so that you could read them one by one and get a better overview. It contains some bugfixes and refactorings. Only the last commit has the new feature!

@hanshasselberg hanshasselberg changed the base branch from release/1.6.x to master December 13, 2019 13:31
@hanshasselberg hanshasselberg requested review from a team and removed request for a team December 13, 2019 15:23
@hanshasselberg hanshasselberg requested a review from a team December 13, 2019 16:45
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the Connect CA, so I won't approve the PR, but I have a couple comments/questions:

  1. The definition for waitForActiveCARoot in TestLeader_SecondaryCA_IntermediateRenew is missing
  2. Would using syscall.Getrlimit cause problems if someone runs the test suite in a non-unix platform like Windows?

@hanshasselberg hanshasselberg self-assigned this Dec 17, 2019
@hanshasselberg
Copy link
Member Author

hanshasselberg commented Dec 17, 2019

@freddygv thanks for the review

  1. not sure how it ever worked before, but I swear I ran the tests! fixing it.
  2. you are correct, I fixed it so that it is a noop on windows.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @i0rek!

Overall I think this is great and super close. I have one significant suggestion about how we trigger this which I think will also help you solve the test flakyness by not relying on blocking semantics at all.

intermediateCert.NotAfter) {
needsNewIntermediate = true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently relies on our blocking query implementation which times out every 10 mins to implement a "periodic check". I'm not sure that's the best approach for a few reasons:

  1. When we upgrade this to streaming there will be no periodic timeouts
  2. We are using an implementation detail (that this method is called regularly even when there are no changes) in an unexpected/undocumented way. Even without streaming it would be easy to miss this assumption and for example add a line to the replication loop that skips calling this in the case of a timeout where the index is the same. I doubt that's plausibly going to happen in practice but is an example of why it feels wrong to implicitly assume regular iterations of this loop in the steady state.

Could we instead have a dedicated timer loop that somehow triggers the renewal? I realise it's more complicated having two different goroutines call this as they could do so concurrently but I don't think it's that hard to either trigger it via the same goroutine as now - just have it select on either watch update or timer update - or else jsut synchronize them with a mutex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. It required me redoing most of the work, but it is better now!

logf("INFO", "blockSize %d too big for system limit %d. Adjusting...", blockSize, limit)
blockSize = limit - 3
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS this solving the darwin annoying warning in tests? If so cool :) I wasn't sure if it's relevant to this PR or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. But it solves the case where your system limits are lower than the lower bound of freeport, which makes its own test fail instantly on OSX.

@hanshasselberg hanshasselberg force-pushed the connect_secondary_renew branch 9 times, most recently from e0c9e51 to 5e8bb04 Compare January 14, 2020 13:43
@hanshasselberg hanshasselberg requested review from banks and a team January 14, 2020 19:53
@hanshasselberg
Copy link
Member Author

Tracking test failures atm.

@hanshasselberg hanshasselberg force-pushed the connect_secondary_renew branch 4 times, most recently from 63a1502 to 4839682 Compare January 15, 2020 12:06
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Hans, Looking super good!

I think most of this is just naming bikesheds - I really like the split of some of the code that was getting kinda hairy though once names are clear!

The main blocker I see is that the renew can race with noticing a new root which could corrupt the CA config state just like the issue we had the other day. I don't think we need to do anything really sophisticated here - maybe just have a mutex held for the entire process of checking renewal and hold same mutex for any other secondary CA reconfiguration so we can be sure that there is only one reconfig going at once.

agent/connect/ca/provider_consul.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
agent/consul/leader_connect.go Show resolved Hide resolved
agent/consul/leader_connect.go Outdated Show resolved Hide resolved
@hanshasselberg
Copy link
Member Author

hanshasselberg commented Jan 16, 2020

Thank you for the review @banks!
The function names were an oversight on my part, I forgot to come up with proper names after I just picked a placeholder name during the refactoring. Your suggestions were very welcome!

The provider reconfiguration now also has its own lock. Despite the existence of caProviderLock which is being used to lock reading and writing the provider from the state. I chose to not reuse that to not block other operations during a possibly longer running intermediate cert renewal.

@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 16, 2020
@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 16, 2020
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks Hans! 🎉

@@ -173,6 +177,9 @@ func (s *Server) initializeCA() error {
if err != nil {
return err
}

s.caProviderReconfigurationLock.Lock()
defer s.caProviderReconfigurationLock.Unlock()
s.setCAProvider(provider, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to document in the comment for setCAProvider that it is often called while holding caProviderReconfigurationLock and so must never take that or call anything that does.

I think it's unlikely it would but it's not super obvious that would cause a deadlock which might bite us eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that warning in a bunch of places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

connect: Secondary DC Intermediate Certificates are not renewed automatically
3 participants