Skip to content

Conversation

cretz
Copy link

@cretz cretz commented Jul 1, 2016

This addresses issue #906. I may have gone a bit overboard w/ the test harness :-). Quick notes:

  • Test harness added in a separate package (storagetest) in the same vein as iotest or httptest since I knew I would end up creating one anyways when I get around to writing my own impl
  • In-memory storage impl in the storagetest package for use by others during testing if desired
  • Basic approach would be for plugins to set httpserver.GetConfig(c).TLS.StorageCreator (or course w/ TLS field null check) in a similar way to how middleware is added. This allows plugins to register themselves as normal, create directives as normal, etc.
  • I will write the wiki for how to create a custom TLS storage impl once this PR is given the thumbs up
  • I likely will be writing either a Vault or Consul+AES impl

As a reminder the whole point of wanting to do this, at least for me, is so I can finally have multiple load balanced end-to-end encrypted web servers that respond to HTTP challenges. Basically we'll be able to update the auto-https faq at https://github.com/caddyserver/caddyserver.com/blob/f27a911a9276cdb6d15ce762504b22db8d082721/public/docs/automatic-https.html#L71 to include the fact that if you use an alternative shared storage plugin, your stuff will work.

Another thing I want to ask: I didn't dig too deeply, but I trust that Caddy obviously caches the cert loaded from storage. So remote storage wouldn't constantly be hit, right?

@CLAassistant
Copy link

CLAassistant commented Jul 1, 2016

CLA assistant check
All committers have signed the CLA.

@cretz
Copy link
Author

cretz commented Jul 1, 2016

Note, all comments between my initial description at the top and this comment were based on an initial review of the first commit.

@mholt
Copy link
Member

mholt commented Jul 1, 2016

Nice, I'll look at this in more scrutiny later! Thanks for the PR. Really looking good from what I saw earlier.

Another thing I want to ask: I didn't dig too deeply, but I trust that Caddy obviously caches the cert loaded from storage. So remote storage wouldn't constantly be hit, right?

Right. That's what certCache is in certificates.go.

Regarding InMemoryStorage:

I was going to use it as an example of an alternative impl and to show that my test harness works on just the interface so it validates both this and the file one. Since it has no runtime purpose per se, shall I remove it?

Ah, gotcha. Yeah, let's remove it if it's not used; feel free to keep what you need for tests in the test file. :)

Is this level of abstraction sufficient for your needs? Do you foresee maybe even a general-purpose plugin that we could put on the download page for others to use, thanks to this change? 😄

@cretz
Copy link
Author

cretz commented Jul 2, 2016

Once you peek at the code you'll see where I just made the in memory store as a tool for other testers in the storagetest package. I think it's fine there.

Also, I think this level of abstraction is perfect. I think once I (or someone else) develops a good shared storage plugin (i.e. one that has its own directives and what not) then it definitely would be worth including in the download. If only Go had better shared lib support it wouldn't have to be bundled.

@mholt mholt added the under review 🧐 Review is pending before merging label Jul 2, 2016

// InMemoryStorageCreator is a caddytls.Storage.StorageCreator to create
// InMemoryStorage instances for testing.
func InMemoryStorageCreator(caURL *url.URL) (caddytls.Storage, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is only used for testing; can we move it into a test file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured this was used for testing like those items in iotest and httptest are only used for testing. This can be used by other packages for their testing (same with the test harness), ref this SO answer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. I get you now.

@cretz
Copy link
Author

cretz commented Jul 2, 2016

@mholt Before I dig too deep in the code, can you help me understand cache invalidation and auto renewal? Specifically, it seems like there is a CacheManagedCertificate that caches in a local map but would not get updated if renewal happened on another server.

Specifically, take the following scenario. Say I am using a multi-server setup (i.e. server A and server B) w/ shared storage to some DB backend or something. When server A and server B both feel the need to renew, what kind of race conditions persist here I wonder. Are overwrites and multiple challenges acceptable to the CA? Should I add some logic in RenewManagedCertificates that can tell the storage interface to use a mutex because we are in renewal? Basically this is the same question as if I were running two Caddy processes on the same machine for the same host (different ports) and sharing the same assets dir.

Now I'm thinking that I may need two more abstractions in the storage interface: LockRenewal(domain string) (bool, error) (which would return false if lock is already there, and of course the implementers would be encouraged to have expiring mutexes) and UnlockRenewal(domain string) error so servers don't step on each other. Of course the default local file storage impl would simple return true from the former and the latter would be a no-op.

@mholt
Copy link
Member

mholt commented Jul 2, 2016

@cretz

Before I dig too deep in the code, can you help me understand cache invalidation and auto renewal? Specifically, it seems like there is a CacheManagedCertificate that caches in a local map but would not get updated if renewal happened on another server.

Correct; that doesn't synchronize, just loads from storage into memory.

When server A and server B both feel the need to renew, what kind of race conditions persist here I wonder.

Each Caddy operates independently, renewing the certificates it is managing separately from other Caddy instances. Caddy would have to be programmed to coordinate with a fleet in order to synchronize renewals and other changes. (I'm working on a tool to do this, independently of Caddy, which will one day plug into Caddy. I do not believe this kind of logic belongs in the caddytls package, as it is highly diverse and complicated on its own...)

Edit: If what you need in caddytls is really as simple as adding some sort of mutex/locking to the interface, then that shouldn't be too burdensome and might be more reasonable indeed.

@cretz
Copy link
Author

cretz commented Jul 2, 2016

Each Caddy operates independently, renewing the certificates it is managing separately from other Caddy instances. Caddy would have to be programmed to coordinate with a fleet in order to synchronize renewals and other changes. (I'm working on a tool to do this, independently of Caddy, which will one day plug into Caddy. I do not believe this kind of logic belongs in the caddytls package, as it is highly diverse and complicated on its own...)

So, my whole idea of using shared storage to solve HA-multi-server HTTP challenges fails under race condition here (technically keys and certs stored at the same time could get out of order)? Would you mind if I added those two functions? I don't want Caddy to do the coordination either (but I do want my plugin to), but I do need to solve storage race conditions. Otherwise, do you foresee an issue of two servers issuing renewals at the same time? I am open to any solution here, but my whole purpose for pluggable storage was to allow multiple servers to share certs (and the renewal of them).

I suppose one solution in my storage provider is to do leader election w/ consul and if it's not the leader, just refuse to store the new cert (without error). Seems hacky.

Edit: another option is I could do the consul semaphone on Store* and only unlock it on the last of the 3 calls in saveCertResource. In that case though, I am tempted to rework my interface to have it store all three things at once and load all three things at once instead of separate calls.

Edit: another option is to have a channel to invalidate the in memory cert cache. Still not enough to prevent race conditions, but at least it can update if the storage provider wants it to (e.g. a consul watch)

@mholt
Copy link
Member

mholt commented Jul 3, 2016

So, my whole idea of using shared storage to solve HA-multi-server HTTP challenges fails under race condition here (technically keys and certs stored at the same time could get out of order)?

Yep, right. Across multiple machines, each Caddy would do their renewals independently which would run into LE rate limits faster. So yeah, you'll need a way to sync their access (including checking their expiration date when scanning for renewals - making sure to update the in-memory cached cert inside the lock).

Would you mind if I added those two functions?

Go for it.

I don't want Caddy to do the coordination either (but I do want my plugin to), but I do need to solve storage race conditions.

Good clarification. That's fine.

@cretz
Copy link
Author

cretz commented Jul 3, 2016

Welp, I may have been a bit too eager with this PR...I'll have some changes incoming at some point.

@mholt
Copy link
Member

mholt commented Jul 3, 2016

I probably should have mentioned it, I just hadn't thought much of Caddy sharing its certificates with other processes on different machines.

Anyway, looking forward to the revisions whenever they show up. :)

@mholt mholt removed the under review 🧐 Review is pending before merging label Jul 3, 2016
cretz added 2 commits July 2, 2016 21:40
* Change storage interface to persist all site or user data in one call
* Add lock/unlock calls for renewal and cert obtaining
@cretz
Copy link
Author

cretz commented Jul 3, 2016

Ok, I did my best to try to alleviate my problems. I reduced site/user loading/storing to one call each. I also added a locking mechanism at strategic parts of the code to prevent duplicate renewals.

I hope this hasn't introduced too much complexity. If it has, we can take this PR off the table. I fear the locking parts I added to the somewhat-confusing code paths already taken for obtain/renew cert only make the code look a bit more confusing. I took care to remain compatible with how it works today. In my impl, I'll also have to watch for changes and call CacheManagedCertificate when I notice a change elsewhere.

Review appreciated and if the benefit of including this layer doesn't outweigh the cost of complexity added, we can kill this PR and I will find other approaches.

// emailUsername returns the username portion of an email address (part before
// '@') or the original input if it can't find the "@" symbol.
func emailUsername(email string) string {
at := strings.Index(email, "@")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be cleaner using strings.SplitN(email, "@", 2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that might be true, this is taken verbatim from https://github.com/mholt/caddy/blob/fdb6d64f9d2afdeb6f843862da7dd243abb2399b/caddytls/storage.go#L119 and I wasn't wanting to change any of that logic during this refactoring. I think to clean up that should be a separate effort.

@weingart
Copy link

weingart commented Jul 3, 2016

FYI, take my comments as suggestions. This is looking like it's going in the right direction. 😄

@cretz
Copy link
Author

cretz commented Jul 3, 2016

Gladly. Thank you for looking! :-)

@mholt mholt added the under review 🧐 Review is pending before merging label Jul 3, 2016
@mholt
Copy link
Member

mholt commented Jul 6, 2016

I want this to go in beta 3.

@cretz Is this a change you are willing to maintain (at least in the foreseeable future) if problems or questions arise? I'll make you a collaborator so you have push rights when I merge this PR.

@cretz
Copy link
Author

cretz commented Jul 6, 2016

Sure, I can maintain for the near future, though I can't promise I'll be able to for a really long time (though I hope it's clear enough for anyone to understand/contribute). I also plan on providing a consul plugin right away. I may request help w/ integration testing the multi-server-HTTP-challenge setup w/ the CA.

@mholt mholt removed the under review 🧐 Review is pending before merging label Jul 8, 2016
@mholt mholt merged commit 88a2811 into caddyserver:master Jul 8, 2016
@mholt
Copy link
Member

mholt commented Jul 8, 2016

Thanks @cretz for all your work! Looking forward to what this makes possible.

// Note, the renewal inside here may not actually occur and no error will be returned
// due to renewal lock (i.e. because a renewal is already happening). This lack of
// error is by intention to force cache invalidation as though it has renewed.
err := cert.Config.RenewCert(allowPrompts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz Just a thought: what if another renewal process updates the certificate between the beginning of this for loop (above on line 67) and actually calling RenewCert? Even though we have a read lock on the certCache, something else could have renewed it by now, and finished, which would cause this certificate to be renewed twice.

Could you double-check my thinking here and see if that's a plausible 'race' condition? I wonder if the lock needs to be put over this whole function. (It's still early here; I could be wrong...)

Copy link
Author

@cretz cretz Jul 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this. It appears RenewCert re-obtains the site data from storage but yeah, it doesn't check if it still needs to be renewed. So yes, in theory it could double renew here. With how the RenewCertificate function is set up, I am not sure of the best place to place the lock. Open to suggestions. If I place the lock in the for loop (remember it is a domain specific lock), you still suffer the problem where it might need to reload from cert cache. Instead, maybe we just do a "continue" on renewal lock? Or just make sure the re-fetched cert is the one we are expecting to renew inside of RenewCert?

Copy link
Member

@mholt mholt Jul 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at some options... Can you move the logic that checks the expiration date into that method, where the expiration check happens inside the lock, maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm thinking. I'll send across another PR at some point.

@mholt mholt mentioned this pull request Jul 11, 2016
return err
} else if !lockObtained {
log.Printf("[INFO] Certificate for %v is already being renewed elsewhere", name)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz I've been looking at this for a new class of Caddy plugins for TLS storage. In this branch, some other instance is already renewing the certificate, but here we return nil instead of waiting until the renew finishes and then loading it below and continuing along. When does the renewed certificate get loaded? What if the lock was blocking instead?

Copy link
Author

@cretz cretz Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mholt - I simply added this for good measure (like if a plugin used it or something) but best I can see, this method is never called anyways. See Config::obtainCertName and Config::renewCertName for where renewal actually appears to happen.

Copy link
Member

@mholt mholt Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks @cretz. Was there a specific reason that LockRegister() was not put into ACMEClient.Obtain() but was put into ACMEClient.Renew()? (Note: I'm not criticizing, just understanding since I think we're going to make some changes.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mholt - Tbh, I don't remember exactly why. If I had to guess, it is because I wanted to lock the storage more than the renewal/registration process. Obtain only uses a single (atomic) StoreSite whereas Renew does a LoadSite and then a StoreSite later and I need to make that combination atomic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants