several instances behind layer 4 load-balancer / new "event" when challenges are created #237

daum3ns · 2021-01-28T15:20:01Z

as discussed a few times, it is quite hard to set up mod_md when running multiple instances behind a load balancer.

in our current setup, certificates are renewed manually and the distribution is then done via our configuration management tool..
now we want to enable mod_md to automate cert renewals and have ocsp handled correctly.

the main problem is:

as the certificate renewal request is initiated by apache, and there is a load balancer in front, it is not guaranteed that the follow-up requests for the challenge end up on the same host.

now, i see two possible approaches for this problem:

share the knowledge between the hosts, so that it doesn't matter which one will answer the challenge
ensure that the follow-up requests alway reach the correct host..

1
the latest work on the "event like" behavior of MDMessageCmd, especially the new "renewing" event allows to share the knowledge between the hosts via NFS as @root360-AndreasUlm described here: #234 (comment)

Now when NFS is not an option, the problem still exists, because the "renewing" event is triggered before the challenges are created, so it is not possible to manually distribute it among the nodes (via tha called script) at this time.

With another new event "challange available" introduced, our workflow would look like this:

a node starts the renewal process, the script then gets called with the new "challenge" event
the script can now trigger synchronization of the MDStoreDir among the nodes.
at the point the script returns all nodes are ready to answer the challenge
the node that answers the challenge triggers the synchronization again after successful creation of the certificates
step by step the nodes are reloaded.

then, by slightly different renew windows, we can avoid the situation where this whole process starts on two nodes simultaneously, but we are still able to renew certificate as long as one host is up and running...

so i think this would be a solution..

2

the second approach seems almost impossible. i was thinking about some combination of:

a new directive to define a "acme master node" on all other nodes
an ap_internal_login_redirect to a special location (which exists on each md.enabled vhost), in case a "non master node" gets a challenge request
and in this location a proxypass to the "acme master node"

but i'm not sure whether it is possible for mod_md to detect an "unexpected" challenge request. also it is a problem in case the "master node" is unavailable..

icing · 2021-01-28T16:05:43Z

I think cluster setups are very specific and as long as the server is not aware of them, what can a little module do?

Therefore I prefer the approach where MDMessageCmd gets invoked when a challenge has been created (and maybe also when a challenge is done, I have to look into the code more for this). The script needs only to return when the cluster sync is done.

As you wrote, there is ideally only one cluster node that is attempting a renewal at the same time. To prevent simultanous efforts, the new "renewing" event can suppress it. How you want to realize that behaviour on your cluster, you can judge better than me.

Maybe you could synchronize that info as well and attach a timestamp on it, so that a burning node will not block renewals indefinitely.

icing · 2021-02-02T15:31:13Z

In v2.3.7 BETA release I added the challenge-setup event that calls MDMessageCmd when the files have been created, but before the ACME server is asked to verify the challenges.

That means MDMessageCmd can distribute the files in your cluster and the module will continue when the command is done. For the exact event sent, please see the description in the README.md.

daum3ns · 2021-09-08T10:06:50Z

@icing while testing i see the following behaviour:
when the script returns an error (i.e. not 0) mod_md still continues to renew the certificate, and will call the script again shortly after with the "renewed" event. Is this intentionally? in my case the script returns an error if for some reason distribution to another node failed, so i would like to abort the process at this time..

icing · 2021-09-08T12:11:08Z

You are correct. Right now, the 'challenge-setup' is an event without a return code. Someone is listening on that to trigger the script. But the return code does not propagate back into the process.

I agree that this should be changed.

…!= 0 exit), the renewal process is aborted and an error is reported for the MDomain. As discussed in #237, this provides scripts that distribute information in a cluster to abort early with bothering an ACME server to validate a dns name that will not work. The common retry logic will make another attempt in the future, as with other failures.

icing · 2021-09-17T11:10:48Z

I changed this in master and also added a test case for it. Will make a new release soon.

icing · 2021-09-17T12:04:03Z

Release in v2.4.7.

icing closed this as completed Sep 17, 2021

daum3ns mentioned this issue Oct 6, 2021

also abort in tls and dns challenges if the script exits with != 0 #266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

several instances behind layer 4 load-balancer / new "event" when challenges are created #237

several instances behind layer 4 load-balancer / new "event" when challenges are created #237

daum3ns commented Jan 28, 2021

icing commented Jan 28, 2021

icing commented Feb 2, 2021

daum3ns commented Sep 8, 2021

icing commented Sep 8, 2021

icing commented Sep 17, 2021

icing commented Sep 17, 2021

several instances behind layer 4 load-balancer / new "event" when challenges are created #237

several instances behind layer 4 load-balancer / new "event" when challenges are created #237

Comments

daum3ns commented Jan 28, 2021

icing commented Jan 28, 2021

icing commented Feb 2, 2021

daum3ns commented Sep 8, 2021

icing commented Sep 8, 2021

icing commented Sep 17, 2021

icing commented Sep 17, 2021