Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong probe_ssl_earliest_cert_expiry value for certificates with multiple trust chains #340

Closed
oleg-glushak opened this issue Jul 3, 2018 · 50 comments · Fixed by #636
Closed

Comments

@oleg-glushak
Copy link

We try to monitor cert expiration date for this certificate: tvonline.swb-gruppe.de
With openssl and using Google Chrome browser we get an expected value (Oct 26 23:59:59 2019).

echo | openssl s_client -servername tvonline.swb-gruppe.de -connect tvonline.swb-gruppe.de:443 2>/dev/null | openssl x509 -noout -dates notBefore=Oct 26 00:00:00 2017 GMT notAfter=Oct 26 23:59:59 2019 GMT

But with blackbox_exporter we get:

probe_ssl_earliest_cert_expiry 1.534824e+09

which is GMT: Tuesday, 21 August 2018 04:00:00.

It seems a similar problem is described here:
https://security.stackexchange.com/questions/66487/what-happens-when-certificates-further-up-the-chain-expires-before-mine-equifa

It's recommended to update openssl to v1.0.2 there which has a fix for this issue but I guess golang/blackbox_exporter use some other mechanism to work with SSL. Is there any workaround or fix for this issue?

@brian-brazil
Copy link
Contributor

The metric purposefully works off the cert that expires first, so that you don't get caught out by something in your chain expiring before the main cert. It doesn't understand multiple trust chains, and it's not clear to me if we should change this.

@dkutetsky
Copy link

We have the same issue when Chrome and OpenSSL shows one date, but blackbox shows date earlier.
How it could be fixed or can we use any workaround for this?

@dkutetsky
Copy link

It will be great to have another metric to return probe_ssl_cert_expiry to have the same date that we have now in all browsers/openssl.

@jrdemasi
Copy link

jrdemasi commented Jan 8, 2020

So turns out this is definitely still an issue - has there been any discussion in terms of how to move forward? The behavior as-is is basically useless.

Edit: I understand the use case for maybe a private CA, but for people trying to monitor externally issued certs, this probably isn't helpful. I'm very interested in a solution that just adds a new value for the end of the chain, and would be willing to work on a patch if one isn't started and that seems a reasonable path forward that would be accepted.

@ribbybibby
Copy link
Contributor

For those of you interested in only alerting on specific certificates in the chain, that can be achieved with: https://github.com/ribbybibby/ssl_exporter.

pir8g33k referenced this issue in greg-solutions/blackbox_exporter Apr 23, 2020
…data

Extend information about last cert of chain
@mr-karan
Copy link

mr-karan commented Apr 30, 2020

https://thesslonline.com/blog/sectigo-addtrust-external-ca-root-expiring-may-30-2020

Sectigo operates a root certificate named the AddTrust External CA Root used to establish cross-certificates to Sectigo’s modern root certificates, the COMODO RSA Certification Authority and USERTrust RSA Certification Authority. Until 2038, those roots do not expire.
The AddTrust External CA Root, however, expires on May 30th 2020.

After this date, clients and browsers will be chaining back to the modern roots used to cross sign with the older AddTrust. No errors will be shown on any patched, existing or modified system or network.

A lot of certificates use Sectigo AddTrust External CA in their chain and it's going to expire next month on May 30th 2020. Is there a way to disable the checks on these kind of intermediate certs?

@jrdemasi
Copy link

jrdemasi commented Apr 30, 2020 via email

@brian-brazil
Copy link
Contributor

From the blackbox exporter standpoint that is a true positive, and you should look at either updating that cert or removing it if it's no longer relevant.

@jrdemasi
Copy link

That isn't always an option, though. It even says in the documentation:

After this date, clients and browsers will be chaining back to the modern roots used to cross sign with the older AddTrust. No errors will be shown on any patched, existing or modified system or network.

That doesn't mean we should have to upgrade our (still valid) certs. This is a shortcoming, and frankly an issue, with blackbox at this point.

@brian-brazil
Copy link
Contributor

If you're sure it's not a problem then I suggest silencing the alert for a month, and then removing it at that point.

@jrdemasi
Copy link

It isn't just going to be a month, though, and it still means that other certs in the chain could be missed. I just don't understand your resistance to just adding some functionality to only report on the end of the chain, which a lot of people have asked for, provided patches for, etc. If this many people are asking, it's clearly sought after and would be helpful. As a sysadmin, I can tell you, this is a HUGE limiting factor in needing to run blackbox vs. having to export our own SSL metrics. It's not possible to always control certs in the way you describe. They are issued to us, often on a per-cert basis, and we aren't looking to renew 200 certs just because blackbox is going to yell at us for another year about them. That isn't a good use of money.

@brian-brazil
Copy link
Contributor

I just don't understand your resistance to just adding some functionality to only report on the end of the chain, which a lot of people have asked for, provided patches for,

I have yet to receive a patch which would avoid the situation where "other certs in the chain could be missed.", as all proposed patches thus far only looked at the last cert in the file.

They are issued to us, often on a per-cert basis, and we aren't looking to renew 200 certs

It's not something I've done myself, but I believe you can remove the particular certs from the set returned to clients as each individual cert is self-contained - which will also reduce bandwidth usage.

@jrdemasi
Copy link

jrdemasi commented Apr 30, 2020

Right, and in 90% of cases, that's what we want to monitor as sysadmins - the last cert in the file or at the end of the trust chain. I don't care if another cert in the chain is going to expire, so I would disable alerting for that entirely, because they're outside of my purview or control. Certs expire in certain parts of chains, especially when bridging is occurring, just like in this scenario, all the time when talking about certs that come from huge places like Sectigo/Comodo. Some people would still find it useful, so I don't propose getting rid of the functionality at all. Manual cert manipulation isn't worth it to save the kiliobytes of bandwidth, load time, etc. It's more risky that you will break something or return something a client will find invalid. We get certs delivered to us as a bundle that we stick into place and call it good until it's time to renew. That's the industry behavior.

Edit: it just seems unreasonable to not ADD functionality to an already great tool that would make it better for a lot of users and completely not impact people who don't want to use it vs. asking people to change their entire config management and cert management paradigms. Truly, not trying to be argumentative. I want to find a solution to this, but I don't understand your opposition, and I would like to.

@brian-brazil
Copy link
Contributor

want to find a solution to this, but I don't understand your opposition, and I would like to.

I'm not willing to accept a feature that enables users to purposefully ignore a definitive upcoming breakage of their application.

To be clear I have rejected PRs that only look at the last cert in the list, what no one has sent me is a PR that provides the time when the last chain will expire.

@jrdemasi
Copy link

I'm not willing to accept a feature that enables users to purposefully ignore a definitive upcoming breakage of their application.

In the case of the presented issue this morning, it will NOT break almost anyone using the certs.

Are you saying if I submit a PR that actually looks at the time when the last chain will expire you will consider it? I am more than willing to put in the time. It is worth noting, though, that a majority of server certs are the last in the list because most software requires that to be the order. Root -> intermediates -> server cert.

@brian-brazil
Copy link
Contributor

Are you saying if I submit a PR that actually looks at the time when the last chain will expire you will consider it?

Yes.

@jrdemasi
Copy link

jrdemasi commented Apr 30, 2020

Do you have a specific test case where the last cert in the list is NOT the end of the chain for reference so I have something to test against?

Would you agree that the following code will return the entire chain as presented by the server, the equivalent of doing an openssl s_client -connect. In many cases, this is going to mirror the same data as what people have presented in PRs before:

conn, err := tls.DialWithDialer(d, "tcp", host+":"+port, &tls.Config{
        InsecureSkipVerify: SkipVerify,
        CipherSuites:       cs,
        MaxVersion:         tlsVersion(),
    })
    if err != nil {
        return []*x509.Certificate{&x509.Certificate{}}, "", err
    }
    defer conn.Close()

@brian-brazil
Copy link
Contributor

My point was more that those PRs weren't considering intermediaries. Based on your responses above it sounds like you already have a good selection of potential test cases available to you.

I'd have to dig into the code, but I expect the challenge here will be what to do after you have all the certs that were returned.

@jrdemasi
Copy link

jrdemasi commented Apr 30, 2020 via email

@brian-brazil
Copy link
Contributor

Any intermediates would be accounted for in the already existing metric,
though, right?

All potential intermediates will be accounted for. What you're looking to implement is to only check one chain of intermediates. Remember that there's no requirement that the last cert in a chain be the first one to expire, which is what the blackbox exporter is trying to protect you against.

@azilber
Copy link

azilber commented May 3, 2020

I think what needs to happen is that we report on expiration of certificate paths, rather than the certificate itself, and have an option whether the expiration of a path triggers an alert.

Right now I have to remove the alert because the legacy chain cert is expiring, but the 2nd path is still going to be fine. Here's some more info for the Sectigo certs: https://support.sectigo.com/Com_KnowledgeDetailPage?Id=kA03l00000117LT

I'm surprised this hasn't been a larger issue for people.

@brian-brazil
Copy link
Contributor

Looking at the docs, Go already derives the chains for us in VerifiedChains so this shouldn't be too difficult to code from there.

@mmakaay
Copy link

mmakaay commented May 4, 2020

I think an important topic is missing from this discussion: availability of trust achors.
The kind of certficiate chain we're looking at here is:

A -> B -> C -> AddTrust CA

The thing is that the AddTrust CA certificate will be expiring by the end of May 2020 and blackbox, correctly, reports this. However, on most systems, this CA is not a trust anchor on the system, while C is (because the C certificate is in the OS store or available as a builtin object in the browser CA store).

With C as a valid trust anchor, an SSL client that has to investigate the validity of the certificate chain, will look at A, then B and then C, stopping there because C is a trust anchor. There is no need for the client to investigate AddTrust CA, since C has explicitly been trusted on the system.

The fact that AddTrust CA will expire by the end of May is not an issue for clients that have an updated trust anchor for C, and this is exactly what is the case for most clients. Things will work as intended (not by accident) with this certificate chain. There is no need to replace the certificate with a new one that does not have the AddTrust CA dangling beneath the certificate tree.

So what would be correct behavior?

What could be done is to only check the certificate tree up to and including the level of a trust root that is available on the system. In the example case from above: only check A, B and C.
This might arguably be the wrong thing to do, since the perspective that blackbox has is not guaranteed to be the same perspective als the regular client that visits the TLS service. E.g. a web browser visiting a https service might work when it has a builtin object C, while blackbox might report an issue when it doesn't have C but only the expiring/expired AddTrust CA

Because it might be hard to make sure that blackbox has the same perspective on valid trust anchors as the visiting clients have, a better option might be some exclude list to flag those certificates that should not be considered in the check. Blackbox could even provide this list for well-known cases like the mentioned AddTrust CA.

My $0.02, I hope they can help this discussion forward.

@brian-brazil
Copy link
Contributor

I had thought a bit about this, and you can already specify which CAs you'd like Go to use which seems to me like it'd cover this sort of case (in both directions).

@mmakaay
Copy link

mmakaay commented May 4, 2020

Is that an answer to my post? I'm not talking about whitelisting CA certificates there. In fact, my proposal was to allow blacklisting CA certificates, for which it's known that they are going to expire, but which will not disappear quickly since they were used for cross signing like AddTrust CA was.

With the example chain that I mentioned in mind: about everybody in the world has the trust anchor for C (in our case COMODO) whitelisted on their system. Therefore, there is no problem in AddTrust CA not being valid anymore. By whitelisting that C trust anchor, we make sure that the TLS connections work as intended. Clients experience no issues, also not after the end of May (we did test that by time shifting tests). Blackbox however is warning us that things go haywire here.

@brian-brazil
Copy link
Contributor

It is. The blackbox exporter already allows you to specify exactly the set of CA roots you want to apply to a tls connection.

@mmakaay
Copy link

mmakaay commented May 4, 2020

I would like to be able to specify exactly what set of CA roots not to consider when computing the minimum expiry date for a certificate chain.

By specifying that I want to apply the COMODO root certificate, and by not specifying the AddTrust CA root, the A -> B -> COMODO -> AddTrust CA chain would be valid from the TLS client perspective and I would be able to connect perfectly well.
However, the computation in Blackbox to find the minimum expiry date to warn about an upcoming expiry would still consider AddTrust CA's expiry date of end May, and therefore would warn about the upcoming (non) issue.

But well, it seems one of us is missing the point here.
I will leave it up to others to make a decision about who that is.
If it's me, then I apologize.

@jrdemasi
Copy link

jrdemasi commented May 5, 2020

The reason for this metric, earliest expiry, is to warn us about upcoming errors, no?

Then I suppose the metric should Ignore any chains that will expire soon if there is another chain that will extend that period.

So, in effect, doing something like this:

func getExpiry(chain []*x509.Certificate) time.Time {
	earliest := time.Time{}
	for _, cert := range chain {
		if (earliest.IsZero() || cert.NotAfter.Before(earliest)) && !cert.NotAfter.IsZero() {
			earliest = cert.NotAfter
		}
	}
	return earliest
}

// we want to know when errors will occur
func noErrorsUntil(state *tls.ConnectionState) time.Time {
	latest := time.Time{}
	for _, chain := range state.VerifiedChains {
		exp := getExpiry(chain)
		if latest.IsZero() || exp.After(latest) {
			latest = exp
		}
	}
	return latest
}

I really like this idea. @brian-brazil, what are your thoughts in the context of what we discussed last week?

@brian-brazil
Copy link
Contributor

That's exactly the sort of new metric I indicated I'd accept a PR for.

@mmakaay
Copy link

mmakaay commented May 6, 2020

New metric? I'd say this is the fixed version of the existing metric @brian-brazil
If you'd make this a new metric, all that will be done is introducing confusion about the meaning of the multiple related metrics, while the current implementation really is not that interesting.

I think if you'd poll your users "what do you expect to gain from this metric? A) the first upcoming expiry of any of the currently valid certificate chains B) the expiry time at which my users will run into problems connecting", that they will go massively for B).

@brian-brazil
Copy link
Contributor

Yes, a new metric. probe_ssl_earliest_cert_expiry describes exactly what the current metric does. Different semantics require a different metric name.

@mmakaay
Copy link

mmakaay commented May 6, 2020

It describes what the metric does, it does not do what people expected from it.

"Earliest" was likely chosen as a description for "the expiry date for the first certificate in the full certificate chain that will expire" (that is what the code reflects as well).

The thing is that there are multiple chains to consider and your code disregards this. However, you know seem to argue that "earliest" is all of a sudden the semantic for "the expiry date for the first certificate that will expire in the first certificate chain that will expire". The semantics have been modified to match the current (unexpected) behavior.

If you want to go for your strict semantics, then it's time to fix your code I'd say, since you're only considering a single certificate chain and not both chains to see which one wins the "earliest expiry" price. Once you've done that, you'll have produced a rather useless metric, since nobody cares that for example end of May one of the chains will expire, while actual problems will only occur in 2038. Your users are interested in 2038 here.

If you don't care about what your users expect from this metric, then have fun with that.
We'll find a different way to get a usable metric here, that doesn't generate false positives.

@brian-brazil
Copy link
Contributor

As a general thing metric names should describe what they do. If a metric name correctly describes what it does but doesn't do what a user expects, then that's a case for finding/adding a metric that does - not creating a confusing situation where the metric doesn't match its name.

In this particular case the current metric name says that it is the earliest cert that expires, which is what it does.

If there is to be a metric that looks at chains instead of certs, and looking for the latest rather than the earliest, then that needs a new metric name as earliest_cert does not accurately describe such a metric.

@mmakaay
Copy link

mmakaay commented May 6, 2020

I get your point, don't worry. It's just a very bureaucratic point.

If I would follow your line of strict reasoning, then the current metric should be fully dropped, in favor of a new metric probe_ssl_earliest_cert_expiry_of_random_chain.

Following strics semantics, you cannot state that probe_ssl_earliest_cert_expiry guarantees to deliver the earliest cert expiry, when only considering one of multiple chains. When you'd get the 2038 chain instead of the end-of-May chain, then the output would differ.

I'm sure you will not agree.

@brian-brazil
Copy link
Contributor

I'd suggest that if you want a resolution to this issue that preparing a PR along the lines discussed above would be more productive than debating metric naming best practices.

@mmakaay
Copy link

mmakaay commented May 6, 2020

I'm really not the one starting semantics debates here.
I'm the one that tries to argue that it would be best to fix the current metric behavior instead of introducing extra metrics, so all users that use the current metric get what they expected from it in the first place and get a fix for the issue.

If @giganteous wants to finish a pull request to provide a different metric altogether, that would be great. But could you please state what would be the one and only possible correct name for such metric in your world? I'm afraid that otherwise a PR might simply lead to yet another discussion about metric naming, instead of an actual fix for this issue.

@brainbug95
Copy link

brainbug95 commented May 7, 2020

What about making probe_ssl_earliest_cert_expiry configurable via prober http? That way the default behavior of probe_ssl_earliest_cert_expiry would not change if one does not configure it differently - i.e. would not break anything. Anyone who wants the other behavior like "find any valid chain" needs to configure the module properly.

blackbox_exporter.yml

http_tls_cert_any:
  prober: http
  http:
    valid_status_codes: [200,302,301,403,404]
    fail_if_not_ssl: true
    valid_chain: any

default behavior for valid_chain would be full which represents the way it works now.

I'm not a developer, therefor I cannot provide any code/PRs. Just an Idea.

@brian-brazil
Copy link
Contributor

What about making probe_ssl_earliest_cert_expiry configurable via prober http?

A given metric should only have one meaning, and adding configurability would only make things worse as now you have to guess what a given metric means (in addition to the metric now being sometimes misleadingly named). If you want a different meaning you have to use a different metric name.

@mmakaay
Copy link

mmakaay commented May 7, 2020

The idea of making the behavior configurable seems a good way of handling things to me. The proposed naming doesn't cover the issue at hand though I think.

The case here is that there are multiple valid chains, all with valid certificates in them, but one of those chains has a certificate in it that is going to expire soon.

Chain A: expires by the end of this month
Chain B: expires in 2038

"expires" here means "earliest expiry date", since we're looking at all certificates in each of these two chains to decide which one is going to expire the soonest within each chain. It doesn't matter if it's a root certificate, intermediate certificate or a host certificate. Once one level of the chain expires, the whole chain becomes invalid.
That is what probe_ssl_earliest_cert_expiry does reflect correctly: it takes a chain and finds the earliest expiry. However, it only reflects on the chain that it gets handed over and does not consider the "earliest expiry date" for both chains.

Hence, the issue was created because we are running into warnings that the certificate chain A is soon to expire. However, the real expiry date for the available chains is in 2038, so the sky is blue and all is good for now. No need for warnings.
From an administrator perspective, the expiry date for chain B is the interesting one. Chain A was created for backward compatibility with older clients and it is of no real interest to most of us.

Back to the proposal of configuring the behavior

Based on the stuff I wrote above, the problem does not lie within the logic that determines "earliest expiry date for a chain". The full chain is checked and that is fine. The problem lies within the absence of logic when there are multiple chains available. The configuration option would be able to tell the exporter to either:

  • Check only the chain from ConnectionState.PeerCertificates
  • Check all chains from ConnectionState.VerifiedChains and use max(earliest expiry date)

A good key and value for configuring the behavior doesn't pop up right away. That doesn't matter too much as I highly doubt that Brian will accept the idea of changing behavior for determining a metric.
(ah, yes, the rejection was already posted while I was typing out this reply)

@mmakaay
Copy link

mmakaay commented May 7, 2020

A given metric should only have one meaning, and adding configurability would only make things worse as now you have to guess what a given metric means

Yet, something like fail_if_not_ssl: true does introduce a change in metric behavior, so what determines whether or not something like that is acceptable?

I am starting to understand why this project has way more forks than pull requests.

@brainbug95
Copy link

brainbug95 commented May 7, 2020

If neither configuration nor changing the behavior of probe_ssl_earliest_cert_expiry are options this would be more like a feature request. New metric may be:

probe_ssl_user_cert_expiry

This one only shows the expiry of the first certificate in the chain. That is the one that most of us can control like mentioned in this thread some time ago.

@mmakaay
Copy link

mmakaay commented May 7, 2020

Ignoring part of the chain and only looking at the host certificate would be possible, however it's cleaner to actually validate all chains and use the expiry date of the chain that will expire last (since the user's browser does the same thing).
This is what the proposal above takes care of and that would fix it in a more strict manner.

The term user_cert might be confusing, since its so close to client_cert which has its whole own meaning in TLS land. What about probe_ssl_time_until_connections_will_fail ? That's what we want to monitor for, right?

@sanvila
Copy link

sanvila commented May 18, 2020

I was affected by this issue. What I did as a workaround was to edit the PEM file to remove the intermediate certificates expiring on 2020-05-30.

(Using the BEGIN CERTIFICATE / END CERTIFICATE separators and
using "openssl x509 -in foo.pem -text -noout" to know the expiry date of each).

(Just in case somebody finds this useful after a google search).

@dragoangel
Copy link

Moved to SSL exporter to validate SSL connection. Thank you @ribbybibby for the link.

itkq added a commit to itkq/blackbox_exporter that referenced this issue Jun 6, 2020
Resolves prometheus#340

Based on disscution in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.
itkq added a commit to itkq/blackbox_exporter that referenced this issue Jun 6, 2020
Resolves prometheus#340

Based on discussion in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.
itkq added a commit to itkq/blackbox_exporter that referenced this issue Jun 6, 2020
Resolves prometheus#340

Based on the discussion in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.
itkq added a commit to itkq/blackbox_exporter that referenced this issue Jun 7, 2020
Resolves prometheus#340

Based on the discussion in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.
itkq added a commit to itkq/blackbox_exporter that referenced this issue Jun 7, 2020
Resolves prometheus#340

Based on the discussion in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.

Signed-off-by: Takuya Kosugiyama <re@itkq.jp>
brian-brazil pushed a commit that referenced this issue Jun 10, 2020
* Add new probe_ssl_latest_verified_chain_expiry metric

Resolves #340

Based on the discussion in the issue above, this metric will help determine
when the SSL/TLS certificate expiration error actually happens on clients
like a browser that attempts to verify certificates by building one or
more chains from peer certificates.

Signed-off-by: Takuya Kosugiyama <re@itkq.jp>
@mr-karan
Copy link

Thanks a lot for this :) Can we expect a release with this commit sometime soon?

@brian-brazil
Copy link
Contributor

I'm planning on getting #635 in and then releasing, so hopefully in the next week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.