Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert client_java to OpenMetrics #615

Merged
merged 11 commits into from Jan 25, 2021
Merged

Convert client_java to OpenMetrics #615

merged 11 commits into from Jan 25, 2021

Conversation

brian-brazil
Copy link
Contributor

This switches the core data model to OM, adds support for negotiating and exposing OpenMetrics format, adds units, created time, and adds new direct instrumentation for Info and Enum.

This is done as gracefully as practical, but this will break users - in particular if a) they have counters lacking a _total suffix in which case the sample names will change or b) if they happen to be using UNTYPED which is now called UNKNOWN in which case they'll need up update their code.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
List<Sample> mungedSamples = samples;
// Deal with _total from pre-OM automatically.
if (type == Type.COUNTER) {
if (name.endsWith("_total")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Worth making _total a const? Then can name.endsWith(counterSuffix) and name = name.substring(0, name.length() - counterSuffix.length()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like indirection like that, it only makes the code harder to understand for me as I have to bounce around the file for something that could be understood one its own line.

if (name.endsWith("_total")) {
name = name.substring(0, name.length() - 6);
}
String withTotal = name + "_total";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could also use the const.

names.add(family.name + "_created");
names.add(family.name);
break;
case GAUGE_HISTOGRAM:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe gauge histogram should also have _created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, gauges don't have a created.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, should we change the proto then? HistogramValue can be used with GAUGE_HISTOGRAM:
https://github.com/OpenObservability/OpenMetrics/blob/master/proto/openmetrics_data_model.proto#L136-L138

I suppose implementors could enforce that's not combined together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify that then.

Comment on lines +78 to +79
if (name.endsWith("_total")) {
name = name.substring(0, name.length() - 6);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These could reference the _total const?

/**
* Get the created time of the counter in milliseconds.
*/
public long created() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe be a float in seconds? That way callers don't need to divide by 1000.0 to get the value they should expose, also since this accessor doesn't have the unit (seconds/milliseconds/etc) maybe it's best to return the unit that's expected to be exposed for exposition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java's current time function returns milliseconds, so that's the effective standard inside Java - and thus also what the client uses everywhere already.

@@ -209,6 +215,13 @@ public B namespace(String namespace) {
this.namespace = namespace;
return (B)this;
}
/**
* Set the uit of the metric. Required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe uit should be unit?

Comment on lines +150 to +154
Gauge g = Gauge.build().name("a").unit("seconds").help("h").create();
assertEquals("a_seconds", g.fullname);

Gauge g2 = Gauge.build().name("a_seconds").unit("seconds").help("h").create();
assertEquals("a_seconds", g2.fullname);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a test to observe the unit being added if the name doesn't correctly have the unit on it already when built?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the test right above this is doing.

@@ -70,6 +70,7 @@ public void loadingCacheExposesMetricsForLoadsAndExceptions() throws Exception {
}
cache.get("user3");


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably could be removed this line change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

for (String accepts : acceptHeader.split(",")) {
if ("application/openmetrics-text".equals(accepts.split(";")[0].trim())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Worth making this a const? It's used both in the content-type + version string const as well as here when detecting the accept type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'd aid readability to have a constant based on other constants, this is a well known string - I find it easier to have the raw text rather than digging through layers of indirection to see what this code is actually doing.

Comment on lines +69 to +71
if (metricFamilySamples.type == Collector.Type.COUNTER) {
writer.write("_total");
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm is Prometheus text version output also meant to have the suffixes appended? I thought only OpenMetrics format would get that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no matter which format is negotiated the exact same samples should end up inside Prometheus.

@@ -53,6 +124,10 @@ public static void write004(Writer writer, Enumeration<Collector.MetricFamilySam
writer.write('\n');
}
}
// Write out any OM-specific samples.
if (!omFamilies.isEmpty()) {
write004(writer, Collections.enumeration(omFamilies.values()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could this not be recursive since this is in write004 method itself? i.e. could it not each time collect into omFamilies and recurse but not actually writing it in the child recursive call but simply recursing again infinitely?

Haven't looked to see if there's a test or not for this just yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if is there for a reason. Java lacks inner functions, so this is the easy way to not repeat the rendering logic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I see - and these will be gauges in the recurse function. Makes sense.

@@ -100,8 +175,101 @@ private static String typeString(Collector.Type t) {
return "summary";
case HISTOGRAM:
return "histogram";
case GAUGE_HISTOGRAM:
return "histogram";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, isn't this written out as a gauge? As per detecting if gsum/gcount is present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram is the closest structurally, even if it's technically a gauge. The Python client does the same.

@beorn7
Copy link
Member

beorn7 commented Jan 21, 2021

Since the Prometheus server cannot be configured to not negotiate OpenMetrics, is it perhaps worth it to make OpenMetrics opt-in here, at least for a transition period? I assume this instrumentation library has loads of users, and many will naively update and then be taken by surprise by the changes in naming and the potential of name collisions.

The opt-in has worked quite well in client_golang. People still run into surprising (for them) issues, but at least they have very explicitly activated OpenMetrics and can easily switch it off again if needed.

@brian-brazil
Copy link
Contributor Author

This PR is at its core a fundamental data model change, the addition of additional output formats and new metric types are incidental to that. No matter what format is negotiated, Prometheus will ultimately ingest the same samples. Having which internal data model to use would be non-trivial, and thus it does not make sense to make this configurable. I'm taking the same approach here as with client_python, clearing calling out the change in the release notes and announcements which is fine as this library is not 1.0. If users wish to continue to use the old code, they can not upgrade.

client_golang still uses the Prometheus data model internally, so it's not a good comparison.

@beorn7
Copy link
Member

beorn7 commented Jan 21, 2021

My argument is purely from the users' perspective. They don't care about the internal data model.

I'm not disputing that we are technically entitled to any breaking changes in a pre-1.0 release. However, I think we should still try to be nice to our users if we can. And even if we are technically not breaking any promises, many users will still see it as "our fault" if things break unexpectedly (from their perspective).

The Python client already caused some trouble with the many added _created timeseries (which could only be prevented by not upgrading the Python client). That's however, just a load problem. Name collisions and suddenly enforced _total suffixes on counters are way more problematic. I'd expect client_java to run more commonly in large, complex, and highly critical production environments. I'm afraid releasing this as is under the current circumstances could hit the trust in the project quite a bit.

If at least Prometheus could have a configurable priority for content negotiation, we could at least offer a quick mitigation strategy.

@RichiH
Copy link
Member

RichiH commented Jan 21, 2021

Would cutting a 0.10 (or 0.9.1) before merging this PR be a compromise?

@beorn7
Copy link
Member

beorn7 commented Jan 21, 2021

It's technically a compromise, but it doesn't address the problem.

Users will suffer the same fate with 0.11. Please also take into account that instrumentation happens in code that is not your own, where you are not in control of which version to use. (See the recent Caddy issues as an example from the Go world. At least, Caddy could fix it by adding a configuration option for their users. And zero blame was put on the Prometheus project.)

I think the following migration path would be the least bad:

  • 0.10 has OM as opt-in, so users can test it in their environments without any risk of activating OM without even knowing it.
  • 0.11 has OM as opt-out, so whoever didn't test it and didn't get the message will now learn it the hard way, but can easily switch off.
  • 0.12 will always negotiate OM if possible.

@brian-brazil
Copy link
Contributor Author

I think the following migration path would be the least bad:

The problem with that suggested path is that it requires making a data model configurable, which is not a reasonable request as that's such a fundamental aspect of the code and would be a major amount of work. This PR was tricky enough as-is without having to duplicate all the relevant code and make it all cleanly and correctly interoperate somehow.

has OM as opt-in, ... has OM as opt-out, ... negotiate OM if possible.

Once again this is primarily a data model change, which metric format(s) can be negotiated is irrelevant as we'll always end up with the exact same samples being ingested into Prometheus. The way you can opt in or opt out is by choosing the version of client_java to use. If you don't like the change, it's easy to downgrade again.

Users will suffer the same fate with 0.11. Please also take into account that instrumentation happens in code that is not your own, where you are not in control of which version to use.

That's not a problem, as that's not how Java dependency resolution works. Whoever builds the final "binary" gets to choose which version is used, not the individual library developers. An end-user can also readily produce such a "binary" with a different version. You can't somehow end up with two different versions of the relevant code inside the binary claiming to both be the simpleclient. There's also no backwards incompatible ABI changes here (I'm always careful with that), so users can roll forwards/backwards as long as they don't write code that uses any of the new features.

Ultimately this is a change which has been carefully considered and publicly communicated for well over a year, including how users can migrate in this grace period. This is the PR that ends that ends the grace period Java, so if users choose to upgrade then they get the OM data model and new metric names with that. If they want to stay as-is, they're free to do that too.

@juliusv
Copy link
Member

juliusv commented Jan 21, 2021

Without having reviewed the code or knowing Java, I'd tend to agree with @brian-brazil here if it's true that the binary author is in control of all dependency versions in Java, and if it's indeed a lot of extra effort to make the behavior configurable within the same version. But again, this is just my 2c without having taken a deep look at things.

@roidelapluie
Copy link
Member

roidelapluie commented Jan 21, 2021

Thanks for this pull request.

By looking at the code, it seems that we still support prometheus text format. So it seems reasonable to add a feature to not check for _total, _unit and keep the original provided name. And that if you opt-in for OpenMetrics, we serve OM when negociated and run the checks about unit and total.

It also seems that the pre-om functions are not marked as @deprecated, which is something that I think we should do.

@brian-brazil
Copy link
Contributor Author

Thanks for the discussion everyone, rollouts like these are always a bit tricky but all things considered I think this is the best path.

@brian-brazil brian-brazil merged commit 5c66e58 into master Jan 25, 2021
isomarcte added a commit to isomarcte/http4s that referenced this pull request Jan 25, 2021
Prometheus is changing the naming scheme of some of their metrics in order to be compatible with the OpenMetrics standard. Part of this means that all counters will have `_total` or `_created` as a suffix. There is no way to create a counter without a `_total` suffix.

See: prometheus/client_java#615
@@ -209,6 +215,13 @@ public B namespace(String namespace) {
this.namespace = namespace;
return (B)this;
}
/**
* Set the unit of the metric. Required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a unit optional, not required? From the OpenMetrics spec:

If non-empty, it MUST be a suffix of the MetricFamily name separated by an underscore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is a copy&paste error, would you like to send a PR to fix this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've opened #619

shakuzen added a commit to shakuzen/client_java that referenced this pull request Jan 27, 2021
It was incorrectly documented as required when initially introduced.
See prometheus#615 (comment)

Signed-off-by: Tommy Ludwig <8924140+shakuzen@users.noreply.github.com>
brian-brazil pushed a commit that referenced this pull request Jan 27, 2021
It was incorrectly documented as required when initially introduced.
See #615 (comment)

Signed-off-by: Tommy Ludwig <8924140+shakuzen@users.noreply.github.com>
@justinabrahms
Copy link

Just hopping in to mention that this broke a few services in my company due to the backwards incompatible change. :(

@nobletrout
Copy link

yup this broke stuff.

armanbilge pushed a commit to http4s/http4s-prometheus-metrics that referenced this pull request May 2, 2022
Prometheus is changing the naming scheme of some of their metrics in order to be compatible with the OpenMetrics standard. Part of this means that all counters will have `_total` or `_created` as a suffix. There is no way to create a counter without a `_total` suffix.

See: prometheus/client_java#615
@deepak-auto
Copy link

deepak-auto commented Jun 13, 2022

We are blocked from upgrading to a newer version of Spring Boot because of this 🤷 Now our backlog involves fixing dozens, if not hundreds, of alerts and dashboards.

@fwbrasil
Copy link
Contributor

We're also blocked from upgrading at Nubank. Fixing all of our alerts, dashboards, and integrations that consume prometheus data is going to be challenging. Would it be possible to provide a better migration path where the new behavior is configurable?

@bwplotka
Copy link
Member

bwplotka commented Jun 25, 2022

Ack, thanks for the reports (feel free to drop more of them or +1 the issue, it lets us know how important this issue is). 🤗

We are aware of this and we are actively discussing the way to unblock you and the way forward with OM transitions here and with other Prometheus clients.

@fwbrasil
Copy link
Contributor

fwbrasil commented Jul 1, 2022

thank you @bwplotka! Please let me know if we can help. Upgrading the client would solve an important performance issue for us (#777)

@fwbrasil
Copy link
Contributor

hi folks! any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet