Skip to content

Operator metrics server#1928

Merged
SamBarker merged 17 commits intokroxylicious:mainfrom
SamBarker:operatorMetricsServer
Mar 18, 2025
Merged

Operator metrics server#1928
SamBarker merged 17 commits intokroxylicious:mainfrom
SamBarker:operatorMetricsServer

Conversation

@SamBarker
Copy link
Copy Markdown
Member

@SamBarker SamBarker commented Mar 13, 2025

Type of change

  • Enhancement / new feature

Description

Following on from #1865 this PR starts a HTTP server and registers a handler at /metrics to expose those metrics.

Additional Context

Fixes: #1866, #1896

This PR starts a standard Java HTTP server (this is not public or intended for high volume requests so we don't need anything more sophisticated).

Note with an eye to the future and the need to support kubernetes health monitoring this PR builds its own webserver rather than using the one offered by prometheus-metrics-exporter-httpserver. I investigated using the prometheus one however we can't register our own handlers and it felt excessive to have two different http servers requiring two different ports.

Things this PR doesn't address:

  • HTTPS support
  • Any configuration of the HTTP server. Such as binding address or port.
  • Support any metrics registry apart from Prometheus

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • PR raised from a fork of this repository and made from a branch rather than main.
  • Write tests
  • Update documentation
  • Make sure all unit/integration tests pass
  • Make sure all Sonarcloud warnings are addressed or are justifiably ignored.
  • If applicable to the change, trigger the system test suite. Make sure tests pass.
  • If applicable to the change, trigger the performance test suite. Ensure that any degradations to performance numbers are understood and justified.
  • Ensure the PR references relevant issue(s) so they are closed on merging.
  • For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

NOTE: You must be a member of @kroxylicious/developers to trigger the system test and performance test suites. If you are not part of this group, comment on the PR requesting a trigger, tagging @kroxylicious/developers.

@SamBarker SamBarker requested a review from a team as a code owner March 13, 2025 01:16
@SamBarker SamBarker force-pushed the operatorMetricsServer branch from fbbbe0a to 47c821d Compare March 13, 2025 02:18
Comment thread CHANGELOG.md
Comment thread kroxylicious-operator/pom.xml Outdated
Copy link
Copy Markdown
Member

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of niggles, but otherwise heading in the right direction.

@SamBarker SamBarker force-pushed the operatorMetricsServer branch from 72ea8e3 to 3c52791 Compare March 14, 2025 00:12
delegate.handle(exchange);
}
else {
exchange.sendResponseHeaders(404, -1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful to terminate the exchange correctly after we're done:

Terminating exchanges Exchanges are terminated when both the request InputStream and response OutputStream are closed. Closing the OutputStream, implicitly closes the InputStream (if it is not already closed). However, it is recommended to consume all the data from the InputStream before closing it. The convenience method close() does all of these tasks. Closing an exchange without consuming all of the request body is not an error but may make the underlying TCP connection unusable for following exchanges. The effect of failing to terminate an exchange is undefined, but will typically result in resources failing to be freed/ reused.

The docs are kind of ambiguous there, saying close does all the tasks but I think it doesn't consume the input stream fully before close.

In the delegation case I guess we can leave cleanup to the delegate?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoping there is JDK one-liner for eating the inputstream, I think we can do is.transferTo(OutputStream.nullOutputStream()) give or take a try with resources.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good find.

Copy link
Copy Markdown
Member

@k-wall k-wall Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of #1039 is to prevent a DoS attack against the HTTP port. The attacker could POST a large amount of data, cause resources to the chewed on the server, preventing service to others.

If I understand is.transferTo(OutputStream.nullOutputStream()) correctly, it will read the data from the network and discard it. Can we do better? Can we just send the response, cause the socket to close (somehow) and avoid reading the response?

Copy link
Copy Markdown
Member

@robobario robobario Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like throwing an exception in the HttpHandler before a response has finished being written would close the socket. So maybe we could send headers and throw.

FWIW the prometheus MetricsHandler doesn't look like it drains the input on the successful path, I can see in their HTTPServer implementation (which we don't use) they do have a drainInputAndClose in one unauthorized 403 case, but that looks like the only place.

Having a play around I couldn't provoke any issues by sending multiple GETs with request body at the /metrics endpoint on the same connection.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read the HttpServer impl code (though I'm not claiming any expertise here) sending the response and leaving the inputStream untouched read will trigger a close as it it will not be EOF.

I'm not sure if we are actually buying anything much here as the InputStream will still be allocated...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read the HttpServer impl code (though I'm not claiming any expertise here) sending the response and leaving the inputStream untouched read will trigger a close as it it will not be EOF.

ah cool, so maybe it's fine. probably why my test with multiple curl requests is fine, was probably using a new connection each time.

I'm not sure if we are actually buying anything much here as the InputStream will still be allocated...

The inputstream is some lazy thing though that doesn't buffer all the data from the client.


@AfterEach
void afterEach() {
if (kafkaProxy != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should do what the JOSDK extension is doing and create a test namespace that we can delete in one shot

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A task for a separate PR.

@robobario
Copy link
Copy Markdown
Member

robobario commented Mar 14, 2025

HTTPS support

think we've talked about this before for Kroxy and setting up TLS on the management port appears uncommon so we've not done it in the proxy

Copy link
Copy Markdown
Member

@robobario robobario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around I get:

curl localhost:8080 -> 404
curl localhost:8080/ -> 404
curl localhost:8080/bla -> 404
curl localhost:8080/ -XPOST -> 405 with Allow: GET header
curl localhost:8080 -I (head request) -> 405 with Allow: GET header
curl localhost:8080/metrics -> dumps metrics as text/plain
curl localhost:8080/metrics -XPOST -> 405 with Allow: GET header
curl localhost:8080/metricsaaaargh -> dumps metrics as text/plain
curl localhost:8080/metrics/1/2/3 -> dumps metrics as text/plain

I guess the MetricsHandler doesn't care about the path. Could fix as part of this PR or another.

@SamBarker
Copy link
Copy Markdown
Member Author

I guess the MetricsHandler doesn't care about the path. Could fix as part of this PR or another.

IIRC upstream (as in Prometheus) doesn't consider that a bug...

@robobario
Copy link
Copy Markdown
Member

another thing spotted in the Prometheus HTTPServer is they're defaulting a couple of system props around the http server timeouts

Looks quite prudent as the defaults are forever https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/jdk.httpserver/share/classes/sun/net/httpserver/ServerConfig.java#L46

@SamBarker SamBarker force-pushed the operatorMetricsServer branch from 002f1d3 to 01c936f Compare March 16, 2025 23:55
@SamBarker
Copy link
Copy Markdown
Member Author

another thing spotted in the Prometheus HTTPServer is they're defaulting a couple of system props around the http server timeouts

Looks quite prudent as the defaults are forever https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/jdk.httpserver/share/classes/sun/net/httpserver/ServerConfig.java#L46

I've done that though I wasn't quite as generous with the max response time as 600 seconds felt rather long...

}

if (!systemProps.containsKey("sun.net.httpserver.maxRspTime")) {
System.setProperty("sun.net.httpserver.maxRspTime", "120");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe support overriding this from an ENV Var for better support in a k8s env. Thoughts? Maybe we should leave that until we see an actual need. As two mins seems rather generous anyway.

@k-wall
Copy link
Copy Markdown
Member

k-wall commented Mar 17, 2025

LGTM thanks @SamBarker

@SamBarker SamBarker force-pushed the operatorMetricsServer branch from d9105e8 to 7f591d5 Compare March 17, 2025 21:33
@SamBarker SamBarker enabled auto-merge March 17, 2025 21:43
@SamBarker SamBarker force-pushed the operatorMetricsServer branch 2 times, most recently from b6c4b6c to 178b9eb Compare March 17, 2025 22:48
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
This partially reverts commit 62a8e57

Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
…outs

Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
It made sense to add as messing around with defaults for timeouts anyway.

Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Sam Barker <sbarker@redhat.com>

rh-pre-commit.version: 2.0.1
rh-pre-commit.check-secrets: ENABLED
@SamBarker SamBarker force-pushed the operatorMetricsServer branch from 178b9eb to 7685e2a Compare March 17, 2025 23:25
@sonarqubecloud
Copy link
Copy Markdown

@SamBarker SamBarker merged commit d61f2f5 into kroxylicious:main Mar 18, 2025
@SamBarker SamBarker deleted the operatorMetricsServer branch March 18, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose metrics to collectors.

3 participants