OpAMP Agent heartbeats #183

jaronoff97 · 2024-04-05T18:11:03Z

Currently, if an agent is functioning successfully and without any changes to its health or component status, no messages will be sent after initial handshakes. For websockets, because no messages are sent from the agent to the server, after a certain period most of these connections will idle and be shut down causing a broken socket. I propose that we add support for an heartbeat interval as part of the specification. I have already implemented this capability in the opamp-bridge where a user is able to configure a heartbeat interval to periodically set the agent's health message. HTTP theoretically gets around this with its polling interval.

This heartbeat implementation is important for the bridge where many of the events in Kubernetes happen asynchronously. The bridge is not informed directly therefore must poll state to send this message. The collector's opamp extension, however, watches for some changes but is currently not busy. For the extension, the collector will idle after sometime and prevent the server from being able to send any more messages to the extension.

I think a heartbeat interval could be optional, however, it must be communicated as part of the initial AgentToServer message. This would allow the server to know when to mark the agent as unhealthy.

Open Questions

Should this functionality only exist for the socket transport?
- I believe it would be valuable for this to exist regardless of the transport and that the HTTP poll interval could be deprecated in favor of this. This would allow the server to make decisions about the liveness of the agent regardless of the transport
What should the default interval be? 30s?
Should the heartbeat interval be negotiable from the server?

I'm happy to write the spec change for this issue, but would love everyone's thoughts here.

JaredTan95 · 2024-04-08T05:26:00Z

I think it is reasonable for the agent to actively report a healthy heartbeat, this helps opamp server keep the information up to date.

But there's a question, what's the difference with #28

jaronoff97 · 2024-04-08T16:30:20Z

I think the main difference is that my proposal relates to heartbeats as a means of solving #28, allowing clients to automatically report health on an interval. A goal of this is definitely to keep the connection alive, but it also means that the client has a responsibility to reports its health on an interval. I can understand closing this issue in favor of that though...

BinaryFissionGames · 2024-04-09T12:57:28Z

Want to give a +1 to this, we recently encountered an issue with a misbehaving proxy that kept the OpAMP connection alive, despite the fact that the agent pod didn't exist anymore, and I believe that could have been detected with a heartbeat mechanism like this.

I do like the idea of having it be independent of transport, since it seems to be very similar conceptually to the poll interval.

There's also this PR that's been open for a while proposing a heartbeat mechanism, seems similar to what's proposed here: #176

haoqixu · 2024-04-18T06:05:17Z

A crashed or misbehaving client may cause connection/goroutine leaks in the OpAMP server (open-telemetry/opamp-go#271). A heartbeat mechanism can help the server find out unresponsive peers and also defend against intermediaries (LB, proxy, network equipment) which may time out and terminate idle connections (#28).

gdfast · 2024-05-07T18:32:17Z

I also wanted to give a big +1 to this. The regular status reports from the opamp-bridge have been really useful in

providing a regular update on status
acting as a heartbeat
making sure the opamp agent checks in (especially if it's connecting over HTTP) so we can send any ServerToAgent message it needs

I agree with @BinaryFissionGames that for the sake of the OpAMP protocol, this behavior should be transport agnostic (i.e. should work for both HTTP and Websockets given both are valid transport protocols for OpAMP).

tpaschalis · 2024-07-03T12:12:10Z

Just chiming in that this does sound like a worthwile addition and that it should be the same for both transports.

My 2c would be that it is indeed useful for clients to either respect any interval negotiable from the server (or eg. a Retry-After header), or at least set a reasonable maximum polling frequency to avoid overloading the server by accident.

jaronoff97 · 2024-07-03T14:06:49Z

@tpaschalis I have the PR up here if you can take a look!

jaronoff97 mentioned this issue Jul 3, 2024

Introduce heartbeats #190

Merged

tigrannajaryan closed this as completed in #190 Jul 29, 2024

tigrannajaryan closed this as completed in 58acf6b Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpAMP Agent heartbeats #183

OpAMP Agent heartbeats #183

jaronoff97 commented Apr 5, 2024

JaredTan95 commented Apr 8, 2024 •

edited

Loading

jaronoff97 commented Apr 8, 2024

BinaryFissionGames commented Apr 9, 2024 •

edited

Loading

haoqixu commented Apr 18, 2024

gdfast commented May 7, 2024

tpaschalis commented Jul 3, 2024

jaronoff97 commented Jul 3, 2024

OpAMP Agent heartbeats #183

OpAMP Agent heartbeats #183

Comments

jaronoff97 commented Apr 5, 2024

Open Questions

JaredTan95 commented Apr 8, 2024 • edited Loading

jaronoff97 commented Apr 8, 2024

BinaryFissionGames commented Apr 9, 2024 • edited Loading

haoqixu commented Apr 18, 2024

gdfast commented May 7, 2024

tpaschalis commented Jul 3, 2024

jaronoff97 commented Jul 3, 2024

JaredTan95 commented Apr 8, 2024 •

edited

Loading

BinaryFissionGames commented Apr 9, 2024 •

edited

Loading