Skip to content

Implement a gRPC health service#654

Merged
guvenc merged 1 commit intomainfrom
feature/grpc_health
Mar 26, 2025
Merged

Implement a gRPC health service#654
guvenc merged 1 commit intomainfrom
feature/grpc_health

Conversation

@PlagueCZ
Copy link
Copy Markdown
Contributor

Fixes #653

I implemented the service "from scratch" because I could not find a built-in service (unlike golang has). But it's not that big in the end.

One difference to golng implementation - golang uses TCP keepalive (packets of length 0), I was unable to find such option here (without overriding some methods to set it for all sockets by calling setsockopt), so I used GRPC_ARG_KEEPALIVE_TIME_MS. This is layer 7 though, so instead of two 0-length packets, this uses two packets of length 17 and an additional ACK.
I chose the same keepalive interval as observed by TCPdump for golang implementation, 15s.

This implementation uses std::mutex and std::condition_variable to prevent periodic sending of current status, simply only send a "change" in the value of status of the service (that's what the lock+cv is used for).

@github-actions github-actions bot added enhancement New feature or request size/L labels Mar 17, 2025
@PlagueCZ PlagueCZ force-pushed the feature/grpc_health branch from 03b5a99 to 6977819 Compare March 18, 2025 23:05
@PlagueCZ
Copy link
Copy Markdown
Contributor Author

This is now running in OSC LAB.
We have replaced the old liveness and startup probes:

livenessProbe:
        exec:
          command:
            - dpservice-cli
            - '--address=127.0.0.1:1338'
            - get
            - init
        initialDelaySeconds: 90
        timeoutSeconds: 3
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 10
      startupProbe:
        exec:
          command:
            - dpservice-cli
            - '--address=127.0.0.1:1338'
            - get
            - init
        initialDelaySeconds: 1
        timeoutSeconds: 3
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 60

with

livenessProbe:
        grpc:
          port: 1338
          service: ''
        initialDelaySeconds: 90
        timeoutSeconds: 3
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 10

Florin also changed the health check inside metalnet to use the gprc health service and it seems to be working.

Thus I think this feature is implemented correctly.


Of course checkpatch is complaining about C++ code, please ignore.

@PlagueCZ PlagueCZ marked this pull request as ready for review March 18, 2025 23:11
@PlagueCZ PlagueCZ requested a review from a team as a code owner March 18, 2025 23:12
@PlagueCZ PlagueCZ force-pushed the feature/grpc_health branch from 6977819 to dc54706 Compare March 19, 2025 14:39
@PlagueCZ PlagueCZ force-pushed the feature/grpc_health branch from dc54706 to 5aa9628 Compare March 25, 2025 14:29
Copy link
Copy Markdown
Contributor

@guvenc guvenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite good and can be merged and offers an equivalent functionality as before. (with a best practice approach)
One thing which was missing before and also with this implementation is that we can not detect the hang of the worker thread in dpservice.
Some sort of timeout mechanism is needed in this busy loop to be able to detect this

while (cq_->Next(&tag, &ok) && ok) {
call = static_cast<BaseCall*>(tag);
while (call->HandleRpc() == CallState::AWAIT_MSG) {
// wait for response from worker
};
}

what do you think ? @PlagueCZ
This is probably a separate thing.

@guvenc guvenc merged commit 5aa9628 into main Mar 26, 2025
5 of 6 checks passed
@guvenc guvenc deleted the feature/grpc_health branch March 26, 2025 14:24
@hardikdr hardikdr added this to Roadmap Jun 26, 2025
@hardikdr hardikdr moved this to Done in Roadmap Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add a gRPC health service

3 participants