Skip to content

Replace REST and RabbitMQ with gRPC#176

Open
benthecarman wants to merge 9 commits intolightningdevkit:mainfrom
benthecarman:from-scratch-grpc
Open

Replace REST and RabbitMQ with gRPC#176
benthecarman wants to merge 9 commits intolightningdevkit:mainfrom
benthecarman:from-scratch-grpc

Conversation

@benthecarman
Copy link
Copy Markdown
Collaborator

@benthecarman benthecarman commented Mar 27, 2026

Implement gRPC directly on hyper's HTTP/2 support, without tonic, to keep the dependency tree small. This consolidates two separate protocols (protobuf-over-REST for RPCs, RabbitMQ for events) into a single gRPC interface.

#175 proved the migration works with tonic. We opted for from-scratch approach instead, informed by the plan at: https://gist.github.com/tnull/1f6ffc5ae71c418f28844e27eee8c62b

Events are now delivered via a server-streaming SubscribeEvents RPC backed by a tokio broadcast channel, eliminating the RabbitMQ operational dependency. HMAC auth is simplified to timestamp-only signing since TLS guarantees body integrity.

Had claude break it into several commits that should make it easy to review.

Also tested with the cli made in #175 and it works so seems our implementation is compliant

benthecarman and others added 2 commits March 27, 2026 13:22
Define the LightningNode service with all 38 unary RPCs and a
server-streaming SubscribeEvents RPC. Add SubscribeEventsRequest
message and events.proto import. Update response comments from
REST/HTTP style to gRPC style.

prost-build ignores service blocks, so this is a documentation-
only change that prepares for the gRPC migration without altering
any generated code or runtime behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Take &Context instead of Context by value in all 37 handler
functions. The handlers only read from context.node and
context.paginated_kv_store (both Arc), so borrowing avoids
unnecessary Arc clones per request and prepares for the gRPC
service rewrite where Context lives in the service struct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ldk-reviews-bot
Copy link
Copy Markdown

ldk-reviews-bot commented Mar 27, 2026

👋 Thanks for assigning @joostjager as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

Copy link
Copy Markdown
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick Claude check for missing parts

@@ -0,0 +1,265 @@
// This file is Copyright its original authors, visible in version control
Copy link
Copy Markdown
Contributor

@joostjager joostjager Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's missing vs. the gRPC spec

Critical for external client compatibility

  1. Trailers-Only encoding may be wrong - For error responses with no body (GrpcBody::Error), the gRPC spec says to use "Trailers-Only" mode: a single HEADERS frame containing both the HTTP headers (:status, content-type) and the gRPC trailers (grpc-status, grpc-message). The current implementation sends the HTTP response headers first, then the grpc-status as a separate trailers frame. Some client libraries (especially grpc-go, grpc-java) may not find grpc-status where they expect it, or may interpret the empty-body-with-trailers differently. This affects every error path.

  2. Content-type check is too loose - starts_with("application/grpc") (line 220) accepts application/grpcfoo or application/grpc+json. The spec requires exactly application/grpc, application/grpc+proto, or application/grpc+<codec> for a codec you support. A client sending application/grpc+json would be accepted but get protobuf back. Should reject anything other than application/grpc and application/grpc+proto.

  3. No grpc-timeout / deadline support - External clients routinely set grpc-timeout (e.g., grpc-timeout: 5S). This server ignores it entirely, so a client expecting a DEADLINE_EXCEEDED after 5 seconds will instead hang until the handler finishes or the connection drops.

  4. No grpc-accept-encoding advertisement - The server doesn't send grpc-accept-encoding: identity in responses. External clients that default to gzip compression will try it, get UNIMPLEMENTED, then retry without compression on every single call (doubling latency on first request per stream). Advertising identity upfront avoids this.

  5. No server reflection - Tools like grpcurl, Postman, and Kreya that external developers use for testing/debugging rely on the grpc.reflection.v1 service to discover the API. Without it, external developers need the .proto files out-of-band, which is a significant friction point.

Functional issues (affect all clients)

  1. Stream errors always appear as OK - GrpcBody::Stream sends OK trailers when the mpsc channel closes (line 191-193), regardless of why it closed. If the server encounters an error mid-stream (or the broadcast sender is dropped), the client sees grpc-status: 0. There's no mechanism to send error trailers on a stream.

  2. Broadcast lag silently terminates streams - The broadcast->mpsc bridge (line 307-313) doesn't handle RecvError::Lagged. If a slow client falls behind the 1024-message broadcast buffer, rx.recv() returns Err(Lagged), which breaks the loop and closes the stream with OK status. The client has no idea events were lost.

  3. No graceful shutdown for streams - On SIGTERM, active SubscribeEvents streams aren't drained or notified. The TCP connection just drops, so external clients see a transport error rather than a clean gRPC status.

  4. Trailing data after gRPC frame is silently ignored - decode_grpc_body reads only one frame. If a client sends multiple gRPC frames in a single request body, the extra data is discarded silently. (Only matters if a client reuses a connection oddly, since these are all unary RPCs.)

  5. No request cancellation detection - When a client cancels (RST_STREAM), the handler keeps running to completion. Handlers that do expensive work (e.g., ExportPathfindingScores) waste resources on cancelled requests.

Nice-to-have for production with external clients

  1. No health checking protocol - No grpc.health.v1.Health service. External infrastructure (load balancers, Kubernetes, service meshes) uses this standard protocol for liveness/readiness probes.

  2. No binary metadata support - The gRPC spec says headers ending in -bin are base64-encoded binary. External clients/interceptors may send binary metadata (e.g., tracing context in grpc-trace-bin), which this server would misinterpret as text.

  3. No compression support - Correctly returns UNIMPLEMENTED per spec, which is fine. But combined with 4 (no grpc-accept-encoding), external clients pay a round-trip penalty discovering this.

Verdict

For the controlled ldk-server-client, most of this works because both sides agree on the wire format quirks. For external clients, items #1-#5 are the real blockers: #1 can cause error responses to be misinterpreted, #2 can silently accept incompatible codecs, #3 breaks client-side timeout contracts, #4 causes unnecessary round-trips, and #5 makes the API hard to discover. Items #6-#8 are correctness issues that affect everyone. Items #11-#13 are quality-of-life gaps for production deployments.

Copy link
Copy Markdown
Collaborator Author

@benthecarman benthecarman Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did most of these beside

reflection: tonic doesn't even support this out of the box

health check: can just use, get-node-info

compression: added the flag so people won't try and not really needed

benthecarman and others added 7 commits March 28, 2026 13:38
Implement gRPC wire protocol primitives (encode/decode frames,
status codes, GrpcBody, trailers) directly on hyper's HTTP/2
types without tonic. No callers yet — this is the building block
for the server gRPC conversion in the next commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from HTTP/1.1 protobuf-over-REST to gRPC over HTTP/2,
implemented directly on hyper without tonic. The gRPC framing
module handles encode/decode of the 5-byte length-prefixed wire
format, status codes, and HTTP/2 trailers.

Key changes:
- hyper http1 → http2, add TokioExecutor for HTTP/2 connections
- TLS ALPN set to h2 for HTTP/2 negotiation
- HMAC auth simplified to timestamp-only (no body) since TLS
  guarantees integrity
- Events delivered via tokio broadcast channel, replacing the
  RabbitMQ EventPublisher infrastructure
- Config renamed rest_service_addr → grpc_service_addr
- Removed lapin, async-trait, events-rabbitmq feature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from protobuf-over-REST to gRPC wire format. Requests use
gRPC framing (5-byte length-prefix) with application/grpc+proto
content-type, routed to /api.LightningNode/{method}. HMAC auth
simplified to timestamp-only signing to match the server change.

Error handling uses Trailers-Only mode: when the server returns a
gRPC error without a body, grpc-status appears as a regular HTTP/2
header that reqwest can read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement the SubscribeEvents RPC end-to-end: a Stream variant in
GrpcBody that delivers multiple gRPC-framed messages from an mpsc
channel, a handler in service.rs that bridges the broadcast channel
to the streaming response, a client EventStream type that reads
gRPC frames incrementally from the HTTP/2 body, and e2e test
assertions that verify payment events arrive via the stream.

Also regenerates proto code to include SubscribeEventsRequest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename rest_service_address to grpc_service_address in CLI config
and e2e test harness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change all handler functions from sync to async, taking
Arc<Context> by value instead of &Context by reference. This
enables future cancellation support and allows handlers to
perform async operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants