Add proposal for a router impl#112
Conversation
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Co-authored-by: Gunnar Morling <gunnar.morling@googlemail.com> Signed-off-by: Tom Bentley <tombentley@users.noreply.github.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
| **Topic-based partitioning.** | ||
| Each topic is owned by exactly one upstream cluster. | ||
| The router determines the owning cluster from the topic name (via an unambiguous mapping) and routes data-plane operations accordingly. | ||
| When a single client request references topics on different clusters (e.g. a `PRODUCE` batch containing records for topics on two clusters), the router decomposes the request, fans out the sub-requests, fans-in the sub-responses and the recomposes a single response to be returned to the client. |
There was a problem hiding this comment.
"PRODUCE batch containing records for topics on two clusters". When would this happen in Kafka?
Something is confused here. With the virtualised node id scheme described in the Route API proposal every broker had a unique node id. The client knows all the node ids. Right?
Producing Kafka clients work by connecting to the broker that is the topic leader. So in the in the case where a client needs to send to two topic on different clusters, the client knows this and will be making two broker connections.
It will send a produce request for topic A to the first, and a produce request for topic B in the second.
The produce request that the router sees won't need to be decomposed because it will only contain topic partition request for the 'correct' broker.
Am I missing something?
There was a problem hiding this comment.
You're right that with the current dedicated (one-to-one) BijectiveNodeIdMapping, a well-behaved Kafka producer will not send a PRODUCE request spanning routes. Each virtual node belongs to exactly one route, the client batches by partition leader, and so each ProduceRequest naturally contains only topics from one route.
However, the Router API is designed to support different NodeIdMapping strategies (see proposal 070's "Node ID mapping implementation" section). The current dedicated mapping is one strategy (the first one we'd implement).
I wanted to have an API which didn't lock router implementation into only working like that. I think it would be a strategic mistake to pour a lot of effort into implementing routers that could only work with a single kind of deployment if there's a relatively simple way to do something much more general.
For example: We may want to support future strategies where a single virtual node may serve brokers from multiple routes — for example, if proxy instances were presented to clients as broker nodes. With such a shared (many-to-one) mapping, a single virtual node fronts brokers from different clusters, and a client's leader-directed PRODUCE request can contain topics from different routes.
The router doesn't know which mapping is in use — that's a runtime implementation detail. So it must handle the general case. Decomposition logic would ensure correctness regardless of the mapping strategy.
That said, your observation prompted a clarification this in the proposal — the overview now explains when decomposition is and isn't needed depending on the mapping, and proposal 070 now describes the dedicated and shared strategies concretely.
There was a problem hiding this comment.
Thank you. The explanation make sense. Having the API shaped in such a way to support the alternative pattern is useful.
It worries me that the Data Virtualisation Router PR resulting from this proposal will have code paths that we can't hit in production. I mean it will only ever see requests Produce/Fetch requests that don't need to be decomposed - so it we'll have redundant code. YAGNI tells us that is a the wrong approach. We can enhance Data Virtualisation to support other NodeIdMapping later if the need arrives.
Signed-off-by: Tom Bentley <tbentley@redhat.com>
| | **Data** | `FETCH` | Split by partition leader (see _Leader caching_ below). | | ||
| | **Data** | `LIST_OFFSETS` | Split by partition leader. | | ||
| | **Metadata** | `METADATA` | Three variants: all-topics (fan out to all routes, filter responses to owned topics), empty-topics/broker-info-only (fan out to all routes for broker discovery), specific topics (split by ownership). Broker lists unioned by node ID. | | ||
| | **Metadata** | `API_VERSIONS` | Forwarded to the specific broker via `sendRequestToNode(route, virtualNodeId, ...)`; response capped (see _Version capping_). | |
There was a problem hiding this comment.
I don't understand how the Router implementation knows the node id in order to call RoutingContext#sendRequest
Neither the Router or RouterContext have sight of it.
On the branch you have a version of the io.kroxylicious.proxy.routing.RoutingContext#sendRequest that doesn't require a node id.
| | **Data** | `FETCH` | Split by partition leader (see _Leader caching_ below). | | ||
| | **Data** | `LIST_OFFSETS` | Split by partition leader. | | ||
| | **Metadata** | `METADATA` | Three variants: all-topics (fan out to all routes, filter responses to owned topics), empty-topics/broker-info-only (fan out to all routes for broker discovery), specific topics (split by ownership). Broker lists unioned by node ID. | | ||
| | **Metadata** | `API_VERSIONS` | Forwarded to the specific broker via `sendRequestToNode(route, virtualNodeId, ...)`; response capped (see _Version capping_). | |
There was a problem hiding this comment.
If the Router implementation is agnostic to the NodeIdMapping strategy, then surely the ApiVersions response needs to be a logical AND of the all the brokers. I say this because we need to make sure the client sends requests that the can are intelligible to any of the brokers and the router can't predict in advance which broker(s) the channel will end up needing to talk to.
| **Security consideration.** | ||
| The topic ID resolution cache does not perform authorisation checks. | ||
| A client that knows a topic ID can discover the corresponding topic name (and confirm the topic's existence) by submitting a request containing that ID, regardless of whether the client is authorised to access the topic. | ||
| In practice we do not consider this a significant weakness: topic IDs are UUIDs with a vast keyspace that is neither guessable nor enumerable (for pratical purposes), so an attacker would need to have obtained a valid topic ID from another source (e.g. logs or a compromised client), at which point they likely already know the topic name. |
There was a problem hiding this comment.
| In practice we do not consider this a significant weakness: topic IDs are UUIDs with a vast keyspace that is neither guessable nor enumerable (for pratical purposes), so an attacker would need to have obtained a valid topic ID from another source (e.g. logs or a compromised client), at which point they likely already know the topic name. | |
| In practice we do not consider this a significant weakness: topic IDs are UUIDs with a vast keyspace that is neither guessable nor enumerable (for practical purposes), so an attacker would need to have obtained a valid topic ID from another source (e.g. logs or a compromised client), at which point they likely already know the topic name. |
| Subject-based routing is only needed for transactional producers, which have the stronger coordinator-affinity requirements described in the _Overview_. | ||
|
|
||
| PID mappings are evicted after a configurable TTL (default: 7 days). | ||
| If a mapping is evicted and a subsequent `PRODUCE` arrives, the router returns `UNKNOWN_PRODUCER_ID`, which causes the Kafka producer to re-initialise. |
There was a problem hiding this comment.
I think this strategy works even if proxy replicas are scaled, right. If a producing client reconnects any lands on a different proxy, it will follow the UNKNOWN_PRODUCER_ID path. The flow is the same as for eviction case.
| If a mapping is evicted and a subsequent `PRODUCE` arrives, the router returns `UNKNOWN_PRODUCER_ID`, which causes the Kafka producer to re-initialise. | ||
|
|
||
|
|
||
| ### KIP-227 fetch session management |
There was a problem hiding this comment.
I think this section could use a bit more details.
| These two sides are independent: a pre-v7 client (no session support) can still benefit from server-side sessions with backends, and vice versa. | ||
|
|
||
| When a backend sends an _incremental_ response (containing only changed partitions), the router reconstructs a _full_ response using its cached session state before merging across routes. | ||
| This ensures the client always receives a complete picture. |
There was a problem hiding this comment.
Are you saying the client will always receive a full-response when using this Router? Isn't that negaing KIP-227?
| * `CONSUMER_GROUP_HEARTBEAT`, `CONSUMER_GROUP_DESCRIBE` | ||
| * `OFFSET_COMMIT`, `OFFSET_FETCH` (when the subject is assigned to a route) | ||
|
|
||
| **Prerequisite**: subject routing requires that the client's identity is known, which in practice means SASL termination must occur before the router in the topology. |
There was a problem hiding this comment.
Why doesn't downstream TLS client auth work for Subject routing?
|
|
||
| ### Assignment rejection | ||
|
|
||
| `CREATE_TOPICS` and `CREATE_PARTITIONS` accept optional explicit broker assignments (replica placement lists that name specific broker node IDs). |
There was a problem hiding this comment.
Shouldn't create_topic need reject topic creations where topic name would be unroutable? Otherwise, the client would be allowed to create topics that would be invisible.
| | **Offsets** | `OFFSET_FOR_LEADER_EPOCH` | Split by partition leader. | | ||
| | **Admin** | `CREATE_TOPICS` | Split by topic ownership; reject explicit broker assignments (see _Assignment rejection_). | | ||
| | **Admin** | `DELETE_TOPICS` | Split by topic ownership. | | ||
| | **Admin** | `CREATE_PARTITIONS` | Split by topic ownership; reject explicit broker assignments. | |
There was a problem hiding this comment.
ALTER_PARTITION_REASSIGNMENTS should split by topic ownership. Same for LIST_PARTITION_REASSIGNMENTS It also needs to validate that the request does not try to move partitions between clusters. To be honest, I think it might be better to deliberately leave these out of scope for v1.
| Many request types batch operations across multiple topics in a single request (e.g. a `PRODUCE` request may contain records for topics owned by different clusters). | ||
| The router must decompose such requests, fan out the sub-requests in parallel, and recompose a single response. | ||
| Beyond basic produce and consume, stateful protocol features — idempotent producers, transactions, consumer groups, fetch sessions — each add further complexity. | ||
| This proposal addresses these features incrementally, with the simplest cases (produce, metadata) first and the more complex cases (transactions, consumer groups) building on that foundation. |
There was a problem hiding this comment.
I presume we are deferring things like Share Consumers for another proposal.
There was a problem hiding this comment.
There are a lot of Kafka features which were omitted from the POC in the name of time, including share groups. They use topic ids, so there's a prereq to solve that. But otherwise I'm not aware of any showstoppers (which doesn't mean there aren't any). We could try to bite off more features as part of this proposal and its implementation (aiming for completeness), or we could prioritise timely delivery. The bottleneck will likely be the code review which we'd all need to feel comfortable that we had an understanding of the code.
There was a problem hiding this comment.
They use topic ids, so there's a prereq to solve that. But otherwise I'm not aware of any showstoppers (which doesn't mean there aren't any). We could try to bite off more features as part of this proposal and its implementation (aiming for completeness), or we could prioritise timely delivery.
I'd favour the latter. The sooner we get a release out with the feature, the sooner we get feedback on real use-cases. There's also a chance we can decompose work on the missing parts into separate issues, that could be worked in parallel, perhaps by community members.
The proposal just need to enumerate what's out of scope.
|
Couple more comments, but otherwise heading in the right direction. |
|
@tombentley please add a TL;DR to the PR's description. Also fix up the PR's title to reflect the proposal's name. Thank you. |
No description provided.