You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Runtime Filter (Dynamic Filter Pushdown) for MSE Hash Joins
TL;DR
For inner hash joins in the Multi-Stage Query Engine (MSE), materialize a compact Bloom filter over the build-side join keys and push it down to the probe-side segment scan. Non-matching rows are pruned during I/O instead of being fetched, shuffled across the network, and dropped at the join operator. This is a well-known MPP optimization (Trino, StarRocks, Apache Doris, Spark, DuckDB). Pinot doesn't have it today; this issue proposes the design.
Design follows the Trino / StarRocks / Doris pattern: a per-query, broker-mediated runtime-filter registry that producers (the HashJoinOperator in the intermediate stage) write to on build completion, and consumers (leaf-stage segment scans) poll between row batches. Timing is asynchronous and best-effort — late rows are pruned, rows already in flight when the filter arrives are not.
Revised on 2026-05-28 based on review feedback from @siddharthteotia (issue comments). The earlier "same-OpChain hand-off via OpChainExecutionContext" sketch was incorrect: build (intermediate stage) and probe leaf are always different OpChains, so cross-stage transport is now first-class instead of a follow-up.
1. Status & Scope
In scope (v1): Inner hash joins with single equi-condition; Bloom filter built on the build side; cross-stage transport via a broker-mediated registry; pushed into the leaf-stage segment scan as a new BLOOM_MEMBERSHIP predicate.
Timing model: Trino-style asynchronous / best-effort. Snowflake schema (FACT × N-DIM, multi-producer to one fact scan) supported day one.
Out of scope (follow-ups, no longer including cross-stage transport): Semi-join and anti-join runtime filters; multi-stage runtime-filter chains beyond simple snowflake; non-equi joins; Spark-style synchronous scheduling barrier; dictionary-encoded leaf path (pre-materialized matching dict-ID set).
2. Background
2.1 JOIN basic knowledge
A JOIN combines rows from two tables that share a value in some column (the join key). Example: a 10-billion-row orders table joined against a small customers table on customer_id to find VIP orders.
2.2 What's a hash join?
There are two phases for hash join:
Build phase — load the smaller side (the 3 VIPs) into an in-memory hash table keyed on customer_id.
Probe phase — stream the bigger side row-by-row, look each customer_id up in the hash table; emit matches, discard misses.
2.3 What's the problem?
In a distributed engine like MSE, the probe side and the build side often run on different workers. Every probe-side row gets read from storage and sent across the network, only to be discarded at the join operator if its key isn't in the hash table. For a selective join (small build side, large probe side), that's a lot of wasted I/O and bandwidth.
2.4 What's a "runtime filter"?
A small "did the build side maybe contain this key?" check that the build side ships to the probe side at runtime, after the hash table is built. The probe side applies it as early as possible — ideally inside the segment scan, so non-matching rows are skipped during I/O. The check is a Bloom filter, which has zero false negatives (so the join stays correct) and a tunable false-positive rate (so it stays compact).
This is the same trick Spark, Trino, Presto, and DuckDB use. Pinot already does a related thing for SEMI joins via PinotJoinToDynamicBroadcastRule (broadcasts an IN-list); the new feature does it for INNER hash joins via a Bloom filter, transported via a registry rather than baked into the physical plan.
3. Design
3.1 Three layers + a control plane
Layer
Module
Responsibility
Planner
pinot-query-planner
Detect eligible inner joins (single equi-condition, build-side row count below threshold). Insert a physical runtime-filter node that names the registry key (requestId, joinColumn).
Runtime
pinot-query-runtime
HashJoinOperator builds the Bloom on finishBuildingRightTable() and publishes to the registry. Leaf-scan OpChain polls the registry between row batches.
Leaf
pinot-core
Accept the Bloom as a BLOOM_MEMBERSHIP predicate inside the segment scan via FilterPlanNode.combineWithRuntimeBloomFilters (already landed in #197).
Control plane (new)
broker
Hosts the per-query RuntimeBloomFilterRegistry keyed by (requestId, columnName). Producers write, consumers read. Cleaned up at query end.
3.2 Key design decisions
Inner joins, single equi-condition first. Mirrors the restriction in the existing PinotJoinToDynamicBroadcastRule. Composite keys and outer joins are out of scope for v1.
Bloom filter as the filter form for v1. Compact (KB-level), zero false negatives. Implementation reuses Guava's BloomFilter (matching BloomFilterIdSet). The registry shape is polymorphic so IN-list / min-max can be added later without rework (cf. StarRocks / Doris multi-form pattern in §3.4).
New Predicate.Type.BLOOM_MEMBERSHIP. Dedicated predicate keeps evaluator routing clean and avoids regressions in the existing IN path.
QueryContext carries runtime Bloom filters per column. Leaf-stage FilterPlanNode AND-combines a synthetic BLOOM_MEMBERSHIP predicate per column into the segment-scan filter tree. Queries without runtime filters allocate nothing.
Broker-mediated registry, not in-process hand-off.OpChainId = (requestId, virtualServerId, stageId) is stage-bound, every PhysicalExchange creates a stage boundary, and build / probe always live in different OpChains. There is no in-process channel that survives the stage boundary. The registry lives on the broker, producers publish(requestId, column, bloom), consumers get(requestId, column).
Asynchronous / best-effort timing. Probe-side leaf scans start immediately. The leaf execution loop polls the registry between batches; when an entry appears, it merges into QueryContext.runtimeBloomFilters and the next batch is filtered. Rows already in flight when the filter arrives are not retroactively filtered — this is the same trade-off Trino / StarRocks / Doris accept. We instrument the "leak" (rows emitted before the filter arrived) so a future decision to add a synchronous barrier is data-driven.
Multi-producer to single consumer day one. Snowflake schemas (FACT × N-DIM) work without rework: each producer HashJoin writes its own column key into the same per-query registry, the fact scan reads all columns it touches. The consumer-side Map<String, RuntimeBloomFilter> already supports this (Inject runtime filters at the leaf-stage filter boundary #197).
Automatic triggering, hint-overridable. The optimizer rule fires when build-side RelMetadataQuery.getRowCount() is below pinot.broker.mse.runtime.filter.build.cardinality.threshold (default 1M). Per-join enable_runtime_filter hint can force on/off.
Flag-gated, disabled by default. Same rollout pattern as the recent MSE fingerprint change. No behavior change for existing queries until the flag flips.
3.3 Data flow
The diagram below was sketched against the original same-OpChain design and is partially superseded — the new control-plane path (registry between intermediate-stage HashJoin and leaf-stage scan) replaces the in-OpChain hand-off shown here. The textual description in §3.2 is authoritative.
3.4 Industry comparison
Surveyed along the dimensions that matter operationally:
Async / registry is industry consensus for general-purpose distributed MPP engines. Trino, StarRocks, Doris all converged here independently. Our proposal sits in the same row.
Spark's synchronous barrier is cheap only because Catalyst already has plan-time barriers (DynamicPruningSubquery, BroadcastExchange with synchronous materialization). MSE has neither. Replicating Spark's coverage in Pinot is not a planner-rule patch — it would require building an OpChain-level scheduling primitive, on the order of quarters of work. That's a v2 decision, not a v1 default.
No SOTA system tries to retroactively prune the early-row leak. They all manage it with timeouts and instrumentation. We follow suit.
3.5 Why async (Trino-style) not synchronous (Spark-style)
Trino-style (proposed v1)
Spark-style (deferred to v2)
Coverage
Late rows pruned; early rows leak
~100% (probe blocks on Bloom)
Change to MSE scheduling
None
Significant — new OpChain-level barrier
Multi-producer (snowflake)
Falls out for free
Requires plan-shape rewrite per dim
Cost to add
Weeks
Quarters
Risk
Low — additive to existing pipeline
High — new scheduling primitive
Decision driver to v2
Measured leak rate in production
—
We instrument the leak (rows-emitted-before-Bloom-arrived vs. after) on Phase 7 so the v2 decision is data-driven.
4. Implementation plan
Total: 9 phases. PRs #195, #196, #197 are already up.
Map<String, RuntimeBloomFilter> on QueryContext; FilterPlanNode.combineWithRuntimeBloomFilters AND-injects BLOOM_MEMBERSHIP predicates. PR description to be revised to drop the OpChainExecutionContext framing.
3
RuntimeBloomFilterRegistry (new)
To do
Per-query, broker-resident registry keyed by (requestId, columnName). Producers call publish(...), consumers call get(...). Cleaned up at query end.
4
HashJoinOperator publish hook
To do
In finishBuildingRightTable(): iterate the PrimitiveLookupTable keys, build a RuntimeBloomFilter sized to the build cardinality, publish to the registry. Replaces the discarded "publish via OpChainExecutionContext" idea.
5
Leaf-scan consumer (best-effort poll)
To do
Between row batches in the leaf execution loop: poll the registry for columns touched by the scan, merge into QueryContext.runtimeBloomFilters. Phase 2's injection logic does the rest. Instrument rows-emitted-before-Bloom-arrived vs. after.
6
RuntimeFilterInsertRule planner rule
To do
Calcite rule detecting eligible inner equi-joins; inserts a physical node carrying the registry key. Mirrors PinotJoinToDynamicBroadcastRule's shape (sibling rule), registered in PhysicalOptRuleSet.
7
Snowflake integration test + benchmark
To do
FACT × 2-DIM test per @siddharthteotia's example. pinot-perf harness measures end-to-end latency and leak rate. Decision driver for Phase 8.
8
Spark-style synchronous barrier
Tracked, not committed
OpChain-level barrier so leaf scans wait on the registry. Closes the leak entirely but invasive. Only if Phase 7 numbers warrant.
4.1 Out of scope for v1 (follow-up issues to file once v1 lands)
Dictionary-encoded leaf path for BLOOM_MEMBERSHIP (pre-materialized matching dict-ID set)
Planner golden plans — PhysicalOptimizerPlans.json cases: INNER triggers, OUTER does not, build-too-large does not, hint-disabled does not, hint-forced does, snowflake (multi-producer) produces correct registry keys.
Snowflake integration test — TPC-H-shaped FACT × 2-DIM in pinot-integration-tests: two HashJoins in different intermediate stages both publish to the same fact-scan registry, leaf scan pulls both Blooms. Assert leaf-stage emitted row count drops vs. flag-off baseline.
Performance benchmark with leak instrumentation — pinot-perf measures latency delta and the (rows-before-Bloom / total-rows) ratio. Target ≥ 20% latency improvement on a representative selective join; leak ratio is the v2 decision driver.
6. Rollout
Land all phases behind pinot.broker.mse.runtime.filter.enabled=false (no behavior change for existing queries).
Enable in a staging cluster, monitor latency, CPU, and the leak ratio on join-heavy workloads.
Enable cluster-wide once stable; consider per-table overrides.
If staging leak ratio is acceptable (TBD threshold), stay on async. If unacceptable on representative workloads, open the Phase 8 discussion.
Runtime Filter (Dynamic Filter Pushdown) for MSE Hash Joins
TL;DR
For inner hash joins in the Multi-Stage Query Engine (MSE), materialize a compact Bloom filter over the build-side join keys and push it down to the probe-side segment scan. Non-matching rows are pruned during I/O instead of being fetched, shuffled across the network, and dropped at the join operator. This is a well-known MPP optimization (Trino, StarRocks, Apache Doris, Spark, DuckDB). Pinot doesn't have it today; this issue proposes the design.
Design follows the Trino / StarRocks / Doris pattern: a per-query, broker-mediated runtime-filter registry that producers (the
HashJoinOperatorin the intermediate stage) write to on build completion, and consumers (leaf-stage segment scans) poll between row batches. Timing is asynchronous and best-effort — late rows are pruned, rows already in flight when the filter arrives are not.1. Status & Scope
BLOOM_MEMBERSHIPpredicate.2. Background
2.1 JOIN basic knowledge
A JOIN combines rows from two tables that share a value in some column (the join key). Example: a 10-billion-row
orderstable joined against a smallcustomerstable oncustomer_idto find VIP orders.2.2 What's a hash join?
There are two phases for hash join:
customer_id.customer_idup in the hash table; emit matches, discard misses.2.3 What's the problem?
In a distributed engine like MSE, the probe side and the build side often run on different workers. Every probe-side row gets read from storage and sent across the network, only to be discarded at the join operator if its key isn't in the hash table. For a selective join (small build side, large probe side), that's a lot of wasted I/O and bandwidth.
2.4 What's a "runtime filter"?
A small "did the build side maybe contain this key?" check that the build side ships to the probe side at runtime, after the hash table is built. The probe side applies it as early as possible — ideally inside the segment scan, so non-matching rows are skipped during I/O. The check is a Bloom filter, which has zero false negatives (so the join stays correct) and a tunable false-positive rate (so it stays compact).
This is the same trick Spark, Trino, Presto, and DuckDB use. Pinot already does a related thing for SEMI joins via
PinotJoinToDynamicBroadcastRule(broadcasts an IN-list); the new feature does it for INNER hash joins via a Bloom filter, transported via a registry rather than baked into the physical plan.3. Design
3.1 Three layers + a control plane
pinot-query-planner(requestId, joinColumn).pinot-query-runtimeHashJoinOperatorbuilds the Bloom onfinishBuildingRightTable()and publishes to the registry. Leaf-scan OpChain polls the registry between row batches.pinot-coreBLOOM_MEMBERSHIPpredicate inside the segment scan viaFilterPlanNode.combineWithRuntimeBloomFilters(already landed in #197).RuntimeBloomFilterRegistrykeyed by(requestId, columnName). Producers write, consumers read. Cleaned up at query end.3.2 Key design decisions
PinotJoinToDynamicBroadcastRule. Composite keys and outer joins are out of scope for v1.BloomFilter(matchingBloomFilterIdSet). The registry shape is polymorphic so IN-list / min-max can be added later without rework (cf. StarRocks / Doris multi-form pattern in §3.4).Predicate.Type.BLOOM_MEMBERSHIP. Dedicated predicate keeps evaluator routing clean and avoids regressions in the existingINpath.QueryContextcarries runtime Bloom filters per column. Leaf-stageFilterPlanNodeAND-combines a syntheticBLOOM_MEMBERSHIPpredicate per column into the segment-scan filter tree. Queries without runtime filters allocate nothing.OpChainId = (requestId, virtualServerId, stageId)is stage-bound, everyPhysicalExchangecreates a stage boundary, and build / probe always live in different OpChains. There is no in-process channel that survives the stage boundary. The registry lives on the broker, producerspublish(requestId, column, bloom), consumersget(requestId, column).QueryContext.runtimeBloomFiltersand the next batch is filtered. Rows already in flight when the filter arrives are not retroactively filtered — this is the same trade-off Trino / StarRocks / Doris accept. We instrument the "leak" (rows emitted before the filter arrived) so a future decision to add a synchronous barrier is data-driven.Map<String, RuntimeBloomFilter>already supports this (Inject runtime filters at the leaf-stage filter boundary #197).RelMetadataQuery.getRowCount()is belowpinot.broker.mse.runtime.filter.build.cardinality.threshold(default 1M). Per-joinenable_runtime_filterhint can force on/off.3.3 Data flow
The diagram below was sketched against the original same-OpChain design and is partially superseded — the new control-plane path (registry between intermediate-stage HashJoin and leaf-stage scan) replaces the in-OpChain hand-off shown here. The textual description in §3.2 is authoritative.
3.4 Industry comparison
Surveyed along the dimensions that matter operationally:
DynamicFilterService)BroadcastExchangesynchronous broadcastGLOBAL JOINPinotJoinToDynamicBroadcastRule)BLOOM_MEMBERSHIPpredicate (PR #196 + #197)Three takeaways:
DynamicPruningSubquery,BroadcastExchangewith synchronous materialization). MSE has neither. Replicating Spark's coverage in Pinot is not a planner-rule patch — it would require building an OpChain-level scheduling primitive, on the order of quarters of work. That's a v2 decision, not a v1 default.3.5 Why async (Trino-style) not synchronous (Spark-style)
We instrument the leak (rows-emitted-before-Bloom-arrived vs. after) on Phase 7 so the v2 decision is data-driven.
4. Implementation plan
Total: 9 phases. PRs #195, #196, #197 are already up.
pinot.broker.mse.runtime.filter.enabled(default false), build-cardinality threshold,enable_runtime_filterhint.BLOOM_MEMBERSHIPpredicate primitiveBloomMembershipPredicate,RuntimeBloomFilter, per-DataTypeevaluator factory, provider dispatch.QueryContextmap + leaf injectionMap<String, RuntimeBloomFilter>onQueryContext;FilterPlanNode.combineWithRuntimeBloomFiltersAND-injectsBLOOM_MEMBERSHIPpredicates. PR description to be revised to drop theOpChainExecutionContextframing.RuntimeBloomFilterRegistry(new)(requestId, columnName). Producers callpublish(...), consumers callget(...). Cleaned up at query end.HashJoinOperatorpublish hookfinishBuildingRightTable(): iterate thePrimitiveLookupTablekeys, build aRuntimeBloomFiltersized to the build cardinality, publish to the registry. Replaces the discarded "publish viaOpChainExecutionContext" idea.QueryContext.runtimeBloomFilters. Phase 2's injection logic does the rest. Instrument rows-emitted-before-Bloom-arrived vs. after.RuntimeFilterInsertRuleplanner rulePinotJoinToDynamicBroadcastRule's shape (sibling rule), registered inPhysicalOptRuleSet.pinot-perfharness measures end-to-end latency and leak rate. Decision driver for Phase 8.4.1 Out of scope for v1 (follow-up issues to file once v1 lands)
BLOOM_MEMBERSHIP(pre-materialized matching dict-ID set)5. Testing strategy
PhysicalOptimizerPlans.jsoncases: INNER triggers, OUTER does not, build-too-large does not, hint-disabled does not, hint-forced does, snowflake (multi-producer) produces correct registry keys.pinot-integration-tests: two HashJoins in different intermediate stages both publish to the same fact-scan registry, leaf scan pulls both Blooms. Assert leaf-stage emitted row count drops vs. flag-off baseline.pinot-perfmeasures latency delta and the(rows-before-Bloom / total-rows)ratio. Target ≥ 20% latency improvement on a representative selective join; leak ratio is the v2 decision driver.6. Rollout
pinot.broker.mse.runtime.filter.enabled=false(no behavior change for existing queries).