feat(join): unify async_execute into N-way streaming MJoin [ENG-301] by brian-arnold · Pull Request #101 · nauticalab/orcapod-python

brian-arnold · 2026-03-24T03:45:56Z

The Join operator's async_execute was replaced from a split implementation (2-input streaming / 3+ blocking
barrier) with a unified N-way streaming hash join (MJoin algorithm). Each arriving row is immediately indexed
and probed against all other sides, emitting matches as soon as all N sides have a matching key — so downstream
can start work before any input is fully consumed. The merge logic was generalized from pair-wise to N-way, and
three correctness issues were fixed: duplicate pipeline hashes now get distinct canonical positions in system
tag suffixes, system tag values are sorted for commutativity (matching static_process), and a stale docstring
was corrected. Tests cover 2/3/4-way joins, early emission, partial matches, empty inputs, cartesian products,
and multi-row cross-products.

…ctness fixes for suffix collisions and system tag sorting

codecov · 2026-03-24T03:47:45Z

Codecov Report

❌ Patch coverage is 98.83041% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/core/operators/join.py	98.83%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR refactors Join.async_execute from a mixed strategy (2-way streaming + N>2 barrier/static_process) into a unified N-way streaming hash join (MJoin), aiming to emit matches as soon as all sides contribute matching keys and to align system-tag behavior with static_process.

Changes:

Replace the async join execution path with an N-way streaming MJoin implementation.
Fix/adjust system-tag suffix canonicalization for duplicate pipeline hashes and add commutativity sorting of system-tag values.
Expand async operator tests to cover 3-way/4-way joins, early emission, empty inputs, and cross-products.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
tests/test_channels/test_native_async_operators.py	Adds/updates async Join tests for N-way streaming behavior and sync/async equivalence.
src/orcapod/core/operators/join.py	Implements N-way streaming MJoin, updates system-tag suffixing and merged system-tag sorting for commutativity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T03:51:27Z

+                    # All sides represented; compute shared keys as the
+                    # intersection of all sides' tag key sets
+                    all_key_sets = []
+                    for s in range(n):
+                        if s == side:
+                            all_key_sets.append(set(tag.keys()))
+                        else:
+                            all_key_sets.append(set(buffers[s][0][0].keys()))
+                    shared_keys = tuple(sorted(set.intersection(*all_key_sets)))


The async MJoin computes shared_keys as the intersection of tag keys across all inputs. This diverges from static_process, which joins each new stream on the keys it shares with the accumulated table (and otherwise falls back to the cartesian _common key). With partially-overlapping tag schemas (e.g., streams with tags {a}, {b}, {a}), static_process enforces a==a between the two {a} sides, but the current async logic will treat shared_keys as empty and produce a full cartesian product. Consider matching static_process semantics by deriving join constraints per key that appears in 2+ inputs (and treating missing keys as unconstrained), rather than requiring keys to be present in every input.

Copilot · 2026-03-24T03:51:28Z

+            n = len(inputs)
+            suffixes = (
+                self._compute_system_tag_suffixes(input_pipeline_hashes)
+                if input_pipeline_hashes is not None
+                else [str(i) for i in range(n)]
+            )
+            await self._mjoin(inputs, output, suffixes)


async_execute accepts input_pipeline_hashes, but there’s no validation that its length matches len(inputs). If it’s shorter/longer, _compute_system_tag_suffixes (and later _merge_rows via suffixes[i]) can raise an IndexError or silently mis-suffix system tags. Consider raising a clear InputValidationError/ValueError when input_pipeline_hashes is not None and len(input_pipeline_hashes) != len(inputs).

Copilot · 2026-03-24T03:51:28Z

+    async def _emit_buffered_matches(
+        self,
+        buffers: list[list[tuple[TagProtocol, PacketProtocol]]],
+        indexes: list[dict[tuple, list[int]]],
+        shared_keys: tuple[str, ...],
+        suffixes: list[str],
+        block_sep: str,
+        output: WritableChannel[tuple[TagProtocol, PacketProtocol]],
+    ) -> None:


_emit_buffered_matches takes shared_keys but doesn’t use it. Dropping the unused parameter (and updating the call site) would reduce cognitive overhead and avoid implying that the function depends on it.

Copilot · 2026-03-24T03:51:28Z

+    @pytest.mark.asyncio
+    async def test_matches_sync_three_way(self):
+        """Three-way MJoin must produce the same data as sync static_process."""
+        t1 = pa.table(
+            {"id": pa.array([1, 2, 3], type=pa.int64()), "a": pa.array([10, 20, 30], type=pa.int64())}
+        )
+        t2 = pa.table(
+            {"id": pa.array([1, 2, 3], type=pa.int64()), "b": pa.array([100, 200, 300], type=pa.int64())}
+        )
+        t3 = pa.table(
+            {"id": pa.array([1, 2, 3], type=pa.int64()), "c": pa.array([1000, 2000, 3000], type=pa.int64())}
+        )


The new async MJoin tests cover cases where all inputs share the same join tag(s) (e.g. id) and the fully-disjoint cartesian case, but there’s no test for partially overlapping tag schemas (some sides share a tag key, others don’t). Join.static_process currently constrains equality on keys shared by a subset of streams (due to the iterative join), so adding an async-vs-sync regression test for a case like tags {a}, {b}, {a} would help ensure async semantics stay aligned.

…for correct partially-overlapping tag semantics

brian-arnold · 2026-03-24T22:10:18Z

Changes introduced in commit " fix(join): replace N-way MJoin with staggered pairwise binary joins for correct partially-overlapping tag semantics":

Replaced N-way MJoin (global key intersection) with staggered pairwise binary joins — join(join(x, y), z) — matching static_process's
iterative semantics
Added input_pipeline_hashes length validation in async_execute
Removed unused shared_keys parameter (_emit_buffered_matches removed entirely)
Removed unreachable buffered-match emission code in binary join
Added tests: partially overlapping tags, pipeline hash length mismatch, buffered reindex emission, _sort_merged_system_tags unit tests
Patch coverage: 80.5% → 96%

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:18:55Z

                        )
-                    await output.send(merged)
+                        renamed[new_key] = v
+                    tag = Tag(tag.as_dict(), system_tags=renamed)


_rename_sys_tags reconstructs Tag via Tag(tag.as_dict(), system_tags=renamed). Because Tag.as_dict() defaults to excluding meta/context, this drops any meta columns (and context key) carried by the incoming Tag. This diverges from the sync join path, which preserves meta columns via table joins, and from other streaming operators that preserve datagram metadata. Consider building the new Tag from tag.as_dict(columns={"meta": True, "context": True}) (excluding system tags) or copying the existing Tag and only rewriting its system-tag dict.

Suggested change

tag = Tag(tag.as_dict(), system_tags=renamed)

tag = Tag(

tag.as_dict(columns={"meta": True, "context": True}),

system_tags=renamed,

)

Re: _rename_sys_tags drops meta/context columns (line 540)

Investigated — this is a false positive. Tag and Packet objects in the async flow are produced by iter_packets(), which constructs them from Arrow record batches that have already had prefixed columns (meta __, source _source_, etc.) separated out by prepare_prefixed_columns. So tag.as_dict() doesn't drop meta columns because the Tag never carried them in the first place.

The sync path preserves meta through Arrow table joins because as_table(columns={"meta": True}) re-materializes them from the separated prefix tables — but this happens at the table/stream layer, not at the Tag/Packet dict layer. Both paths are consistent: meta columns are a table-level concern, not a datagram-level one.

Copilot · 2026-03-24T22:18:55Z

        # Merge packet dicts (non-overlapping by Join's validation)
        merged_pkt_d: dict = {}
+        merged_si: dict = {}
        merged_pkt_d.update(left_pkt.as_dict())
        merged_pkt_d.update(right_pkt.as_dict())
+        merged_si.update(left_pkt.source_info())
+        merged_si.update(right_pkt.source_info())
+
+        merged_pkt = Packet(merged_pkt_d, source_info=merged_si)
+        return merged_tag, merged_pkt


In _merge_pair_rename / _merge_pair_passthrough, the merged Packet is constructed from left_pkt.as_dict() + right_pkt.as_dict(). Since Packet.as_dict() excludes meta columns by default, any meta info present on either input Packet is silently dropped in the async join output. The sync static_process joins Arrow tables that include meta columns, so this breaks parity between sync and async for callers that rely on packet meta. Consider explicitly merging/preserving get_meta_info() (and any other requested info) when constructing the merged Packet (and likewise for Tags if needed), handling key collisions consistently with static_process's column-renaming behavior.

Re: _merge_pair_* drops Packet meta (line 594)

Same finding as the _rename_sys_tags comment — this is a false positive. Packet objects in the async flow never carry meta columns. They're constructed by iter_packets() from Arrow record batches where prepare_prefixed_columns has already separated meta columns into a distinct prefix table. pkt.as_dict() returns exactly the data columns — nothing is lost.

Copilot · 2026-03-24T22:18:55Z

+
+        # Output should be empty — side 3 hasn't contributed yet
+        with pytest.raises(asyncio.TimeoutError):
+            await asyncio.wait_for(out.reader.receive(), timeout=0.05)


This test uses a very small negative-assertion timeout (0.05s) to prove the join doesn't emit before the third side arrives. On slower CI runners or under load, this can be flaky (false failures if the scheduler delays). Consider increasing the timeout and/or using a more deterministic synchronization mechanism (e.g., wait until the join task has consumed the first two sends before asserting no output).

Suggested change

await asyncio.wait_for(out.reader.receive(), timeout=0.05)

await asyncio.wait_for(out.reader.receive(), timeout=0.5)

Fixed — increased timeout from 0.05s to 0.5s.

Copilot · 2026-03-24T22:18:55Z

+        await ch1.writer.send((Tag({"id": 2}), Packet({"a": 20})))
+
+        # Give the event loop a chance to process
+        await asyncio.sleep(0.01)


await asyncio.sleep(0.01) is used to “give the event loop a chance to process”. Fixed sleeps tend to be flaky and slow tests down unnecessarily. Prefer a deterministic wait (e.g., awaiting a specific output/event, or yielding with await asyncio.sleep(0) if you only need to allow task switching).

Suggested change

await asyncio.sleep(0.01)

await asyncio.sleep(0)

Fixed — changed to await asyncio.sleep(0) for a deterministic event loop yield.

Copilot · 2026-03-24T22:18:56Z

 class TestJoinNativeAsync:
-    """Tests for Join.async_execute (symmetric hash join + N>2 barrier)."""
+    """Tests for Join.async_execute (N-way streaming MJoin)."""



The module/class docstrings refer to a “streaming N-way MJoin” implementation, but Join.async_execute is documented/implemented as a staggered chain of pairwise binary joins (join(join(x, y), z)). This mismatch makes the behavior/algorithm harder to understand when maintaining the tests. Consider renaming the test/docstring wording to match the implemented semantics (or updating the implementation if true N-way MJoin was intended).

Fixed — updated all MJoin references in module docstring, class docstring, and individual test docstrings to "staggered pairwise streaming join".

Copilot · 2026-03-24T22:18:56Z

+        """Async streaming join with pairwise iterative semantics.

        Single input: streams through directly without any buffering.

-        Two inputs: symmetric hash join — each arriving row is
-        immediately probed against the opposite side's buffer, emitting
-        matches as soon as found.  System-tag columns are correctly
-        renamed using the ``input_pipeline_hashes``.
+        Two inputs: binary symmetric hash join — each arriving row is
+        probed against the opposite side's buffer, emitting matches as
+        soon as found.

-        Three or more inputs: collects all inputs concurrently, then
-        delegates to ``static_process`` for the Polars N-way join.
+        Three or more inputs: staggered pairwise binary joins in
+        canonical order — ``join(join(x, y), z)`` — matching
+        ``static_process``'s iterative accumulation.  Each binary join
+        uses the per-pair intersection of tag keys, so partially
+        overlapping tag schemas are handled correctly.


PR description/title and the test plan describe a unified N-way “MJoin” that indexes/probes across all sides, but the current async implementation is a staggered chain of pairwise joins. If the intent is the staggered iterative semantics (matching static_process), consider adjusting the PR description and/or naming to avoid implying a different algorithm/behavior.

Fixed — the async_execute docstring was already updated to describe staggered pairwise semantics in this commit. The PR title/description will be updated to match before merge.

eywalker

LGTM

feat(join): unify async_execute into N-way streaming MJoin with corre…

d890099

…ctness fixes for suffix collisions and system tag sorting

brian-arnold requested a review from Copilot March 24, 2026 03:46

Copilot started reviewing on behalf of brian-arnold March 24, 2026 03:46 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

brian-arnold marked this pull request as draft March 24, 2026 03:53

brian-arnold marked this pull request as ready for review March 24, 2026 14:58

brian-arnold requested a review from eywalker March 24, 2026 14:58

brian-arnold marked this pull request as draft March 24, 2026 16:48

brian-arnold changed the title ~~feat(join): unify async_execute into N-way streaming MJoin~~ feat(join): unify async_execute into N-way streaming MJoin [ENG-301] Mar 24, 2026

fix(join): replace N-way MJoin with staggered pairwise binary joins …

9acab13

…for correct partially-overlapping tag semantics

brian-arnold marked this pull request as ready for review March 24, 2026 22:10

brian-arnold requested a review from Copilot March 24, 2026 22:10

Copilot started reviewing on behalf of brian-arnold March 24, 2026 22:11 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

eywalker approved these changes Mar 24, 2026

View reviewed changes

eywalker merged commit f8ca2c8 into nauticalab:dev Mar 24, 2026
9 checks passed

brian-arnold mentioned this pull request Mar 24, 2026

Update test docs for join operator #103

Merged

	await asyncio.wait_for(out.reader.receive(), timeout=0.05)
	await asyncio.wait_for(out.reader.receive(), timeout=0.5)

Conversation

brian-arnold commented Mar 24, 2026

Uh oh!

codecov Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brian-arnold Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

eywalker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 24, 2026 •

edited

Loading