Skip to content

Conversation

@lukemartinlogan
Copy link
Contributor

No description provided.

lukemartinlogan and others added 19 commits February 8, 2026 01:19
Created simplified GPU-specific version of MakeCopyFuture that works
correctly on GPU and allows CPU deserialization from FutureShm.

Key changes:
- Added MakeCopyFutureGpu() in ipc_manager.h (GPU-only function)
- Made Future constructors GPU-compatible with HSHM_CROSS_FUN
- Use __threadfence() for GPU memory fencing
- Fixed UniqueId operators to be GPU-compatible
- Test validates: GPU NewTask → MakeCopyFutureGpu → CPU deserialize

The function mirrors the pattern from passing serialization tests,
using task->SerializeIn(archive) directly for reliable GPU execution.

Test results: 100% pass rate on GPU IPC buffer allocation tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implemented GPU kernel task creation and serialization using MakeCopyFutureGpu.

Changes:
- GPU kernel now creates tasks using NewTask on GPU
- Uses MakeCopyFutureGpu to serialize tasks for future processing
- Added error diagnostic for MakeCopyFutureGpu failures (-14)
- All GPU submission tests now pass (100% success rate)

Test flow:
1. GPU kernel initializes with CHIMAERA_GPU_INIT
2. Creates task with NewTask<GpuSubmitTask>
3. Serializes with MakeCopyFutureGpu
4. Returns success (result == 1)

Test results: 4/4 tests passing (gpu_init, cpu_submission,
multiple_executions, kernel_task_submission)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrates that vkWaitSemaphores efficiently sleeps a CPU thread
(~0ms CPU time over ~5s wall-clock) instead of busy-polling, validating
it as a GPU→CPU notification primitive for the ring buffer architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GPU device code cannot signal a CPU thread to wake — all Vulkan/CUDA
semaphore signaling is stream-ordered and fires only after a kernel
completes, making the approach unsuitable for persistent GPU kernels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Designate one worker (N-2) as the GPU worker that polls GPU lanes,
while regular workers no longer receive GPU lane assignments. Refactor
ProcessNewTasks to accept a TaskLane* parameter and extract per-task
logic into ProcessNewTask. The GPU worker forwards dequeued tasks to
scheduler workers via round-robin in RuntimeMapTask. GPU workers never
sleep to ensure continuous polling. Also remove GpuTaskQueue alias in
favor of TaskQueue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace raw ZeroMQ calls with lightbeam PUSH/PULL transport for client
task submission (TCP/IPC modes). Add inline bulk data serialization to
LocalSaveTaskArchive/LocalLoadTaskArchive so TCP/IPC transport can
transfer actual data bytes instead of ShmPtr addresses. Add real bdev
task round-trip tests (Create, AllocateBlocks, Write+Read) to all
transport mode tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove inline_bulk_ flag from LocalSaveTaskArchive. Instead, bulk()
checks whether the ShmPtr's alloc_id_ is null to decide if data must
be inlined (private memory) or if the ShmPtr itself suffices (shared
memory).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lukemartinlogan lukemartinlogan linked an issue Feb 10, 2026 that may be closed by this pull request
lukemartinlogan and others added 10 commits February 10, 2026 18:41
Replace LocalSaveTaskArchive/LocalLoadTaskArchive with SaveTaskArchive/
LoadTaskArchive in SendZmq, ClientRecv, ClientSend, RecvZmqClientThread,
and Recv. This eliminates the manual wire protocol and uses lightbeam's
multi-frame bulk transfer (2 copies: ZMQ send + recv) instead of inlining
bulk data into the serialized stream (4-5 copies).

Also add ContinueBlockedTasks(true) after epoll_wait in SuspendMe() so
periodic tasks like ClientRecv/Send execute immediately on wake.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ShmClient/ShmServer that transfer data through a shared copy_space
buffer with atomic flag synchronization, eliminating kernel crossings
for same-node IPC. Bulks with non-null alloc_id skip the data copy and
pass only the ShmPtr; the receiver sets ptr_ to nullptr for the caller
to resolve.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lukemartinlogan and others added 7 commits February 12, 2026 19:20
Fix LoadTaskArchive::bulk() to use ptr.IsNull() instead of
ptr.alloc_id_.IsNull() when checking for caller-provided buffers.
MallocAllocator uses null alloc_id_ for all allocations, so the old
check always took the zero-copy path, causing read data to never
reach the caller's buffer over TCP.

Split bdev_file_explicit_backend test into three per-mode variants
(SHM, TCP, IPC) that each run as separate processes. Update
docker-compose to only start the runtime on node1, with run_tests.sh
driving test execution via docker exec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CTE run_tests.sh unconditionally overwrote IOWARP_CORE_ROOT with
the devcontainer-internal path (/workspace), but Docker volume mounts
need the host path. Respect the existing IOWARP_CORE_ROOT set by the
devcontainer (matching the bdev test pattern).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lukemartinlogan lukemartinlogan marked this pull request as ready for review February 12, 2026 23:16
lukemartinlogan and others added 2 commits February 12, 2026 23:34
- Add local_sched.h and local_sched.cc that were missing from git
  (scheduler_factory.cc includes local_sched.h)
- Add #include <algorithm> to shm_transport.h for std::min with
  initializer_list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lukemartinlogan lukemartinlogan merged commit 790067c into main Feb 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix context transport primitives for the GPU

1 participant