# Fluster Demo Notebook

Interactive exploration of the fluster cluster system.

This notebook connects to a running fluster cluster and demonstrates how to:
- Submit jobs to the cluster
- Wait for job completion
- Submit jobs with arguments
- View cluster state
- Run stateful actors and call methods on them
- Use WorkerPool for parallel task execution

## Setup and Connection

Connect to the demo cluster using environment variables set by `demo_cluster.py`.

In [1]:
import os
from pathlib import Path

from fluster.client import FlusterClient
from fluster.cluster.types import Entrypoint
from fluster.rpc import cluster_pb2

# Connect to the demo cluster
# FLUSTER_CONTROLLER_ADDRESS and FLUSTER_WORKSPACE are set by demo_cluster.py
controller_address = os.environ.get("FLUSTER_CONTROLLER_ADDRESS", "http://127.0.0.1:8080")
workspace_str = os.environ.get("FLUSTER_WORKSPACE")
if workspace_str is None:
    raise RuntimeError(
        "FLUSTER_WORKSPACE not set. Run this notebook via: "
        "uv run python examples/demo_cluster.py"
    )
workspace = Path(workspace_str)

client = FlusterClient.remote(controller_address, workspace=workspace)
print(f"Connected to cluster at {controller_address}")

Connected to cluster at http://127.0.0.1:60670


## Submit a Simple Job

Submit a basic job that prints a message and returns a value.

In [2]:
def hello_world():
    print("Hello from the cluster!")
    return 42

job_id = client.submit(
    entrypoint=Entrypoint.from_callable(hello_world),
    name="notebook-hello",
    resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
)
print(f"Submitted job: {job_id}")

Submitted job: notebook-hello


## Wait and Check Status

Wait for the job to complete and check its status.

In [3]:
status = client.wait(job_id, timeout=30.0, stream_logs=True)
print(f"Job {job_id}: {cluster_pb2.JobState.Name(status.state)}")

Job notebook-hello: JOB_STATE_SUCCEEDED


## Submit a Job with Arguments

Submit a job that takes arguments and performs a computation.

In [4]:
def compute(a: int, b: int) -> int:
    result = a * b
    print(f"Computing {a} * {b} = {result}")
    return result

job_id = client.submit(
    entrypoint=Entrypoint.from_callable(compute, 7, 6),
    name="multiply-job",
    resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
)
status = client.wait(job_id, stream_logs=True)
print(f"Result: {cluster_pb2.JobState.Name(status.state)}")

Result: JOB_STATE_SUCCEEDED


## Submit Multiple Jobs

Submit multiple jobs and wait for all of them to complete.

In [5]:
def square(n: int) -> int:
    result = n * n
    print(f"{n}^2 = {result}")
    return result

# Submit 5 jobs
job_ids = []
for i in range(1, 6):
    job_id = client.submit(
        entrypoint=Entrypoint.from_callable(square, i),
        name=f"square-{i}",
        resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
    )
    job_ids.append(job_id)
    print(f"Submitted: {job_id}")

# Wait for all
print("\nWaiting for jobs...")
for job_id in job_ids:
    status = client.wait(job_id, stream_logs=True)
    print(f"{job_id}: {cluster_pb2.JobState.Name(status.state)}")

Submitted: square-1
Submitted: square-2
Submitted: square-3
Submitted: square-4
Submitted: square-5

Waiting for jobs...
square-1: JOB_STATE_SUCCEEDED
square-2: JOB_STATE_SUCCEEDED
square-3: JOB_STATE_SUCCEEDED
square-4: JOB_STATE_SUCCEEDED
square-5: JOB_STATE_SUCCEEDED


## Remote Actor Demo

Demonstrate running a stateful actor as a remote job. The actor maintains state
across method calls, enabling persistent services within the cluster.

This example:
1. Submits a job that starts an ActorServer with a Counter actor
2. Discovers the actor endpoint via the controller
3. Calls methods and verifies state is maintained across calls

In [6]:
class Counter:
    """A simple stateful actor that maintains a count."""

    def __init__(self):
        self._count = 0

    def increment(self, amount: int = 1) -> int:
        """Increment the counter and return the new value."""
        self._count += amount
        return self._count

    def get_count(self) -> int:
        """Return the current count."""
        return self._count

    def reset(self) -> int:
        """Reset the counter to zero and return the old value."""
        old = self._count
        self._count = 0
        return old

print("Counter actor class defined (for illustration)")

Counter actor class defined (for illustration)


In [7]:
import time

from fluster.actor import ActorServer
from fluster.client import fluster_ctx


def run_counter_actor():
    """Job entrypoint that starts a Counter actor server.

    The actor server:
    1. Binds to the allocated port (from context)
    2. Registers its endpoint with the controller for discovery
    3. Serves requests until the job is terminated
    """
    # Define Counter inside the function so it gets pickled with the entrypoint
    # (required for Docker mode where the class isn't available in the container)
    class Counter:
        """A simple stateful actor that maintains a count."""

        def __init__(self):
            self._count = 0

        def increment(self, amount: int = 1) -> int:
            """Increment the counter and return the new value."""
            self._count += amount
            return self._count

        def get_count(self) -> int:
            """Return the current count."""
            return self._count

        def reset(self) -> int:
            """Reset the counter to zero and return the old value."""
            old = self._count
            self._count = 0
            return old
    
    ctx = fluster_ctx()
    print(f"Starting counter actor for job {ctx.job_id}")

    # Get the allocated port from context (works in both Docker and local modes)
    port = ctx.get_port("actor")

    # Bind to all interfaces - works in both Docker and local modes
    bind_host = "0.0.0.0"
    
    # Create and register the actor
    server = ActorServer(host=bind_host, port=port)
    server.register("counter", Counter())

    # Start the server (uses port from __init__)
    actual_port = server.serve_background()
    print(f"Actor server started on {bind_host}:{actual_port}")

    # Register endpoint with controller for discovery via context registry
    # The registry handles namespace prefixing automatically
    if ctx.registry:
        endpoint_id = ctx.registry.register("counter", f"127.0.0.1:{actual_port}")
        print(f"Registered endpoint: counter -> 127.0.0.1:{actual_port} (id={endpoint_id})")
    else:
        print("WARNING: No registry in context, endpoint not registered")

    # Keep running to serve requests
    # The job will be terminated externally when no longer needed
    print("Actor ready, serving requests...")
    while True:
        time.sleep(1)


print("Actor job function defined")

Actor job function defined


In [8]:
# Submit the actor job
# Request ports=["actor"] so the worker allocates a port for the actor server
actor_job_id = client.submit(
    entrypoint=Entrypoint.from_callable(run_counter_actor),
    name="counter-actor",
    resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
    ports=["actor"],
)
print(f"Submitted actor job: {actor_job_id}")

Submitted actor job: counter-actor


In [9]:
from fluster.time_utils import ExponentialBackoff

# Wait for the actor job to start running
# Unlike regular jobs, we don't wait for completion - we wait for RUNNING state
print("Waiting for actor to start...")

_job_status = None

def job_is_running_or_failed() -> bool:
    global _job_status
    _job_status = client.status(actor_job_id)
    return _job_status.state in (
        cluster_pb2.JOB_STATE_RUNNING,
        cluster_pb2.JOB_STATE_FAILED,
        cluster_pb2.JOB_STATE_KILLED,
    )

ExponentialBackoff(initial=0.1, maximum=1.0).wait_until_or_raise(
    job_is_running_or_failed,
    timeout=15.0,
    error_message="Actor job did not start in time",
)

if _job_status.state != cluster_pb2.JOB_STATE_RUNNING:
    raise RuntimeError(f"Actor job failed: {cluster_pb2.JobState.Name(_job_status.state)}")

print("Actor job is running")

Waiting for actor to start...
Actor job is running


In [10]:
from fluster.actor import ActorClient
from fluster.client import FlusterContext, fluster_ctx_scope, ClusterResolver
from fluster.time_utils import ExponentialBackoff

# Enter the actor job's namespace context
# This enables ClusterResolver to discover endpoints in that namespace
ctx = FlusterContext(job_id=actor_job_id, client=client)

with fluster_ctx_scope(ctx):
    resolver = ClusterResolver(controller_address)
    
    # Wait for the actor to register its endpoint
    print("Waiting for actor endpoint to be registered...")
    
    ExponentialBackoff(initial=0.1, maximum=1.0).wait_until_or_raise(
        lambda: not resolver.resolve("counter").is_empty,
        timeout=15.0,
        error_message="Actor endpoint not registered in time",
    )
    
    # Log the resolved endpoint
    resolved = resolver.resolve("counter")
    print(f"Resolved endpoint: {resolved.endpoints}")
    
    # Create actor client using ClusterResolver
    counter = ActorClient(resolver, "counter", timeout=10.0)
    print("Actor client created")
    
    # Test the actor: verify state is maintained across calls
    print("\nTesting actor state persistence...")
    
    # Initial count should be 0
    initial = counter.get_count()
    print(f"Initial count: {initial}")
    assert initial == 0, f"Expected 0, got {initial}"
    
    # Increment 3 times
    for i in range(1, 4):
        result = counter.increment()
        print(f"After increment {i}: {result}")
        assert result == i, f"Expected {i}, got {result}"
    
    # Verify final count
    final = counter.get_count()
    print(f"Final count: {final}")
    assert final == 3, f"Expected 3, got {final}"
    
    # Test increment with custom amount
    result = counter.increment(10)
    print(f"After increment(10): {result}")
    assert result == 13, f"Expected 13, got {result}"
    
    # Test reset
    old_value = counter.reset()
    print(f"Reset returned: {old_value}")
    assert old_value == 13, f"Expected 13, got {old_value}"
    
    # Verify count is now 0
    after_reset = counter.get_count()
    print(f"After reset: {after_reset}")
    assert after_reset == 0, f"Expected 0, got {after_reset}"
    
    print("\nAll actor tests passed! State is correctly maintained across calls.")

Waiting for actor endpoint to be registered...
Resolved endpoint: [ResolvedEndpoint(url='http://127.0.0.1:30001', actor_id='030ec174-3d9e-4bbe-b666-321127b795ee', metadata={})]
Actor client created

Testing actor state persistence...
Initial count: 0
After increment 1: 1
After increment 2: 2
After increment 3: 3
Final count: 3
After increment(10): 13
Reset returned: 13
After reset: 0

All actor tests passed! State is correctly maintained across calls.


In [11]:
# Cleanup: terminate the actor job
print(f"Terminating actor job: {actor_job_id}")
client.terminate(actor_job_id)

# Wait for termination to complete
status = client.wait(actor_job_id, timeout=10.0)
print(f"Actor job terminated: {cluster_pb2.JobState.Name(status.state)}")

Terminating actor job: counter-actor
Actor job terminated: JOB_STATE_KILLED


## WorkerPool Demo

WorkerPool provides a high-level interface for parallel task execution. Unlike
submitting individual jobs (which have scheduling overhead), WorkerPool maintains
a persistent pool of workers that can execute arbitrary callables with minimal latency.

Key features:
- **Persistent workers**: Workers stay running and accept tasks via RPC
- **Task queuing**: Submit many tasks; they queue and dispatch to idle workers
- **map() interface**: Familiar parallel map semantics for batch processing

WorkerPool must run from within a job context (it needs FlusterContext for
endpoint discovery). This demo submits a "coordinator" job that creates and
uses a WorkerPool internally.

In [12]:
def workerpool_coordinator():
    """Coordinator job that demonstrates WorkerPool usage.
    
    This runs inside a job context, which provides the FlusterContext needed
    for WorkerPool's endpoint discovery mechanism.
    """
    from fluster.client import fluster_ctx, WorkerPool, WorkerPoolConfig
    from fluster.rpc import cluster_pb2

    ctx = fluster_ctx()
    print(f"Coordinator starting (job_id={ctx.job_id})")

    # Define a simple computation function
    def square(n: int) -> int:
        return n * n

    # Create pool configuration: 3 workers with minimal resources
    config = WorkerPoolConfig(
        num_workers=3,
        resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
        name_prefix="pool-worker",
    )

    print(f"Creating WorkerPool with {config.num_workers} workers...")

    # Use WorkerPool as a context manager for automatic cleanup
    with WorkerPool(ctx.client, config, timeout=30.0) as pool:
        print(f"Pool ready: {pool.size} workers available")
        pool.print_status()

        # Use map() to compute squares of 1-10 in parallel
        items = list(range(1, 11))
        print(f"\nComputing squares of {items}...")

        futures = pool.map(square, items)
        results = [f.result(timeout=30.0) for f in futures]

        print(f"Results: {results}")
        
        # Verify results
        expected = [i * i for i in items]
        assert results == expected, f"Expected {expected}, got {results}"
        
        print("\nFinal pool status:")
        pool.print_status()

    print("\nWorkerPool demo completed successfully!")


print("Coordinator function defined")

Coordinator function defined


In [13]:
# Submit the coordinator job
coordinator_job_id = client.submit(
    entrypoint=Entrypoint.from_callable(workerpool_coordinator),
    name="workerpool-demo",
    resources=cluster_pb2.ResourceSpec(cpu=1, memory="512m"),
)
print(f"Submitted coordinator job: {coordinator_job_id}")

Submitted coordinator job: workerpool-demo


In [14]:
# Wait for the coordinator to complete with log streaming
# This may take a minute as it launches worker sub-jobs and waits for them
print("Waiting for WorkerPool demo to complete...")
print("(The coordinator will launch 3 worker sub-jobs internally)")
print()

status = client.wait(coordinator_job_id, timeout=120.0, stream_logs=True)
state_name = cluster_pb2.JobState.Name(status.state)
print(f"Coordinator job finished: {state_name}")

if status.state != cluster_pb2.JOB_STATE_SUCCEEDED:
    print(f"WARNING: Job did not succeed (state={state_name})")
else:
    print("\nWorkerPool demo completed successfully!")

Waiting for WorkerPool demo to complete...
(The coordinator will launch 3 worker sub-jobs internally)

Coordinator job finished: JOB_STATE_FAILED
