diff --git a/.github/workflows/env-examples.yml b/.github/workflows/env-examples.yml index 94632516..0fad9853 100644 --- a/.github/workflows/env-examples.yml +++ b/.github/workflows/env-examples.yml @@ -4,7 +4,13 @@ on: push: branches: - main + paths-ignore: + - '**.md' + - 'docs/**' pull_request: + paths-ignore: + - '**.md' + - 'docs/**' permissions: contents: read diff --git a/.github/workflows/lint-typecheck.yml b/.github/workflows/lint-typecheck.yml index 742576f6..e1808c38 100644 --- a/.github/workflows/lint-typecheck.yml +++ b/.github/workflows/lint-typecheck.yml @@ -4,7 +4,13 @@ on: push: branches: - main + paths-ignore: + - '**.md' + - 'docs/**' pull_request: + paths-ignore: + - '**.md' + - 'docs/**' permissions: contents: read diff --git a/.github/workflows/requirements-sync.yml b/.github/workflows/requirements-sync.yml index ac2de1e5..cfb38fb4 100644 --- a/.github/workflows/requirements-sync.yml +++ b/.github/workflows/requirements-sync.yml @@ -4,7 +4,13 @@ on: push: branches: - main + paths-ignore: + - '**.md' + - 'docs/**' pull_request: + paths-ignore: + - '**.md' + - 'docs/**' permissions: contents: read diff --git a/.github/workflows/security.yml b/.github/workflows/security.yml index 1a46ab3d..6a3a047e 100644 --- a/.github/workflows/security.yml +++ b/.github/workflows/security.yml @@ -2,8 +2,14 @@ name: security on: pull_request: + paths-ignore: + - '**.md' + - 'docs/**' push: branches: [main] + paths-ignore: + - '**.md' + - 'docs/**' permissions: contents: read diff --git a/.github/workflows/unit-tests.yml b/.github/workflows/unit-tests.yml index 3c24da35..ea66cb07 100644 --- a/.github/workflows/unit-tests.yml +++ b/.github/workflows/unit-tests.yml @@ -4,7 +4,13 @@ on: push: branches: - main + paths-ignore: + - '**.md' + - 'docs/**' pull_request: + paths-ignore: + - '**.md' + - 'docs/**' permissions: contents: read diff --git a/AGENTS.md b/AGENTS.md index 6b985607..f10a9d0d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,710 +1,80 @@ # AGENTS.md — FlowMesh -This file provides guidance to coding agents (Claude Code, Codex, Cursor, and -similar tools) working in this repository. It follows the cross-agent -`AGENTS.md` convention; agent-specific entry files (e.g. `CLAUDE.md`) at this -level simply import from here. - -## Project Overview - -FlowMesh is a service fabric for running LLM agentic workflows on distributed -GPU workers. It accepts workflow definitions (YAML, JSON, or n8n graph format), -parses them into a DAG of tasks, schedules and dispatches each task to a -suitable worker, and collects results and artifacts. It supports inference -(vLLM, HF transformers, diffusers), training (SFT, LoRA, DPO, PPO), -retrieval-augmented generation, agent execution, SSH-style interactive -sessions, and arbitrary container jobs. - -The codebase is organized as a **uv workspace** with SDK and CLI packages: - -| Package | Path | Purpose | -|---------|------|---------| -| `flowmesh` (root) | `src/` | Server, Worker, shared schemas | -| `flowmesh-sdk` | `sdk/` | Public Python SDK for the server API | -| `flowmesh-sdk-stack` | `sdk/stack/` | Stack and node helpers for local deployments | -| `flowmesh-cli` | `cli/` | Typer CLI entry point + core commands | -| `flowmesh-cli-stack` | `cli/stack/` | Stack deployment commands | - -## Architecture - -``` -Client (CLI / SDK / API) - │ - ▼ HTTP (default 8000) -Server ─── FastAPI orchestrator - │ - YAML / JSON / n8n workflow parsing - │ - DAG dependency resolution, task scheduling - │ - Dispatch: best-fit / first-fit / min-satisfying worker selection - │ - Task merging, stage stickiness, context reuse, epoch scheduling - │ - SSE log streaming, metrics recording - │ - Redis (control + telemetry channels) for pub/sub and task state - │ - ├──▶ Redis Pub/Sub ──▶ Supervisor(s) ──▶ gRPC ──▶ Worker(s) - │ │ │ - │ │ Worker lifecycle │ Stateless GPU executors - │ │ Docker / Vast.ai │ 19 executor types - │ │ adapters │ - │ │ │ - │ ▼ ▼ - │ Worker Registry Results / Artifacts - │ - └──▶ Redis streams: logs, events, task queues -``` - -### Components - -The runtime is two top-level processes: - -1. **Server** (`src/server/`) — Central FastAPI orchestrator (default HTTP - port 8000). Hosts workflow / task / dispatch logic and the **Supervisor - subsystem** under `src/server/supervisor/`, which manages per-node worker - lifecycle, runs the worker-facing gRPC server (default port 50051), and - drives the Docker / Vast.ai worker adapters. - -2. **Worker** (`src/worker/`) — Stateless executor process. Connects to a - supervisor via gRPC, receives tasks, runs executors, reports results. - -### Communication Protocols - -- **Server ↔ Supervisor (within a node)**: `multiprocessing.Queue` between - the parent server and its supervisor child process for command/response, - plus `nodes:events` Redis pub/sub for telemetry. -- **Server ↔ Supervisor (across nodes)**: Redis pub/sub on - `node:{id}:dispatch`, `node:{id}:cmds`, `nodes:events`, `nodes:responses`. -- **Supervisor ↔ Worker**: bidirectional gRPC (proto stubs at - `src/shared/grpc/supervisor/v1/`). -- **Client ↔ Server**: REST API over HTTP. - -### Prefer CLI and SDK - -When interacting with FlowMesh, prefer the **CLI** (`flowmesh`) or **SDK** -(`flowmesh` Python package) over raw HTTP calls or shell scripts. The CLI and -SDK handle pagination, error formatting, and SSE streaming. - -### Hook Plugin Extension Points - -External integrations (auth, submission policy, usage tracking) plug into the -server through three protocol hooks defined in `src/server/hooks/`: - -- `IdentityProvider` — resolve a bearer token to a `PrincipalContext` - (iterated from `auth/security.py`). With no providers registered, auth is - a no-op and `authenticate_api_key` returns a default admin principal. -- `SubmissionGuard` — pre-submit precondition (iterated from - `routers/v1/workflows.py`). -- `UsageSink` — fan-out per-task usage rows after a task completes - (iterated from `services/monitoring.py`). Typical consumers: billing, - audit, observability. - -A plugin is any Python module that exposes a top-level `install()`. The -server loads `FLOWMESH_PLUGINS` (comma-separated module names) inside its -FastAPI lifespan and treats `install()` as either: - -- a sync function returning `None` — the plugin appends its adapters to the - registries in `server.hooks` and returns; or -- an `@asynccontextmanager async def install()` — the plugin owns resources - with a lifecycle (a SQLAlchemy engine, an HTTP client, a background task) - that need teardown on server shutdown. The loader holds an - `AsyncExitStack`, enters each ctx-manager `install()` on startup, and - unwinds them in reverse order on shutdown. - -Plugins live anywhere on `sys.path` — in-tree under `src/server//`, -sibling-mounted under `/app/src//`, or a pip-installable wheel. Core -never references plugin module names; each plugin self-filters internally. - -OSS ships no DB itself. Plugins that need persistence bring their own engine -and manage it via the ctx-manager `install()` form. Example: - -```python -# myorg_auth_plugin/__init__.py -import os -from contextlib import asynccontextmanager -from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine - -from server.hooks import IDENTITY_PROVIDERS - - -class _MyOrgAuth: - name = "myorg.auth" - def __init__(self, sessionmaker): self._sm = sessionmaker - async def resolve(self, raw_token, logger): - async with self._sm() as session: - ... - -@asynccontextmanager -async def install(): - engine = create_async_engine(os.environ["MYORG_DATABASE_URL"]) - IDENTITY_PROVIDERS.append(_MyOrgAuth(async_sessionmaker(engine))) - try: - yield - finally: - await engine.dispose() -``` - -## API Reference (Server: `http://localhost:8000`) - -### Workflows - -| Method | Path | Description | -|--------|------|-------------| -| POST | `/api/v1/workflows` | Submit workflow (YAML `text/plain` or JSON; set `Workflow-Format: n8n` header for n8n) | -| POST | `/api/v1/workflows/validate` | Validate without executing | -| GET | `/api/v1/workflows` | List workflows (filterable) | -| GET | `/api/v1/workflows/{id}` | Get workflow details | -| GET | `/api/v1/workflows/{id}/logs` | Query logs (limit, before/after cursors) | -| GET | `/api/v1/workflows/{id}/logs/stream` | SSE log stream | -| POST | `/api/v1/workflows/{id}/cancel` | Cancel workflow | - -Workflow statuses: `PENDING`, `DISPATCHED`, `FAILED`, `CANCELLED`, `DONE` - -### Tasks - -| Method | Path | Description | -|--------|------|-------------| -| GET | `/api/v1/tasks` | List tasks (filter by workflow_id, status, task_type, assigned_worker) | -| GET | `/api/v1/tasks/{id}` | Task details | -| GET | `/api/v1/tasks/{id}/logs` | Query task logs | -| GET | `/api/v1/tasks/{id}/logs/stream` | SSE task log stream | - -Task statuses: `PENDING`, `DISPATCHED`, `FAILED`, `CANCELLED`, `DONE` - -### SSH - -| Method | Path | Description | -|--------|------|-------------| -| WS | `/api/v1/ssh/tasks/{task_id}/proxy` | WebSocket SSH proxy for proxy-mode SSH tasks | -| GET | `/api/v1/ssh/connections` | List active server-audited SSH proxy/forward connections | -| TCP | `:` | Raw TCP per-task forward-mode port | - -Server policy toggles: -- `ENABLE_SERVER_SSH_PROXY` — WebSocket SSH proxy endpoint -- `ENABLE_SERVER_SSH_FORWARD` — TCP forward listener -- `ENABLE_SERVER_SSH_CONNECTION_AUDIT` — best-effort SSH connection audit - -### Results - -| Method | Path | Description | -|--------|------|-------------| -| POST | `/api/v1/results` | Submit task result (worker → server) | -| GET | `/api/v1/results/{task_id}` | Get task result | -| GET | `/api/v1/results/{task_id}/bundle` | Download tar.gz bundle (`?include=results,artifacts,logs,all`) | -| POST | `/api/v1/results/{task_id}/files` | Upload artifact (multipart) | -| GET | `/api/v1/results/{task_id}/files/{filename}` | Download artifact | -| GET | `/api/v1/results/{task_id}/logs` | Download archived logs.jsonl | - -### Workers - -| Method | Path | Description | -|--------|------|-------------| -| GET | `/api/v1/workers` | List workers (filter by alias, namespace, cluster, status, tags) | -| GET | `/api/v1/workers/{id}` | Worker details | - -Worker statuses: `STARTING`, `IDLE`, `BUSY`, `STOPPING`, `STOPPED` - -### Nodes (Supervisors) - -| Method | Path | Description | -|--------|------|-------------| -| GET | `/api/v1/nodes` | List nodes | -| POST | `/api/v1/nodes/register` | Register a node | -| GET | `/api/v1/nodes/{id}` | Node details | -| GET | `/api/v1/nodes/{id}/workers` | List workers under a node | -| POST | `/api/v1/nodes/{id}/workers/register` | Register worker under node | -| POST | `/api/v1/nodes/{id}/workers/{name}/start` | Start worker | -| POST | `/api/v1/nodes/{id}/workers/{name}/stop` | Stop worker | - -### Stack (local single-node lifecycle) - -`/api/v1/stack/workers/...` — wraps node-registered workers with local-only -container lifecycle, used by `flowmesh stack worker {up,down,start,stop}`. - -### System - -| Method | Path | Description | -|--------|------|-------------| -| GET | `/healthz` | Top-level health check | -| GET | `/api/v1/system/metrics` | System metrics snapshot | - -## Object Identifiers - -Object IDs use 3-character type prefixes: - -- `wfl-` workflows -- `tsk-` tasks -- `ssn-` SSH session containers -- `scn-` SSH connection (server-side audit row) -- `cmd-` supervisor commands - -ID factories live in `src/shared/utils/ids.py`. Always use `new_*_id()` -helpers; never use `uuid4()` or `secrets.token_hex` for IDs. - -## SDK Usage - -### Sync Client - -```python -from flowmesh import FlowMesh - -client = FlowMesh(base_url="http://localhost:8000", api_key="...") - -# Submit a workflow -wf = client.workflows.submit_yaml(open("templates/echo_local.yaml").read()) - -# Watch progress -for ev in client.workflows.stream_logs(wf.workflow_id): - print(ev.line) - -# Inspect tasks -tasks = client.tasks.list(workflow_id=wf.workflow_id) -result = client.results.get(tasks[0].task_id) -``` - -### Async Client - -```python -from flowmesh import AsyncFlowMesh - -async with AsyncFlowMesh(base_url="...", api_key="...") as client: - wf = await client.workflows.submit_yaml(...) - async for ev in client.workflows.stream_logs(wf.workflow_id): - ... -``` - -## CLI Commands - -The CLI entry point is `flowmesh`. Top-level commands and command groups: - -``` -flowmesh info | health | logout -flowmesh workflow {submit, validate, list, info, watch, cancel, logs} -flowmesh task {list, info, watch, stop, logs} -flowmesh worker {list, info} -flowmesh node {list, info, worker} -flowmesh node worker {list, start, stop} -flowmesh ssh {connect, run, proxy, connections} -flowmesh result {fetch, download} -flowmesh system {metrics} -flowmesh stack {build, push, pull, pullall, up, down, restart, ps, logs} -flowmesh stack worker {up, start, stop, down, list, pull} -``` - -`flowmesh stack` (from `flowmesh-cli-stack`) builds Docker images, runs the -local Compose stack, and manages local-only worker containers. Core commands -work against any reachable FlowMesh server. - -## Workflow YAML Format - -### Single Task - -```yaml -apiVersion: flowmesh/v1 -kind: InferenceTask -metadata: - name: hello-inference -spec: - taskType: inference - resources: - hardware: { gpu: { type: any, count: 1 } } - model: - source: { type: huggingface, identifier: TinyLlama/TinyLlama-1.1B-Chat-v1.0 } - vllm: { gpu_memory_utilization: 0.5 } - data: - type: list - items: - - - role: user - content: What is the capital of France? - inference: { max_tokens: 64, temperature: 0.0 } - output: - destination: { type: http } -``` - -### Multi-Stage DAG - -```yaml -apiVersion: flowmesh/v1 -kind: Workflow -spec: - stages: - - name: extract - spec: - taskType: inference - ... - data: - type: list - items: - - - role: user - content: "Extract entities from: {{input}}" - - name: summarize - dependsOn: [extract] - spec: - taskType: inference - ... - data: - type: list - items: - - - role: user - content: "Summarize: {{extract.output}}" -``` - -### Graph DAG - -`taskType: graph_template` — topology-aware multi-input prompts with parent -output substitution and validation. See -`src/worker/executors/utils/graph_templates.py` for the templating contract. - -### Schedule Hints - -Workflows can declare scheduling preferences via -`metadata.annotations.schedule_hint`: - -- `epoch_groups: [[, ...], ...]` — epoch-ordered execution; tasks - in epoch `n` only dispatch after every task in epoch `n-1` succeeds. -- `schedule_in_epoch_order: true` — for dependent DAGs, prefer - position-in-epoch tie-breaks during dispatch. - -## Task Types & Executor Registry - -The worker resolves `spec.taskType` against an executor registry in -`src/worker/runner.py`. Built-in executors: - -| `taskType` | Executor | Use case | -|-----------|----------|----------| -| `echo` | `EchoExecutor` | Echo input back as result (smoke tests) | -| `inference` | `VLLMExecutor` / `TransformersExecutor` | LLM inference | -| `diffusion` | `DiffusersExecutor` | Image / video diffusion models | -| `omni_text2{audio,image,speech,general}` | `Omni*Executor` | Multimodal generation | -| `training` | `SFTExecutor` / `LoRASFTExecutor` / `DPOExecutor` / `PPOExecutor` | Fine-tuning | -| `rag` | `RAGExecutor` | Retrieval-augmented generation | -| `agent` | `AgentExecutor` | Tool-using LLM agent (utu / youtu-agent backend) | -| `data_profiling` | `DataProfilingExecutor` | DataFrame profiling | -| `data_retrieval` | `DataRetrievalExecutor` | DataFrame loading from sources | -| `ssh` | `SSHExecutor` | Interactive SSH session or non-interactive container job | - -Helper utilities live in `src/worker/executors/utils/` (`artifacts`, -`checkpoints`, `data_utils`, `graph_templates`, `huggingface`, `safe_eval`). -Cross-cutting behavior is in `src/worker/executors/mixins/` -(`data`, `governance`, `inference`, `training`). - -### Agent Executor (utu / youtu-agent) - -Requires `UTU_LLM_TYPE`, `UTU_LLM_MODEL`, `UTU_LLM_BASE_URL`, `UTU_LLM_API_KEY` -to run. Optional: `SERPER_API_KEY`, `JINA_API_KEY` for search tools. - -## Environment Variables - -The canonical declared set lives in -`cli/stack/src/flowmesh_cli_stack/env_schema.py` and is mirrored to -`cli/stack/src/flowmesh_cli_stack/assets/.env.example`. Run -`uv run scripts/dev/check_env_examples.py --write` after schema edits. - -### Server (selected) - -| Variable | Default | Description | -|----------|---------|-------------| -| `REDIS_CONTROL_URL` | `redis://localhost:6379/0` | Redis control channel | -| `REDIS_TELEMETRY_URL` | `redis://localhost:6380/0` | Redis telemetry channel | -| `DATABASE_URL` | – | Postgres connection string | -| `RESULTS_DIR` | `./results` | Server-side results directory | -| `SERVER_HTTP_PORT` | `8000` | Public HTTP port | -| `SERVER_GRPC_PORT` | `50051` | Supervisor gRPC port | -| `ORCHESTRATOR_DISPATCH_MODE` | `adaptive` | Scheduler mode | -| `ORCHESTRATOR_WORKER_SELECTION` | `best_fit` | `best_fit`, `first_fit`, `min_satisfying` | -| `SCHEDULER_LAMBDA_INFERENCE` | `0.4` | Inference task weight | -| `SCHEDULER_LAMBDA_TRAINING` | `0.8` | Training task weight | -| `SCHEDULER_LAMBDA_OTHER` | `0.5` | Other-task weight | -| `SCHEDULER_SELECTION_JITTER` | `1e-3` | Tie-break jitter | -| `ENABLE_TASK_MERGE` | `true` | DAG-level task coalescing | -| `TASK_MERGE_MAX_BATCH_SIZE` | `4` | Max merged tasks per dispatch | -| `ENABLE_CONTEXT_REUSE` | `true` | Bias toward workers with cached models | -| `WORKER_CACHE_TTL_SEC` | `3600` | Cache metadata TTL | -| `ENABLE_STAGE_WEIGHT_STICKINESS` | `false` | Pin stages to checkpoint-producing workers | -| `ENABLE_WORKER_WATCHDOG` | `true` | Worker death detection | -| `WORKER_DEATH_GRACE_SEC` | `60` | Grace period before marking dead | -| `FLOWMESH_PLUGINS` | – | Comma-separated plugin module names | -| `LOG_LEVEL` | `INFO` | Server log level | - -### Worker (selected) - -| Variable | Default | Description | -|----------|---------|-------------| -| `WORKER_TOKEN` | – | Auth token for supervisor gRPC | -| `SUPERVISOR_GRPC_TARGET` | – | Supervisor gRPC endpoint | -| `RESULTS_DIR` | `./results_workers` | Task output directory | -| `WORKER_TAGS` | `` | Scheduler hints | -| `WORKER_COST_PER_HOUR` | `1.0` | Cost metadata | -| `WORKER_UPLOAD_RESULTS` | `false` | Upload results when no destination set | -| `HF_CACHE_DIR` | – | Shared HuggingFace cache mount | -| `HEARTBEAT_INTERVAL_SEC` | `30` | Heartbeat cadence | - -### Supervisor (selected) - -| Variable | Default | Description | -|----------|---------|-------------| -| `NODE_NAMESPACE` / `NODE_CLUSTER` / `NODE_ALIAS` | defaults | Identity | -| `NODE_TAGS` | `` | Scheduler hints (CSV) | -| `SUPERVISOR_GRPC_DISABLE_SERVER_TLS` | `false` | Local-only insecure gRPC | -| `SUPERVISOR_GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS` | `true` | gRPC keepalive | -| `SUPERVISOR_GRPC_EXTERNAL_PORT` | – | External port (when port-forwarded) | -| `SERVER_GRPC_TLS_*` | – | TLS certificate files | - -## Directory Structure - -``` -src/ - server/ # FastAPI orchestrator - auth/ # OSS auth shim (no-op when IDENTITY_PROVIDERS is empty; otherwise delegates to the chain) - clients/ # Redis client(s) - config.py # Server config + DispatchConfig - db/ # SQLAlchemy session factory and migrations (no-op in OSS) - dispatcher/ # Dispatch loop, worker selector, stage stickiness, context reuse - env.py # Centralized env reads - hooks/ # IdentityProvider / SubmissionGuard / UsageSink ABCs + registries - main.py # Entrypoint, FLOWMESH_PLUGINS loader, EventMonitor wiring - registries/ # Worker / Node registries (Redis-backed) - routers/v1/ # workflows, tasks, results, workers, nodes, ssh, stack, system - schemas/ # Pydantic models for API + DB - services/ # monitoring, log streaming, ssh forwarding, runtime - supervisor/ # Per-node agent (gRPC server, adapters, lifecycle) - adapters/ # Docker / Vast.ai worker adapters - services/ # gRPC server, command listener, task listener, relay - task/ # parser, runtime, models, merge / epoch helpers - shared/ - schemas/ # Cross-cutting schemas (NodeInfo, etc.) - grpc/supervisor/v1/ # Generated proto stubs (used by server's supervisor subsystem + worker) - tasks/ # Workflow/task spec models - utils/ # JSON, parsing, time, ids - worker/ - config.py # Worker config (loaded from env) - docker/ # Worker Dockerfiles (CPU + GPU) - executors/ # Executor implementations - mixins/ # data, governance, inference, training - utils/ # artifacts, checkpoints, data_utils, graph_templates, huggingface, safe_eval - runner.py # Task lifecycle (execute, write results, upload artifacts) - supervisor_client.py # gRPC client to supervisor -cli/ - src/flowmesh_cli/ # Core CLI commands (flowmesh-cli) - stack/src/flowmesh_cli_stack/ # Stack deployment commands (flowmesh-cli-stack) -sdk/ - src/flowmesh/ # Public SDK (flowmesh-sdk) - stack/src/flowmesh_stack/ # Stack helpers (flowmesh-sdk-stack) -proto/ - supervisor/v1/supervisor.proto # gRPC service definition -templates/ # Example workflow YAMLs -tests/ - server/ # Server-side unit tests - worker/ # Worker-side unit tests - shared/ # Shared schema tests - cli/ # CLI tests - sdk/ # SDK tests -scripts/dev/ # Developer scripts (compile_protos.sh, sync_requirements.py, check_env_examples.py) -``` - -## Development Workflow - -### Setup - -```bash -pip install uv -uv sync --all-extras # All Python deps including dev tooling -uv run scripts/dev/compile_protos.sh # Regenerate proto stubs (only when proto changes) -``` - -### Format / lint / type-check - -```bash -uv run pre-commit run --all-files -``` - -Hooks: gitleaks, isort, black, ruff, codespell, mypy, sync_requirements, -check_env_examples (via `scripts/dev/check_env_examples.py`). - -### Tests - -```bash -uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py -``` - -The cleanup-gpu test requires NVIDIA hardware and isolated processes; skip it -in normal CI. - -### Run locally - -```bash -uv run flowmesh stack up # Server + Postgres + Redis + Supervisor -uv run flowmesh stack worker up cpu 1 # 1 CPU worker -uv run flowmesh stack worker up gpu --targets 0 # 1 GPU worker pinned to GPU 0 -uv run flowmesh workflow submit templates/echo_local.yaml -``` - -### Docker - -Build images locally: - -```bash -uv run flowmesh stack build server -uv run flowmesh stack build flowmesh_worker_cpu flowmesh_worker_gpu -uv run flowmesh stack build flowmesh_ssh_cpu flowmesh_ssh_gpu -``` - -Always rebuild the affected worker image after changing executor code; stale -images silently mask code regressions. - -## Code Style - -- Python 3.12+ (see `[project]` in `pyproject.toml`). -- Top-level imports only; inline imports only to break circular imports. -- Prefer `typing.Any` over `object` in annotations; only use `object` when - `Any` is semantically wrong. -- Use `# type: ignore[]` only after exhausting alternatives. Never - use a bare `# type: ignore`. -- Prefer `X | Y` over `typing.Union[X, Y]` and `X | None` over `typing.Optional[X]`. -- Don't write `from __future__ import annotations` unless strictly necessary; - use `typing.Self` or quoted forward references instead. -- Avoid `hasattr` / `getattr` that bypass type checking. Use `isinstance` - guards. Acceptable `getattr` uses: dynamic dispatch, providing a default, - accessing untyped third-party APIs. -- Default to no comments. Comment only when the *why* is non-obvious. Names - self-document. -- Don't add comments referencing the current task, fix, or callers ("used by - X", "added for Y", "handles issue #123") — those rot as the codebase - evolves. - -### Security Rules (bandit-enforced) - -CI runs `bandit` with no severity / confidence threshold. Every finding must -either have a source-level fix, a documented skip in `[tool.bandit]` in -`pyproject.toml`, or a per-line `# nosec BXXX` paired with an inline TODO -that names the planned fix. A bare `# nosec` with no rule code and no TODO -is disallowed — silencing a finding without a written reason defeats the -audit. - -When writing new code, follow these rules so bandit stays green: - -- **B113 (request timeouts)** — every `requests.get/post/...` call must pass - `timeout=`. Hung connections are denial-of-service; no implicit defaults. -- **B202 (archive extraction)** — `tarfile.extractall(..., filter="data")` is - required (Python 3.12+). For zipfile, iterate `infolist()`, validate that - each member resolves under the destination, and extract per-member; never - call `zipfile.extractall` on untrusted archives. -- **B310 (urlopen)** — don't use `urllib.request.urlopen`; use the `requests` - library and validate the URL scheme (`http`/`https` only) before fetching. -- **B324 (insecure hash)** — `hashlib.md5(..., usedforsecurity=False)` is - required when MD5 is used for cache keys / fingerprints. Never use MD5 for - anything that crosses a security boundary. -- **B506 (yaml load)** — always use `yaml.safe_load`. `yaml.load(..., - Loader=yaml.FullLoader)` is forbidden. -- **B603 (subprocess call site)** — every `subprocess.run/call/Popen/...` - needs a per-line `# nosec B603` with a one-line written rationale (e.g. - `argv list, no shell=True, absolute path via shutil.which()`). The B404 - blanket rule is project-skipped because B602 (`shell=True`) and B607 - (partial path) catch the actually-dangerous patterns; B603 is enforced - per-site so every shellout is visible at the call line. -- **B607 (subprocess with partial path)** — prefer the vendored SDK - (`pynvml`, `docker-py`, `GitPython`) over shelling out via `nvidia-smi` / - `docker` / `git`. If shelling out is unavoidable, the absolute path must - be provided. -- **B614 (torch.load)** — `torch.load(..., weights_only=True)` is required. - Pickle deserialization is RCE waiting to happen. -- **B701 (jinja2 autoescape)** — `Environment(autoescape=select_autoescape())` - is required. The default `False` is unsafe even for non-HTML templates. -- **B108 (hardcoded /tmp)** — use `tempfile.gettempdir()` or - `tempfile.NamedTemporaryFile` for local-host work. The literal `"/tmp"` in - Python source is forbidden; if a Linux container path is genuinely - intended (e.g. an in-container sentinel), construct it from - `PurePosixPath` segments rather than as a single string constant. - -Skipped rules and the rationale for each are listed in `[tool.bandit]`. -When the rationale stops being true (e.g. a sandbox stops being a sandbox), -remove the skip and fix the call sites — don't widen the skip list silently. - -### Dependency CVE scanning (pip-audit) - -CI runs `pip-audit` against each generated requirements file -(`src/server/requirements.txt`, `src/worker/requirements/requirements.txt`, -`src/worker/requirements/requirements.gpu.txt`). The job lives in -`.github/workflows/security.yml`. - -When pip-audit reports a new CVE, the only real fix is to bump the -offending dep in `pyproject.toml`, then `uv lock` + `uv run -scripts/dev/sync_requirements.py --write`. Silencing a finding via -`--ignore-vuln` is a last resort; every silenced GHSA needs a written -reason, same rule as bandit. The currently-ignored advisories and the -upgrade blocker that justifies each are listed below. - -| GHSA | Package | Fix version | Why ignored | -|------|---------|-------------|-------------| -| GHSA-69w3-r845-3855 | transformers | 5.0.0rc3 | held by vllm/vllm-omni 0.18 compatibility | -| GHSA-pf3h-qjgv-vcpr | vllm | 0.19.0 | held by transformers 4.57 + adjacent inference deps | -| GHSA-pq5c-rjhq-qp7p | vllm | 0.19.0 | same | -| GHSA-3mwp-wvh9-7528 | vllm | 0.19.0 | same | -| GHSA-cfh3-3jmp-rvhc | pillow | 12.1.1 | gradio 5.50 caps pillow<12 (transitive via vllm-omni) | -| GHSA-whj4-6x5x-4v2j | pillow | 12.2.0 | same cap | -| GHSA-vfmq-68hx-4jfw | lxml | 6.1.0 | crawl4ai 0.8.6 caps lxml<6 | -| GHSA-39mp-8hj3-5c49 | gradio | 6.7.0 | vllm-omni 0.18 pins gradio==5.50 | -| GHSA-h3h8-3v2v-rg7m | gradio | 6.6.0 | same pin | -| GHSA-jmh7-g254-2cq9 | gradio | 6.6.0 | same pin | -| GHSA-pfjf-5gxr-995x | gradio | 6.6.0 | same pin | -| GHSA-w8v5-vhqr-4h9v | diskcache | (none) | upstream unmaintained, no fixed version published | - -When a blocker lifts (e.g. transformers 5 ↔ vllm 0.19 line stabilizes), -drop the corresponding `--ignore-vuln` flag from the workflow rather than -extending the rationale to unrelated packages. - -## Commit Conventions - -- Single-line subject in imperative mood; no body unless a non-obvious "why" - is needed. -- Conventional prefixes: `feat:`, `fix:`, `refactor:`, `chore:`, `docs:`, - `style:`, `test:`. Scope optional: `feat(server): ...`. -- Sign off (`--signoff`) is required for code coming from coding agents - (Claude Code, Codex, Cursor, etc.). -- One logical change per commit. Don't batch unrelated changes. - -## Key Patterns - -### Task State Machine - -`PENDING` → `DISPATCHED` → (`DONE` | `FAILED` | `CANCELLED`). -A retried task transitions back to `PENDING` until exhausted. - -### Redis Channel Conventions - -- `flowmesh:control:*` — control plane (task assignments, cancellations, - worker lifecycle). -- `flowmesh:telemetry:*` — telemetry (heartbeats, status updates). -- `flowmesh:logs:task:{task_id}` — per-task log stream. -- `flowmesh:logs:workflow:{workflow_id}` — per-workflow log stream. - -Stream lengths are bounded via `LOG_STREAM_MAXLEN_TASK` and -`LOG_STREAM_MAXLEN_WORKFLOW`. Streams expire `LOG_STREAM_TTL_SEC` after close. - -### Cursor-Based Pagination - -List endpoints (`/api/v1/workflows`, `/api/v1/tasks`, log queries) accept -`limit` and `before` / `after` cursors. The cursor is an opaque base64 -encoding of `(timestamp, id)`; do not parse client-side. - -### Task Merging - -Compatible adjacent tasks in a DAG (same `taskType`, model, hardware shape, -and merge key) coalesce into a single dispatch. The merged children are -carried in `WorkerTaskMessage.merged_children`. The runtime hands the worker -a single message; the worker writes per-child results into a `children` -section of the response. The dispatcher then fans out synthetic -TASK_SUCCEEDED / TASK_FAILED events for each child. - -Merge can be disabled with `ENABLE_TASK_MERGE=false`. - -### Stage Stickiness - -When `ENABLE_STAGE_WEIGHT_STICKINESS=true`, the dispatcher pins stages that -reference an upstream stage's checkpoint to the worker that produced it, -falling back to normal selection if that worker is unavailable or stale. -Mostly relevant for training pipelines where reusing the on-disk checkpoint -saves repeated downloads. - -### Context Reuse / Cache Affinity - -Workers report cached models and datasets via `WorkerHardware`. The -dispatcher's `_cached_worker_candidates` filters the candidate pool to -workers whose cache covers the task's model/dataset references. Stale cache -entries (older than `WORKER_CACHE_TTL_SEC`) are ignored. +Routing doc for coding agents (Claude Code, Codex, Cursor, …) working in +this repository. The full project rules live in dedicated docs; read the +ones relevant to your task before editing. + +FlowMesh is a service fabric for running LLM agentic workflows on +distributed GPU workers. The server parses a workflow, turns it into a +DAG of tasks, dispatches each task to a worker, and collects results +and artifacts. + +## Where to read what + +- **[`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)** — topology diagram, + components (server / supervisor / worker), communication protocols, + object IDs, task state machine, directory map, runtime behavior + (task merging, stage stickiness, context reuse, log streams), plugin + hooks. Read before any cross-component change. +- **[`docs/CODE_STYLE.md`](docs/CODE_STYLE.md)** — Python rules, + docstring conventions, bandit security rules and `# nosec` policy, + pip-audit policy. Read before writing any source code. +- **[`CONTRIBUTING.md`](CONTRIBUTING.md)** — setup, pre-commit hooks, + testing, dependency-pin workflow, **commit and PR title conventions**, + DCO sign-off. Read before committing or opening a PR. +- **[`docs/API.md`](docs/API.md)** — common server REST endpoints + (workflows, tasks, results, workers, nodes, SSH, system) and cursor + pagination contract. Read before calling the server directly or + changing a router. +- **[`docs/SDK.md`](docs/SDK.md)** — client usage, common operations, error contract. +- **[`docs/CLI.md`](docs/CLI.md)** — `flowmesh ...` command groups, + common workflows (submit/watch/logs), local stack lifecycle, SSH tasks. +- **[`docs/EXECUTORS.md`](docs/EXECUTORS.md)** — `taskType → Executor` + registry table, helper utilities, and the `AgentExecutor` env + requirements (`UTU_LLM_*`, `SERPER_API_KEY`, `JINA_API_KEY`). +- **[`docs/WORKFLOWS.md`](docs/WORKFLOWS.md)** — workflow YAML format + hierarchy: single task, multi-stage DAG (`spec.stages`), graph DAG + (`taskType: graph_template`), and schedule hints (`epoch_groups`, + `schedule_in_epoch_order`). +- **[`docs/ENV.md`](docs/ENV.md)** — curated server / worker / + supervisor env var tables (the knobs you actually tune). Full schema + in `cli/stack/src/flowmesh_cli_stack/env_schema.py`. +- **[`docs/PLUGINS.md`](docs/PLUGINS.md)** — plugin extension contract, + loader semantics (`FLOWMESH_PLUGINS`), and a worked example. + +Concrete examples and runnable workflows live in `templates/`. +When code, APIs, CLI commands, SDK methods, env vars, workflow formats, or +runtime behavior change, update the corresponding docs in the same PR. + +## Reminders + +The full justification lives in the linked docs; these are the rules +that come up most often, surfaced here so you can't miss them. + +- **PR title type**: one of `feat | fix | refactor | chore | test | + perf | docs`. Anything else fails the title check + (`scripts/ci/check_pr_title.py`). Use `docs:` for doc-only PRs — + those skip the code-related CI jobs (lint, tests, security, + env/requirements sync). +- **Docstrings**: describe what the code *does*, not what it + *replaced*. No "in-process replacement for X", no "previously did Y". +- **Comments**: default to none. Add a short one only when the *why* is + non-obvious; let names self-document. +- **Test failures**: never ignore one. CI guards the suite, so a red + test on `main` is a CI gap, not a permission to skip — fix it as part + of your PR even when the failure looks unrelated to your change. + Every PR ships a runnable build. +- **Cluster management goes through SDK / CLI**: reach for `flowmesh ...` + or the Python SDK. Raw `docker` is a last-resort escape hatch when the + SDK / CLI doesn't expose what you need. + +## Dev workflow + +Use `CONTRIBUTING.md` for setup, hooks, tests, dependency pins, and +commit rules. Use `docs/CLI.md` for local stack and workflow commands. + +## Auto-loaded references + +@docs/ARCHITECTURE.md +@docs/CODE_STYLE.md +@CONTRIBUTING.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index dd9cdfc6..044bd88f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -56,8 +56,27 @@ This installs three hook stages: [DCO sign-off](#signing-off-commits-dco) line to your commit message. - **commit-msg** — verifies the sign-off is present (safety net). +## Commit message and PR title conventions + +- Allowed prefixes for commit messages and PR titles (enforced for **PR titles** by + `scripts/ci/check_pr_title.py`): `feat:`, `fix:`, `refactor:`, + `chore:`, `test:`, `perf:`, `docs:`. Use `docs:` for doc-only PRs — + those skip the code-related CI jobs (lint, tests, security, + env/requirements sync). +- Optional scope: `feat(server): ...`. Optional `[BREAKING]` prefix + *before* the type for breaking changes: `[BREAKING] feat: ...`. +- For commit messages, the subject should be a short sentence about the key changes; + add body only when the reason for the changes need elaboration. +- Sign off (`--signoff`) is required for every commit. +- One logical change per commit. Don't batch unrelated changes. + ## Code Style +See [`docs/CODE_STYLE.md`](docs/CODE_STYLE.md) for the project's Python +rules, docstring conventions, bandit security rules, and pip-audit +policy. The tools below run automatically via pre-commit; running them +manually is rarely necessary. + | Tool | Purpose | Config | |------|---------|--------| | [gitleaks](https://github.com/gitleaks/gitleaks) | Committed-secret detection | - | @@ -91,7 +110,7 @@ If your change touches shared schemas or proto definitions, verify downstream co Dependency versions live in two places with different styles: - **`pyproject.toml`** — `>=X.Y.Z` lower bounds. Expresses a compatibility floor; lets uv resolve the current acceptable version. -- **`src/worker/requirements/requirements{,.gpu}.txt`** — exact `==X.Y.Z` pins. These feed the worker Docker images (`uv pip install --requirement …`), which need deterministic, reproducible installs. +- **`src/worker/requirements/requirements{,.gpu}.txt`** — exact `==X.Y.Z` pins for deterministic, reproducible environment of worker environment. The requirements files are **auto-generated** from `pyproject.toml` + `uv.lock` by `scripts/dev/sync_requirements.py`. Do not edit them by hand. @@ -123,5 +142,7 @@ git push --force-with-lease ## Project Layout -FlowMesh follows a multi-tier architecture (Server / Worker) with -shared schemas, SDK, and CLI packages. +FlowMesh follows a multi-tier architecture (Server / Worker) with shared +schemas, SDK, and CLI packages. See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) +for the topology diagram, communication protocols, directory map, and +runtime behavior (task merging, stickiness, context reuse, etc.). diff --git a/docs/API.md b/docs/API.md new file mode 100644 index 00000000..cebac686 --- /dev/null +++ b/docs/API.md @@ -0,0 +1,87 @@ +# API reference (common endpoints) + +Server runs at `http://localhost:8000` by default. The router source of +truth is `src/server/routers/v1/*.py`. + +Workflow and task statuses (used in payloads and filters): +`PENDING`, `DISPATCHED`, `FAILED`, `CANCELLED`, `DONE`. Worker statuses: +`STARTING`, `IDLE`, `BUSY`, `STOPPING`, `STOPPED`. + +## Workflows + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/api/v1/workflows` | Submit a workflow. Body is YAML (`text/plain`) or JSON; set `Workflow-Format: n8n` for n8n graphs. | +| POST | `/api/v1/workflows/validate` | Parse without executing. | +| GET | `/api/v1/workflows` | List workflows (`workflow_id`, `owner`, `status`, cursor pagination). | +| GET | `/api/v1/workflows/{id}` | Workflow details + per-task summary. | +| GET | `/api/v1/workflows/{id}/logs` | Query logs (`limit`, `before`/`after` cursors). | +| GET | `/api/v1/workflows/{id}/logs/stream` | SSE log stream. | +| POST | `/api/v1/workflows/{id}/cancel` | Cancel a workflow and all in-flight tasks. | + +## Tasks + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/tasks` | List tasks. Filters: `workflow_id`, `status`, `task_type`, `assigned_worker`. | +| GET | `/api/v1/tasks/{id}` | Task details. | +| GET | `/api/v1/tasks/{id}/logs` | Query task logs. | +| GET | `/api/v1/tasks/{id}/logs/stream` | SSE task log stream. | + +## Results + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/api/v1/results` | Submit task result (worker → server). | +| GET | `/api/v1/results/{task_id}` | Get task result JSON. | +| GET | `/api/v1/results/{task_id}/bundle` | Download tar.gz bundle (`?include=results,artifacts,logs,all`). | +| POST | `/api/v1/results/{task_id}/files` | Upload artifact (multipart). | +| GET | `/api/v1/results/{task_id}/files/{filename}` | Download artifact. | +| GET | `/api/v1/results/{task_id}/logs` | Download archived `logs.jsonl`. | + +## Traces + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/traces/workflows/{workflow_id}/{trace_type}` | Fetch workflow trace JSONL rows. `trace_type` is `spans`, `assets`, or `lineage`. | +| GET | `/api/v1/traces/workflows/analyze/{workflow_id}` | Run the trace analyzer and return a profile summary. | +| POST | `/api/v1/traces/tasks/{task_id}/{trace_type}` | Upload a per-task trace JSONL file. `trace_type` is `spans`, `assets`, or `lineage`. | + +## Workers and nodes + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/workers` | List workers. Filters: `alias`, `namespace`, `cluster`, `status`, `tags`. | +| GET | `/api/v1/workers/{id}` | Worker details + hardware. | +| GET | `/api/v1/nodes` | List nodes (supervisors). | +| POST | `/api/v1/nodes/register` | Register a node. | +| GET | `/api/v1/nodes/{id}/workers` | List workers under a node. | +| POST | `/api/v1/nodes/{id}/workers/register` | Register worker under node. | +| POST | `/api/v1/nodes/{id}/workers/{name}/{start,stop}` | Start/stop a worker. | + +`/api/v1/stack/workers/...` wraps node-registered workers with local-only +container lifecycle and is what `flowmesh stack worker {up,down,...}` +calls. + +## SSH + +| Method | Path | Description | +|--------|------|-------------| +| WS | `/api/v1/ssh/tasks/{task_id}/proxy` | WebSocket SSH proxy for proxy-mode SSH tasks. | +| GET | `/api/v1/ssh/connections` | List active server-audited SSH proxy/forward connections. | + +Server policy toggles: `ENABLE_SERVER_SSH_PROXY`, +`ENABLE_SERVER_SSH_FORWARD`, `ENABLE_SERVER_SSH_CONNECTION_AUDIT`. + +## System + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/healthz` | Top-level health check. | +| GET | `/api/v1/system/metrics` | System metrics snapshot. | + +## Cursor pagination + +List endpoints (`/api/v1/workflows`, `/api/v1/tasks`, log queries) +accept `limit` and `before` / `after` cursors. The cursor is an opaque +base64 of `(timestamp, id)`; do not parse client-side. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 00000000..2a3ca71c --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,129 @@ +# Architecture + +FlowMesh is a service fabric for running LLM agentic workflows on +distributed GPU workers. The server parses a workflow (YAML / JSON / n8n), +turns it into a DAG of tasks, dispatches each task to a worker, and +collects results and artifacts. + +## Workspace layout + +The codebase is a **uv workspace** with these packages: + +| Package | Path | Purpose | +|---------|------|---------| +| `flowmesh` (root) | `src/` | Server, Worker, shared schemas | +| `flowmesh-sdk` | `sdk/` | Public Python SDK | +| `flowmesh-sdk-stack` | `sdk/stack/` | Stack/node helpers | +| `flowmesh-cli` | `cli/` | Typer CLI (`flowmesh ...`) | +| `flowmesh-cli-stack` | `cli/stack/` | Stack deployment commands | + +## Topology + +``` +Client (CLI / SDK / API) ──▶ Server (FastAPI orchestrator, :8000) + │ + ├── Redis (control + telemetry pub/sub, log streams) + │ + └─▶ Supervisor (per-node) ──gRPC──▶ Worker (executor) +``` + +The runtime is two top-level processes: + +1. **Server** (`src/server/`) — FastAPI orchestrator at `:8000`. Hosts + workflow / task / dispatch logic and the **Supervisor subsystem** + (`src/server/supervisor/`), which manages per-node worker lifecycle, + runs the worker-facing gRPC server (`:50051`), and drives the + Docker / Vast.ai worker adapters. +2. **Worker** (`src/worker/`) — stateless executor. Connects to a + supervisor via gRPC, receives tasks, runs the matching executor, + reports results. + +## Communication + +- **server ↔ supervisor (same node)** — `multiprocessing.Queue`. +- **server ↔ supervisor (across nodes)** — Redis pub/sub. +- **supervisor ↔ worker** — bidirectional gRPC. Proto stubs at + `src/shared/grpc/supervisor/v1/`. +- **client ↔ server** — REST. + +## Object IDs + +3-char prefixes: `wfl-` workflows, `tsk-` tasks, `ssn-` SSH sessions, +`scn-` SSH connection rows, `cmd-` supervisor commands. Always use +`new_*_id()` helpers in `src/shared/utils/ids.py`. Never use `uuid4()` or +`secrets.token_hex` for IDs. + +## Task state machine + +`PENDING → DISPATCHED → (DONE | FAILED | CANCELLED)`. Retried tasks +cycle back to `PENDING` until exhausted. + +## Directory map + +``` +src/ + server/ FastAPI orchestrator + auth/ OSS no-op auth shim; delegates to IDENTITY_PROVIDERS chain + db/ SQLAlchemy session factory + migrations (no-op in OSS) + dispatcher/ Dispatch loop, worker selector, stage stickiness, context reuse + hooks/ Plugin extension ABCs + registries + main.py Entrypoint, FLOWMESH_PLUGINS loader, EventMonitor wiring + registries/ Worker / Node registries (Redis-backed) + routers/v1/ workflows, tasks, results, workers, nodes, ssh, stack, system + services/ monitoring, log streaming, ssh forwarding, runtime + supervisor/ Per-node agent (gRPC server, adapters, lifecycle) + task/ parser, runtime, models, merge / epoch helpers + shared/ + grpc/supervisor/v1/ Generated proto stubs (server + worker) + schemas/ Cross-cutting schemas + tasks/ Workflow/task spec models + utils/ JSON, parsing, time, ids + worker/ + docker/ Worker Dockerfiles (CPU + GPU) + executors/ Executor implementations + mixins/ data, governance, inference, training + utils/ artifacts, checkpoints, data_utils, distributed, + graph_templates, huggingface, safe_eval + runner.py Task lifecycle (execute, write results, upload artifacts) +cli/ Typer CLI (`flowmesh`) +sdk/ Public Python SDK +proto/ gRPC service definition +templates/ Example workflow YAMLs +tests/{server,worker,shared,cli,sdk}/ +scripts/dev/ compile_protos, sync_requirements, check_env_examples +``` + +## Key runtime behavior + +- **Task merging.** Compatible adjacent tasks in a DAG (same `taskType`, + model, hardware shape, and merge key) coalesce into a single dispatch. + Merged children ride on `WorkerTaskMessage.merged_children`; the worker + writes per-child results into `result.children`; the dispatcher fans + out synthetic `TASK_SUCCEEDED` / `TASK_FAILED` events. Disable with + `ENABLE_TASK_MERGE=false`. +- **Stage stickiness** (`ENABLE_STAGE_WEIGHT_STICKINESS=true`) — the + dispatcher pins stages that reference an upstream stage's checkpoint + to the worker that produced it, falling back to normal selection when + unavailable or stale. Mostly relevant for training pipelines reusing + on-disk checkpoints. +- **Context reuse.** Workers report cached models/datasets in their + `WorkerHardware`. The dispatcher's `_cached_worker_candidates` filters + to workers whose cache covers the task's references; entries older + than `WORKER_CACHE_TTL_SEC` are ignored. +- **Cursor pagination.** List endpoints accept `limit` and `before` / + `after` cursors. The cursor is an opaque base64 of `(timestamp, id)`; + do not parse client-side. +- **Redis channels.** The runtime uses three namespaces: + - `flowmesh:control:*` — control plane (task assignments, + cancellations, worker lifecycle). + - `flowmesh:telemetry:*` — telemetry (heartbeats, status updates). + - `flowmesh:logs:task:{task_id}` and + `flowmesh:logs:workflow:{wfl_id}` — log streams, bounded by + `LOG_STREAM_MAXLEN_TASK` / `LOG_STREAM_MAXLEN_WORKFLOW` and + expired `LOG_STREAM_TTL_SEC` after close. + +## Plugin extension points + +Server extension points are loaded via the `FLOWMESH_PLUGINS` env var. +Full contract, loader semantics, and a worked example live in +[`docs/PLUGINS.md`](PLUGINS.md). diff --git a/docs/CLI.md b/docs/CLI.md new file mode 100644 index 00000000..0bfd67eb --- /dev/null +++ b/docs/CLI.md @@ -0,0 +1,90 @@ +# CLI usage (`flowmesh`) + +The CLI entry point is `flowmesh` (provided by `flowmesh-cli`). Run any +subcommand with `-h` / `--help` for the full flag list — this doc covers +the common ones. Stack commands (`flowmesh stack ...`) come from the +separate `flowmesh-cli-stack` package and operate on the local Compose +stack and its containers; everything else works against any reachable +FlowMesh server. + +## Top-level groups + +``` +flowmesh info | health | logout +flowmesh workflow {submit, validate, list, info, watch, cancel, logs} +flowmesh task {list, info, watch, stop, logs} +flowmesh worker {list, info} +flowmesh node {list, info, worker {list, start, stop}} +flowmesh ssh {connect, run, proxy, connections} +flowmesh result {fetch, download} +flowmesh trace {fetch, analyze} +flowmesh system {metrics} +flowmesh stack {build, push, pull, pullall, up, down, restart, ps, logs} +flowmesh stack worker {up, start, stop, down, list, pull} +``` + +## Common workflows + +Submit a workflow and watch it: + +```bash +flowmesh workflow submit templates/echo_local.yaml +flowmesh workflow watch # blocks until DONE / FAILED +flowmesh workflow logs show # recent log entries +flowmesh workflow logs stream # SSE log stream +flowmesh workflow logs download -o logs/ +``` + +List, filter, paginate: + +```bash +flowmesh workflow list --status DONE +flowmesh task list --workflow-id --status FAILED +``` + +Pull results / artifacts: + +```bash +flowmesh result fetch # JSON result +flowmesh result download --include all -o bundle.tgz # tar.gz bundle +``` + +Fetch or analyze workflow traces: + +```bash +flowmesh trace fetch spans -o spans.jsonl +flowmesh trace analyze --format critical-path +``` + +`trace fetch` accepts `spans`, `assets`, or `lineage`. `trace analyze +--format` accepts `rich`, `critical-path` (`cp`), `end-to-end` (`e2e`), +`queuing`, `lineage`, or `json`. + +## Local stack lifecycle + +`.env` controls registry, version tag, and ports — see +`cli/stack/src/flowmesh_cli_stack/assets/.env.example`. Set +`FLOWMESH_VERSION` to a PR-identifying slug (e.g. `myfeature`) so parallel +PRs don't overwrite each other's local images. + +```bash +flowmesh stack up # Server + Redis + Supervisor +flowmesh stack worker up cpu 1 # 1 CPU worker +flowmesh stack worker up gpu --targets 0 # 1 GPU worker pinned to GPU 0 +flowmesh stack down +``` + +After changing executor code, rebuild the affected image before bringing +the stack back up — running containers don't pick up source changes: + +```bash +flowmesh stack build flowmesh_worker_cpu flowmesh_worker_gpu +``` + +## SSH tasks + +```bash +flowmesh ssh connect # interactive shell into an SSH task +flowmesh ssh run -- # one-shot exec +flowmesh ssh connections # list active proxy/forward connections +``` diff --git a/docs/CODE_STYLE.md b/docs/CODE_STYLE.md new file mode 100644 index 00000000..86b0b72d --- /dev/null +++ b/docs/CODE_STYLE.md @@ -0,0 +1,110 @@ +# Code style + +## Python + +- Python 3.12+ (see `[project]` in `pyproject.toml`). +- Top-level imports only; inline imports only to break circular imports. +- Prefer `typing.Any` over `object`; use `object` only when `Any` is + semantically wrong. +- Prefer `X | Y` and `X | None` over `typing.Union[X, Y]` / + `typing.Optional[X]`. +- Don't write `from __future__ import annotations` — use `typing.Self` or + quoted forward references instead. +- No `hasattr` / `getattr` that bypasses type checking. Use `isinstance` + guards. Acceptable `getattr` uses: dynamic dispatch, providing a + default, reaching into untyped third-party APIs. +- `# type: ignore[]` only after exhausting alternatives. + Never a bare `# type: ignore`. + +## Comments and docstrings + +- Default to **no comments**. Comment only when *why* is non-obvious. + Names self-document. +- Don't reference the current task / fix / caller in comments ("used by + X", "added for Y", "handles issue #123") — those rot. +- Docstrings describe what the function/module *does*, not what it + *replaced*. No "in-process replacement for X", no "previously did Y" — + read the docstring as if seeing the code for the first time. + +## Object IDs + +3-char prefixes: `wfl-`, `tsk-`, `ssn-`, `scn-`, `cmd-`. Always use +`new_*_id()` helpers in `src/shared/utils/ids.py`. Never use `uuid4()` +or `secrets.token_hex` for IDs. + +## Security rules (bandit-enforced) + +CI runs `bandit` with no severity / confidence threshold. Every finding +must have a source-level fix, a documented skip in `[tool.bandit]` in +`pyproject.toml`, or a per-line `# nosec BXXX` with a one-line written +rationale at the call site. A bare `# nosec` (no rule code, no reason) +is disallowed. + +When writing new code, follow these rules: + +- **B113** — every `requests.get/post/...` call passes `timeout=`. No + implicit defaults; hung connections are a DoS. +- **B202** — `tarfile.extractall(..., filter="data")` (Python 3.12+). + For zipfile, iterate `infolist()`, validate each member resolves under + the destination, extract per-member. Never `zipfile.extractall` on + untrusted archives. +- **B310** — don't use `urllib.request.urlopen`. Use `requests` and + validate the URL scheme (`http`/`https` only) before fetching. +- **B324** — `hashlib.md5(..., usedforsecurity=False)` is required for + cache-key / fingerprint use. Never MD5 across a security boundary. +- **B506** — `yaml.safe_load`, never `yaml.load(..., Loader=FullLoader)`. +- **B603** — every `subprocess.run/call/Popen/...` needs a per-line + `# nosec B603` with a one-line rationale (e.g. `argv list, no + shell=True, absolute path via shutil.which()`). The B404 import-level + rule is project-skipped because B602/B607 catch the actually-dangerous + patterns; B603 is enforced per-site so every shellout is visible at + the call line. +- **B607** — prefer the vendored SDK (`pynvml`, `docker-py`, `GitPython`) + over shelling out via `nvidia-smi` / `docker` / `git`. If shelling out + is unavoidable, the absolute path must be provided. +- **B614** — `torch.load(..., weights_only=True)`. Pickle deserialization + is RCE waiting to happen. +- **B701** — `Environment(autoescape=select_autoescape())`. The default + `False` is unsafe even for non-HTML templates. +- **B108** — use `tempfile.gettempdir()` or `tempfile.NamedTemporaryFile`. + The literal `"/tmp"` in Python source is forbidden; for an + in-container sentinel, build it from `PurePosixPath` segments. + +When a documented `[tool.bandit]` skip stops being true (e.g. a sandbox +stops being a sandbox), remove the skip and fix the call sites — don't +widen the skip list silently. + +## Dependency CVE scanning (pip-audit) + +CI runs `pip-audit` against each generated requirements file +(`src/server/requirements.txt`, +`src/worker/requirements/requirements.txt`, +`src/worker/requirements/requirements.gpu.txt`). The job lives in +`.github/workflows/security.yml`. + +When pip-audit reports a new CVE, the only real fix is to bump the +offending dep in `pyproject.toml`, then `uv lock` and `uv run +scripts/dev/sync_requirements.py --write`. Silencing via `--ignore-vuln` +is a last resort; every silenced GHSA needs a written upgrade-blocker. +The currently-ignored advisories and the upgrade blocker that justifies +each are listed below; the same list is encoded as `--ignore-vuln` +flags in `.github/workflows/security.yml`. + +| GHSA | Package | Fix version | Why ignored | +|------|---------|-------------|-------------| +| GHSA-69w3-r845-3855 | transformers | 5.0.0rc3 | held by vllm/vllm-omni 0.18 compatibility | +| GHSA-pf3h-qjgv-vcpr | vllm | 0.19.0 | held by transformers 4.57 + adjacent inference deps | +| GHSA-pq5c-rjhq-qp7p | vllm | 0.19.0 | same | +| GHSA-3mwp-wvh9-7528 | vllm | 0.19.0 | same | +| GHSA-cfh3-3jmp-rvhc | pillow | 12.1.1 | gradio 5.50 caps pillow<12 (transitive via vllm-omni) | +| GHSA-whj4-6x5x-4v2j | pillow | 12.2.0 | same cap | +| GHSA-vfmq-68hx-4jfw | lxml | 6.1.0 | crawl4ai 0.8.6 caps lxml<6 | +| GHSA-39mp-8hj3-5c49 | gradio | 6.7.0 | vllm-omni 0.18 pins gradio==5.50 | +| GHSA-h3h8-3v2v-rg7m | gradio | 6.6.0 | same pin | +| GHSA-jmh7-g254-2cq9 | gradio | 6.6.0 | same pin | +| GHSA-pfjf-5gxr-995x | gradio | 6.6.0 | same pin | +| GHSA-w8v5-vhqr-4h9v | diskcache | (none) | upstream unmaintained, no fixed version published | + +When a blocker lifts (e.g. transformers 5 ↔ vllm 0.19 line stabilizes), +drop the corresponding `--ignore-vuln` flag from the workflow and the +row from this table — don't extend the rationale to unrelated packages. diff --git a/docs/ENV.md b/docs/ENV.md new file mode 100644 index 00000000..efe5cf6b --- /dev/null +++ b/docs/ENV.md @@ -0,0 +1,59 @@ +# Environment variables (curated) + +The canonical declared set lives in +`cli/stack/src/flowmesh_cli_stack/env_schema.py` and is mirrored to +`cli/stack/src/flowmesh_cli_stack/assets/.env.example`. Run +`uv run scripts/dev/check_env_examples.py --write` after schema edits. + +The tables below curate the knobs you actually tune. Anything not +listed here is in `.env.example`. + +## Server + +| Variable | Default | Description | +|----------|---------|-------------| +| `REDIS_CONTROL_URL` | `redis://localhost:6379/0` | Redis control channel | +| `REDIS_TELEMETRY_URL` | `redis://localhost:6380/0` | Redis telemetry channel | +| `DATABASE_URL` | – | Postgres connection string | +| `RESULTS_DIR` | `./results` | Server-side results directory | +| `SERVER_HTTP_PORT` | `8000` | Public HTTP port | +| `SERVER_GRPC_PORT` | `50051` | Supervisor gRPC port | +| `ORCHESTRATOR_DISPATCH_MODE` | `adaptive` | Scheduler mode | +| `ORCHESTRATOR_WORKER_SELECTION` | `best_fit` | `best_fit`, `first_fit`, `min_satisfying` | +| `SCHEDULER_LAMBDA_INFERENCE` | `0.4` | Inference task weight | +| `SCHEDULER_LAMBDA_TRAINING` | `0.8` | Training task weight | +| `SCHEDULER_LAMBDA_OTHER` | `0.5` | Other-task weight | +| `SCHEDULER_SELECTION_JITTER` | `1e-3` | Tie-break jitter | +| `ENABLE_TASK_MERGE` | `true` | DAG-level task coalescing | +| `TASK_MERGE_MAX_BATCH_SIZE` | `4` | Max merged tasks per dispatch | +| `ENABLE_CONTEXT_REUSE` | `true` | Bias toward workers with cached models | +| `WORKER_CACHE_TTL_SEC` | `3600` | Cache metadata TTL | +| `ENABLE_STAGE_WEIGHT_STICKINESS` | `false` | Pin stages to checkpoint-producing workers | +| `ENABLE_WORKER_WATCHDOG` | `true` | Worker death detection | +| `WORKER_DEATH_GRACE_SEC` | `60` | Grace period before marking dead | +| `FLOWMESH_PLUGINS` | – | Comma-separated plugin module names | +| `LOG_LEVEL` | `INFO` | Server log level | + +## Worker + +| Variable | Default | Description | +|----------|---------|-------------| +| `WORKER_TOKEN` | – | Auth token for supervisor gRPC | +| `SUPERVISOR_GRPC_TARGET` | – | Supervisor gRPC endpoint | +| `RESULTS_DIR` | `./results_workers` | Task output directory | +| `WORKER_TAGS` | `` | Scheduler hints | +| `WORKER_COST_PER_HOUR` | `1.0` | Cost metadata | +| `WORKER_UPLOAD_RESULTS` | `false` | Upload results when no destination set | +| `HF_CACHE_DIR` | – | Shared HuggingFace cache mount | +| `HEARTBEAT_INTERVAL_SEC` | `30` | Heartbeat cadence | + +## Supervisor + +| Variable | Default | Description | +|----------|---------|-------------| +| `NODE_NAMESPACE` / `NODE_CLUSTER` / `NODE_ALIAS` | defaults | Identity | +| `NODE_TAGS` | `` | Scheduler hints (CSV) | +| `SUPERVISOR_GRPC_DISABLE_SERVER_TLS` | `false` | Local-only insecure gRPC | +| `SUPERVISOR_GRPC_KEEPALIVE_PERMIT_WITHOUT_CALLS` | `true` | gRPC keepalive | +| `SUPERVISOR_GRPC_EXTERNAL_PORT` | – | External port (when port-forwarded) | +| `SERVER_GRPC_TLS_*` | – | TLS certificate files | diff --git a/docs/EXECUTORS.md b/docs/EXECUTORS.md new file mode 100644 index 00000000..ed02e903 --- /dev/null +++ b/docs/EXECUTORS.md @@ -0,0 +1,39 @@ +# Task types and executor registry + +The worker resolves `spec.taskType` against an executor registry in +`src/worker/runner.py`. Built-in executors: + +| `taskType` | Executor | Use case | +|-----------|----------|----------| +| `echo` | `EchoExecutor` | Echo input back as result (smoke tests) | +| `inference` | `VLLMExecutor` / `TransformersExecutor` | LLM inference | +| `diffusion` | `DiffusersExecutor` | Image / video diffusion models | +| `omni_text2{audio,image,speech,general}` | `Omni*Executor` | Multimodal generation | +| `training` | `SFTExecutor` / `LoRASFTExecutor` / `DPOExecutor` / `PPOExecutor` | Fine-tuning | +| `rag` | `RAGExecutor` | Retrieval-augmented generation | +| `agent` | `AgentExecutor` | Tool-using LLM agent (utu / youtu-agent backend) | +| `data_profiling` | `DataProfilingExecutor` | DataFrame profiling | +| `data_retrieval` | `DataRetrievalExecutor` | DataFrame loading from sources | +| `ssh` | `SSHExecutor` | Interactive SSH session or non-interactive container job | + +Helper utilities live in `src/worker/executors/utils/` (`artifacts`, +`checkpoints`, `data_utils`, `distributed`, `graph_templates`, +`huggingface`, `safe_eval`). Cross-cutting behavior is in +`src/worker/executors/mixins/` (`data`, `governance`, `inference`, +`training`). + +## Agent executor (utu / youtu-agent) + +`AgentExecutor` requires the following env vars to run; the executor +asserts them at import time, so a worker without them fails the task +immediately: + +- `UTU_LLM_TYPE` — provider kind (e.g. `chat.completions`). +- `UTU_LLM_MODEL` — model identifier. +- `UTU_LLM_BASE_URL` — LLM endpoint base URL. +- `UTU_LLM_API_KEY` — LLM API key. + +Optional, for the search tools: + +- `SERPER_API_KEY` +- `JINA_API_KEY` diff --git a/docs/PLUGINS.md b/docs/PLUGINS.md new file mode 100644 index 00000000..69e8476d --- /dev/null +++ b/docs/PLUGINS.md @@ -0,0 +1,66 @@ +# Plugin extension points + +External integrations (auth, submission policy, usage tracking) plug +into the server through three protocol hooks defined in +`src/server/hooks/`: + +- `IdentityProvider` — resolve a bearer token to a `PrincipalContext` + (iterated from `auth/security.py`). With no providers registered, + auth is a no-op and `authenticate_api_key` returns a default admin + principal. +- `SubmissionGuard` — pre-submit precondition (iterated from + `routers/v1/workflows.py`). +- `UsageSink` — fan-out per-task usage rows after a task completes + (iterated from `services/monitoring.py`). Typical consumers: billing, + audit, observability. + +## How plugins are loaded + +A plugin is any Python module that exposes a top-level `install()`. The +server loads `FLOWMESH_PLUGINS` (comma-separated module names) inside +its FastAPI lifespan and treats `install()` as either: + +- a sync function returning `None` — the plugin appends its adapters to + the registries in `server.hooks` and returns; or +- an `@asynccontextmanager async def install()` — the plugin owns + resources with a lifecycle (a SQLAlchemy engine, an HTTP client, a + background task) that need teardown on server shutdown. The loader + holds an `AsyncExitStack`, enters each ctx-manager `install()` on + startup, and unwinds them in reverse order on shutdown. + +Plugins live anywhere on `sys.path` — in-tree under +`src/server//`, sibling-mounted under `/app/src//`, or a +pip-installable wheel. Core never references plugin module names; each +plugin self-filters internally. + +## Plugins with their own DB + +OSS ships no DB itself. Plugins that need persistence bring their own +engine and manage it via the ctx-manager `install()` form: + +```python +# myorg_auth_plugin/__init__.py +import os +from contextlib import asynccontextmanager +from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine + +from server.hooks import IDENTITY_PROVIDERS + + +class _MyOrgAuth: + name = "myorg.auth" + def __init__(self, sessionmaker): self._sm = sessionmaker + async def resolve(self, raw_token, logger): + async with self._sm() as session: + ... + + +@asynccontextmanager +async def install(): + engine = create_async_engine(os.environ["MYORG_DATABASE_URL"]) + IDENTITY_PROVIDERS.append(_MyOrgAuth(async_sessionmaker(engine))) + try: + yield + finally: + await engine.dispose() +``` diff --git a/docs/SDK.md b/docs/SDK.md new file mode 100644 index 00000000..fa3615e2 --- /dev/null +++ b/docs/SDK.md @@ -0,0 +1,70 @@ +# SDK usage (Python) + +The Python SDK lives in `sdk/src/flowmesh/`. The public surface exposes +two clients — `FlowMesh` (sync) and `AsyncFlowMesh` (async) — both +configured the same way and exposing the same resource groups +(`workflows`, `tasks`, `workers`, `nodes`, `results`, `system`, +`ssh`, `traces`). Both handle pagination, error formatting, and SSE +streaming for you; prefer them over raw HTTP calls. + +## Sync client + +```python +from flowmesh import FlowMesh + +client = FlowMesh(base_url="http://localhost:8000", api_key="...") + +# Submit a workflow from a YAML file. +workflow_text = open("templates/echo_local.yaml").read() +wf = client.workflows.submit(workflow_text) + +# Stream logs until the workflow finishes. +for ev in client.workflows.stream_logs(wf.workflow_id): + print(ev.line) + +# Inspect tasks and pull a result. +tasks = client.tasks.list(workflow_id=wf.workflow_id) +result = client.results.get(tasks[0].task_id) +``` + +## Async client + +```python +from flowmesh import AsyncFlowMesh + +async with AsyncFlowMesh(base_url="...", api_key="...") as client: + wf = await client.workflows.submit(yaml_text) + async for ev in client.workflows.stream_logs(wf.workflow_id): + ... +``` + +## Common operations + +- **Submit YAML / JSON / n8n** — `client.workflows.submit(text_or_mapping)`; + pass `workflow_format="n8n"` for n8n graphs. +- **Validate without executing** — `client.workflows.validate(text_or_mapping)`. +- **List with filters and pagination** — `client.workflows.list(status=...)` + and `client.tasks.list(workflow_id=..., status=...)`; pass raw cursor + params with `query_params=[("before", cursor), ("limit", "100")]`. +- **Stream logs** — `client.workflows.stream_logs(wf_id)` and + `client.tasks.stream_logs(task_id)` yield server-sent events; the + iterator stops when the source closes. +- **Pull artifacts** — `client.results.get(task_id)` for the result + payload, `client.results.download_bundle(task_id, include="all")` + for the tar.gz. +- **Fetch and analyze traces** — `client.traces.fetch(wf_id, "spans")` + yields JSONL rows; `client.traces.analyze(wf_id)` returns a profile + summary. +- **Cancel** — `client.workflows.cancel(wf_id)`. + +## Cursor pagination + +Cursor-enabled calls take `limit` and `before` / `after` params. +Cursors are opaque base64 strings — pass them through; do not parse +them. + +## Errors + +The SDK raises `flowmesh.FlowMeshError` (and a small set of subclasses +for auth / not-found / rate-limit) instead of returning HTTP error +shapes. Wrap your calls accordingly. diff --git a/docs/WORKFLOWS.md b/docs/WORKFLOWS.md new file mode 100644 index 00000000..c2fa4b6d --- /dev/null +++ b/docs/WORKFLOWS.md @@ -0,0 +1,81 @@ +# Workflow YAML format + +Workflows are submitted as YAML (or JSON) to `POST /api/v1/workflows` +(see [`docs/API.md`](API.md)). The `templates/` directory contains +runnable examples for each shape; this page documents the spec +hierarchy and the cross-cutting features. + +## Single task + +```yaml +apiVersion: flowmesh/v1 +kind: InferenceTask +metadata: + name: hello-inference +spec: + taskType: inference + resources: + hardware: { gpu: { type: any, count: 1 } } + model: + source: { type: huggingface, identifier: TinyLlama/TinyLlama-1.1B-Chat-v1.0 } + vllm: { gpu_memory_utilization: 0.5 } + data: + type: list + items: + - - role: user + content: What is the capital of France? + inference: { max_tokens: 64, temperature: 0.0 } + output: + destination: { type: http } +``` + +## Multi-stage DAG + +```yaml +apiVersion: flowmesh/v1 +kind: Workflow +spec: + stages: + - name: extract + spec: + taskType: inference + ... + data: + type: list + items: + - - role: user + content: "Extract entities from: {{input}}" + - name: summarize + dependsOn: [extract] + spec: + taskType: inference + ... + data: + type: list + items: + - - role: user + content: "Summarize: {{extract.output}}" +``` + +`spec.stages[].dependsOn` declares the DAG edges; the dispatcher +schedules each stage once all of its dependencies are `DONE`. +Substitutions like `{{extract.output}}` are resolved against the +upstream stage's result. + +## Graph DAG + +`taskType: graph_template` — topology-aware multi-input prompts with +parent output substitution and validation. See +`src/worker/executors/utils/graph_templates.py` for the templating +contract. + +## Schedule hints + +Workflows can declare scheduling preferences via +`metadata.annotations.schedule_hint`: + +- `epoch_groups: [[, ...], ...]` — epoch-ordered execution; + tasks in epoch `n` only dispatch after every task in epoch `n-1` + succeeds. +- `schedule_in_epoch_order: true` — for dependent DAGs, prefer + position-in-epoch tie-breaks during dispatch. diff --git a/scripts/ci/check_pr_title.py b/scripts/ci/check_pr_title.py index e6e02915..33c5d9c9 100644 --- a/scripts/ci/check_pr_title.py +++ b/scripts/ci/check_pr_title.py @@ -9,7 +9,7 @@ print("❌ PR title is empty.") sys.exit(1) -ALLOWED_TYPES = ["feat", "fix", "refactor", "chore", "test", "perf"] +ALLOWED_TYPES = ["feat", "fix", "refactor", "chore", "test", "perf", "docs"] # Strip optional [1/N] progress prefix progress_match = re.match(r"^\[\d+/[\dNn]+\]\s*(.+)$", pr_title, re.IGNORECASE) @@ -25,7 +25,7 @@ core_title = pr_title is_breaking = False -# Validate type: feat, fix, refactor, chore, test, perf +# Validate type: feat, fix, refactor, chore, test, perf, docs types_re = "|".join(re.escape(t) for t in ALLOWED_TYPES) type_match = re.match(rf"^({types_re}):\s+.+$", core_title, re.IGNORECASE) if not type_match: