feat: data agent connector for lumid.data#19
Merged
Conversation
489b8ca to
6d8c5ba
Compare
Wire ``data.type == "agent"`` into ``DataRetrievalExecutor`` so a worker task can describe what it wants in natural language plus a schema scope and let lumid.data's ``/retrieve/v1`` plan + replay the chain server-side. Each item carries the materialized DataFrame and the typed access chain so a downstream consumer binds to either via the existing ``path: items.X`` resolver. Worker delivery wiring: ``analytics`` extra picks up the ``lumid-data-sdk`` git+ pin; ``sync_requirements.py`` regex extended for the PEP 508 ``name @ git+url`` form; the security workflow filters ``@ git+`` deps before ``pip-audit --strict`` since PyPI doesn't carry git-source deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Two-stage workflow: agent retrieval against lumid.data, then a Qwen 1.5B summary node consuming the materialized table and access chain. One ``flowmesh workflow submit`` exercises the connector and downstream consumption end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
6d8c5ba to
31ddd4a
Compare
Drop ``schema_scope`` from the executor's required-keys validation and let the connector forward ``None`` to the SDK so a workflow can omit the field when it wants lumid.data to default to all visible schemas. The e2e template drops its explicit scope to exercise the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Track ``a800051`` so the SDK pin reflects PR #8 (optional ``schema_scope``) on lumid.data main. Wire contract is unchanged from the FlowMesh side — the connector already passes ``None`` through ``model_dump(exclude_none=True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
kaiitunnz
requested changes
May 6, 2026
Collaborator
kaiitunnz
left a comment
There was a problem hiding this comment.
A few minor comments.
kaiitunnz
requested changes
May 6, 2026
Collaborator
kaiitunnz
left a comment
There was a problem hiding this comment.
A few minor comments.
6fd20d9 to
0bc2ca3
Compare
… mismatch Single-node deployments can share the results volume between the server (root) and supervisor-spawned workers (appuser). Both call sync_manifest, so the prior direct write_text raced to EACCES on the second writer when the manifest was already owned by the first writer's UID. prepare_output_dir now chmods each managed directory to 0o0777 (best-effort, tolerant of cross-UID ownership). sync_manifest writes the manifest with write_text and then chmods it to 0o0666 so the next sync_manifest call from a peer UID can overwrite the file directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
…prompts Aggregate templates that bind a per-row pd.DataFrame value into the prompt rendered the cell via tabulate's to_markdown, which falls back to pandas' default __str__ on each DataFrame entry. The default 80-col display width clipped middle columns to '...', so the consumer LLM only saw the first and last few columns of any wide retrieval result. Wrap the to_markdown sites in pd.option_context with max_columns/width/ max_colwidth set to None so DataFrames render in full regardless of the calling environment's pandas display defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Four cleanups raised in code review: - Move the duplicated ``prepare_output_dir`` / ``sync_manifest`` (and helpers) from ``src/server/utils/manifest.py`` and ``src/worker/utils/manifest.py`` into a single ``src/shared/utils/manifest.py`` exposing everything either side uses (including the worker-only ``scratch_dir`` / ``SCRATCH_DIR``); rewrite every call site to import from ``shared.utils.manifest`` and move the helper tests under ``tests/shared/utils/``. - Drop two unnecessary ``# type: ignore`` comments on ``self._normalize_params`` calls in ``data_retrieval_executor`` — mypy resolves them cleanly without an escape. - Replace the ``# type: ignore[import-untyped]`` on ``lumid_data.sdk.Client`` with a ``follow_untyped_imports`` override in ``pyproject.toml`` so the override applies to the whole SDK and goes away as soon as upstream ships type stubs. - Name the worker-CPU pip-audit input file ``/tmp/requirements-worker-cpu-audit.txt`` so its purpose reads at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
GHSA-x368-4g9h-fvv4 (vllm 0.18.0, fix 0.19.1) and GHSA-83vm-p52w-f9pw (vllm 0.18.0, fix 0.20.0) join the existing list — both are blocked by the same transformers 4.57 / inference-deps pin that already keeps the other vllm advisories on the ignore list. GHSA-j7w6-vpvq-j3gm (diffusers 0.36.0, fix 0.38.0) is added separately: diffusers 0.38 requires safetensors>=0.8.0rc0, which uv lock refuses to resolve without an explicit pre-release opt-in. Holding the floor at 0.36 and ignoring until safetensors ships a non-rc 0.8. Update the upgrade-blocker table in CODE_STYLE.md alongside the workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
0bc2ca3 to
b3d8fb5
Compare
Replaces the tempfile + os.replace approach with a direct write_text + chmod 0o0666 on the manifest, and a guarded mkdir + chmod 0o0777 on the output directories. Both rely on the file/dir's owner being the only caller that needs to run chmod, which holds because sync_manifest is the sole writer of the manifest and prepare_output_dir's chmod runs only on creation. The "best-effort" PermissionError swallow on chmod is gone — a chmod failure now propagates loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
b3d8fb5 to
7bf2e63
Compare
kaiitunnz
requested changes
May 7, 2026
…arties Promote ``shared.utils.manifest`` 's tempfile + os.replace path to a standalone ``shared.utils.atomic.atomic_write_text`` helper and apply it to every file the server and the worker can both write: the per-task ``manifest.json`` and ``results.json``. Each write goes through a tempfile in the same directory, gets chmodded to ``0o0666``, and is swapped in via ``os.replace`` so a peer-UID writer can replace it without permission issues and a crash mid-write leaves either the old file or the new one — never a half-written one. Drop the manifest-permission/overwrite tests in ``tests/worker/test_task_output.py`` that the shared-utils suite already covers, and add a one-line comment over the two ``pd.option_context`` sites in graph_templates so the intent of the width-cap toggles is obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
kaiitunnz
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR supports data access via our data plane (lumid.data), which supports natural language-based data retrieval via a data agent. A workflow node describes what it wants in natural language, optionally constrains the search to a schema scope, and gets back the materialized rows alongside the canonical SQL chain the agent emitted — so a downstream node can bind to either.
Changes
src/worker/connectors/agent_connector.py(new) +connectors/__init__.py— thin wrapper aroundlumid_data.sdk.Client.retrieve_to_file.src/worker/executors/data_retrieval_executor.py— adds thedata.type == "agent"branch and emits per-rowitems[]with table + access chain + run metadata;schema_scopeis optional.tests/worker/test_agent_connector.py(new) — connector tests with mocked SDK.pyproject.toml,uv.lock,src/worker/requirements/requirements.txt— addlumid-data-sdkto theanalyticsextra; mypyfollow_untyped_importsforlumid_data.sdk[.*].scripts/dev/sync_requirements.py— accept thename @ git+urlform in the package-name regex..github/workflows/security.yml— strip@ git+deps beforepip-audit --strict; rename the temp file to/tmp/requirements-worker-cpu-audit.txt; ignore two new vllm advisories (GHSA-x368-4g9h-fvv4,GHSA-83vm-p52w-f9pw) blocked by the same transformers 4.57 pin.templates/data_retrieval_agent.yaml(new) — two-stage e2e: agent retrieval → Qwen 1.5B summary.src/shared/utils/manifest.py(new) + deletions ofsrc/{server,worker}/utils/manifest.py— single home forprepare_output_dir/sync_manifest.prepare_output_dircreates each directory at0o0777, andsync_manifestchmods the manifest to0o0666after each write, so single-node deployments where the server (root) and worker (appuser) share the results volume can both overwrite the manifest from either UID.src/worker/executors/utils/graph_templates.py— wrap theto_markdownsites inpd.option_context(max_columns=None, width=None, max_colwidth=None)so nestedDataFramecells render in full instead of pandas' default...-clipped layout.Test Plan
Test Result
Output from the first node:
Output of the second node:
{ "ok": true, "model": "Qwen/Qwen2.5-1.5B-Instruct", "items": [ { "output": [ "The headline trend across the rows for the leading metric is a significant increase in revenue and earnings per share (EPS) from Q4 FY2025 to Q1 FY2025, followed by a decline in Q2 FY2025. This dataset enables analysts to assess NVDA's financial health and performance trends over the past 10 quarters, providing insights into its revenue growth, profitability, and market valuation." ] } ], "usage": { "prompt_tokens": 1518, "completion_tokens": 89, "total_tokens": 1607, "latency_sec": 0.30570387840270996, "num_requests": 1 }, "_artifacts": { "base_dir": "/var/lib/flowmesh-results/tsk-***", "base_url": null } }