DataForge is a production-grade autonomous data analysis agent built with FastAPI and a multi-provider LLM routing layer. Upload any CSV dataset and DataForge executes a 9-step analytical workflow — generating statistical KPIs, publication-quality charts, domain research, and a final analyst brief — all streamed live to a dashboard UI. After the workflow completes, the agent enters an interactive chat mode for follow-up questions and custom visualisations.
Dataforge.Demo.mp4
The project started with a prompt written from scratch: initial prompt. This prompt defined DataForge's identity, tool-calling syntax, card output schema, provider failover chain, and the nine-step workflow. It was deliberately minimal — a skeleton that described what the agent should do and how it should respond, without the polish needed for production LLMs.
The initial prompt was submitted to a Prompt Evaluation framework. More details can be found in the link. Claude assessed the prompt against nine criteria:
| Criterion | Initial Prompt |
|---|---|
| Explicit Reasoning | ✅ <thinking> blocks required |
| Structured Output | ✅ Pydantic JSON + tool-call format |
| Tool Separation | ✅ Reasoning vs. computation cleanly split |
| Conversation Loop | ✅ Post-workflow chat mode defined |
| Instructional Framing | ✅ Tool format, script rules, workflow steps |
| Internal Self-Checks | ❌ No verification gates before card push |
| Reasoning Type Awareness | ❌ All steps undifferentiated |
| Error Handling / Fallbacks | ✅ (partial) Infrastructure only; no analytical fallbacks |
| Overall Clarity | Strong |
Score: 6.5 / 9. The evaluation surfaced two critical gaps: (1) no self-verification protocol before pushing cards — errors in early steps could propagate silently; (2) no reasoning type taxonomy — different steps require different modes of thinking with different failure modes.
The gaps were closed in the final prompt, which added:
- A six-value reasoning type taxonomy (
statistical,data-quality,domain-research,synthesis,visualisation,planning) — each<thinking>block must declare which type applies - A full Self-Verification Protocol gating every card push (plausibility checks, file existence, source quality, causal language guards)
- An Analytical Edge-Case Decision Matrix covering 8 real-world scenarios (high missingness, low-cardinality numerics, insufficient data for scatter, inconclusive web search, and more)
- An example card payload to give weaker models in the failover chain a concrete reference
With the final prompt in hand, the ai-agents-architect skill was invoked to expand the prompt into a full system architecture: Architecture Plan.md. The plan covers the agent loop design, component structure, tool definitions, provider failover chain, failure scenarios, MLflow tracing integration, memory schema, and a 7-phase implementation blueprint. The codebase was then built against this plan.
CSV Upload → 9-Step Workflow → Live Dashboard → Interactive Chat
- Scan & Plan — Reads the first 50 rows, infers shape and types, identifies columns to exclude or reclassify, and emits a
StepPlanCardlisting every subsequent step with its reasoning type. - Dataset Summary — Runs a pre-computed column profile and asks the LLM to write a 3–5 sentence prose narrative and per-column descriptions. Pushes a
DataSummaryCard. - Domain Research — Issues 2–3 targeted web search queries derived from column names, evaluates source quality, and synthesises domain context into a
ResearchCard. If results are inconclusive the card says so. - Statistical KPIs — Computes descriptive stats, missing value percentages, and a correlation matrix via a pre-run Python script. Pushes a
KPICardwith all values plausibility-checked. - Correlation Heatmap — Generates an annotated correlation heatmap for all numeric columns. Pushes a
VisualizationCardwith a data-driven interpretation. - Distribution Histograms — Generates histograms (log-transformed for skewed columns) for up to 8 numeric columns in a single card.
- Categorical Bar Charts — Horizontal bar charts for categorical columns with 2–20 unique values (falls back to top-20 if higher cardinality).
- Scatter Plots — Scatter plot of the most correlated numeric pair (only if
|r| ≥ 0.1and at least 2 valid columns exist), with regression line. - Analyst Insights Brief — Synthesises all prior findings into an executive brief using associative language (never causal), cross-referenced against the KPIs from step 4.
After step 9, the agent enters chat mode. Users can request custom charts, new statistics, follow-up research, or clarifications. Chat responses are nested under the session's MLflow run for full traceability.
DataForge System
════════════════
┌───────────────┐ ┌──────────────────────────┐
│ Browser │ POST /upload ──► │ │
│ Dashboard │ │ FastAPI Application │
│ (HTML) │ ◄── SSE /cards ── │ (uvicorn + asyncio) │
└───────────────┘ POST /chat └─────────────┬────────────┘
│
▼
┌───────────────────────┐
│ Orchestrator │
│ 9-step workflow │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Step Executor │
│ Pre-compute │
│ → LLM loop │
│ → Parse │
│ → Verify │
│ → Push │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Provider Router │
│ NIM → Groq │
│ Cerebras → Mistral │
│ Gemini │
│ (sequential failover)│
└───────────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ Session Store │ │ MLflow Tracer │ │ Tool Registry │
│ (in-memory) │ │ (SQLite DB) │ │ execute_python │
└─────────────────┘ └─────────────────┘ │ search │
└─────────────────────┘
DataForge uses the ReAct (Reason + Act) pattern inside every step. Each iteration of the LLM loop follows:
REASON → ACT → OBSERVE → [repeat or emit card]
flowchart TD
A[Build messages with session context] --> B[LLM call via provider router]
B --> C{Response contains\nFUNCTION_CALL?}
C -- Yes --> D[Dispatch tool\nexecute_python / web_search / fetch_url]
D --> E[Append TOOL_RESULT to messages]
E --> B
C -- No --> F{Response contains\nvalid card JSON?}
F -- No --> G[Append format correction message]
G --> B
F -- Yes --> H{card_type matches\nexpected for step?}
H -- No --> I[Append card_type correction message]
I --> B
H -- Yes --> J[verify_card: self-verification gate]
J -- Failed --> K[Append verification issues]
K --> B
J -- Passed --> L[push_card → SSE queue]
L --> M[Return StepResult]
B -- MaxIterations --> N[push ErrorCard]
For computational steps (steps 3–8), the system pre-runs a Python script before entering the LLM loop, injecting the result as a DATA_PROFILE message. This offloads the heavy computation to deterministic Python code and lets weaker LLMs in the failover chain focus purely on interpretation and schema compliance.
sequenceDiagram
participant U as User
participant API as FastAPI
participant O as Orchestrator
participant E as Step Executor
participant R as Router
participant T as Tools
participant M as MLflow
U->>API: POST /upload (CSV file)
API->>O: run_workflow(session_id, dataset_path)
M<<->>O: start_session_run()
loop Steps 1–9
O->>M: start_step_run(parent=session)
O->>E: execute_step(step_id, state)
opt Pre-computation (steps 2–8)
E->>T: execute_python(profile_script)
T-->>E: DATA_PROFILE JSON
end
loop ReAct iterations (max 5)
E->>R: route(messages)
R-->>E: LLMResponse
M<<->>E: log_llm_call()
opt Tool call in response
E->>T: dispatch_tool()
T-->>E: observation
M<<->>E: log_tool_call()
end
end
E->>API: push_card (SSE)
API-->>U: Server-Sent Event
O->>M: finish_step_run(status, provider, tokens)
end
O->>M: finish_session_run()
API-->>U: analysis_complete event
U->>API: POST /chat (follow-up question)
API->>M: start_step_run(parent=session, "Chat")
Every <thinking> block must declare one of six reasoning types. The ExplicitReasoner validates the block is present and logs a warning if the type doesn't match the step's expectation.
| Step | Name | Expected Reasoning Type |
|---|---|---|
| step_01 | Scan & Plan | planning |
| step_02 | Dataset Summary | synthesis |
| step_03 | Domain Research | domain-research |
| step_04 | Statistical KPIs | statistical |
| step_05 | Correlation Heatmap | visualisation |
| step_06 | Distribution Histograms | visualisation |
| step_07 | Categorical Bar Charts | visualisation |
| step_08 | Scatter Plots | visualisation |
| step_09 | Analyst Insights Brief | synthesis |
Different types have different failure modes: statistical reasoning needs plausibility checks, domain research needs source credibility checks, synthesis needs causal-language guards. Tagging reasoning type makes those checks step-specific rather than generic.
For every step that requires data analysis, the executor pre-runs a purpose-built Python script before calling the LLM, injecting the result as context:
flowchart LR
A[Step starts] --> B{Step has\npre-compute?}
B -- No --> C[LLM loop immediately]
B -- Yes --> D[Run Python script\nvia execute_python]
D --> E{Script\nsucceeded?}
E -- Yes --> F[Inject DATA_PROFILE\ninto messages]
F --> G[LLM formats data\ninto card schema]
E -- No --> H[Log warning]
H --> C
G --> I[Card pushed]
C --> I
| Step | Pre-Computed Data |
|---|---|
| step_02 | Column profiles: dtypes, missing %, unique counts, sample values |
| step_03 | Shape, column names, dtypes, top categorical values for query building |
| step_04 | Full KPI package: stats, correlations, missing fractions |
| step_05–08 | Dataset exploration profile + chart image (generated ahead of the LLM call) |
For steps 5–8, the chart PNG is generated before the LLM is even called. The LLM only needs to write the interpretation — it cannot get the image path wrong because it's injected verbatim.
flowchart LR
A[route messages] --> B{NIM\nconfigured?}
B -- Yes --> C[Try NIM]
C -- OK --> Z[Return response]
C -- Auth fail --> skip1[ ]
C -- Rate limit --> D
B -- No --> D
D{Groq?} -- Yes --> E[Try Groq]
E -- OK --> Z
E -- fail --> F
D -- No --> F
F{Cerebras?} -- Yes --> G[Try Cerebras]
G -- OK --> Z
G -- fail --> H
F -- No --> H
H{Mistral?} -- Yes --> I[Try Mistral]
I -- OK --> Z
I -- fail --> J
H -- No --> J
J{Gemini?} -- Yes --> K[Try Gemini]
K -- OK --> Z
K -- fail --> L[AllProvidersFailedError]
J -- No --> L
Each provider call uses temperature=0.2 for determinism. Context windows are respected per-provider (8k for Cerebras, 32k for Groq/Mistral, 128k for NIM, 1M for Gemini). Messages are trimmed oldest-first to fit within 85% of the provider's limit.
Context limits:
| Provider | Limit |
|---|---|
| NVIDIA NIM | 128,000 tokens |
| Groq | 32,768 tokens |
| Cerebras | 8,192 tokens |
| Mistral | 32,768 tokens |
| Gemini | 1,000,000 tokens |
Runs a self-contained Python script in a subprocess with a configurable timeout (default 90s). The tool:
- Prepends UTF-8 encoding declarations and
matplotlib.use("Agg")to every script - Blocks dangerous patterns:
subprocess,os.system,eval,exec,socket,urllib,requests - Parses stdout as JSON into the
resultfield of the response - Returns
{stdout, stderr, exit_code, result, error}— all failures are surfaced, never swallowed
# LLM emits:
FUNCTION_CALL: execute_python
ARGUMENTS: {"script": "import pandas as pd\nimport json\n..."}
END_FUNCTION_CALL
# Tool returns:
{
"result": {"rows": 16598, "cols": 11, ...},
"exit_code": 0,
"error": null
}Dual-engine search with automatic quality scoring:
- Primary: Exa neural search API
- Fallback: Firecrawl
- Quality tiers:
preferred(.gov, .edu, arxiv.org, WHO, etc.),acceptable,low
Results include title, url, snippet, quality, and published_date. The LLM is instructed to prefer preferred sources in its ResearchCard.
Fetches a URL and extracts clean text. Only accepts http/https. Strips HTML, collapses whitespace, truncates to max_chars (default 2000). 15-second timeout.
Enqueues a serialised card to an in-memory asyncio.Queue keyed by session_id. The SSE endpoint /cards/{session_id} polls this queue and streams each card as an event to the browser. Retries up to 3 times if the queue is full.
@dataclass
class SessionState:
session_id: str
dataset_path: str
dataset_shape: tuple[int, int]
dataset_encoding: str
completed_steps: list[str]
failed_steps: list[str]
pushed_card_ids: list[str]
excluded_columns: list[str]
reclassified_columns: dict[str, str]
kpi_summary: dict[str, Any]
top_correlated_pairs: list[tuple[str, str, float]]
conversation_history: list[dict]
sampling_applied: boolState is held in a module-level dict[session_id → SessionState] and updated after each step completes. The conversation_history field accumulates user/assistant turns during chat mode, truncated to the last 10 messages to prevent context growth.
Before any card is pushed, verify_card(card, step_id) runs type-specific checks:
| Card Type | Checks |
|---|---|
KPICard |
top_missing values are 0–1 floats; top_correlated_pairs are valid 3-tuples |
VisualizationCard |
image_path is non-empty; rationale is present; key_insight is present |
ResearchCard |
At least one source with a non-empty URL |
DataSummaryCard (step_09) |
No causal language ("causes", "leads to", etc.) in prose_summary |
Failures append the issue list to the conversation and the LLM must re-emit. This is the second line of defence after the pre-computation scripts — the agent cannot push a card it hasn't validated.
Pydantic v2 is used at every layer of the stack.
class Settings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env")
nvidia_nim_api_key: str = Field(default="", alias="NVIDIA_NIM_API_KEY")
groq_api_key: str = Field(default="", alias="GROQ_API_KEY")
mlflow_tracking_uri: str = Field(default="sqlite:///mlflow.db", alias="MLFLOW_TRACKING_URI")
max_iterations_per_step: int = 5
large_dataset_row_threshold: int = 500_000
correlation_threshold: float = 0.3
# ... 30+ settingspydantic-settings reads from .env, environment variables, and defaults in priority order. Every numeric threshold used in the analysis scripts — sampling size, scatter plot cutoff, missing value threshold — is configurable here.
All six card types inherit from BaseCard. The schema enforces exact Literal values for card_type so parse_card() can dispatch to the right class without ambiguity.
LLM type coercion — LLMs occasionally emit type mismatches (e.g. "11.1%" for a field typed float, a dict where a string is expected, or a list as a string). A @model_validator(mode="before") on BaseCard handles all of these before Pydantic's own validation runs:
@model_validator(mode="before")
@classmethod
def coerce_field_types(cls, data: dict) -> dict:
for field_name, field_info in cls.model_fields.items():
annotation = field_info.annotation
origin = get_origin(annotation)
args = get_args(annotation)
val = data.get(field_name)
if val is None:
continue
# str field received dict/list → join to markdown string
if annotation is str or str in flat_args(annotation):
if not isinstance(val, str):
data[field_name] = _coerce_str(val)
# list field received str → parse JSON or wrap
elif origin is list:
if not isinstance(val, list):
data[field_name] = _coerce_list(val)
# dict field received str → parse JSON
elif origin is dict:
if not isinstance(val, dict):
data[field_name] = _coerce_dict(val)
# dict[str, float] with "11.1%" string values → 0.111
if len(args) == 2 and args[1] in (float, int):
data[field_name] = _coerce_dict_values(data[field_name], args[1])
return dataStepSpec has its own @model_validator that backfills name from step_id when the LLM omits it — a common failure mode across weaker providers in the chain.
The upload endpoint uses Form/UploadFile directly. The card endpoints return list[dict] serialised from the Pydantic models. The chat endpoint receives a ChatRequest model:
class ChatRequest(BaseModel):
message: str
session_id: strEvery session and every step is tracked in MLflow using the explicit runs API (MlflowClient) rather than the context-manager API — necessary because the context-manager uses thread-local state that is incompatible with async code.
dataforge_2026-05-14_15-08-22 (session parent run)
├── step_01_Scan-Plan (nested step run)
├── step_02_Dataset-Summary
├── step_03_Domain-Research
│ └── artifacts/
│ └── llm_calls/
│ ├── iter_00_request.json
│ └── iter_00_response.json
├── step_04_Statistical-KPIs
├── step_05_Correlation-Heatmap
│ └── artifacts/
│ ├── llm_calls/
│ └── charts/
│ └── step_05_viz.png
├── step_06_Distribution-Histograms
├── step_07_Categorical-Bar-Charts
├── step_08_Scatter-Plots
├── step_09_Analyst-Insights-Brief
└── step_chat_01 (nested chat run, same parent)
| Artifact / Metric | Where |
|---|---|
| LLM request messages (JSON) | llm_calls/iter_NN_request.json |
| LLM response content (JSON) | llm_calls/iter_NN_response.json |
| Tool call + result (JSON) | tool_calls/NN_tool_name.json |
| Chart images (PNG) | charts/ |
input_tokens, output_tokens |
MLflow metrics (per-step index) |
latency_ms |
MLflow metrics |
provider, model |
MLflow tags |
card_type, iterations_used |
Step run tags |
status (complete/failed) |
Run status |
All tracer calls are wrapped in try/except — a tracing failure never aborts the workflow.
Run the app with python run.py and open http://localhost:5001. The MLflow UI shows the dataforge experiment with one run per session. Click a session run to see all 9 nested step runs and their artifacts.
flowchart LR
A[Session run created\nby orchestrator] --> B[Step run created\nbefore execute_step]
B --> C[Each LLM call\nlogged as artifact]
B --> D[Each tool call\nlogged as artifact]
B --> E[Chart PNG\nlogged as artifact]
B --> F[Metrics:\ntokens, latency]
B --> G[Step run finished\nwith status + tags]
G --> H[Next step...]
H --> I[Session run finished\nwith counts + status]
The dashboard is a single-page HTML/CSS/JS application (src/dataforge/static/index.html) served statically by FastAPI. It uses no build step and no framework — vanilla JS with a dark Indigo theme.
Cards are pushed to the browser as Server-Sent Events. The JavaScript opens an EventSource to /cards/{session_id} and appends each card to the DOM as it arrives:
const es = new EventSource(`/cards/${sessionId}`);
es.onmessage = (event) => {
const card = JSON.parse(event.data);
renderCard(card); // append to main content
};Each card type is rendered differently:
StepPlan→ collapsible step list with step IDs and reasoning typesDataSummaryCard→ prose paragraph + column tableKPICard→ stat tiles (row count, numeric/categorical column counts) + missing value barsVisualizationCard→ full-width chart image + interpretation textResearchCard→ source links with quality badges + synthesis paragraphErrorCard→ red alert with error type and recovery hint
After analysis_complete fires, the chat input unlocks. Messages are POST /chat with {session_id, message}. Responses are a single card pushed through the same SSE channel, so chat replies appear inline in the card stream.
dataforge/
├── run.py # Launch script (MLflow UI + uvicorn)
├── pyproject.toml # Dependencies, build config
├── design_prompts/
│ ├── dataforge_initial_prompt.md # v1 system prompt
│ ├── dataforge_prompt_evaluation.md # Claude's evaluation of v1
│ └── dataforge_final_prompt.md # v2 system prompt (production)
├── skills/
│ ├── data_analysis_agent_skill.md
│ ├── dataset_research_skill.md
│ └── visualization_selection_skill.md
├── Architecture Plan.md # Full system design document
└── src/dataforge/
├── config.py # Settings (pydantic-settings)
├── __main__.py # uvicorn entrypoint
├── agent/
│ ├── orchestrator.py # 9-step workflow runner
│ ├── step_executor.py # ReAct loop + pre-computation
│ ├── card_builder.py # JSON → card type dispatch
│ ├── self_verifier.py # Pre-push verification gate
│ ├── conversation.py # Chat mode handler
│ └── prompts.py # SYSTEM_PROMPT + per-step hints
├── api/
│ ├── main.py # FastAPI app + lifespan
│ └── routes/
│ ├── upload.py # POST /upload → workflow trigger
│ ├── cards.py # GET /cards/{id} SSE stream
│ ├── chat.py # POST /chat
│ └── traces.py # GET /api/traces (MLflow)
├── memory/
│ └── session_store.py # In-memory SessionState registry
├── reasoning/
│ ├── explicit_reasoner.py # <thinking> block validator
│ └── types.py # ReasoningType enum
├── router/
│ ├── failover.py # Sequential provider failover
│ └── providers/
│ ├── nvidia_nim.py
│ ├── groq.py
│ ├── cerebras.py
│ ├── mistral.py
│ └── gemini.py
├── schemas/
│ └── cards.py # All card types + coercion validators
├── tools/
│ ├── execute_python.py # Sandboxed subprocess runner
│ ├── web_search.py # Exa / Firecrawl search
│ ├── fetch_url.py # HTML-stripped URL fetcher
│ └── push_card.py # SSE queue pusher
├── tracing/
│ └── mlflow_tracer.py # MLflow span management
└── static/
└── index.html # Dashboard UI
The skills/ directory contains three behavioural reference documents that are embedded directly into the agent's system prompt (via src/dataforge/agent/prompts.py). They are not user-facing documentation — they are machine-readable specifications that shape how the LLM reasons, selects visualisations, and researches datasets. Each skill targets a specific failure mode identified during the prompt evaluation phase.
What it is: A comprehensive standards document that defines how the DataForge agent must approach every analytical step. It establishes a priority order (understand the data → discover insights → create visualisations → provide recommendations) and maps each phase to a specific workflow step and card type.
What it does:
- Defines the 4-Part Insight Framework that every card field (
key_insight,interpretation,synthesis) must satisfy:WHAT— state the finding with actual numbersWHY— hypothesise mechanism, hedged with "may suggest" / "is consistent with"IMPACT— significance for the domain or analysis goalACTION— a concrete, specific next step
- Sets the Quantification Standard: forbids adjective-only findings ("strongly correlated") and requires numerical equivalents ("r = 0.74, meaning 55% of variance is shared")
- Specifies the Step 09 Synthesis Card structure (Executive Summary → Key Insights → Data Quality Notes → Suggested Next Steps) and the rule that synthesis must elevate and integrate prior findings rather than repeat them
- Documents chart-type eligibility rules and per-step use cases (heatmap, histograms, bar charts, scatter)
- Defines an edge-case table covering high missingness, low-cardinality numerics, small datasets, and all-weak-correlation scenarios
How it helps: Without this skill, weaker LLMs in the failover chain produce vague, adjective-heavy insight text or duplicate observations across cards. The skill is a quality floor — it gives every provider in the chain the same rubric, so the output quality degrades gracefully rather than catastrophically when failover kicks in.
Where it's invoked: The skill content is injected as a section of the system prompt in src/dataforge/agent/prompts.py. It applies globally to all 9 steps and to chat mode. The key_insight, interpretation, and prose_summary fields in src/dataforge/schemas/cards.py are the primary card fields this skill governs. The self-verifier in src/dataforge/agent/self_verifier.py enforces the causal-language guard (step_09 DataSummaryCard) that originates from this skill.
What it is: A targeted playbook for step_03 (ResearchCard). It defines an exact ReAct sequence for profiling the dataset, building ranked search queries, evaluating results, and degrading gracefully when searches fail.
What it does:
- Specifies the full ReAct sequence for step_03:
- Profile the data with
execute_python(shape, dtypes, top values, missing percentages) - Infer domain from the profile, build 3 ranked queries (specific → domain-level → broadest fallback)
- Execute searches in order, evaluate each result, fall back when results are irrelevant
- Synthesise into a ResearchCard using data-profile facts when web results fail
- Profile the data with
- Defines a Query Construction Strategy with examples showing how column names drive queries (e.g.
upvote_ratio + subreddit→"Reddit posts dataset upvote_ratio subreddit NLP analysis") and explicitly bans generic templates like"dataset analysis" - Documents Result Evaluation Criteria (source quality tiers:
preferred/acceptable/low, snippet relevance signals) - Defines 4 ResearchCard status values (
complete,partial,data-only,inconclusive) with strict rules — most importantly,"complete"with an emptysourcesarray is invalid - Specifies the 5-paragraph synthesis structure (what it is → what it measures → use cases → data quality → analytical potential)
How it helps: Step 03 is the only step where the agent reaches out to the web and must evaluate source credibility. Without this skill, the agent either invents dataset facts or uses generic queries that return useless results. The ranked-query strategy and explicit fallback logic ensure the step always produces a useful card — even when all 3 searches fail — by falling back to a data-profile-only description.
Where it's invoked: The skill is embedded as a step-specific hint for step_03 in src/dataforge/agent/prompts.py, injected alongside the step-level ReAct loop in src/dataforge/agent/step_executor.py. The web_search and fetch_url tools in src/dataforge/tools/ are the execution layer this skill coordinates. The ResearchCard schema in src/dataforge/schemas/cards.py (with its status, sources, queries_attempted, and synthesis fields) directly mirrors the skill's output specification.
What it is: A decision framework for steps 05–08 that governs which columns are charted, which chart type is used, and whether a visualisation is warranted at all. It exists to prevent the most common LLM visualisation failure: producing charts because data is available rather than because they reveal something meaningful.
What it does:
- Defines a Column Priority Framework with three tiers:
HIGH— always worth visualising (outcome variables, key business drivers, time columns, segmentation columns)MEDIUM— only if they show a strong relationship with HIGH columnsLOW / AVOID— IDs, index columns, near-zero variance columns
- Defines a 5-question Insight Validation Gate that must score ≥3 YES before
execute_pythonis called to generate a chart. If the gate fails, the step produces anErrorCardexplaining why — a chart that cannot justify its existence should not be produced - Specifies step-specific guidance for each visualisation step:
- step_05 (heatmap): filter to HIGH-priority columns; only render if ≥2 pairs have |r| > 0.25 with domain logic
- step_06 (histograms): select 4–8 HIGH-priority numeric columns by analytical importance, not by column order
- step_07 (bar charts): only columns with 2–20 unique values; horizontal bars ordered by frequency
- step_08 (scatter): one business-relevant pair; regression line if |r| ≥ 0.3; colour-code by segmentation if sub-group patterns exist
- Defines an anti-patterns reference with specific bad patterns (pairplot of every feature, correlation heatmap of all numeric columns,
"Distribution of column X"titles) and their correct alternatives
How it helps: Visualisation steps are the most likely to produce shallow output when run by a weaker provider. Without the column priority framework, an LLM will include ID or index columns in heatmaps, generate histograms in column order rather than by analytical importance, and produce charts titled "Distribution of column 3". The insight validation gate is a hard gate — if the data genuinely does not support a meaningful scatter plot, the correct output is an ErrorCard, not a weak chart.
Where it's invoked: The skill is embedded as the _viz_react_hint in src/dataforge/agent/prompts.py and is injected for every visualisation step (05–08). The pre-computation scripts in src/dataforge/agent/step_executor.py generate the chart PNG before the LLM loop, but the column-selection decisions and insight gate happen inside the LLM reasoning pass guided by this skill. The VisualizationCard and ErrorCard schemas in src/dataforge/schemas/cards.py are the output types the skill governs.
- Python 3.11+
- uv package manager
git clone <repo>
cd dataforge
uv syncCopy .env.example to .env and add at least one API key:
# At least one provider required
NVIDIA_NIM_API_KEY=nvapi-...
GROQ_API_KEY=gsk_...
CEREBRAS_API_KEY=csk-...
MISTRAL_API_KEY=...
GEMINI_API_KEY=AIza...
# Optional search tools
EXA_API_KEY=...
FIRECRAWL_API_KEY=...
# Overrides
MLFLOW_TRACKING_URI=sqlite:///mlflow.db
APP_PORT=8000
MLFLOW_UI_PORT=5001
WORKSPACE_DIR=/tmp/workspacepython run.pyThis starts both services:
- DataForge → http://localhost:8000
- MLflow UI → http://localhost:5001
- Open http://localhost:8000
- Click Upload and select a CSV file
- Watch cards stream in as the 9-step workflow runs
- When complete, type a question in the chat input
- Open http://localhost:5001 to explore traces
| Variable | Default | Description |
|---|---|---|
NVIDIA_NIM_MODEL |
meta/llama-3.1-70b-instruct |
NIM model name |
GROQ_MODEL |
llama-3.3-70b-versatile |
Groq model name |
CEREBRAS_MODEL |
llama3.1-70b |
Cerebras model |
MISTRAL_MODEL |
mistral-large-latest |
Mistral model |
GEMINI_MODEL |
gemini-1.5-pro |
Gemini model |
MAX_ITERATIONS_PER_STEP |
5 |
Max ReAct loop iterations |
STEP_TIMEOUT_SECONDS |
600 |
Per-step hard timeout |
SCRIPT_TIMEOUT_SECONDS |
90 |
execute_python timeout |
MAX_TOKENS_PER_CALL |
8192 |
LLM max_tokens |
MAX_COST_CENTS_PER_SESSION |
500 |
Cost circuit-breaker |
LARGE_DATASET_ROW_THRESHOLD |
500,000 |
Triggers sampling |
SAMPLE_SIZE |
100,000 |
Rows after sampling |
CORRELATION_THRESHOLD |
0.3 |
Min |
HIGH_MISSING_THRESHOLD |
0.5 |
Missing % to exclude column |
MAX_CATEGORICAL_UNIQUE |
30 |
Max unique values for bar chart |
| Layer | Technology |
|---|---|
| Web framework | FastAPI + uvicorn |
| Schema validation | Pydantic v2 |
| Settings | pydantic-settings |
| LLM providers | NVIDIA NIM, Groq, Cerebras, Mistral, Gemini |
| Data analysis | pandas, numpy, scipy, scikit-learn |
| Visualisation | matplotlib, seaborn |
| Web search | Exa, Firecrawl |
| Observability | MLflow (SQLite backend) |
| Async runtime | asyncio (Python 3.11+) |
| Streaming | SSE via sse-starlette |
| Package manager | uv |
| UI | Vanilla HTML/CSS/JS (no build step) |