Skip to content

mkthoma/dataforge

Repository files navigation

DataForge — Autonomous Data Analysis Agent

DataForge is a production-grade autonomous data analysis agent built with FastAPI and a multi-provider LLM routing layer. Upload any CSV dataset and DataForge executes a 9-step analytical workflow — generating statistical KPIs, publication-quality charts, domain research, and a final analyst brief — all streamed live to a dashboard UI. After the workflow completes, the agent enters an interactive chat mode for follow-up questions and custom visualisations.


🎬 Demo video

Dataforge.Demo.mp4

Raw File


Origin Story

Step 1 — Initial Prompt

The project started with a prompt written from scratch: initial prompt. This prompt defined DataForge's identity, tool-calling syntax, card output schema, provider failover chain, and the nine-step workflow. It was deliberately minimal — a skeleton that described what the agent should do and how it should respond, without the polish needed for production LLMs.

Step 2 — Prompt Evaluation with Claude

The initial prompt was submitted to a Prompt Evaluation framework. More details can be found in the link. Claude assessed the prompt against nine criteria:

Criterion Initial Prompt
Explicit Reasoning <thinking> blocks required
Structured Output ✅ Pydantic JSON + tool-call format
Tool Separation ✅ Reasoning vs. computation cleanly split
Conversation Loop ✅ Post-workflow chat mode defined
Instructional Framing ✅ Tool format, script rules, workflow steps
Internal Self-Checks ❌ No verification gates before card push
Reasoning Type Awareness ❌ All steps undifferentiated
Error Handling / Fallbacks ✅ (partial) Infrastructure only; no analytical fallbacks
Overall Clarity Strong

Score: 6.5 / 9. The evaluation surfaced two critical gaps: (1) no self-verification protocol before pushing cards — errors in early steps could propagate silently; (2) no reasoning type taxonomy — different steps require different modes of thinking with different failure modes.

Step 3 — Final Prompt

The gaps were closed in the final prompt, which added:

  • A six-value reasoning type taxonomy (statistical, data-quality, domain-research, synthesis, visualisation, planning) — each <thinking> block must declare which type applies
  • A full Self-Verification Protocol gating every card push (plausibility checks, file existence, source quality, causal language guards)
  • An Analytical Edge-Case Decision Matrix covering 8 real-world scenarios (high missingness, low-cardinality numerics, insufficient data for scatter, inconclusive web search, and more)
  • An example card payload to give weaker models in the failover chain a concrete reference

Step 4 — Architecture

With the final prompt in hand, the ai-agents-architect skill was invoked to expand the prompt into a full system architecture: Architecture Plan.md. The plan covers the agent loop design, component structure, tool definitions, provider failover chain, failure scenarios, MLflow tracing integration, memory schema, and a 7-phase implementation blueprint. The codebase was then built against this plan.


What DataForge Does

CSV Upload → 9-Step Workflow → Live Dashboard → Interactive Chat
  1. Scan & Plan — Reads the first 50 rows, infers shape and types, identifies columns to exclude or reclassify, and emits a StepPlanCard listing every subsequent step with its reasoning type.
  2. Dataset Summary — Runs a pre-computed column profile and asks the LLM to write a 3–5 sentence prose narrative and per-column descriptions. Pushes a DataSummaryCard.
  3. Domain Research — Issues 2–3 targeted web search queries derived from column names, evaluates source quality, and synthesises domain context into a ResearchCard. If results are inconclusive the card says so.
  4. Statistical KPIs — Computes descriptive stats, missing value percentages, and a correlation matrix via a pre-run Python script. Pushes a KPICard with all values plausibility-checked.
  5. Correlation Heatmap — Generates an annotated correlation heatmap for all numeric columns. Pushes a VisualizationCard with a data-driven interpretation.
  6. Distribution Histograms — Generates histograms (log-transformed for skewed columns) for up to 8 numeric columns in a single card.
  7. Categorical Bar Charts — Horizontal bar charts for categorical columns with 2–20 unique values (falls back to top-20 if higher cardinality).
  8. Scatter Plots — Scatter plot of the most correlated numeric pair (only if |r| ≥ 0.1 and at least 2 valid columns exist), with regression line.
  9. Analyst Insights Brief — Synthesises all prior findings into an executive brief using associative language (never causal), cross-referenced against the KPIs from step 4.

After step 9, the agent enters chat mode. Users can request custom charts, new statistics, follow-up research, or clarifications. Chat responses are nested under the session's MLflow run for full traceability.


Architecture Overview

                         DataForge System
                         ════════════════

  ┌───────────────┐                    ┌──────────────────────────┐
  │    Browser    │  POST /upload  ──► │                          │
  │   Dashboard   │                    │    FastAPI Application   │
  │    (HTML)     │ ◄── SSE /cards ──  │    (uvicorn + asyncio)   │
  └───────────────┘   POST /chat       └─────────────┬────────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │      Orchestrator     │
                                          │    9-step workflow    │
                                          └───────────┬───────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │     Step Executor     │
                                          │  Pre-compute          │
                                          │  → LLM loop           │
                                          │  → Parse              │
                                          │  → Verify             │
                                          │  → Push               │
                                          └───────────┬───────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │    Provider Router    │
                                          │  NIM  →  Groq         │
                                          │  Cerebras  →  Mistral │
                                          │  Gemini               │
                                          │  (sequential failover)│
                                          └───────────────────────┘

  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐
  │  Session Store  │  │  MLflow Tracer  │  │   Tool Registry     │
  │  (in-memory)    │  │  (SQLite DB)    │  │  execute_python     │
  └─────────────────┘  └─────────────────┘  │  search             │
                                            └─────────────────────┘

Agent Design

The ReAct Loop

DataForge uses the ReAct (Reason + Act) pattern inside every step. Each iteration of the LLM loop follows:

REASON → ACT → OBSERVE → [repeat or emit card]
flowchart TD
    A[Build messages with session context] --> B[LLM call via provider router]
    B --> C{Response contains\nFUNCTION_CALL?}
    C -- Yes --> D[Dispatch tool\nexecute_python / web_search / fetch_url]
    D --> E[Append TOOL_RESULT to messages]
    E --> B
    C -- No --> F{Response contains\nvalid card JSON?}
    F -- No --> G[Append format correction message]
    G --> B
    F -- Yes --> H{card_type matches\nexpected for step?}
    H -- No --> I[Append card_type correction message]
    I --> B
    H -- Yes --> J[verify_card: self-verification gate]
    J -- Failed --> K[Append verification issues]
    K --> B
    J -- Passed --> L[push_card → SSE queue]
    L --> M[Return StepResult]
    B -- MaxIterations --> N[push ErrorCard]
Loading

For computational steps (steps 3–8), the system pre-runs a Python script before entering the LLM loop, injecting the result as a DATA_PROFILE message. This offloads the heavy computation to deterministic Python code and lets weaker LLMs in the failover chain focus purely on interpretation and schema compliance.

The 9-Step Workflow

sequenceDiagram
    participant U as User
    participant API as FastAPI
    participant O as Orchestrator
    participant E as Step Executor
    participant R as Router
    participant T as Tools
    participant M as MLflow

    U->>API: POST /upload (CSV file)
    API->>O: run_workflow(session_id, dataset_path)
    M<<->>O: start_session_run()

    loop Steps 1–9
        O->>M: start_step_run(parent=session)
        O->>E: execute_step(step_id, state)

        opt Pre-computation (steps 2–8)
            E->>T: execute_python(profile_script)
            T-->>E: DATA_PROFILE JSON
        end

        loop ReAct iterations (max 5)
            E->>R: route(messages)
            R-->>E: LLMResponse
            M<<->>E: log_llm_call()

            opt Tool call in response
                E->>T: dispatch_tool()
                T-->>E: observation
                M<<->>E: log_tool_call()
            end
        end

        E->>API: push_card (SSE)
        API-->>U: Server-Sent Event
        O->>M: finish_step_run(status, provider, tokens)
    end

    O->>M: finish_session_run()
    API-->>U: analysis_complete event
    U->>API: POST /chat (follow-up question)
    API->>M: start_step_run(parent=session, "Chat")
Loading

Reasoning Type System

Every <thinking> block must declare one of six reasoning types. The ExplicitReasoner validates the block is present and logs a warning if the type doesn't match the step's expectation.

Step Name Expected Reasoning Type
step_01 Scan & Plan planning
step_02 Dataset Summary synthesis
step_03 Domain Research domain-research
step_04 Statistical KPIs statistical
step_05 Correlation Heatmap visualisation
step_06 Distribution Histograms visualisation
step_07 Categorical Bar Charts visualisation
step_08 Scatter Plots visualisation
step_09 Analyst Insights Brief synthesis

Different types have different failure modes: statistical reasoning needs plausibility checks, domain research needs source credibility checks, synthesis needs causal-language guards. Tagging reasoning type makes those checks step-specific rather than generic.

Pre-Computation Pattern

For every step that requires data analysis, the executor pre-runs a purpose-built Python script before calling the LLM, injecting the result as context:

flowchart LR
    A[Step starts] --> B{Step has\npre-compute?}
    B -- No --> C[LLM loop immediately]
    B -- Yes --> D[Run Python script\nvia execute_python]
    D --> E{Script\nsucceeded?}
    E -- Yes --> F[Inject DATA_PROFILE\ninto messages]
    F --> G[LLM formats data\ninto card schema]
    E -- No --> H[Log warning]
    H --> C
    G --> I[Card pushed]
    C --> I
Loading
Step Pre-Computed Data
step_02 Column profiles: dtypes, missing %, unique counts, sample values
step_03 Shape, column names, dtypes, top categorical values for query building
step_04 Full KPI package: stats, correlations, missing fractions
step_05–08 Dataset exploration profile + chart image (generated ahead of the LLM call)

For steps 5–8, the chart PNG is generated before the LLM is even called. The LLM only needs to write the interpretation — it cannot get the image path wrong because it's injected verbatim.


Component Reference

Provider Router (src/dataforge/router/)

flowchart LR
    A[route messages] --> B{NIM\nconfigured?}
    B -- Yes --> C[Try NIM]
    C -- OK --> Z[Return response]
    C -- Auth fail --> skip1[ ]
    C -- Rate limit --> D
    B -- No --> D
    D{Groq?} -- Yes --> E[Try Groq]
    E -- OK --> Z
    E -- fail --> F
    D -- No --> F
    F{Cerebras?} -- Yes --> G[Try Cerebras]
    G -- OK --> Z
    G -- fail --> H
    F -- No --> H
    H{Mistral?} -- Yes --> I[Try Mistral]
    I -- OK --> Z
    I -- fail --> J
    H -- No --> J
    J{Gemini?} -- Yes --> K[Try Gemini]
    K -- OK --> Z
    K -- fail --> L[AllProvidersFailedError]
    J -- No --> L
Loading

Each provider call uses temperature=0.2 for determinism. Context windows are respected per-provider (8k for Cerebras, 32k for Groq/Mistral, 128k for NIM, 1M for Gemini). Messages are trimmed oldest-first to fit within 85% of the provider's limit.

Context limits:

Provider Limit
NVIDIA NIM 128,000 tokens
Groq 32,768 tokens
Cerebras 8,192 tokens
Mistral 32,768 tokens
Gemini 1,000,000 tokens

Tool Registry (src/dataforge/tools/)

execute_python

Runs a self-contained Python script in a subprocess with a configurable timeout (default 90s). The tool:

  • Prepends UTF-8 encoding declarations and matplotlib.use("Agg") to every script
  • Blocks dangerous patterns: subprocess, os.system, eval, exec, socket, urllib, requests
  • Parses stdout as JSON into the result field of the response
  • Returns {stdout, stderr, exit_code, result, error} — all failures are surfaced, never swallowed
# LLM emits:
FUNCTION_CALL: execute_python
ARGUMENTS: {"script": "import pandas as pd\nimport json\n..."}
END_FUNCTION_CALL

# Tool returns:
{
  "result": {"rows": 16598, "cols": 11, ...},
  "exit_code": 0,
  "error": null
}

web_search

Dual-engine search with automatic quality scoring:

  • Primary: Exa neural search API
  • Fallback: Firecrawl
  • Quality tiers: preferred (.gov, .edu, arxiv.org, WHO, etc.), acceptable, low

Results include title, url, snippet, quality, and published_date. The LLM is instructed to prefer preferred sources in its ResearchCard.

fetch_url

Fetches a URL and extracts clean text. Only accepts http/https. Strips HTML, collapses whitespace, truncates to max_chars (default 2000). 15-second timeout.

push_card

Enqueues a serialised card to an in-memory asyncio.Queue keyed by session_id. The SSE endpoint /cards/{session_id} polls this queue and streams each card as an event to the browser. Retries up to 3 times if the queue is full.

Memory & Session State (src/dataforge/memory/)

@dataclass
class SessionState:
    session_id: str
    dataset_path: str
    dataset_shape: tuple[int, int]
    dataset_encoding: str
    completed_steps: list[str]
    failed_steps: list[str]
    pushed_card_ids: list[str]
    excluded_columns: list[str]
    reclassified_columns: dict[str, str]
    kpi_summary: dict[str, Any]
    top_correlated_pairs: list[tuple[str, str, float]]
    conversation_history: list[dict]
    sampling_applied: bool

State is held in a module-level dict[session_id → SessionState] and updated after each step completes. The conversation_history field accumulates user/assistant turns during chat mode, truncated to the last 10 messages to prevent context growth.

Self-Verification Gate (src/dataforge/agent/self_verifier.py)

Before any card is pushed, verify_card(card, step_id) runs type-specific checks:

Card Type Checks
KPICard top_missing values are 0–1 floats; top_correlated_pairs are valid 3-tuples
VisualizationCard image_path is non-empty; rationale is present; key_insight is present
ResearchCard At least one source with a non-empty URL
DataSummaryCard (step_09) No causal language ("causes", "leads to", etc.) in prose_summary

Failures append the issue list to the conversation and the LLM must re-emit. This is the second line of defence after the pre-computation scripts — the agent cannot push a card it hasn't validated.


Pydantic Throughout

Pydantic v2 is used at every layer of the stack.

Settings (src/dataforge/config.py)

class Settings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env")

    nvidia_nim_api_key: str = Field(default="", alias="NVIDIA_NIM_API_KEY")
    groq_api_key: str = Field(default="", alias="GROQ_API_KEY")
    mlflow_tracking_uri: str = Field(default="sqlite:///mlflow.db", alias="MLFLOW_TRACKING_URI")
    max_iterations_per_step: int = 5
    large_dataset_row_threshold: int = 500_000
    correlation_threshold: float = 0.3
    # ... 30+ settings

pydantic-settings reads from .env, environment variables, and defaults in priority order. Every numeric threshold used in the analysis scripts — sampling size, scatter plot cutoff, missing value threshold — is configurable here.

Card Schemas (src/dataforge/schemas/cards.py)

All six card types inherit from BaseCard. The schema enforces exact Literal values for card_type so parse_card() can dispatch to the right class without ambiguity.

LLM type coercion — LLMs occasionally emit type mismatches (e.g. "11.1%" for a field typed float, a dict where a string is expected, or a list as a string). A @model_validator(mode="before") on BaseCard handles all of these before Pydantic's own validation runs:

@model_validator(mode="before")
@classmethod
def coerce_field_types(cls, data: dict) -> dict:
    for field_name, field_info in cls.model_fields.items():
        annotation = field_info.annotation
        origin = get_origin(annotation)
        args = get_args(annotation)
        val = data.get(field_name)
        if val is None:
            continue

        # str field received dict/list → join to markdown string
        if annotation is str or str in flat_args(annotation):
            if not isinstance(val, str):
                data[field_name] = _coerce_str(val)

        # list field received str → parse JSON or wrap
        elif origin is list:
            if not isinstance(val, list):
                data[field_name] = _coerce_list(val)

        # dict field received str → parse JSON
        elif origin is dict:
            if not isinstance(val, dict):
                data[field_name] = _coerce_dict(val)
            # dict[str, float] with "11.1%" string values → 0.111
            if len(args) == 2 and args[1] in (float, int):
                data[field_name] = _coerce_dict_values(data[field_name], args[1])
    return data

StepSpec has its own @model_validator that backfills name from step_id when the LLM omits it — a common failure mode across weaker providers in the chain.

API Request/Response

The upload endpoint uses Form/UploadFile directly. The card endpoints return list[dict] serialised from the Pydantic models. The chat endpoint receives a ChatRequest model:

class ChatRequest(BaseModel):
    message: str
    session_id: str

MLflow Tracing

Every session and every step is tracked in MLflow using the explicit runs API (MlflowClient) rather than the context-manager API — necessary because the context-manager uses thread-local state that is incompatible with async code.

Run Hierarchy

dataforge_2026-05-14_15-08-22  (session parent run)
├── step_01_Scan-Plan           (nested step run)
├── step_02_Dataset-Summary
├── step_03_Domain-Research
│   └── artifacts/
│       └── llm_calls/
│           ├── iter_00_request.json
│           └── iter_00_response.json
├── step_04_Statistical-KPIs
├── step_05_Correlation-Heatmap
│   └── artifacts/
│       ├── llm_calls/
│       └── charts/
│           └── step_05_viz.png
├── step_06_Distribution-Histograms
├── step_07_Categorical-Bar-Charts
├── step_08_Scatter-Plots
├── step_09_Analyst-Insights-Brief
└── step_chat_01               (nested chat run, same parent)

What Gets Logged

Artifact / Metric Where
LLM request messages (JSON) llm_calls/iter_NN_request.json
LLM response content (JSON) llm_calls/iter_NN_response.json
Tool call + result (JSON) tool_calls/NN_tool_name.json
Chart images (PNG) charts/
input_tokens, output_tokens MLflow metrics (per-step index)
latency_ms MLflow metrics
provider, model MLflow tags
card_type, iterations_used Step run tags
status (complete/failed) Run status

All tracer calls are wrapped in try/except — a tracing failure never aborts the workflow.

MLflow UI

Run the app with python run.py and open http://localhost:5001. The MLflow UI shows the dataforge experiment with one run per session. Click a session run to see all 9 nested step runs and their artifacts.

flowchart LR
    A[Session run created\nby orchestrator] --> B[Step run created\nbefore execute_step]
    B --> C[Each LLM call\nlogged as artifact]
    B --> D[Each tool call\nlogged as artifact]
    B --> E[Chart PNG\nlogged as artifact]
    B --> F[Metrics:\ntokens, latency]
    B --> G[Step run finished\nwith status + tags]
    G --> H[Next step...]
    H --> I[Session run finished\nwith counts + status]
Loading

Dashboard UI

The dashboard is a single-page HTML/CSS/JS application (src/dataforge/static/index.html) served statically by FastAPI. It uses no build step and no framework — vanilla JS with a dark Indigo theme.

Layout

SSE Card Streaming

Cards are pushed to the browser as Server-Sent Events. The JavaScript opens an EventSource to /cards/{session_id} and appends each card to the DOM as it arrives:

const es = new EventSource(`/cards/${sessionId}`);
es.onmessage = (event) => {
    const card = JSON.parse(event.data);
    renderCard(card);  // append to main content
};

Each card type is rendered differently:

  • StepPlan → collapsible step list with step IDs and reasoning types
  • DataSummaryCard → prose paragraph + column table
  • KPICard → stat tiles (row count, numeric/categorical column counts) + missing value bars
  • VisualizationCard → full-width chart image + interpretation text
  • ResearchCard → source links with quality badges + synthesis paragraph
  • ErrorCard → red alert with error type and recovery hint

Chat Mode

After analysis_complete fires, the chat input unlocks. Messages are POST /chat with {session_id, message}. Responses are a single card pushed through the same SSE channel, so chat replies appear inline in the card stream.


Project Structure

dataforge/
├── run.py                          # Launch script (MLflow UI + uvicorn)
├── pyproject.toml                  # Dependencies, build config
├── design_prompts/
│   ├── dataforge_initial_prompt.md  # v1 system prompt
│   ├── dataforge_prompt_evaluation.md  # Claude's evaluation of v1
│   └── dataforge_final_prompt.md   # v2 system prompt (production)
├── skills/
│   ├── data_analysis_agent_skill.md
│   ├── dataset_research_skill.md
│   └── visualization_selection_skill.md
├── Architecture Plan.md            # Full system design document
└── src/dataforge/
    ├── config.py                   # Settings (pydantic-settings)
    ├── __main__.py                 # uvicorn entrypoint
    ├── agent/
    │   ├── orchestrator.py         # 9-step workflow runner
    │   ├── step_executor.py        # ReAct loop + pre-computation
    │   ├── card_builder.py         # JSON → card type dispatch
    │   ├── self_verifier.py        # Pre-push verification gate
    │   ├── conversation.py         # Chat mode handler
    │   └── prompts.py              # SYSTEM_PROMPT + per-step hints
    ├── api/
    │   ├── main.py                 # FastAPI app + lifespan
    │   └── routes/
    │       ├── upload.py           # POST /upload → workflow trigger
    │       ├── cards.py            # GET /cards/{id} SSE stream
    │       ├── chat.py             # POST /chat
    │       └── traces.py           # GET /api/traces (MLflow)
    ├── memory/
    │   └── session_store.py        # In-memory SessionState registry
    ├── reasoning/
    │   ├── explicit_reasoner.py    # <thinking> block validator
    │   └── types.py                # ReasoningType enum
    ├── router/
    │   ├── failover.py             # Sequential provider failover
    │   └── providers/
    │       ├── nvidia_nim.py
    │       ├── groq.py
    │       ├── cerebras.py
    │       ├── mistral.py
    │       └── gemini.py
    ├── schemas/
    │   └── cards.py                # All card types + coercion validators
    ├── tools/
    │   ├── execute_python.py       # Sandboxed subprocess runner
    │   ├── web_search.py           # Exa / Firecrawl search
    │   ├── fetch_url.py            # HTML-stripped URL fetcher
    │   └── push_card.py            # SSE queue pusher
    ├── tracing/
    │   └── mlflow_tracer.py        # MLflow span management
    └── static/
        └── index.html              # Dashboard UI

Agent Skills

The skills/ directory contains three behavioural reference documents that are embedded directly into the agent's system prompt (via src/dataforge/agent/prompts.py). They are not user-facing documentation — they are machine-readable specifications that shape how the LLM reasons, selects visualisations, and researches datasets. Each skill targets a specific failure mode identified during the prompt evaluation phase.


data_analysis_agent_skill.md — Analysis Standards & Insight Framework

What it is: A comprehensive standards document that defines how the DataForge agent must approach every analytical step. It establishes a priority order (understand the data → discover insights → create visualisations → provide recommendations) and maps each phase to a specific workflow step and card type.

What it does:

  • Defines the 4-Part Insight Framework that every card field (key_insight, interpretation, synthesis) must satisfy:
    • WHAT — state the finding with actual numbers
    • WHY — hypothesise mechanism, hedged with "may suggest" / "is consistent with"
    • IMPACT — significance for the domain or analysis goal
    • ACTION — a concrete, specific next step
  • Sets the Quantification Standard: forbids adjective-only findings ("strongly correlated") and requires numerical equivalents ("r = 0.74, meaning 55% of variance is shared")
  • Specifies the Step 09 Synthesis Card structure (Executive Summary → Key Insights → Data Quality Notes → Suggested Next Steps) and the rule that synthesis must elevate and integrate prior findings rather than repeat them
  • Documents chart-type eligibility rules and per-step use cases (heatmap, histograms, bar charts, scatter)
  • Defines an edge-case table covering high missingness, low-cardinality numerics, small datasets, and all-weak-correlation scenarios

How it helps: Without this skill, weaker LLMs in the failover chain produce vague, adjective-heavy insight text or duplicate observations across cards. The skill is a quality floor — it gives every provider in the chain the same rubric, so the output quality degrades gracefully rather than catastrophically when failover kicks in.

Where it's invoked: The skill content is injected as a section of the system prompt in src/dataforge/agent/prompts.py. It applies globally to all 9 steps and to chat mode. The key_insight, interpretation, and prose_summary fields in src/dataforge/schemas/cards.py are the primary card fields this skill governs. The self-verifier in src/dataforge/agent/self_verifier.py enforces the causal-language guard (step_09 DataSummaryCard) that originates from this skill.


dataset_research_skill.md — Domain Research & Description (Step 03)

What it is: A targeted playbook for step_03 (ResearchCard). It defines an exact ReAct sequence for profiling the dataset, building ranked search queries, evaluating results, and degrading gracefully when searches fail.

What it does:

  • Specifies the full ReAct sequence for step_03:
    1. Profile the data with execute_python (shape, dtypes, top values, missing percentages)
    2. Infer domain from the profile, build 3 ranked queries (specific → domain-level → broadest fallback)
    3. Execute searches in order, evaluate each result, fall back when results are irrelevant
    4. Synthesise into a ResearchCard using data-profile facts when web results fail
  • Defines a Query Construction Strategy with examples showing how column names drive queries (e.g. upvote_ratio + subreddit"Reddit posts dataset upvote_ratio subreddit NLP analysis") and explicitly bans generic templates like "dataset analysis"
  • Documents Result Evaluation Criteria (source quality tiers: preferred / acceptable / low, snippet relevance signals)
  • Defines 4 ResearchCard status values (complete, partial, data-only, inconclusive) with strict rules — most importantly, "complete" with an empty sources array is invalid
  • Specifies the 5-paragraph synthesis structure (what it is → what it measures → use cases → data quality → analytical potential)

How it helps: Step 03 is the only step where the agent reaches out to the web and must evaluate source credibility. Without this skill, the agent either invents dataset facts or uses generic queries that return useless results. The ranked-query strategy and explicit fallback logic ensure the step always produces a useful card — even when all 3 searches fail — by falling back to a data-profile-only description.

Where it's invoked: The skill is embedded as a step-specific hint for step_03 in src/dataforge/agent/prompts.py, injected alongside the step-level ReAct loop in src/dataforge/agent/step_executor.py. The web_search and fetch_url tools in src/dataforge/tools/ are the execution layer this skill coordinates. The ResearchCard schema in src/dataforge/schemas/cards.py (with its status, sources, queries_attempted, and synthesis fields) directly mirrors the skill's output specification.


visualization_selection_skill.md — Chart Selection & Column Prioritisation (Steps 05–08)

What it is: A decision framework for steps 05–08 that governs which columns are charted, which chart type is used, and whether a visualisation is warranted at all. It exists to prevent the most common LLM visualisation failure: producing charts because data is available rather than because they reveal something meaningful.

What it does:

  • Defines a Column Priority Framework with three tiers:
    • HIGH — always worth visualising (outcome variables, key business drivers, time columns, segmentation columns)
    • MEDIUM — only if they show a strong relationship with HIGH columns
    • LOW / AVOID — IDs, index columns, near-zero variance columns
  • Defines a 5-question Insight Validation Gate that must score ≥3 YES before execute_python is called to generate a chart. If the gate fails, the step produces an ErrorCard explaining why — a chart that cannot justify its existence should not be produced
  • Specifies step-specific guidance for each visualisation step:
    • step_05 (heatmap): filter to HIGH-priority columns; only render if ≥2 pairs have |r| > 0.25 with domain logic
    • step_06 (histograms): select 4–8 HIGH-priority numeric columns by analytical importance, not by column order
    • step_07 (bar charts): only columns with 2–20 unique values; horizontal bars ordered by frequency
    • step_08 (scatter): one business-relevant pair; regression line if |r| ≥ 0.3; colour-code by segmentation if sub-group patterns exist
  • Defines an anti-patterns reference with specific bad patterns (pairplot of every feature, correlation heatmap of all numeric columns, "Distribution of column X" titles) and their correct alternatives

How it helps: Visualisation steps are the most likely to produce shallow output when run by a weaker provider. Without the column priority framework, an LLM will include ID or index columns in heatmaps, generate histograms in column order rather than by analytical importance, and produce charts titled "Distribution of column 3". The insight validation gate is a hard gate — if the data genuinely does not support a meaningful scatter plot, the correct output is an ErrorCard, not a weak chart.

Where it's invoked: The skill is embedded as the _viz_react_hint in src/dataforge/agent/prompts.py and is injected for every visualisation step (05–08). The pre-computation scripts in src/dataforge/agent/step_executor.py generate the chart PNG before the LLM loop, but the column-selection decisions and insight gate happen inside the LLM reasoning pass guided by this skill. The VisualizationCard and ErrorCard schemas in src/dataforge/schemas/cards.py are the output types the skill governs.


Getting Started

Prerequisites

  • Python 3.11+
  • uv package manager

Installation

git clone <repo>
cd dataforge
uv sync

Configuration

Copy .env.example to .env and add at least one API key:

# At least one provider required
NVIDIA_NIM_API_KEY=nvapi-...
GROQ_API_KEY=gsk_...
CEREBRAS_API_KEY=csk-...
MISTRAL_API_KEY=...
GEMINI_API_KEY=AIza...

# Optional search tools
EXA_API_KEY=...
FIRECRAWL_API_KEY=...

# Overrides
MLFLOW_TRACKING_URI=sqlite:///mlflow.db
APP_PORT=8000
MLFLOW_UI_PORT=5001
WORKSPACE_DIR=/tmp/workspace

Running

python run.py

This starts both services:

Usage

  1. Open http://localhost:8000
  2. Click Upload and select a CSV file
  3. Watch cards stream in as the 9-step workflow runs
  4. When complete, type a question in the chat input
  5. Open http://localhost:5001 to explore traces

Configuration Reference

Variable Default Description
NVIDIA_NIM_MODEL meta/llama-3.1-70b-instruct NIM model name
GROQ_MODEL llama-3.3-70b-versatile Groq model name
CEREBRAS_MODEL llama3.1-70b Cerebras model
MISTRAL_MODEL mistral-large-latest Mistral model
GEMINI_MODEL gemini-1.5-pro Gemini model
MAX_ITERATIONS_PER_STEP 5 Max ReAct loop iterations
STEP_TIMEOUT_SECONDS 600 Per-step hard timeout
SCRIPT_TIMEOUT_SECONDS 90 execute_python timeout
MAX_TOKENS_PER_CALL 8192 LLM max_tokens
MAX_COST_CENTS_PER_SESSION 500 Cost circuit-breaker
LARGE_DATASET_ROW_THRESHOLD 500,000 Triggers sampling
SAMPLE_SIZE 100,000 Rows after sampling
CORRELATION_THRESHOLD 0.3 Min
HIGH_MISSING_THRESHOLD 0.5 Missing % to exclude column
MAX_CATEGORICAL_UNIQUE 30 Max unique values for bar chart

Technology Stack

Layer Technology
Web framework FastAPI + uvicorn
Schema validation Pydantic v2
Settings pydantic-settings
LLM providers NVIDIA NIM, Groq, Cerebras, Mistral, Gemini
Data analysis pandas, numpy, scipy, scikit-learn
Visualisation matplotlib, seaborn
Web search Exa, Firecrawl
Observability MLflow (SQLite backend)
Async runtime asyncio (Python 3.11+)
Streaming SSE via sse-starlette
Package manager uv
UI Vanilla HTML/CSS/JS (no build step)

About

Autonomous data analysis agent - upload a CSV, get KPIs, charts, domain research & an analyst brief streamed live to a dashboard. Multi-provider LLM routing with MLflow tracing, Pydantic validation, and post-workflow chat.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors