DataForge — Autonomous Data Analysis Agent

DataForge is a production-grade autonomous data analysis agent built with FastAPI and a multi-provider LLM routing layer. Upload any CSV dataset and DataForge executes a 9-step analytical workflow — generating statistical KPIs, publication-quality charts, domain research, and a final analyst brief — all streamed live to a dashboard UI. After the workflow completes, the agent enters an interactive chat mode for follow-up questions and custom visualisations.

🎬 Demo video

Dataforge.Demo.mp4

Raw File

Origin Story

Step 1 — Initial Prompt

The project started with a prompt written from scratch: initial prompt. This prompt defined DataForge's identity, tool-calling syntax, card output schema, provider failover chain, and the nine-step workflow. It was deliberately minimal — a skeleton that described what the agent should do and how it should respond, without the polish needed for production LLMs.

Step 2 — Prompt Evaluation with Claude

The initial prompt was submitted to a Prompt Evaluation framework. More details can be found in the link. Claude assessed the prompt against nine criteria:

Criterion	Initial Prompt
Explicit Reasoning	✅ `<thinking>` blocks required
Structured Output	✅ Pydantic JSON + tool-call format
Tool Separation	✅ Reasoning vs. computation cleanly split
Conversation Loop	✅ Post-workflow chat mode defined
Instructional Framing	✅ Tool format, script rules, workflow steps
Internal Self-Checks	❌ No verification gates before card push
Reasoning Type Awareness	❌ All steps undifferentiated
Error Handling / Fallbacks	✅ (partial) Infrastructure only; no analytical fallbacks
Overall Clarity	Strong

Score: 6.5 / 9. The evaluation surfaced two critical gaps: (1) no self-verification protocol before pushing cards — errors in early steps could propagate silently; (2) no reasoning type taxonomy — different steps require different modes of thinking with different failure modes.

Step 3 — Final Prompt

The gaps were closed in the final prompt, which added:

A six-value reasoning type taxonomy (statistical, data-quality, domain-research, synthesis, visualisation, planning) — each <thinking> block must declare which type applies
A full Self-Verification Protocol gating every card push (plausibility checks, file existence, source quality, causal language guards)
An Analytical Edge-Case Decision Matrix covering 8 real-world scenarios (high missingness, low-cardinality numerics, insufficient data for scatter, inconclusive web search, and more)
An example card payload to give weaker models in the failover chain a concrete reference

Step 4 — Architecture

With the final prompt in hand, the ai-agents-architect skill was invoked to expand the prompt into a full system architecture: Architecture Plan.md. The plan covers the agent loop design, component structure, tool definitions, provider failover chain, failure scenarios, MLflow tracing integration, memory schema, and a 7-phase implementation blueprint. The codebase was then built against this plan.

What DataForge Does

CSV Upload → 9-Step Workflow → Live Dashboard → Interactive Chat

Scan & Plan — Reads the first 50 rows, infers shape and types, identifies columns to exclude or reclassify, and emits a StepPlanCard listing every subsequent step with its reasoning type.
Dataset Summary — Runs a pre-computed column profile and asks the LLM to write a 3–5 sentence prose narrative and per-column descriptions. Pushes a DataSummaryCard.
Domain Research — Issues 2–3 targeted web search queries derived from column names, evaluates source quality, and synthesises domain context into a ResearchCard. If results are inconclusive the card says so.
Statistical KPIs — Computes descriptive stats, missing value percentages, and a correlation matrix via a pre-run Python script. Pushes a KPICard with all values plausibility-checked.
Correlation Heatmap — Generates an annotated correlation heatmap for all numeric columns. Pushes a VisualizationCard with a data-driven interpretation.
Distribution Histograms — Generates histograms (log-transformed for skewed columns) for up to 8 numeric columns in a single card.
Categorical Bar Charts — Horizontal bar charts for categorical columns with 2–20 unique values (falls back to top-20 if higher cardinality).
Scatter Plots — Scatter plot of the most correlated numeric pair (only if |r| ≥ 0.1 and at least 2 valid columns exist), with regression line.
Analyst Insights Brief — Synthesises all prior findings into an executive brief using associative language (never causal), cross-referenced against the KPIs from step 4.

After step 9, the agent enters chat mode. Users can request custom charts, new statistics, follow-up research, or clarifications. Chat responses are nested under the session's MLflow run for full traceability.

Architecture Overview

                         DataForge System
                         ════════════════

  ┌───────────────┐                    ┌──────────────────────────┐
  │    Browser    │  POST /upload  ──► │                          │
  │   Dashboard   │                    │    FastAPI Application   │
  │    (HTML)     │ ◄── SSE /cards ──  │    (uvicorn + asyncio)   │
  └───────────────┘   POST /chat       └─────────────┬────────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │      Orchestrator     │
                                          │    9-step workflow    │
                                          └───────────┬───────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │     Step Executor     │
                                          │  Pre-compute          │
                                          │  → LLM loop           │
                                          │  → Parse              │
                                          │  → Verify             │
                                          │  → Push               │
                                          └───────────┬───────────┘
                                                      │
                                                      ▼
                                          ┌───────────────────────┐
                                          │    Provider Router    │
                                          │  NIM  →  Groq         │
                                          │  Cerebras  →  Mistral │
                                          │  Gemini               │
                                          │  (sequential failover)│
                                          └───────────────────────┘

  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐
  │  Session Store  │  │  MLflow Tracer  │  │   Tool Registry     │
  │  (in-memory)    │  │  (SQLite DB)    │  │  execute_python     │
  └─────────────────┘  └─────────────────┘  │  search             │
                                            └─────────────────────┘

Agent Design

The ReAct Loop

DataForge uses the ReAct (Reason + Act) pattern inside every step. Each iteration of the LLM loop follows:

REASON → ACT → OBSERVE → [repeat or emit card]

flowchart TD
    A[Build messages with session context] --> B[LLM call via provider router]
    B --> C{Response contains\nFUNCTION_CALL?}
    C -- Yes --> D[Dispatch tool\nexecute_python / web_search / fetch_url]
    D --> E[Append TOOL_RESULT to messages]
    E --> B
    C -- No --> F{Response contains\nvalid card JSON?}
    F -- No --> G[Append format correction message]
    G --> B
    F -- Yes --> H{card_type matches\nexpected for step?}
    H -- No --> I[Append card_type correction message]
    I --> B
    H -- Yes --> J[verify_card: self-verification gate]
    J -- Failed --> K[Append verification issues]
    K --> B
    J -- Passed --> L[push_card → SSE queue]
    L --> M[Return StepResult]
    B -- MaxIterations --> N[push ErrorCard]

For computational steps (steps 3–8), the system pre-runs a Python script before entering the LLM loop, injecting the result as a DATA_PROFILE message. This offloads the heavy computation to deterministic Python code and lets weaker LLMs in the failover chain focus purely on interpretation and schema compliance.

The 9-Step Workflow

sequenceDiagram
    participant U as User
    participant API as FastAPI
    participant O as Orchestrator
    participant E as Step Executor
    participant R as Router
    participant T as Tools
    participant M as MLflow

    U->>API: POST /upload (CSV file)
    API->>O: run_workflow(session_id, dataset_path)
    M<<->>O: start_session_run()

    loop Steps 1–9
        O->>M: start_step_run(parent=session)
        O->>E: execute_step(step_id, state)

        opt Pre-computation (steps 2–8)
            E->>T: execute_python(profile_script)
            T-->>E: DATA_PROFILE JSON
        end

        loop ReAct iterations (max 5)
            E->>R: route(messages)
            R-->>E: LLMResponse
            M<<->>E: log_llm_call()

            opt Tool call in response
                E->>T: dispatch_tool()
                T-->>E: observation
                M<<->>E: log_tool_call()
            end
        end

        E->>API: push_card (SSE)
        API-->>U: Server-Sent Event
        O->>M: finish_step_run(status, provider, tokens)
    end

    O->>M: finish_session_run()
    API-->>U: analysis_complete event
    U->>API: POST /chat (follow-up question)
    API->>M: start_step_run(parent=session, "Chat")

Reasoning Type System

Every <thinking> block must declare one of six reasoning types. The ExplicitReasoner validates the block is present and logs a warning if the type doesn't match the step's expectation.

Step	Name	Expected Reasoning Type
step_01	Scan & Plan	`planning`
step_02	Dataset Summary	`synthesis`
step_03	Domain Research	`domain-research`
step_04	Statistical KPIs	`statistical`
step_05	Correlation Heatmap	`visualisation`
step_06	Distribution Histograms	`visualisation`
step_07	Categorical Bar Charts	`visualisation`
step_08	Scatter Plots	`visualisation`
step_09	Analyst Insights Brief	`synthesis`

Different types have different failure modes: statistical reasoning needs plausibility checks, domain research needs source credibility checks, synthesis needs causal-language guards. Tagging reasoning type makes those checks step-specific rather than generic.

Pre-Computation Pattern

For every step that requires data analysis, the executor pre-runs a purpose-built Python script before calling the LLM, injecting the result as context:

flowchart LR
    A[Step starts] --> B{Step has\npre-compute?}
    B -- No --> C[LLM loop immediately]
    B -- Yes --> D[Run Python script\nvia execute_python]
    D --> E{Script\nsucceeded?}
    E -- Yes --> F[Inject DATA_PROFILE\ninto messages]
    F --> G[LLM formats data\ninto card schema]
    E -- No --> H[Log warning]
    H --> C
    G --> I[Card pushed]
    C --> I

Step	Pre-Computed Data
step_02	Column profiles: dtypes, missing %, unique counts, sample values
step_03	Shape, column names, dtypes, top categorical values for query building
step_04	Full KPI package: stats, correlations, missing fractions
step_05–08	Dataset exploration profile + chart image (generated ahead of the LLM call)

For steps 5–8, the chart PNG is generated before the LLM is even called. The LLM only needs to write the interpretation — it cannot get the image path wrong because it's injected verbatim.

Component Reference

Provider Router (`src/dataforge/router/`)

flowchart LR
    A[route messages] --> B{NIM\nconfigured?}
    B -- Yes --> C[Try NIM]
    C -- OK --> Z[Return response]
    C -- Auth fail --> skip1[ ]
    C -- Rate limit --> D
    B -- No --> D
    D{Groq?} -- Yes --> E[Try Groq]
    E -- OK --> Z
    E -- fail --> F
    D -- No --> F
    F{Cerebras?} -- Yes --> G[Try Cerebras]
    G -- OK --> Z
    G -- fail --> H
    F -- No --> H
    H{Mistral?} -- Yes --> I[Try Mistral]
    I -- OK --> Z
    I -- fail --> J
    H -- No --> J
    J{Gemini?} -- Yes --> K[Try Gemini]
    K -- OK --> Z
    K -- fail --> L[AllProvidersFailedError]
    J -- No --> L

Each provider call uses temperature=0.2 for determinism. Context windows are respected per-provider (8k for Cerebras, 32k for Groq/Mistral, 128k for NIM, 1M for Gemini). Messages are trimmed oldest-first to fit within 85% of the provider's limit.

Context limits:

Provider	Limit
NVIDIA NIM	128,000 tokens
Groq	32,768 tokens
Cerebras	8,192 tokens
Mistral	32,768 tokens
Gemini	1,000,000 tokens

Tool Registry (`src/dataforge/tools/`)

`execute_python`

Runs a self-contained Python script in a subprocess with a configurable timeout (default 90s). The tool:

Prepends UTF-8 encoding declarations and matplotlib.use("Agg") to every script
Blocks dangerous patterns: subprocess, os.system, eval, exec, socket, urllib, requests
Parses stdout as JSON into the result field of the response
Returns {stdout, stderr, exit_code, result, error} — all failures are surfaced, never swallowed

# LLM emits:
FUNCTION_CALL: execute_python
ARGUMENTS: {"script": "import pandas as pd\nimport json\n..."}
END_FUNCTION_CALL

# Tool returns:
{
  "result": {"rows": 16598, "cols": 11, ...},
  "exit_code": 0,
  "error": null
}

`web_search`

Dual-engine search with automatic quality scoring:

Primary: Exa neural search API
Fallback: Firecrawl
Quality tiers: preferred (.gov, .edu, arxiv.org, WHO, etc.), acceptable, low

Results include title, url, snippet, quality, and published_date. The LLM is instructed to prefer preferred sources in its ResearchCard.

`fetch_url`

Fetches a URL and extracts clean text. Only accepts http/https. Strips HTML, collapses whitespace, truncates to max_chars (default 2000). 15-second timeout.

`push_card`

Enqueues a serialised card to an in-memory asyncio.Queue keyed by session_id. The SSE endpoint /cards/{session_id} polls this queue and streams each card as an event to the browser. Retries up to 3 times if the queue is full.

Memory & Session State (`src/dataforge/memory/`)

@dataclass
class SessionState:
    session_id: str
    dataset_path: str
    dataset_shape: tuple[int, int]
    dataset_encoding: str
    completed_steps: list[str]
    failed_steps: list[str]
    pushed_card_ids: list[str]
    excluded_columns: list[str]
    reclassified_columns: dict[str, str]
    kpi_summary: dict[str, Any]
    top_correlated_pairs: list[tuple[str, str, float]]
    conversation_history: list[dict]
    sampling_applied: bool

State is held in a module-level dict[session_id → SessionState] and updated after each step completes. The conversation_history field accumulates user/assistant turns during chat mode, truncated to the last 10 messages to prevent context growth.

Self-Verification Gate (`src/dataforge/agent/self_verifier.py`)

Before any card is pushed, verify_card(card, step_id) runs type-specific checks:

Card Type	Checks
`KPICard`	`top_missing` values are 0–1 floats; `top_correlated_pairs` are valid 3-tuples
`VisualizationCard`	`image_path` is non-empty; `rationale` is present; `key_insight` is present
`ResearchCard`	At least one source with a non-empty URL
`DataSummaryCard` (step_09)	No causal language (`"causes"`, `"leads to"`, etc.) in `prose_summary`

Failures append the issue list to the conversation and the LLM must re-emit. This is the second line of defence after the pre-computation scripts — the agent cannot push a card it hasn't validated.

Pydantic Throughout

Pydantic v2 is used at every layer of the stack.

Settings (`src/dataforge/config.py`)

class Settings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env")

    nvidia_nim_api_key: str = Field(default="", alias="NVIDIA_NIM_API_KEY")
    groq_api_key: str = Field(default="", alias="GROQ_API_KEY")
    mlflow_tracking_uri: str = Field(default="sqlite:///mlflow.db", alias="MLFLOW_TRACKING_URI")
    max_iterations_per_step: int = 5
    large_dataset_row_threshold: int = 500_000
    correlation_threshold: float = 0.3
    # ... 30+ settings

pydantic-settings reads from .env, environment variables, and defaults in priority order. Every numeric threshold used in the analysis scripts — sampling size, scatter plot cutoff, missing value threshold — is configurable here.

Card Schemas (`src/dataforge/schemas/cards.py`)

All six card types inherit from BaseCard. The schema enforces exact Literal values for card_type so parse_card() can dispatch to the right class without ambiguity.

LLM type coercion — LLMs occasionally emit type mismatches (e.g. "11.1%" for a field typed float, a dict where a string is expected, or a list as a string). A @model_validator(mode="before") on BaseCard handles all of these before Pydantic's own validation runs:

@model_validator(mode="before")
@classmethod
def coerce_field_types(cls, data: dict) -> dict:
    for field_name, field_info in cls.model_fields.items():
        annotation = field_info.annotation
        origin = get_origin(annotation)
        args = get_args(annotation)
        val = data.get(field_name)
        if val is None:
            continue

        # str field received dict/list → join to markdown string
        if annotation is str or str in flat_args(annotation):
            if not isinstance(val, str):
                data[field_name] = _coerce_str(val)

        # list field received str → parse JSON or wrap
        elif origin is list:
            if not isinstance(val, list):
                data[field_name] = _coerce_list(val)

        # dict field received str → parse JSON
        elif origin is dict:
            if not isinstance(val, dict):
                data[field_name] = _coerce_dict(val)
            # dict[str, float] with "11.1%" string values → 0.111
            if len(args) == 2 and args[1] in (float, int):
                data[field_name] = _coerce_dict_values(data[field_name], args[1])
    return data

StepSpec has its own @model_validator that backfills name from step_id when the LLM omits it — a common failure mode across weaker providers in the chain.

API Request/Response

The upload endpoint uses Form/UploadFile directly. The card endpoints return list[dict] serialised from the Pydantic models. The chat endpoint receives a ChatRequest model:

class ChatRequest(BaseModel):
    message: str
    session_id: str

MLflow Tracing

Every session and every step is tracked in MLflow using the explicit runs API (MlflowClient) rather than the context-manager API — necessary because the context-manager uses thread-local state that is incompatible with async code.

Run Hierarchy

dataforge_2026-05-14_15-08-22  (session parent run)
├── step_01_Scan-Plan           (nested step run)
├── step_02_Dataset-Summary
├── step_03_Domain-Research
│   └── artifacts/
│       └── llm_calls/
│           ├── iter_00_request.json
│           └── iter_00_response.json
├── step_04_Statistical-KPIs
├── step_05_Correlation-Heatmap
│   └── artifacts/
│       ├── llm_calls/
│       └── charts/
│           └── step_05_viz.png
├── step_06_Distribution-Histograms
├── step_07_Categorical-Bar-Charts
├── step_08_Scatter-Plots
├── step_09_Analyst-Insights-Brief
└── step_chat_01               (nested chat run, same parent)

What Gets Logged

Artifact / Metric	Where
LLM request messages (JSON)	`llm_calls/iter_NN_request.json`
LLM response content (JSON)	`llm_calls/iter_NN_response.json`
Tool call + result (JSON)	`tool_calls/NN_tool_name.json`
Chart images (PNG)	`charts/`
`input_tokens`, `output_tokens`	MLflow metrics (per-step index)
`latency_ms`	MLflow metrics
`provider`, `model`	MLflow tags
`card_type`, `iterations_used`	Step run tags
`status` (complete/failed)	Run status

All tracer calls are wrapped in try/except — a tracing failure never aborts the workflow.

MLflow UI

Run the app with python run.py and open http://localhost:5001. The MLflow UI shows the dataforge experiment with one run per session. Click a session run to see all 9 nested step runs and their artifacts.

flowchart LR
    A[Session run created\nby orchestrator] --> B[Step run created\nbefore execute_step]
    B --> C[Each LLM call\nlogged as artifact]
    B --> D[Each tool call\nlogged as artifact]
    B --> E[Chart PNG\nlogged as artifact]
    B --> F[Metrics:\ntokens, latency]
    B --> G[Step run finished\nwith status + tags]
    G --> H[Next step...]
    H --> I[Session run finished\nwith counts + status]

Dashboard UI

The dashboard is a single-page HTML/CSS/JS application (src/dataforge/static/index.html) served statically by FastAPI. It uses no build step and no framework — vanilla JS with a dark Indigo theme.

Layout

SSE Card Streaming

Cards are pushed to the browser as Server-Sent Events. The JavaScript opens an EventSource to /cards/{session_id} and appends each card to the DOM as it arrives:

const es = new EventSource(`/cards/${sessionId}`);
es.onmessage = (event) => {
    const card = JSON.parse(event.data);
    renderCard(card);  // append to main content
};

Each card type is rendered differently:

StepPlan → collapsible step list with step IDs and reasoning types
DataSummaryCard → prose paragraph + column table
KPICard → stat tiles (row count, numeric/categorical column counts) + missing value bars
VisualizationCard → full-width chart image + interpretation text
ResearchCard → source links with quality badges + synthesis paragraph
ErrorCard → red alert with error type and recovery hint

Chat Mode

After analysis_complete fires, the chat input unlocks. Messages are POST /chat with {session_id, message}. Responses are a single card pushed through the same SSE channel, so chat replies appear inline in the card stream.

Project Structure

dataforge/
├── run.py                          # Launch script (MLflow UI + uvicorn)
├── pyproject.toml                  # Dependencies, build config
├── design_prompts/
│   ├── dataforge_initial_prompt.md  # v1 system prompt
│   ├── dataforge_prompt_evaluation.md  # Claude's evaluation of v1
│   └── dataforge_final_prompt.md   # v2 system prompt (production)
├── skills/
│   ├── data_analysis_agent_skill.md
│   ├── dataset_research_skill.md
│   └── visualization_selection_skill.md
├── Architecture Plan.md            # Full system design document
└── src/dataforge/
    ├── config.py                   # Settings (pydantic-settings)
    ├── __main__.py                 # uvicorn entrypoint
    ├── agent/
    │   ├── orchestrator.py         # 9-step workflow runner
    │   ├── step_executor.py        # ReAct loop + pre-computation
    │   ├── card_builder.py         # JSON → card type dispatch
    │   ├── self_verifier.py        # Pre-push verification gate
    │   ├── conversation.py         # Chat mode handler
    │   └── prompts.py              # SYSTEM_PROMPT + per-step hints
    ├── api/
    │   ├── main.py                 # FastAPI app + lifespan
    │   └── routes/
    │       ├── upload.py           # POST /upload → workflow trigger
    │       ├── cards.py            # GET /cards/{id} SSE stream
    │       ├── chat.py             # POST /chat
    │       └── traces.py           # GET /api/traces (MLflow)
    ├── memory/
    │   └── session_store.py        # In-memory SessionState registry
    ├── reasoning/
    │   ├── explicit_reasoner.py    # <thinking> block validator
    │   └── types.py                # ReasoningType enum
    ├── router/
    │   ├── failover.py             # Sequential provider failover
    │   └── providers/
    │       ├── nvidia_nim.py
    │       ├── groq.py
    │       ├── cerebras.py
    │       ├── mistral.py
    │       └── gemini.py
    ├── schemas/
    │   └── cards.py                # All card types + coercion validators
    ├── tools/
    │   ├── execute_python.py       # Sandboxed subprocess runner
    │   ├── web_search.py           # Exa / Firecrawl search
    │   ├── fetch_url.py            # HTML-stripped URL fetcher
    │   └── push_card.py            # SSE queue pusher
    ├── tracing/
    │   └── mlflow_tracer.py        # MLflow span management
    └── static/
        └── index.html              # Dashboard UI

Agent Skills

The skills/ directory contains three behavioural reference documents that are embedded directly into the agent's system prompt (via src/dataforge/agent/prompts.py). They are not user-facing documentation — they are machine-readable specifications that shape how the LLM reasons, selects visualisations, and researches datasets. Each skill targets a specific failure mode identified during the prompt evaluation phase.

`data_analysis_agent_skill.md` — Analysis Standards & Insight Framework

What it is: A comprehensive standards document that defines how the DataForge agent must approach every analytical step. It establishes a priority order (understand the data → discover insights → create visualisations → provide recommendations) and maps each phase to a specific workflow step and card type.

What it does:

Defines the 4-Part Insight Framework that every card field (key_insight, interpretation, synthesis) must satisfy:
- WHAT — state the finding with actual numbers
- WHY — hypothesise mechanism, hedged with "may suggest" / "is consistent with"
- IMPACT — significance for the domain or analysis goal
- ACTION — a concrete, specific next step
Sets the Quantification Standard: forbids adjective-only findings ("strongly correlated") and requires numerical equivalents ("r = 0.74, meaning 55% of variance is shared")
Specifies the Step 09 Synthesis Card structure (Executive Summary → Key Insights → Data Quality Notes → Suggested Next Steps) and the rule that synthesis must elevate and integrate prior findings rather than repeat them
Documents chart-type eligibility rules and per-step use cases (heatmap, histograms, bar charts, scatter)
Defines an edge-case table covering high missingness, low-cardinality numerics, small datasets, and all-weak-correlation scenarios

How it helps: Without this skill, weaker LLMs in the failover chain produce vague, adjective-heavy insight text or duplicate observations across cards. The skill is a quality floor — it gives every provider in the chain the same rubric, so the output quality degrades gracefully rather than catastrophically when failover kicks in.

Where it's invoked: The skill content is injected as a section of the system prompt in src/dataforge/agent/prompts.py. It applies globally to all 9 steps and to chat mode. The key_insight, interpretation, and prose_summary fields in src/dataforge/schemas/cards.py are the primary card fields this skill governs. The self-verifier in src/dataforge/agent/self_verifier.py enforces the causal-language guard (step_09 DataSummaryCard) that originates from this skill.

`dataset_research_skill.md` — Domain Research & Description (Step 03)

What it is: A targeted playbook for step_03 (ResearchCard). It defines an exact ReAct sequence for profiling the dataset, building ranked search queries, evaluating results, and degrading gracefully when searches fail.

What it does:

Specifies the full ReAct sequence for step_03:
1. Profile the data with execute_python (shape, dtypes, top values, missing percentages)
2. Infer domain from the profile, build 3 ranked queries (specific → domain-level → broadest fallback)
3. Execute searches in order, evaluate each result, fall back when results are irrelevant
4. Synthesise into a ResearchCard using data-profile facts when web results fail
Defines a Query Construction Strategy with examples showing how column names drive queries (e.g. upvote_ratio + subreddit → "Reddit posts dataset upvote_ratio subreddit NLP analysis") and explicitly bans generic templates like "dataset analysis"
Documents Result Evaluation Criteria (source quality tiers: preferred / acceptable / low, snippet relevance signals)
Defines 4 ResearchCard status values (complete, partial, data-only, inconclusive) with strict rules — most importantly, "complete" with an empty sources array is invalid
Specifies the 5-paragraph synthesis structure (what it is → what it measures → use cases → data quality → analytical potential)

How it helps: Step 03 is the only step where the agent reaches out to the web and must evaluate source credibility. Without this skill, the agent either invents dataset facts or uses generic queries that return useless results. The ranked-query strategy and explicit fallback logic ensure the step always produces a useful card — even when all 3 searches fail — by falling back to a data-profile-only description.

Where it's invoked: The skill is embedded as a step-specific hint for step_03 in src/dataforge/agent/prompts.py, injected alongside the step-level ReAct loop in src/dataforge/agent/step_executor.py. The web_search and fetch_url tools in src/dataforge/tools/ are the execution layer this skill coordinates. The ResearchCard schema in src/dataforge/schemas/cards.py (with its status, sources, queries_attempted, and synthesis fields) directly mirrors the skill's output specification.

`visualization_selection_skill.md` — Chart Selection & Column Prioritisation (Steps 05–08)

What it is: A decision framework for steps 05–08 that governs which columns are charted, which chart type is used, and whether a visualisation is warranted at all. It exists to prevent the most common LLM visualisation failure: producing charts because data is available rather than because they reveal something meaningful.

What it does:

Defines a Column Priority Framework with three tiers:
- HIGH — always worth visualising (outcome variables, key business drivers, time columns, segmentation columns)
- MEDIUM — only if they show a strong relationship with HIGH columns
- LOW / AVOID — IDs, index columns, near-zero variance columns
Defines a 5-question Insight Validation Gate that must score ≥3 YES before execute_python is called to generate a chart. If the gate fails, the step produces an ErrorCard explaining why — a chart that cannot justify its existence should not be produced
Specifies step-specific guidance for each visualisation step:
- step_05 (heatmap): filter to HIGH-priority columns; only render if ≥2 pairs have |r| > 0.25 with domain logic
- step_06 (histograms): select 4–8 HIGH-priority numeric columns by analytical importance, not by column order
- step_07 (bar charts): only columns with 2–20 unique values; horizontal bars ordered by frequency
- step_08 (scatter): one business-relevant pair; regression line if |r| ≥ 0.3; colour-code by segmentation if sub-group patterns exist
Defines an anti-patterns reference with specific bad patterns (pairplot of every feature, correlation heatmap of all numeric columns, "Distribution of column X" titles) and their correct alternatives

How it helps: Visualisation steps are the most likely to produce shallow output when run by a weaker provider. Without the column priority framework, an LLM will include ID or index columns in heatmaps, generate histograms in column order rather than by analytical importance, and produce charts titled "Distribution of column 3". The insight validation gate is a hard gate — if the data genuinely does not support a meaningful scatter plot, the correct output is an ErrorCard, not a weak chart.

Where it's invoked: The skill is embedded as the _viz_react_hint in src/dataforge/agent/prompts.py and is injected for every visualisation step (05–08). The pre-computation scripts in src/dataforge/agent/step_executor.py generate the chart PNG before the LLM loop, but the column-selection decisions and insight gate happen inside the LLM reasoning pass guided by this skill. The VisualizationCard and ErrorCard schemas in src/dataforge/schemas/cards.py are the output types the skill governs.

Getting Started

Prerequisites

Python 3.11+
uv package manager

Installation

git clone <repo>
cd dataforge
uv sync

Configuration

Copy .env.example to .env and add at least one API key:

# At least one provider required
NVIDIA_NIM_API_KEY=nvapi-...
GROQ_API_KEY=gsk_...
CEREBRAS_API_KEY=csk-...
MISTRAL_API_KEY=...
GEMINI_API_KEY=AIza...

# Optional search tools
EXA_API_KEY=...
FIRECRAWL_API_KEY=...

# Overrides
MLFLOW_TRACKING_URI=sqlite:///mlflow.db
APP_PORT=8000
MLFLOW_UI_PORT=5001
WORKSPACE_DIR=/tmp/workspace

Running

python run.py

This starts both services:

DataForge → http://localhost:8000
MLflow UI → http://localhost:5001

Usage

Open http://localhost:8000
Click Upload and select a CSV file
Watch cards stream in as the 9-step workflow runs
When complete, type a question in the chat input
Open http://localhost:5001 to explore traces

Configuration Reference

Variable	Default	Description
`NVIDIA_NIM_MODEL`	`meta/llama-3.1-70b-instruct`	NIM model name
`GROQ_MODEL`	`llama-3.3-70b-versatile`	Groq model name
`CEREBRAS_MODEL`	`llama3.1-70b`	Cerebras model
`MISTRAL_MODEL`	`mistral-large-latest`	Mistral model
`GEMINI_MODEL`	`gemini-1.5-pro`	Gemini model
`MAX_ITERATIONS_PER_STEP`	`5`	Max ReAct loop iterations
`STEP_TIMEOUT_SECONDS`	`600`	Per-step hard timeout
`SCRIPT_TIMEOUT_SECONDS`	`90`	execute_python timeout
`MAX_TOKENS_PER_CALL`	`8192`	LLM max_tokens
`MAX_COST_CENTS_PER_SESSION`	`500`	Cost circuit-breaker
`LARGE_DATASET_ROW_THRESHOLD`	`500,000`	Triggers sampling
`SAMPLE_SIZE`	`100,000`	Rows after sampling
`CORRELATION_THRESHOLD`	`0.3`	Min
`HIGH_MISSING_THRESHOLD`	`0.5`	Missing % to exclude column
`MAX_CATEGORICAL_UNIQUE`	`30`	Max unique values for bar chart

Technology Stack

Layer	Technology
Web framework	FastAPI + uvicorn
Schema validation	Pydantic v2
Settings	pydantic-settings
LLM providers	NVIDIA NIM, Groq, Cerebras, Mistral, Gemini
Data analysis	pandas, numpy, scipy, scikit-learn
Visualisation	matplotlib, seaborn
Web search	Exa, Firecrawl
Observability	MLflow (SQLite backend)
Async runtime	asyncio (Python 3.11+)
Streaming	SSE via sse-starlette
Package manager	uv
UI	Vanilla HTML/CSS/JS (no build step)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
design_prompts		design_prompts
skills		skills
src/dataforge		src/dataforge
tests		tests
.env.example		.env.example
Architecture Plan.md		Architecture Plan.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DataForge — Autonomous Data Analysis Agent

🎬 Demo video

Origin Story

Step 1 — Initial Prompt

Step 2 — Prompt Evaluation with Claude

Step 3 — Final Prompt

Step 4 — Architecture

What DataForge Does

Architecture Overview

Agent Design

The ReAct Loop

The 9-Step Workflow

Reasoning Type System

Pre-Computation Pattern

Component Reference

Provider Router (src/dataforge/router/)

Tool Registry (src/dataforge/tools/)

execute_python

web_search

fetch_url

push_card

Memory & Session State (src/dataforge/memory/)

Self-Verification Gate (src/dataforge/agent/self_verifier.py)

Pydantic Throughout

Settings (src/dataforge/config.py)

Card Schemas (src/dataforge/schemas/cards.py)

API Request/Response

MLflow Tracing

Run Hierarchy

What Gets Logged

MLflow UI

Dashboard UI

Layout

SSE Card Streaming

Chat Mode

Project Structure

Agent Skills

data_analysis_agent_skill.md — Analysis Standards & Insight Framework

dataset_research_skill.md — Domain Research & Description (Step 03)

visualization_selection_skill.md — Chart Selection & Column Prioritisation (Steps 05–08)

Getting Started

Prerequisites

Installation

Configuration

Running

Usage

Configuration Reference

Technology Stack

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Provider Router (`src/dataforge/router/`)

Tool Registry (`src/dataforge/tools/`)

`execute_python`

`web_search`

`fetch_url`

`push_card`

Memory & Session State (`src/dataforge/memory/`)

Self-Verification Gate (`src/dataforge/agent/self_verifier.py`)

Settings (`src/dataforge/config.py`)

Card Schemas (`src/dataforge/schemas/cards.py`)

`data_analysis_agent_skill.md` — Analysis Standards & Insight Framework

`dataset_research_skill.md` — Domain Research & Description (Step 03)

`visualization_selection_skill.md` — Chart Selection & Column Prioritisation (Steps 05–08)

Packages