data-analyzer

Large-scale JSON/JSONL data analysis CLI using local LLMs.

Uses a sliding window + progressive summarization approach to overcome context window limitations — no Map-Reduce information loss.

Features

Analyze up to 100K+ JSON/JSONL records with local LLMs
Sliding window engine with overlap for boundary context preservation
Citation verification — every finding cites source records, verified against originals
Checkpoint-based resume for long-running analysis
Idempotent job execution
Interactive parameter builder (with file input support)
Markdown and HTML report output
Output language control (--lang Japanese)

Requirements

Go 1.23+
Local LLM with OpenAI-compatible API (e.g., LM Studio)
Recommended model: google/gemma-4-26b-a4b (Think OFF)

Installation

make build    # → dist/data-analyzer

Setup

# Option 1: Environment variables
export DATA_ANALYZER_API_ENDPOINT="http://localhost:1234/v1"
export DATA_ANALYZER_API_MODEL="google/gemma-4-26b-a4b"

# Option 2: Config file (~/.config/data-analyzer/config.toml)
mkdir -p ~/.config/data-analyzer
cp config.example.toml ~/.config/data-analyzer/config.toml

Usage

1. Prepare analysis parameters

Build parameters interactively with LLM assistance:

# Interactive mode (supports multi-line input, end with empty line)
data-analyzer prepare --output params.json

# With sample data — LLM sees actual field names and values
data-analyzer prepare --sample logs.jsonl --output params.json

# Load requirements from file + sample data, then refine interactively
data-analyzer prepare --input requirements.txt --sample logs.jsonl --output params.json

Or create params.json manually:

{
  "perspective": "Detect insider threats and unauthorized access",
  "target_fields": ["user", "action", "source_ip", "timestamp"],
  "attention_points": [
    "Multiple failed login attempts",
    "Privilege escalation",
    "Large data transfers to external services"
  ],
  "user_findings": [],
  "lang": "Japanese"
}

2. Run analysis

# Single file
data-analyzer analyze --params params.json logs.jsonl

# Directory (all .json/.jsonl files)
data-analyzer analyze --params params.json ./log_dir/

# With output file and language
data-analyzer analyze --params params.json --lang Japanese --output result.json logs.jsonl

# Resume interrupted analysis
data-analyzer analyze --params params.json --resume <job-id> logs.jsonl

3. Generate report

# Markdown to stdout
data-analyzer compile result.json

# HTML report
data-analyzer compile --format html --output report.html result.json

# Both Markdown and HTML
data-analyzer compile --format both --output report result.json

# From stdin
cat result.json | data-analyzer compile -

Configuration

Settings are loaded in order: defaults → config file → env vars → CLI flags.

Variable	Default	Description
`DATA_ANALYZER_API_ENDPOINT`	`http://localhost:1234/v1`	OpenAI-compatible API endpoint
`DATA_ANALYZER_API_MODEL`	`google/gemma-4-26b-a4b`	Model name
`DATA_ANALYZER_API_KEY`	—	API key (optional)
`DATA_ANALYZER_CONTEXT_LIMIT`	`131072`	Context window budget (tokens)
`DATA_ANALYZER_OVERLAP_RATIO`	`0.1`	Window overlap ratio (0.0–1.0)
`DATA_ANALYZER_MAX_FINDINGS`	`100`	Max findings to accumulate
`DATA_ANALYZER_MAX_RECORDS_PER_WINDOW`	`200`	Max records per window (quality guard)
`DATA_ANALYZER_LANG`	—	Output language (e.g., `Japanese`)
`DATA_ANALYZER_TEMP_DIR`	`$TMPDIR/data-analyzer`	Checkpoint directory

Resilience

When the LLM backend crashes or unloads the model during long analysis sessions, the client automatically detects the failure and waits for the model to reload. Configurable via [resilience] in config.toml:

Setting	Default	Description
`max_retries`	`10`	Max retry attempts per LLM call
`max_backoff_sec`	`120`	Max backoff wait between retries (seconds)
`health_check_interval_sec`	`10`	Polling interval for model readiness (seconds)
`health_check_timeout_sec`	`300`	Max wait for model to become ready (seconds)

On each retry, the client polls /v1/models to confirm the model is loaded before sending the next request. This prevents wasting retries while the backend is still reloading.

How It Works

┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│   prepare    │───▶│   analyze    │───▶│   compile    │
│ (interactive)│    │(sliding win) │    │(md/html/both)│
└─────────────┘    └──────────────┘    └──────────────┘
   params.json        result.json       report.md/.html

Sliding Window Algorithm:

Divide records into overlapping windows (max 200 records per window)
For each window: [Previous Summary] + [Findings] + [New Data] → LLM
LLM returns updated summary + new findings with record citations
Citation verification: check excerpt relevance, replace with full original record
Checkpoint saved after each window (resume on interruption)
Final report generated from accumulated findings

Citation Verification:

Every citation from the LLM is verified against the original data:

Excerpt values checked for relevance against the original record
Excerpts always replaced with the full original record (no field omission)
Non-matching excerpts flagged as possible hallucination
Missing citations recovered from Record #N references in description text

Memory Map (128K token budget):

Section	Allocation
System prompt	~2K (fixed)
Previous summary	0→15K (grows, then stabilizes)
Accumulated findings	0→20K (grows, priority eviction)
New RAW data	Remainder (~86K–106K)
Response buffer	~5K (fixed)

4. Clean up old jobs

# Remove completed jobs older than 7 days (default)
data-analyzer clean

# Remove completed jobs older than 1 day
data-analyzer clean --max-age 1d

# Remove all jobs (including incomplete)
data-analyzer clean --all

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmd		cmd
docs		docs
internal		internal
testdata		testdata
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.ja.md		README.ja.md
README.md		README.md
config.example.toml		config.example.toml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-analyzer

Features

Requirements

Installation

Setup

Usage

1. Prepare analysis parameters

2. Run analysis

3. Generate report

Configuration

Resilience

How It Works

4. Clean up old jobs

License

About

Uh oh!

Releases 13

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-analyzer

Features

Requirements

Installation

Setup

Usage

1. Prepare analysis parameters

2. Run analysis

3. Generate report

Configuration

Resilience

How It Works

4. Clean up old jobs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Contributors

Uh oh!

Languages