Large-scale JSON/JSONL data analysis CLI using local LLMs.
Uses a sliding window + progressive summarization approach to overcome context window limitations — no Map-Reduce information loss.
- Analyze up to 100K+ JSON/JSONL records with local LLMs
- Sliding window engine with overlap for boundary context preservation
- Citation verification — every finding cites source records, verified against originals
- Checkpoint-based resume for long-running analysis
- Idempotent job execution
- Interactive parameter builder (with file input support)
- Markdown and HTML report output
- Output language control (
--lang Japanese)
- Go 1.23+
- Local LLM with OpenAI-compatible API (e.g., LM Studio)
- Recommended model:
google/gemma-4-26b-a4b(Think OFF)
make build # → dist/data-analyzer# Option 1: Environment variables
export DATA_ANALYZER_API_ENDPOINT="http://localhost:1234/v1"
export DATA_ANALYZER_API_MODEL="google/gemma-4-26b-a4b"
# Option 2: Config file (~/.config/data-analyzer/config.toml)
mkdir -p ~/.config/data-analyzer
cp config.example.toml ~/.config/data-analyzer/config.tomlBuild parameters interactively with LLM assistance:
# Interactive mode (supports multi-line input, end with empty line)
data-analyzer prepare --output params.json
# With sample data — LLM sees actual field names and values
data-analyzer prepare --sample logs.jsonl --output params.json
# Load requirements from file + sample data, then refine interactively
data-analyzer prepare --input requirements.txt --sample logs.jsonl --output params.jsonOr create params.json manually:
{
"perspective": "Detect insider threats and unauthorized access",
"target_fields": ["user", "action", "source_ip", "timestamp"],
"attention_points": [
"Multiple failed login attempts",
"Privilege escalation",
"Large data transfers to external services"
],
"user_findings": [],
"lang": "Japanese"
}# Single file
data-analyzer analyze --params params.json logs.jsonl
# Directory (all .json/.jsonl files)
data-analyzer analyze --params params.json ./log_dir/
# With output file and language
data-analyzer analyze --params params.json --lang Japanese --output result.json logs.jsonl
# Resume interrupted analysis
data-analyzer analyze --params params.json --resume <job-id> logs.jsonl# Markdown to stdout
data-analyzer compile result.json
# HTML report
data-analyzer compile --format html --output report.html result.json
# Both Markdown and HTML
data-analyzer compile --format both --output report result.json
# From stdin
cat result.json | data-analyzer compile -Settings are loaded in order: defaults → config file → env vars → CLI flags.
| Variable | Default | Description |
|---|---|---|
DATA_ANALYZER_API_ENDPOINT |
http://localhost:1234/v1 |
OpenAI-compatible API endpoint |
DATA_ANALYZER_API_MODEL |
google/gemma-4-26b-a4b |
Model name |
DATA_ANALYZER_API_KEY |
— | API key (optional) |
DATA_ANALYZER_CONTEXT_LIMIT |
131072 |
Context window budget (tokens) |
DATA_ANALYZER_OVERLAP_RATIO |
0.1 |
Window overlap ratio (0.0–1.0) |
DATA_ANALYZER_MAX_FINDINGS |
100 |
Max findings to accumulate |
DATA_ANALYZER_MAX_RECORDS_PER_WINDOW |
200 |
Max records per window (quality guard) |
DATA_ANALYZER_LANG |
— | Output language (e.g., Japanese) |
DATA_ANALYZER_TEMP_DIR |
$TMPDIR/data-analyzer |
Checkpoint directory |
When the LLM backend crashes or unloads the model during long analysis sessions,
the client automatically detects the failure and waits for the model to reload.
Configurable via [resilience] in config.toml:
| Setting | Default | Description |
|---|---|---|
max_retries |
10 |
Max retry attempts per LLM call |
max_backoff_sec |
120 |
Max backoff wait between retries (seconds) |
health_check_interval_sec |
10 |
Polling interval for model readiness (seconds) |
health_check_timeout_sec |
300 |
Max wait for model to become ready (seconds) |
On each retry, the client polls /v1/models to confirm the model is loaded
before sending the next request. This prevents wasting retries while the
backend is still reloading.
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ prepare │───▶│ analyze │───▶│ compile │
│ (interactive)│ │(sliding win) │ │(md/html/both)│
└─────────────┘ └──────────────┘ └──────────────┘
params.json result.json report.md/.html
Sliding Window Algorithm:
- Divide records into overlapping windows (max 200 records per window)
- For each window:
[Previous Summary] + [Findings] + [New Data]→ LLM - LLM returns updated summary + new findings with record citations
- Citation verification: check excerpt relevance, replace with full original record
- Checkpoint saved after each window (resume on interruption)
- Final report generated from accumulated findings
Citation Verification:
Every citation from the LLM is verified against the original data:
- Excerpt values checked for relevance against the original record
- Excerpts always replaced with the full original record (no field omission)
- Non-matching excerpts flagged as possible hallucination
- Missing citations recovered from
Record #Nreferences in description text
Memory Map (128K token budget):
| Section | Allocation |
|---|---|
| System prompt | ~2K (fixed) |
| Previous summary | 0→15K (grows, then stabilizes) |
| Accumulated findings | 0→20K (grows, priority eviction) |
| New RAW data | Remainder (~86K–106K) |
| Response buffer | ~5K (fixed) |
# Remove completed jobs older than 7 days (default)
data-analyzer clean
# Remove completed jobs older than 1 day
data-analyzer clean --max-age 1d
# Remove all jobs (including incomplete)
data-analyzer clean --allMIT