A Python CLI for turning raw coding-agent session logs (Claude Code, OpenCode, Qwen Code) into clean, redacted, quality-scored JSONL that loads directly into a HuggingFace SFT pipeline.
The pipeline is a chain of JSONL → JSONL stages. Each stage has its own
subcommand and writes a file you can inspect; run chains them all together.
raw logs ──parse──▶ *_raw.jsonl
│
redact (scrub secrets, anonymize paths)
▼
*_redacted.jsonl
│
clean (drop bad tool calls, empty turns, orphans)
▼
*_cleaned.jsonl
│
evaluate (attach quality `score`)
▼
*_scored.jsonl
│
filter (keep score ≥ threshold, format for training)
▼
final.jsonl
pip install -e . # core CLI (stdlib only)
pip install -e '.[huggingface]' # optional: enable Hugging Face dataset uploadsOnly validate pulls in transformers — the rest of the pipeline runs on the
standard library alone, so it works in constrained sandboxes with no network.
End-to-end on Claude Code logs (auto-discovers ~/.claude/projects):
python -m logminer run --source claude --output training.jsonlThis writes the final dataset plus four intermediate files
(training_raw.jsonl, _redacted.jsonl, _cleaned.jsonl,
_scored.jsonl) so you can diff stages when debugging.
Higher-quality cut, all supported providers:
python -m logminer run --source all --output data/out.jsonl --min-score 0.7Upload the final dataset to a Hugging Face dataset repo when a hub token is in
HF_TOKEN, HUGGINGFACE_HUB_TOKEN, or HUGGING_FACE_HUB_TOKEN:
export HF_TOKEN=hf_xxx
python -m logminer run --source claude --output training.jsonl --hf-repo your-name/logminer-dataWhen upload runs, logminer also writes a dataset-card README.md with a
consistent logminer tag and a link back to this GitHub repo so those datasets
are easier to find on Hugging Face.
Pass --hf-private if you want the dataset repo created as private.
If you already ran the earlier stages yourself, filter can upload the final
JSONL too:
python -m logminer filter --input data/scored.jsonl --output data/training.jsonl --hf-repo your-name/logminer-data --hf-privateParse only, then sanity-check records against a real tokenizer:
python -m logminer parse --source claude --output data/raw.jsonl
python -m logminer validate --input data/raw.jsonl --model Qwen/Qwen3.5-4BWhen you omit --input, the parser falls back to each provider's standard
directory:
- claude →
~/.claude/projects - opencode →
~/.local/share/opencode - qwen →
~/.qwen/projects
Pass --input <path> to point at a specific export instead.
--min-score— 0.5 is the default and is forgiving. Raise to 0.7+ when you have plenty of source logs and want a tighter dataset; lower it when you're data-starved or still tuning.--min-turns/--min-token-count— what counts as a "real" conversation. The evaluator hard-floors anything below these to score 0, regardless of other signals. Drop them for sparse logs; raise them when you only want substantive sessions.--source all— convenient, but it parses every provider's default log dir. Pair--source <name>with--input <path>to target one export.
A parsed record:
{
"id": "<session id>",
"source": "claude",
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "...", "tool_calls": [...]},
{"role": "tool", "tool_call_id": "...", "content": "..."}
]
}After evaluate, records gain score, label, reasons, and a metrics
block. After filter, the first message also carries a tools array (the
OpenAI-style tool schemas inferred from observed tool calls), and messages
is serialized to list[str] (one JSON-encoded turn per element) so PyArrow
can unify the schema across rows.
The final JSONL loads cleanly via vanilla datasets.load_dataset:
import json
from datasets import load_dataset
ds = load_dataset("json", data_files="data/training.jsonl", split="train")
for rec in ds:
msgs = [json.loads(m) for m in rec["messages"]] # list[str] → list[dict]
tools = msgs[0].pop("tools", None) if msgs else None
# tokenizer.apply_chat_template(msgs, tools=tools, ...)Adding a new agent (e.g. Codex) usually means one new file in
logminer/parsers/ plus a one-line entry in parsers/__init__.py. The
pipeline stages are agent-agnostic and don't need changes.
See SKILL.md for the parser contract (BaseParser), step-by-step
instructions, and notes on extending the redaction, scoring, and cleaning
stages.
pytest # full suite
pytest tests/test_parsers.py -k claude # one parser at a timeInstall pre-commit hooks once per clone so lint and formatting run on every commit:
pip install pre-commit
pre-commit install
pre-commit run --all-files # optional: run against the whole repo right nowRuff handles both linting and formatting; config lives in pyproject.toml under
[tool.ruff]. CI runs ruff check, ruff format --check, and pytest against
Python 3.10, 3.11, and 3.12 on every push to main and every pull request
(.github/workflows/ci.yml).
Publishing uses Trusted Publishing (OIDC) — no API tokens stored as
secrets. The workflow at .github/workflows/publish.yml builds an sdist + wheel
with python -m build and uploads via pypa/gh-action-pypi-publish.
- PyPI: create the project (or register a pending publisher for the first
release). Under the project's Publishing settings, add a Trusted
Publisher with:
- Owner: your GitHub username/org
- Repository:
logminer - Workflow:
publish.yml - Environment:
pypi
- GitHub: in repo Settings → Environments, create an environment named
pypi. Optionally add required reviewers for an extra approval step before any release runs.
- Bump
versioninpyproject.toml(follow semver). - Commit the bump on
main:git commit -am "Release v0.1.1" - Tag and push — this auto-triggers the publish workflow:
git tag v0.1.1 git push origin main --tags
Alternatively, trigger the workflow manually from the Actions tab
(Publish to PyPI → Run workflow). Manual runs publish whatever is on the
selected branch — make sure version in pyproject.toml is already bumped, or
PyPI will reject the upload as a duplicate.
- Watch the run under the Actions tab; the publish step prints the uploaded files.
- Confirm the new version appears at
https://pypi.org/project/logminer/. - Smoke test in a clean venv:
pip install --upgrade logminer logminer --help
- To refresh the bundled copy when qwen-code updates, re-run QWEN_WRITE_SYSTEM_MD=logminer/parsers/qwen_system_prompt.md qwen -p "x".