Domain-specific synthetic dataset generation — from a single keyword to thousands of training examples.
BFS trees, Markov chains, graph expansion, concurrent generation, and quality filtering. All in one CLI.
How It Works • Features • Quick Start • Dataset Types • Providers
Input: Domain keyword (e.g., "medical imaging")
│
▼
┌────────────────────────────────┐
│ Prompt Expansion Engine │
│ │
│ ┌──────────┐ ┌───────────┐ │
│ │ BFS Tree │ │ Markov │ │
│ │ │ │ Chain │ │
│ └──────────┘ └───────────┘ │
│ ┌──────────┐ ┌───────────┐ │
│ │ Random │ │ Graph │ │
│ │ Walk │ │ Expansion │ │
│ └──────────┘ └───────────┘ │
└────────────────────────────────┘
│
▼ (hundreds of diverse prompts)
┌────────────────────────────────┐
│ Concurrent LLM Generation │
│ (via LiteLLM — any model) │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ Quality Filtering │
│ - Length checks │
│ - Diversity scoring │
│ - Embedding similarity │
└────────────────────────────────┘
│
▼
📦 Dataset: CSV / JSONL / TXT
(Q&A, corpus, CoT, agent trajectories)
| Feature | Description |
|---|---|
| 🧠 Algorithmic Prompt Expansion | BFS trees, random walks, Markov chains, hierarchical clustering, graph traversal |
| 🔌 Multi-Provider Support | OpenAI, Anthropic (Claude), Google (Gemini), Cohere, Groq, Ollama — unified via LiteLLM |
| 📊 Multiple Dataset Types | Q&A pairs, text corpora, chain-of-thought, agent trajectories |
| 📁 Flexible Output Formats | CSV, JSONL, TXT — ready for fine-tuning pipelines |
| ⚡ Concurrent Generation | Parallel workers for fast, large-scale dataset creation |
| 🔍 Quality Filtering | Length checks, diversity scoring, embedding-based deduplication |
| 💾 Resumable Runs | Save state and continue interrupted generation jobs |
| 📝 YAML Templates | Fully customizable prompts and system messages |
# Clone
git clone https://github.com/PeakScripter/dscurator.git
cd dscurator
# Install
pip install -r requirements.txt
# Set your API key (any supported provider)
export OPENAI_API_KEY=your_key_here
# or GOOGLE_API_KEY, ANTHROPIC_API_KEY, etc.
# Generate a Q&A dataset for "medical imaging"
python main.py --domain "medical imaging" --type qa --output dataset.jsonl
# Generate with Gemini, 500 examples, concurrent
python main.py --domain "astronomy" --type corpus --provider gemini --count 500 --workers 8| Type | Description | Use Case |
|---|---|---|
qa |
Question-answer pairs | Instruction fine-tuning |
corpus |
Domain text passages | Language model pre-training |
cot |
Chain-of-thought reasoning | Reasoning fine-tuning |
agent |
Tool-use / agent trajectories | Agent fine-tuning (ReAct, etc.) |
dscurator uses LiteLLM as a unified gateway — switch providers with a single flag:
--provider openai # GPT-4o, GPT-3.5, etc.
--provider gemini # Gemini 1.5 Pro / Flash
--provider claude # Claude 3.5 Sonnet / Haiku
--provider groq # Llama, Mixtral (fast inference)
--provider ollama # Local models (no API key needed)BFS Tree — breadth-first expansion of topic subtopics
Random Walk — stochastic exploration of the topic space
Markov Chain — probabilistic next-topic generation
Graph Expansion — knowledge-graph-style traversal
Hierarchical — cluster-then-expand for diverse coverage
output/
├── dataset.jsonl # Generated examples (JSONL)
├── dataset.csv # Same data in CSV format
├── prompts_used.txt # All expanded prompts (for inspection)
└── run_state.json # Checkpoint for resumable runs
- Language: Python 3.10+
- LLM Gateway: LiteLLM (multi-provider)
- Graph Algorithms: NetworkX
- Embeddings: sentence-transformers (for quality filtering)
- Concurrency: asyncio + ThreadPoolExecutor
MIT License — see LICENSE for details.