Optimize a skill document / system prompt like you would optimize code: iterate, evaluate, and keep the model weights frozen.
This open-source fork is focused on local AI workflows:
- run SkillOpt against any OpenAI-compatible local server
- use the included DotNetDebug example for cheap end-to-end smoke tests
- keep private configs, outputs, and secrets out of git
- document the local path first, while still supporting cloud backends
Important: SkillOpt optimizes a skill document / system prompt, not model weights. Your model stays frozen; SkillOpt improves the instructions it receives.
- optimize prompts/skills for benchmarked agent tasks
- compare skill revisions with validation-gated training loops
- run local experiments through
openai_compat - inspect generated skills, histories, patches, and evaluation summaries
Requirements: Python 3.10+
git clone https://github.com/mitkox/SkillOpt.git
cd SkillOpt
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Optional extras:
pip install -e ".[webui]"
pip install -e ".[alfworld]"If you install the ALFWorld extra, also download its assets:
alfworld-downloadCopy the environment template and load it:
cp .env.example .env
set -a
source .env
set +aThe default local workflow expects an OpenAI-compatible endpoint such as llama.cpp server, vLLM, LM Studio, Ollama's OpenAI bridge, or your own local server.
export OPENAI_COMPAT_BASE_URL="http://localhost:8000/v1"
export OPENAI_COMPAT_API_KEY="local"The included local sample config is:
- config:
configs/dotnetdebug/local_mitko.yaml - backend:
openai_compat - default model name:
mitko - sample dataset:
data/dotnetdebug/tasks.json - seed skill:
skillopt/envs/dotnetdebug/skills/initial.md
If your server exposes a different model name, change model.optimizer and model.target in the config or override them with --cfg-options.
This is the fastest way to verify the local setup end to end.
python scripts/train.py \
--config configs/dotnetdebug/local_mitko.yaml \
--cfg-options \
train.num_epochs=1 \
train.batch_size=2 \
gradient.minibatch_size=2 \
gradient.analyst_workers=1 \
env.workers=1 \
env.limit=2 \
optimizer.learning_rate=2 \
env.out_root=outputs/dotnetdebug_smokeInspect the main artifact at:
outputs/dotnetdebug_smoke/best_skill.md
Other useful artifacts:
outputs/dotnetdebug_smoke/history.jsonoutputs/dotnetdebug_smoke/summary.json(if present)outputs/dotnetdebug_smoke/steps/
python scripts/eval_only.py \
--config configs/dotnetdebug/local_mitko.yaml \
--skill outputs/dotnetdebug_smoke/best_skill.md \
--split test \
--cfg-options \
env.limit=2 \
env.workers=1 \
env.out_root=outputs/dotnetdebug_eval_smokeLocal is the default path in this fork, but SkillOpt also supports hosted backends.
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth
export AZURE_OPENAI_AUTH_MODE="azure_cli"export OPENAI_API_KEY="sk-..."export ANTHROPIC_API_KEY="sk-ant-..."export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"| Benchmark | Type | Config |
|---|---|---|
| SearchQA | QA | configs/searchqa/default.yaml |
| ALFWorld | Embodied agent | configs/alfworld/default.yaml |
| DocVQA | Document QA | configs/docvqa/default.yaml |
| LiveMathematicianBench | Math | configs/livemathematicianbench/default.yaml |
| SpreadsheetBench | Code generation | configs/spreadsheetbench/default.yaml |
| OfficeQA | Tool-augmented QA | configs/officeqa/default.yaml |
| DotNetDebug | C# debugging example | configs/dotnetdebug/default.yaml |
SkillOpt expects data in a split directory with train/, val/, and test/ subdirectories, each containing a JSON file such as items.json.
data/my_split/
├── train/items.json
├── val/items.json
└── test/items.json
Each JSON file is an array of task items. The exact schema depends on the benchmark. For example, SearchQA items look like:
[
{
"id": "unique_item_id",
"question": "Who wrote the novel ...",
"context": "[DOC] relevant passage text ...",
"answers": ["expected answer"]
}
]See skillopt/envs/<benchmark>/dataloader.py for benchmark-specific formats.
Note: Most benchmark datasets are not included in this repository. The bundled exception is
data/dotnetdebug/tasks.json, which exists specifically to support a runnable local smoke test.
| Argument | Description | Example |
|---|---|---|
--config |
Benchmark config YAML | configs/dotnetdebug/local_mitko.yaml |
--split_dir |
Path to data split directory | /path/to/split |
--skill |
Skill document to evaluate | outputs/my_run/best_skill.md |
--split |
Split to evaluate | test |
--cfg-options |
Inline config overrides | env.limit=2 env.workers=1 |
Each run writes to a structured output directory:
outputs/<run_name>/
├── config.json # Flattened runtime config
├── history.json # Per-step training history
├── runtime_state.json # Resume checkpoint
├── best_skill.md # Best validated skill document
├── skills/skill_vXXXX.md # Skill snapshot per step
├── steps/step_XXXX/ # Per-step artifacts
├── slow_update/epoch_XX/ # Slow-update logs
└── meta_skill/epoch_XX/ # Meta-skill logs
Re-running the same command resumes from the last completed step when possible.
Launch the optional monitoring dashboard:
python -m skillopt_webui.appCommon flags:
| Flag | Default | Description |
|---|---|---|
--port |
7860 | Server port |
--host |
0.0.0.0 |
Bind address |
--share |
off | Create a public Gradio share link |
This repo is grounded in the original SkillOpt research. If you want the paper/demo context, see:
- Project page: https://microsoft.github.io/SkillOpt/
- Paper: https://arxiv.org/abs/2605.23904
- Demo video: https://youtu.be/JUBMDTCiM0M
@article{skillopt2026,
title={SKILLOPT: Executive Strategy for Self-Evolving Agent Skills},
author={SkillOpt Team},
year={2026}
}