Skip to content

microsoft/SkillOpt

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks β€” with epochs, learning rates, and validation gates β€” but without touching model weights.

Project Page Paper Project Video Python 3.10+ License: MIT

🎬 SkillOpt Demo Video

64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4

β–Ά Watch the full demo on YouTube


Install

Requirements: Python 3.10+

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# For ALFWorld benchmark (optional):
pip install -e ".[alfworld]"
alfworld-download

Configure API Credentials

cp .env.example .env
# Edit .env with your API credentials, then:
source .env

Azure OpenAI (recommended):

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth (no API key needed)
export AZURE_OPENAI_AUTH_MODE="azure_cli"

Note: AZURE_OPENAI_ENDPOINT is always required. Without it, all LLM calls will fail.

OpenAI directly:

export OPENAI_API_KEY="sk-..."

Anthropic Claude:

export ANTHROPIC_API_KEY="sk-ant-..."

Qwen (local vLLM):

export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"

Data Preparation

SkillOpt expects data in a split directory with train/, val/, test/ subdirectories, each containing a JSON file (e.g., items.json).

data/my_split/
β”œβ”€β”€ train/items.json
β”œβ”€β”€ val/items.json
└── test/items.json

Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:

[
  {
    "id": "unique_item_id",
    "question": "Who wrote the novel ...",
    "context": "[DOC] relevant passage text ...",
    "answers": ["expected answer"]
  }
]

See skillopt/envs/<benchmark>/dataloader.py for the exact format each benchmark expects.

Note: Benchmark datasets are not included in this repository. Prepare your own data following the format above.

Supported Benchmarks

Benchmark Type Config
SearchQA QA configs/searchqa/default.yaml
ALFWorld Embodied agent configs/alfworld/default.yaml
DocVQA Document QA configs/docvqa/default.yaml
LiveMathematicianBench Math configs/livemathematicianbench/default.yaml
SpreadsheetBench Code generation configs/spreadsheetbench/default.yaml
OfficeQA Tool-augmented QA configs/officeqa/default.yaml

Quick Start

Training

# Minimal example β€” train on SearchQA:
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on LiveMathematicianBench:
python scripts/train.py \
    --config configs/livemathematicianbench/default.yaml \
    --split_dir /path/to/your/livemath_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on ALFWorld:
python scripts/train.py \
    --config configs/alfworld/default.yaml \
    --split_dir /path/to/your/alfworld_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Key CLI arguments:

Argument Description Example
--config Benchmark config YAML configs/searchqa/default.yaml
--split_dir Path to data split directory /path/to/split
--azure_openai_endpoint Azure OpenAI endpoint URL https://your-resource.openai.azure.com/
--optimizer_model Optimizer model deployment name gpt-5.5
--target_model Target model deployment name gpt-5.5
--num_epochs Number of training epochs 4
--batch_size Batch size per step 40
--workers Parallel rollout workers 8
--out_root Output directory outputs/my_run

Eval Only

Evaluate a trained skill on specific data splits without training:

# Evaluate on test set only:
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

# Evaluate on all splits (train + val + test):
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split all \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/
Split Description
valid_unseen Test set
valid_seen Validation set
train Training set
all All splits combined (default)

Output Structure

Each run writes to a structured output directory:

outputs/<run_name>/
β”œβ”€β”€ config.json              # Flattened runtime config
β”œβ”€β”€ history.json             # Per-step training history
β”œβ”€β”€ runtime_state.json       # Resume checkpoint
β”œβ”€β”€ best_skill.md            # Best validated skill document
β”œβ”€β”€ skills/skill_vXXXX.md   # Skill snapshot per step
β”œβ”€β”€ steps/step_XXXX/        # Per-step artifacts (patches, evals)
β”œβ”€β”€ slow_update/epoch_XX/   # Slow update logs
└── meta_skill/epoch_XX/    # Meta skill logs

Re-running the same command auto-resumes from the last completed step.


WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app
Flag Default Description
--port 7860 Server port
--host 0.0.0.0 Bind address
--share off Create a public Gradio share link
# With public share link (useful for remote servers)
python -m skillopt_webui.app --share

Citation

@article{skillopt2026,
  title={SKILLOPT: Executive Strategy for Self-Evolving Agent Skills},
  author={SkillOpt Team},
  year={2026}
}

About

SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors