Bringing Reinforcement Learning to OpenClaw/Krab agents — Self-improving AI that learns from real actions and rewards.
Transform your static agents into adaptive learners that continuously improve from experience. Built on Stable-Baselines3 and Gymnasium, ClawReinforcementLearning enables your agents to:
- 📚 Learn from real task outcomes and user feedback
- 🎯 Adapt strategies based on reward signals
- 🔄 Improve automatically over time with minimal human intervention
- ⚡ Train locally with full privacy and control
- 🌐 Integrate seamlessly with ClawFlow ecosystem
# Install
pip install -r requirements.txt
# Run basic training
python scripts/train.py
# Use in your agent
from src.agent.rl_agent import RLAgent
agent = RLAgent()
action = agent.predict(obs)| Feature | Details |
|---|---|
| 🧠 Multiple Algorithms | PPO (default), DQN, A3C |
| 📊 Custom Environment | Gymnasium-compatible ClawEnv |
| 💾 Smart Model Management | Auto-save, checkpointing, best model tracking |
| 🔗 Deep Integration | ClawGraph, ClawMemory, ClawFlow compatible |
| 📈 Monitoring | TensorBoard integration, reward tracking |
| 🚀 Production Ready | Error handling, logging, config management |
clawflow install openkrab/claw-reinforcement-learninggit clone https://github.com/OpenKrab/ClawReinforcementLearning.git
cd ClawReinforcementLearning
pip install -r requirements.txtfrom src.agent.rl_agent import RLAgent
# Initialize
agent = RLAgent()
# Train
agent.train(timesteps=50000)
# Predict
obs, _ = agent.env.reset()
action = agent.predict(obs)
print(f"Smart action: {action}")# Train with custom parameters
python scripts/train.py \
--algorithm PPO \
--timesteps 100000 \
--learning-rate 0.0001
# Monitor with TensorBoard
tensorboard --logdir ./logs# In your ClawFlow config
integrations:
rl:
enabled: true
auto_train_interval: "1d" # Train daily
reward_source: clawgraph # Get rewards from ClawGraph┌─────────────────────────────────────┐
│ ClawReinforcementLearning │
├─────────────────────────────────────┤
│ │
│ RLAgent (Stable-Baselines3) │
│ ├─ Train: Learn from rewards │
│ ├─ Predict: Choose actions │
│ └─ Save/Load: Model management │
│ │
│ ClawEnv (Gymnasium) │
│ ├─ Observations: State from graph │
│ ├─ Actions: Discrete action space │
│ └─ Rewards: Custom reward function │
│ │
└─────────────────────────────────────┘
│
├─→ ClawGraph (state/rewards)
├─→ ClawMemory (observations)
├─→ ClawFlow (orchestration)
└─→ ClawTeam (multi-agent)
Edit config.yaml:
rl:
algorithm: PPO # Algorithm to use
total_timesteps: 10000 # Training iterations
learning_rate: 0.0003 # Learning rate
batch_size: 64 # Batch size
env:
max_steps: 50 # Episode length
action_space: 10 # Number of actions
observation_space: 20 # State vector size
reward:
success: 10.0 # Success reward
failure: -5.0 # Failure penalty
step_penalty: -0.1 # Per-step cost
save_path: ~/.openkrab/rl_modelsClawReinforcementLearning/
├── src/
│ ├── __init__.py
│ ├── env/
│ │ ├── __init__.py
│ │ └── claw_env.py # Gymnasium environment
│ └── agent/
│ ├── __init__.py
│ └── rl_agent.py # RL agent wrapper
│
├── scripts/
│ └── train.py # Training script
│
├── examples/
│ └── basic_training.py # Usage examples
│
├── tests/
│ └── test_env.py # Unit tests
│
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── SKILL.md # ClawHub skill manifest
└── README.md # This file
from src.env.claw_env import ClawEnv
env = ClawEnv()
obs, info = env.reset()from src.agent.rl_agent import RLAgent
agent = RLAgent()
agent.train(timesteps=50000) # Train# Load and use
agent = RLAgent()
agent.model = agent.load_model()
obs, _ = agent.env.reset()
for _ in range(100):
action = agent.predict(obs)
obs, reward, done, truncated, info = agent.env.step(action)
if done:
breakOnce trained, use your model in ClawFlow:
# clawflow.yaml
agents:
main:
skills:
- name: rl
enabled: true
model_path: ~/.openkrab/rl_models/ppo_claw.zip
- name: claw_browser
- name: claw_tools
decision_policy: "reinforcement" # Use RL for decisions-
Increase Training Time: More timesteps = better convergence
agent.train(timesteps=500000) # More is better
-
Tune Reward Function: Customize rewards in
config.yamlreward: success: 20.0 # Higher reward for wins step_penalty: -0.2 # More penalty for slow actions
-
Use GPU: Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-
Monitor Progress: Use TensorBoard
tensorboard --logdir ./logs
- ✅ v0.1: Basic PPO training, Gymnasium integration
- 🔜 v0.2: ClawGraph integration for smart state observations
- 🔜 v0.3: Multi-agent RL with ClawTeam
- 🔜 v0.4: Safe training in ClawSandbox
- 🔜 v0.5: User feedback reward integration
- 🔜 v1.0: Production monitoring & auto-scaling
python scripts/train.py --force-retrainpip install -r requirements.txt --upgradeReduce batch size in config.yaml:
batch_size: 32 # From 64- Use GPU (install CUDA-enabled PyTorch)
- Increase
batch_size(if memory allows) - Reduce
observation_spacesize - Use faster algorithm (PPO is fastest)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - See LICENSE
Built with 🦞 by the OpenKrab community
Ready to train smarter agents? Start with:
python scripts/train.py