Official implementation of
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Donghu Kim*1, Youngdo Lee*2,3, Minho Park2, Kinam Kim2, Takuma Seno4, I Made Aswin Nahrendra3, Sehee Min1, Daniel Palenicek5,6, Florian Vogt7, Danica Kragic7, Jan Peters5,6,8, Jaegul Choo2, Hojoon Lee1
1Holiday Robotics, 2KAIST, 3KRAFTON, 4Turing Inc, 5TU Darmstadt, 6hessian.AI, 7KTH Royal Institute of Technology, 8German Research Center for AI (DFKI)
(* indicates equal contribution)
arXiv'2026.
teaser-2_3.mp4
FlashSAC is a fast and stable off-policy reinforcement learning algorithm that achieves the highest asymptotic performance in the shortest wall-clock time for high-dimensional robotic control.
This repository (FlashSAC) provides the full training framework, agent implementations, and environment integrations used in the paper, supporting over 100 tasks across diverse simulators: IsaacLab, MuJoCo Playground, ManiSkill, Genesis, HumanoidBench, MyoSuite, MuJoCo, Meta-World, and DeepMind Control Suite.
If you're using PPO, try FlashSAC!
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"| Configuration | Ubuntu | GPU | Python |
|---|---|---|---|
| Config 1 | 22.04 | RTX 30x0, 40x0 | uv python pin 3.10.18 |
| Config 2 | 24.04 | RTX 50x0, Bx00 (Blackwell) | uv python pin 3.11.14 |
uv syncwget https://github.com/deepmind/mujoco/releases/download/2.1.0/mujoco210-linux-x86_64.tar.gz
tar xvf mujoco210-linux-x86_64.tar.gz && rm mujoco210-linux-x86_64.tar.gz
mkdir -p ~/.mujoco && mv mujoco210 ~/.mujoco/mujoco210Add to ~/.bashrc:
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/home/$USER/.mujoco/mujoco210/bin:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
export MUJOCO_GL="egl"
export MUJOCO_EGL_DEVICE_ID="0"
export MKL_SERVICE_FORCE_INTEL="0"Verify:
source ~/.bashrc
uv run python -c "import gymnasium; gymnasium.make('HalfCheetah-v4')"By default, only MuJoCo and DMC are available. Install additional environments with:
uv sync --extra <environment>Available extras: isaaclab, mujoco-playground, maniskill, genesis, humanoid-bench, myosuite, metaworld, all
Note
mujoco-playground has known issues with JAX > 0.5.2 (NaN values, training collapse — see issue #153) and may not work with Python 3.11.
Note
isaaclab cannot be installed alongside genesis or humanoid-bench due to dependency conflicts. If you need IsaacLab, install it in a separate virtual environment with uv sync --extra isaaclab. For the same reason, all installs every extra except isaaclab.
uv run python train.pyOverride config values via --overrides:
uv run python train.py --overrides env=dmc --overrides env.env_name='humanoid-walk'Example scripts for each environment are provided in scripts/:
bash scripts/run_mujoco.sh
bash scripts/run_isaaclab.shConfigs are managed via Hydra. The base config is configs/flashSAC_base.yaml, with modular sub-configs under configs/agent/ and configs/env/.
Both Weights & Biases and TensorBoard are supported. Set logger_type in configs/flashSAC_base.yaml:
logger_type: 'wandb' # or 'tensorboard'TensorBoard logs are saved to runs/. Launch with:
tensorboard --logdir runsFlashSAC adapts its configuration based on the simulator type for optimal speed:
| GPU simulators (IsaacLab, MJP, Genesis, ManiSkill) | CPU simulators (MuJoCo, DMC, HBench, Myosuite) | |
|---|---|---|
num_envs |
1024 | 1 |
batch_size |
2048 | 512 |
| AMP | On | Off |
| Buffer device | cuda:0 |
cpu |
Note
torch.compile mode is determined by Python version. This is configured automatically — do not change it manually.
| Python | Compile mode | PyTorch | Notes |
|---|---|---|---|
| 3.10 | reduce-overhead |
2.5.1 | Legacy default |
| 3.11 | max-autotune |
2.9.1 | reduce-overhead causes 5–10x slowdowns after PyTorch 2.8 |
We use PyTorch 2.9.1 for Python 3.11 instead of 2.7.1 (IsaacLab's default), since IsaacLab will eventually migrate to newer versions. See
pyproject.tomlfor version pinning details.
Key design choices:
- AMP off for small batches — AMP incurs a GPU/CPU sync that becomes a bottleneck when batch and model sizes are small.
- CPU buffer for CPU simulators — With only 1 env, the overhead of GPU buffer operations outweighs the benefit. GPU buffer only pays off with large parallel envs.
- Compiled critical paths — Weight normalization, target critic EMA,
_select_min_q_log_probs, and_compute_categorical_td_targetare compiled for speed.
See the scripts/ directory for recommended per-environment configurations.
Agent checkpoints and replay buffers can be saved and loaded during training.
Checkpoints are saved automatically at the end of training by default. To save at regular intervals, set save_checkpoint_per_interaction_step and optionally save_buffer_per_interaction_step:
uv run python train.py \
--overrides save_checkpoint_per_interaction_step=24400 \
--overrides save_buffer_per_interaction_step=24400Checkpoints are saved to models/<group>/<exp>/<env_name>/seed<seed>-<timestamp>/step<N>/ and include the actor, critic, target critic, temperature, reward normalizer, and agent state (update step, grad scaler).
To resume training from a checkpoint, provide agent_load_path and optionally buffer_load_path:
uv run python train.py \
--overrides agent_load_path='models/.../step24400' \
--overrides buffer_load_path='models/.../step24400'By default, optimizer and reward normalizer states are also restored. This can be configured via agent.load_optimizer and agent.load_reward_normalizer in the agent config.
Trained IsaacLab agents can be visualized in the Isaac Sim viewport using play_isaaclab.py. This uses the same Hydra config system as training — pass the same --overrides you trained with so the network architecture matches the checkpoint.
uv run python play_isaaclab.py \
--checkpoint_path 'models/.../step24400' \
--num_envs 16 \
--num_episodes 10 \
--overrides env=isaaclab \
--overrides env.env_name='Isaac-Velocity-Flat-G1-v0' \
--overrides agent=flashSAC \
--overrides agent.asymmetric_observation=true \
--overrides agent.buffer_max_length=1Key arguments:
| Argument | Description |
|---|---|
--checkpoint_path |
Path to the saved checkpoint directory (contains actor.pt, etc.) |
--num_envs |
Number of parallel environments to visualize (default: 16) |
--num_episodes |
Number of episodes to run (default: 10) |
--overrides |
Same Hydra overrides used during training |
Note
agent.buffer_max_length can be set to a small value (e.g., 1) since the replay buffer is not used during play.
flash_rl/
agents/ # Agent implementations (FlashSAC, random)
buffers/ # Replay buffer implementations
common/ # Logger (wandb / tensorboard)
envs/ # Environment wrappers (Gymnasium 1.1 API)
evaluation.py # Evaluation and video recording
configs/ # Hydra configs (base, agent, env)
scripts/ # Launch scripts per environment
results/ # Experiment results and plots
train.py # Training entry point
play_isaaclab.py # IsaacLab visualization entry point
uv sync --dev # install formatters, linter, type checker
./bin/lint # run Black, Ruff, Mypy@article{kim2026flashsac,
title={FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control},
author={Kim, Donghu and Lee, Youngdo and Park, Minho and Kim, Kinam and Nahendra, I Made Aswin and Seno, Takuma and Min, Sehee and Palenicek, Daniel and Vogt, Florian and Kragic, Danica and Peters, Jan and Choo, Jaegul and Lee, Hojoon},
journal={arXiv preprint arXiv:2604.04539},
year={2026}
}