Reinforcement Learning with Deep Energy-Based Policies
Switch branches/tags
Nothing to show
Clone or download
haarnoja Merge pull request #10 from haarnoja/composition
Adds policy composition code.
Latest commit e7ee214 Mar 22, 2018

Soft Q-Learning

Soft Q-learning (SQL) is a deep reinforcement learning framework for training maximum entropy policies in continuous domains. The algorithm is based on the paper Reinforcement Learning with Deep Energy-Based Policies presented at the International Conference on Machine Learning (ICML), 2017.

Getting Started

Soft Q-learning can be run either locally or through Docker.


You will need to have Docker and Docker Compose installed unless you want to run the environment locally.

Most of the models require a MuJoCo license.

Docker Installation

Currently, rendering of simulations is not supported on Docker due to a missing display setup. As a fix, you can use a local installation. If you want to run the MuJoCo environments without rendering, the docker environment needs to know where to find your MuJoCo license key (mjkey.txt). You can either copy your key into <PATH_TO_THIS_REPOSITY>/.mujoco/mjkey.txt, or you can specify the path to the key in your environment variables:

export MUJOCO_LICENSE_PATH=<path_to_mujoco>/mjkey.txt

Once that's done, you can run the Docker container with

docker-compose up

Docker compose creates a Docker container named soft-q-learning and automatically sets the needed environment variables and volumes.

You can access the container with the typical Docker exec-command, i.e.

docker exec -it soft-q-learning bash

See examples section for examples of how to train and simulate the agents.

To clean up the setup:

docker-compose down

Local Installation

To get the environment installed correctly, you will first need to clone rllab, and have its path added to your PYTHONPATH environment variable.

  1. Clone rllab
cd <installation_path_of_your_choice>
git clone
cd rllab
git checkout b3a28992eca103cab3cb58363dd7a4bb07f250a0
  1. Download and copy MuJoCo files to rllab path: If you're running on OSX, download instead, and copy the .dylib files instead of .so files.
mkdir -p /tmp/mujoco_tmp && cd /tmp/mujoco_tmp
wget -P .
mkdir <installation_path_of_your_choice>/rllab/vendor/mujoco
cp ./mjpro131/bin/ <installation_path_of_your_choice>/rllab/vendor/mujoco
cp ./mjpro131/bin/ <installation_path_of_your_choice>/rllab/vendor/mujoco
cd ..
rm -rf /tmp/mujoco_tmp
  1. Copy your MuJoCo license key (mjkey.txt) to rllab path:
cp <mujoco_key_folder>/mjkey.txt <installation_path_of_your_choice>/rllab/vendor/mujoco
  1. Clone softqlearning
cd <installation_path_of_your_choice>
git clone
  1. Create and activate conda environment
cd softqlearning
conda env create -f environment.yml
source activate sql

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

source deactivate
conda remove --name sql --all


Training and simulating an agent

  1. To train the agent
python ./examples/ --env=swimmer --log_dir="/root/sql/data/swimmer-experiment"
  1. To simulate the agent (NOTE: This step currently fails with the Docker installation, due to missing display.)
python ./scripts/ /root/sql/data/swimmer-experiment/itr_<iteration>.pkl contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag. For example:

python ./examples/ --help
usage: [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

Training and combining policies

It is also possible to merge two existing maximum entropy policies to form a new composed skill that approximately optimizes both constituent tasks simultaneously as discussed in Composable Deep Reinforcement Learning for Robotic Manipulation. To run the pusher experiment described in the paper, you can first train two policies for the constituent tasks ("push the object to the given x-coordinate" and "push the object to the given y-coordinate") by running

python ./examples/ --log_dir=/root/sql/data/pusher

You can then combine the two policies to form a combined skill ("push the object to the given x and y coordinates"), without collecting more experience form the environment, with

python ./examples/ --log_dir=/root/sql/data/pusher/combined \
--snapshot1=/root/sql/data/pusher/00/params.pkl \


The soft q-learning algorithm was developed by Haoran Tang and Tuomas Haarnoja under the supervision of Prof. Sergey Levine and Prof. Pieter Abbeel at UC Berkeley. Special thanks to Vitchyr Pong, who wrote some parts of the code, and Kristian Hartikainen who helped testing, documenting, and polishing the code and streamlining the installation process. The work was supported by Berkeley Deep Drive.


  title={Reinforcement Learning with Deep Energy-Based Policies},
  author={Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey},
  booktitle={International Conference on Machine Learning},
  title={Composable Deep Reinforcement Learning for Robotic Manipulation},
  author={Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, Sergey Levine},
  booktitle={International Conference on Robotics and Automation},