Skip to content
Tensorflow/Keras code and trained models for Episodic Curiosity Through Reachability
Branch: master
Clone or download
Raphaël Marinier Raphaël Marinier
Raphaël Marinier and Raphaël Marinier Also release publicly the fully trained policies.
Updated README and made the visualization script OSS-compatible.

PiperOrigin-RevId: 238615180
Latest commit b2ea704 Mar 15, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
episodic_curiosity Also release publicly the fully trained policies. Mar 15, 2019
scripts Project import generated by Copybara. Mar 1, 2019
third_party Added Feb 25, 2019
LICENSE Project import generated by Copybara. Mar 1, 2019 Also release publicly the fully trained policies. Mar 15, 2019

Episodic Curiosity Through Reachability

In ICLR 2019 [Project Website][Paper]

Nikolay Savinov¹, Anton Raichuk², Raphaël Marinier², Damien Vincent², Marc Pollefeys¹, Timothy Lillicrap³, Sylvain Gelly²
¹ETH Zurich, ²Google AI, ³DeepMind

Navigation out of curiosity Locomotion out of curiosity

This is an implementation of our ICLR 2019 Episodic Curiosity Through Reachability. If you use this work, please cite:

    Author = {Savinov, Nikolay and Raichuk, Anton and Marinier, Rapha{\"e}l and Vincent, Damien and Pollefeys, Marc and Lillicrap, Timothy and Gelly, Sylvain},
    Title = {Episodic Curiosity through Reachability},
    Booktitle = {International Conference on Learning Representations ({ICLR})},
    Year = {2019}


The code was tested on Linux only. The code assumes that the command "python" invokes python 2.7. We recommend you use virtualenv:

sudo apt-get install python-pip
pip install virtualenv
python -m virtualenv episodic_curiosity_env
source episodic_curiosity_env/bin/activate


Clone this repository:

git clone
cd episodic-curiosity

We require a modified version of DeepMind lab:

Clone DeepMind Lab:

git clone
cd lab

Apply our patch to DeepMind Lab:

git checkout 7b851dcbf6171fa184bf8a25bf2c87fe6d3f5380
git checkout -b modified_dmlab
git apply ../third_party/dmlab/dmlab_min_goal_distance.patch

Install DMLab as a PIP module by following these instructions

In a nutshell, once you've installed DMLab dependencies, you need to run:

bazel build -c opt python/pip_package:build_pip_package
./bazel-bin/python/pip_package/build_pip_package /tmp/dmlab_pkg
pip install /tmp/dmlab_pkg/DeepMind_Lab-1.0-py2-none-any.whl --force-reinstall

Finally, install episodic curiosity and its pip dependencies:

cd episodic-curiosity
pip install -e .

Resource requirements for training

Environment Training method Required GPU Recommended RAM
DMLab PPO No 32GBs
DMLab PPO + Grid Oracle No 32GBs
DMLab PPO + EC using already trained R-networks No 32GBs
DMLab PPO + EC with R-network training Yes, otherwise, training is slower by >20x.
Required GPU RAM: 5GBs
Tip: reduce dataset_buffer_size for using less RAM at the expense of policy performance.
DMLab PPO + ECO Yes, otherwise, raining is slower by >20x.
Required GPU RAM: 5GBs
Tip: reduce observation_history_size for using less RAM, at the expense of policy performance

Trained models

Trained R-networks and policies can be found in the episodic-curiosity Google cloud bucket. You can access them via the web interface, or copy them with the gsutil command from the Google Cloud SDK:

gsutil -m cp -r gs://episodic-curiosity/r_networks .
gsutil -m cp -r gs://episodic-curiosity/policies .

Example of command to visualize a trained policy with two episodes of 1000 steps, and create videos similar to the ones at the top of this page:

python -m episodic_curiosity.visualize_curiosity_reward --workdir=/tmp/ec_visualizations --r_net_weights=<path_to_r_network> --policy_path=<path_to_trained_policy> --alsologtostderr --num_episodes=2 --num_steps=1000 --visualization_type=surrogate_reward --trajectory_mode=do_nothing

This requires that you install extra dependencies for generating videos, with pip install -e .[video]


On a single machine

scripts/ is the main entry point to reproduce the results of Table 1 in the paper. For instance, the following command line launches training of the PPO + EC method on the Sparse+Doors scenario:

python episodic_curiosity/scripts/ --workdir=/tmp/ec_workdir --method=ppo_plus_ec --scenarios=sparseplusdoors

Main flags:

Flag Descriptions
--method Solving method to use, corresponds to the rows in table 1 of the paper. Possible values: ppo, ppo_plus_ec, ppo_plus_eco, ppo_plus_grid_oracle
--scenario Scenario to launch. Corresponds to the columns in table 1 of the paper. Possible values: noreward, norewardnofire, sparse, verysparse, sparseplusdoors, dense1, dense2
--workdir Directory where logs and checkpoints will be stored.
--run_number Run number of the current run. This is used to create an appropriate subdir in workdir.
--r_networks_path Only meaningful for the ppo_plus_ec method. Path to the root dir for pre-trained r networks. If specified, we train the policy using those pre-trained r networks. If not specified, we first generate the R network training data, train the R network and then train the policy.

Training takes a couple of days. We used CPUs with 16 hyper-threads, but smaller CPUs should do.

Under the hood, launches with the right hyperparameters. For the method ppo_plus_ec, it first launches to accumulate training data for the R-network using a random policy, then launches to train the R-network, and finally for the policy. In the method ppo_plus_eco, all this happens online as part of the policy training.

On Google Cloud

First, make sure you have the Google Cloud SDK installed.

scripts/ is the main entry point. Edit the script and replace the FILL-MEs with the details of your GCP project. In particular, you will need to point it to a GCP disk snapshot with the installed dependencies as described in the Installation section.

IMPORTANT: By default the script reproduces all results in table 1 and launches ~300 VMs on cloud with GPUs (7 scenarios x 4 methods x 10 runs). The cost of running all those VMs is very significant: on the order of USD 30 per day per VM based on early 2019 GCP pricing. Pass --i_understand_launching_vms_is_expensive to scripts/ to indicate that you understood that.

Under the hood, launches one VM for each (scenario, method, run_number) tuple. The VMs use startup scripts to launch training, and retrieve the parameters of the run through Instance Metadata.

TIP: Use sudo journalctl -u google-startup-scripts.service to see the logs of the startup script.

Training logs

Each training job stores logs and checkpoints in a workdir. The workdir is organized as follows:

File or Directory Description
r_training_data/{R_TRAINING,VALIDATION}/ TF Records with data generated from a random policy for R-network training. Only for method ppo_plus_ec without supplying pre-trained R-networks.
r_networks/ Keras checkpoints of trained R-networks. Only for method ppo_plus_ec without supplying pre-trained R-networks.
reward_{train,valid,test}.csv CSV files with {train,valid,test} rewards, tracking the performance of the policy at multiple training steps.
checkpoints/ Checkpoints of the policy.
log.txt, progress.csv Training logs and CSV from OpenAI's PPO2 code.

On cloud, the workdir of each job will be synced to a cloud bucket directory of the form <cloud_bucket_root>/<vm_id>/<method>/<scenario>/run_number_<d>/.

We provide a colab to plot graphs during training of the policies, using data from the reward_{train,valid,test}.csv files.

Known limitations

  • As of 2019/02/20, ppo_plus_eco method is not robust to restarts, because the R-network trained online is not checkpointed.
  • This repo only covers training on Deepmind Lab. We are also considering releasing the code for training on Mujoco in the future.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.