This work introduces a new scalable model-free actor-critic based algorithm based on Proximal Policy Optimization that uses a deep Gaussian process to directly approximate both the policy and the value function.
This project uses Poetry for dependency management.
You can also set up a virtual environment using Poetry. Poetry can be installed using pip
:
pip install poetry
Then initiate the virtual environment with the required dependencies (see poetry.lock
, pyproject.toml
):
poetry config virtualenvs.in-project true # ensures virtual environment is in project
poetry install
The virtual environment can be accessed from the shell using:
poetry shell
IDEs like Pycharm will be able to detect the interpreter of this virtual environment.
Alternatively generate a requirements.txt:
poetry export -f requirements.txt --without-hashes > requirements.txt
and
python3 -m pip install -r requirements.txt
This project uses Hydra for configuration management. To run the code:
python main.py mode=train agent=gppo_walker2d # (or other algorithms) or just `python main.py` for default configs
The codebase currently supports training, evaluation, and with optional tracking via Weights & Biases. Models can be saved and loaded. The results for each agent (GPPO, PPO, etc.) will be saved in the specified results directory. To reproduce results, ensure that train.yaml
and config.yaml
are configured as follows:
name: train
train: True
load_model: False
save_model: True
import_metrics: True
export_metrics: True
defaults:
- mode: train # Set a default mode value, it can be overridden by command-line args
- agent: gppo_walker2d # Replace with correct algorithm
- _self_
num_episodes: 10000
num_bootstrap_samples: 100
num_runs: 1
results_save_path: "./results/"
environment: "Walker2d-v5"
normalize_obs: True
normalize_act: False
clip_obs: 10.0
wandb:
project: "gppo-drl"
entity: "ml_exp"
use_wandb: True # Disable if you do not want this
agents
contains core RL implementations (e.g.,PPO
,GPPO
) and a factory class for instantiating them.gp
contains implementations of Deep Gaussian Processes (vanilla and Deep Sigma Point Process variant) and also GPPO specific implementation variants and objective functions.hyperparam_tuning
contains a generic implementation of the Bayesian optimization algorithm and additional helper functions.metrics
containsMetricsTracker
class which can be used to aggregate metrics across runs.util
contains utility functions and classes such asreplaybuffer
androlloutbuffer
.