Skip to content

michele-milesi/torch-mopo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch MOPO

Python 3.11

This repository provide a PyTorch Implementation of MOPO: Model-based Offline Policy Optimization, based on SheepRL and accelerated with Lightning Fabric.

Getting Started

Installation

  1. Download MuJoCo and put it in ~/.mijoco folder (https://github.com/openai/mujoco-py?tab=readme-ov-file#install-mujoco).
  2. Update LD_LIBRARY_PATH environment variable:
    • export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco210/bin
    • export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
  3. Install Poetry (https://python-poetry.org/docs/#installing-with-the-official-installer).
  4. Create venv (python 3.11).
  5. sudo apt-get install libosmesa6-dev patchelf
  6. pip install "pyproject-hooks==1.0.0"
  7. poetry install

Train and Evaluate MOPO

Training

To run an experiment you can launch the following commands:

  1. cd torch-mopo
  2. MUJOCO_GL=egl mopo exp=halfcheetah-medium

Of course, you can add overrides or change the experiment name. The exp argument must specify a config filename in the ./torch_mopo/configs/exp/ folder. The MUJOCO_GL environment variable must be set according to the accelerator you are using:

  • egl: If you have a GPU (or more than one).
  • osmesa: if you are using CPUs.

Evaluation

You can also evaluate the trained agent: in the root folder of the repository (~/torch-mopo) you have to run the following command.

MUJOCO_GL=egl mopo_eval checkpoint_path=/path/to/chekpoint.ckpt

A bit of theory

MOPO is a model-based offline RL algorithm. Model-based means that it learns the dynamics of the environment, i.e., how actions affect the state of the environment and the amount of reward obtained for the played actions. MOPO learns the dynamics with a conservative approach in regions outside the data distribution, by penalizing the reward in those areas. The penalty given to the reward is model-dependent and does not necessarily penalize all uncertain (out-of-distribution) states and actions equally.

Goals

MOPO aims to solve some of the critical issues of offline RL:

  • Generalize the MDP dynamics using only the available data in the dataset. This is necessary because it is usually sub-optimal in terms of both states and actions covered. The agent should be able to make good decisions also in out-of-distribution states and actions.
  • Solve a different (target) task from the one in the dataset, this can happen when collecting a significant amount of samples is difficult for the target task or when there is not enough data for the target task.

MOPO seeks to achieve these two objectives with a trade-off between the return and the risk of taking action in unknown states.

Dynamics Model

The dynamics model (also known as the world model) is responsible for learning the environment dynamics, i.e., this model predicts the next state and the reward, given the current state and the action taken in that state. It is composed of an ensemble of $N$ MLPs, that learn a Gaussian distribution over the next state and reward. They are trained independently via maximum likelihood.

$$\hat{T}_{\theta, \phi} \left( s_{t + 1}, r | s_t, a_t \right) = \mathcal{N} \left( \mu_{\theta}(s_t, a_t), \Sigma_{\phi} (s_t, a_t) \right)$$

The uncertainty estimator is composed of bootstrap ensembles and the penalty ($u(s,a)$) is defined as the maximum of the norm of the standard deviation of the dynamics models in the ensemble, while this estimator lacks theoretical guarantees, the authors find that it is sufficiently accurate to achieve good performance in practice.

$$u(s,a) = \max^{N}_{i=1} \lVert \Sigma^{i}_{\phi} (s, a) \rVert _{F}$$

Finally, the reward is discounted by the penalty multiplied by a penalty coefficient.

$$\tilde{r}(s, a) = \hat{r}(s, a) - \lambda \max_{i = 1, \dots, N} \lVert \Sigma^{i}_{\phi} (s, a) \rVert _{F}$$

$\lambda$ is a hyper-parameter that should be set according to the ability of the error estimator to estimate the true error: if the heuristic ($u(s, a)$) underestimates the true error, $\lambda$ should be large; if it overestimates the true error, it should be small.

Dataset and Policy Update

The dataset is composed of the "environment dataset" ($\mathcal{D}_{\text{env}}$) which consists of the previously collected experiences. This dataset is augmented during training with the states (and rewards and actions) predicted by the algorithm. In particular, starting from a state in $\mathcal{D}_{\text{env}}$, the actor selects $h$ actions and the dynamics models predict the next states and the next reward. In this way, the world model is a kind of simulator of the real environment. The reward predicted by the dynamics model is discounted by the predicted error multiplied by $\lambda$. Finally, the termination signal is computed using the predicted observations and the environment rules, so, it is a function of the next (predicted) observations. The new dataset constructed during dynamics learning is called $\mathcal{D}_{\text{model}}$.

Finally, the policy is updated using SAC on the union of $\mathcal{D}_{\text{env}}$ (5%) and $\mathcal{D}_{\text{model}}$ (95%). The pseudocode of the algorithm is shown below.

Image

Implementation

Since this MOPO implementation is based on SheepRL, you can run different experiments by defining different configurations (thanks to Hydra). In this repository, a set of experiments is already provided, you can start from them and then define your own.

Note

This implementation of MOPO works only with vector observations. Images are not supported.

Environment Configuration

The first file to define is the ~/torch-mopo/torch_mopo/configs/env/d4rl.yaml config file, where all the environment-related arguments are defined. Below it is reported.

The environment must have the id, the validation_split, and wrapper fields. The id identifies the environment and how the dataset was created. A list of available environments can be found here (only Gym environments). The validation_split specifies the validation dataset split percentage. Finally, the wrapper contains all the necessary for instantiating the environment class, in this case, the target, the id, and the render_mode.

Note

The environment should not need wrappers such as action repeat or frame stack. The environment should already provide the observations and behave in the same way as the dataset was collected.

Algorithm Configuration

The second file to define is the ~/torch-mopo/torch_mopo/configs/algo/mopo.yaml config file, where all the model-related arguments are defined. Below it is reported.

The config inherits from the SAC configuration, since all the SAC components are necessary for learning an optimal policy, moreover, new arguments must be defined. Below is an explanation of the arguments defined in the ~/torch-mopo/torch_mopo/configs/algo/mopo.yaml config file:

  • total_steps: The total number of SAC updates.
  • h: The rollout length when simulating the environment.
  • num_epochs: The number of epochs for policy updates (total_steps // num_epochs SAC updates for each epoch).
  • rollout_batch_size: The number of initial states to sample from the d_env (b in the pseudo-code in Figure 4).
  • per_rank_batch_size: The batch size when updating SAC.
  • ensembles: The configurations of the dynamics model.
  • ensembles.num_ensembles: The number of MLPs in the dynamics model.
  • ensembles.num_elites: The number of models of the ensemble to use for predicting the next observation and the penalized reward.
  • ensembles.penalty_coef: The penalty coefficient to apply to the penalty (Figure 3).
  • ensembles.batch_size: The batch size used when learning the environment dynamics.
  • ensembles.max_epochs: The maximum number of epochs to perform during dynamics model learning (it can be null).
  • ensembles.max_epochs_since_update: The maximum number of epochs without improvements in the validation loss.
  • ensembles.optimizer: The configs of the dynamics model optimizer.
  • ensembles.optimizer.weight_decays: The list of weight decays to apply to the layers of the models in the ensemble.

Experiment Configuration

The third config file that must be defined is the experiment config file, the ~/torch-mopo/torch_mopo/configs/exp/mopo.yaml file is reported below as an example.

As you can notice, there are not a lot of arguments to be defined. The algorithm and the environment are overridden with the configuration defined in the previous sections. Then you need to define the env.id parameter and specify which metrics to log.

Note

You can use this configuration file as a basis for new configurations, as done in the ~/torch-mopo/torch_mopo/configs/exp/halfcheetah-medium.yaml file.

Add a Custom Environment for the Offline RL

A further step you can perform is to use MOPO to learn the optimal policy in your environment. This studio provides the D4RL environments, but nothing prohibits the use of another environment for training.

The procedure for adding a custom environment is almost equal to the one described in the SheepRL - How to add an environment (you can find a tutorial on how to add a custom environment in this studio). The only difference is that your environment class must provide an additional method: get_dataset(), which takes as input the validation_split (integer) and the seed (integer, optional).

Once the class is implemented, you must define the configs of your custom environment as the ones in the environment configuration section. You can specify additional parameters if needed, you just have to remember to put them under the env.wrapper field in the configs, if they are to be passed to the constructor.