This repository provide a PyTorch Implementation of MOPO: Model-based Offline Policy Optimization, based on SheepRL and accelerated with Lightning Fabric.
- Download MuJoCo and put it in
~/.mijoco
folder (https://github.com/openai/mujoco-py?tab=readme-ov-file#install-mujoco). - Update
LD_LIBRARY_PATH
environment variable:export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco210/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
- Install Poetry (https://python-poetry.org/docs/#installing-with-the-official-installer).
- Create venv (python 3.11).
sudo apt-get install libosmesa6-dev patchelf
pip install "pyproject-hooks==1.0.0"
poetry install
To run an experiment you can launch the following commands:
cd torch-mopo
MUJOCO_GL=egl mopo exp=halfcheetah-medium
Of course, you can add overrides or change the experiment name. The exp
argument must specify a config filename in the ./torch_mopo/configs/exp/
folder. The MUJOCO_GL
environment variable must be set according to the accelerator you are using:
egl
: If you have a GPU (or more than one).osmesa
: if you are using CPUs.
You can also evaluate the trained agent: in the root folder of the repository (~/torch-mopo
) you have to run the following command.
MUJOCO_GL=egl mopo_eval checkpoint_path=/path/to/chekpoint.ckpt
MOPO is a model-based offline RL algorithm. Model-based means that it learns the dynamics of the environment, i.e., how actions affect the state of the environment and the amount of reward obtained for the played actions. MOPO learns the dynamics with a conservative approach in regions outside the data distribution, by penalizing the reward in those areas. The penalty given to the reward is model-dependent and does not necessarily penalize all uncertain (out-of-distribution) states and actions equally.
MOPO aims to solve some of the critical issues of offline RL:
- Generalize the MDP dynamics using only the available data in the dataset. This is necessary because it is usually sub-optimal in terms of both states and actions covered. The agent should be able to make good decisions also in out-of-distribution states and actions.
- Solve a different (target) task from the one in the dataset, this can happen when collecting a significant amount of samples is difficult for the target task or when there is not enough data for the target task.
MOPO seeks to achieve these two objectives with a trade-off between the return and the risk of taking action in unknown states.
The dynamics model (also known as the world model) is responsible for learning the environment dynamics, i.e., this model predicts the next state and the reward, given the current state and the action taken in that state. It is composed of an ensemble of
The uncertainty estimator is composed of bootstrap ensembles and the penalty (
Finally, the reward is discounted by the penalty multiplied by a penalty coefficient.
The dataset is composed of the "environment dataset" (
Finally, the policy is updated using SAC on the union of
Since this MOPO implementation is based on SheepRL, you can run different experiments by defining different configurations (thanks to Hydra). In this repository, a set of experiments is already provided, you can start from them and then define your own.
Note
This implementation of MOPO works only with vector observations. Images are not supported.
The first file to define is the ~/torch-mopo/torch_mopo/configs/env/d4rl.yaml
config file, where all the environment-related arguments are defined. Below it is reported.
The environment must have the id
, the validation_split
, and wrapper
fields. The id
identifies the environment and how the dataset was created. A list of available environments can be found here (only Gym environments). The validation_split specifies the validation dataset split percentage. Finally, the wrapper contains all the necessary for instantiating the environment class, in this case, the target
, the id
, and the render_mode
.
Note
The environment should not need wrappers such as action repeat or frame stack. The environment should already provide the observations and behave in the same way as the dataset was collected.
The second file to define is the ~/torch-mopo/torch_mopo/configs/algo/mopo.yaml
config file, where all the model-related arguments are defined. Below it is reported.
The config inherits from the SAC configuration, since all the SAC components are necessary for learning an optimal policy, moreover, new arguments must be defined. Below is an explanation of the arguments defined in the ~/torch-mopo/torch_mopo/configs/algo/mopo.yaml
config file:
total_steps
: The total number of SAC updates.h
: The rollout length when simulating the environment.num_epochs
: The number of epochs for policy updates (total_steps // num_epochs
SAC updates for each epoch).rollout_batch_size
: The number of initial states to sample from thed_env
(b
in the pseudo-code in Figure 4).per_rank_batch_size
: The batch size when updating SAC.ensembles
: The configurations of the dynamics model.ensembles.num_ensembles
: The number of MLPs in the dynamics model.ensembles.num_elites
: The number of models of the ensemble to use for predicting the next observation and the penalized reward.ensembles.penalty_coef
: The penalty coefficient to apply to the penalty (Figure 3).ensembles.batch_size
: The batch size used when learning the environment dynamics.ensembles.max_epochs
: The maximum number of epochs to perform during dynamics model learning (it can benull
).ensembles.max_epochs_since_update
: The maximum number of epochs without improvements in the validation loss.ensembles.optimizer
: The configs of the dynamics model optimizer.ensembles.optimizer.weight_decays
: The list of weight decays to apply to the layers of the models in the ensemble.
The third config file that must be defined is the experiment config file, the ~/torch-mopo/torch_mopo/configs/exp/mopo.yaml
file is reported below as an example.
As you can notice, there are not a lot of arguments to be defined. The algorithm and the environment are overridden with the configuration defined in the previous sections. Then you need to define the env.id
parameter and specify which metrics to log.
Note
You can use this configuration file as a basis for new configurations, as done in the ~/torch-mopo/torch_mopo/configs/exp/halfcheetah-medium.yaml
file.
A further step you can perform is to use MOPO to learn the optimal policy in your environment. This studio provides the D4RL environments, but nothing prohibits the use of another environment for training.
The procedure for adding a custom environment is almost equal to the one described in the SheepRL - How to add an environment (you can find a tutorial on how to add a custom environment in this studio). The only difference is that your environment class must provide an additional method: get_dataset()
, which takes as input the validation_split
(integer) and the seed
(integer, optional).
Once the class is implemented, you must define the configs of your custom environment as the ones in the environment configuration section. You can specify additional parameters if needed, you just have to remember to put them under the env.wrapper
field in the configs, if they are to be passed to the constructor.