<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/morl_tutorial_with_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Objective RL Tutorial

Author: Lucas N. Alegre

https://lucasalegre.github.io




This is a tutorial on running MORL Baselines algorithms on MO-Gymnasium's environments.

* MO-Gymnasium's documentation: https://mo-gymnasium.farama.org/

* MORL Baselines' documentation: https://lucasalegre.github.io/morl-baselines

* Create a Weights and Biases account (https://wandb.ai) to see the dashboard with all evaluation and learning metrics!

# 1) **Set-up MO-Gymnasium**


First, let's install MO-Gymnasium!

## Step 1: install and import the libraries

In [None]:
!pip install mo-gymnasium



This is required to record videos of our learned agents:

In [None]:
import sys
!{sys.executable} -m pip install moviepy



In [None]:
import gymnasium as gym
import mo_gymnasium as mo_gym

## Step 2: create an environment

In [None]:
env = mo_gym.make("mo-mountaincar-v0", render_mode="rgb_array")

## Step 3: extract environment information

The state of the agent is a 2-d vector containing:

* position of the car along the x-axis (between -1.2 and 0.6)
* velocity of the car (between -0.07 and 0.07)

In [None]:
env.observation_space

Box([-1.2  -0.07], [0.6  0.07], (2,), float32)

There are 3 discrete deterministic actions:

* 0: Accelerate to the left
* 1: Don’t accelerate
* 2: Accelerate to the right

In [None]:
env.action_space

Discrete(3)

The reward space is a 3D vector containing the time penalty, and penalties for reversing and going forward.

* time penalty: -1.0 for each time step
* reverse penalty: -1.0 for each time step the action is 0 (reverse) else 0
* forward penalty: -1.0 for each time step the action is 2 (forward) else 0

In [None]:
env.reward_space

  logger.warn(


Box(-1.0, [-1.  0.  0.], (3,), float32)

## Step 4: use Gymnasium features in MO-Gymnasium

MO-Gymnasium is compatible with most Gymnasium's (https://gymnasium.farama.org/) features, such as the wrappers.

In [None]:
from gymnasium.wrappers.record_video import RecordVideo

In [None]:
env = RecordVideo(env, "videos/demo", episode_trigger=lambda e: True)

  logger.warn(


## Step 5: go through the environment with a random agent

Below, we follow a random policy and save a video of the episode at "videos/demo" folder

In [None]:
env.reset()
done = False

while not done:
    obs, vec_reward, terminated, truncated, info = env.step(env.action_space.sample())
    done = terminated or truncated

Moviepy - Building video /content/videos/demo/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos/demo/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos/demo/rl-video-episode-0.mp4


## Step 6: scalarize env and run it with stable-baselines3



It is also possible to scalarize the reward function and run standard single-objective RL libraries!

In [None]:
!pip install stable-baselines3 --upgrade

Collecting stable-baselines3
  Obtaining dependency information for stable-baselines3 from https://files.pythonhosted.org/packages/d9/57/13d4e4b7bbbc940815964ac31e205263b8133f1f2a0147bd4ca884a6e174/stable_baselines3-2.0.0-py3-none-any.whl.metadata
  Using cached stable_baselines3-2.0.0-py3-none-any.whl.metadata (5.4 kB)
Collecting torch>=1.11 (from stable-baselines3)
  Using cached torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
Using cached stable_baselines3-2.0.0-py3-none-any.whl (178 kB)
Installing collected packages: torch, stable-baselines3
  Attempting uninstall: stable-baselines3
    Found existing installation: stable-baselines3 1.5.0
    Uninstalling stable-baselines3-1.5.0:
      Successfully uninstalled stable-baselines3-1.5.0
Successfully installed stable-baselines3-2.0.0 torch-2.0.1


In [None]:
import stable_baselines3 as sb3
import numpy as np
import mo_gymnasium as mo_gym

In [None]:
# Linear scalarizes the environment
env = mo_gym.LinearReward(mo_gym.make("mo-mountaincar-v0"), weight=np.array([0.9, 0.1, 0.0]))

# Run DQN agent!
agent = sb3.DQN("MlpPolicy", env)
agent.learn(100000)

  logger.warn(


<stable_baselines3.dqn.dqn.DQN at 0x7f9b66f70ee0>

## 2) Running Experiments with *MORL-Baselines*

Now, we will start playing with MORL algorithms!

## Install MORL Baselines

In [None]:
!apt-get install libgmp-dev python3-dev
!pip install pycddlib  # required for computing linear support

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-dev is already the newest version (3.10.6-1~22.04).
python3-dev set to manually installed.
Suggested packages:
  gmp-doc libgmp10-doc libmpfr-dev
The following NEW packages will be installed:
  libgmp-dev libgmpxx4ldbl
0 upgraded, 2 newly installed, 0 to remove and 8 not upgraded.
Need to get 346 kB of archives.
After this operation, 1,702 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libgmpxx4ldbl amd64 2:6.2.1+dfsg-3ubuntu1 [9,580 B]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libgmp-dev amd64 2:6.2.1+dfsg-3ubuntu1 [337 kB]
Fetched 346 kB in 0s (2,948 kB/s)
Selecting previously unselected package libgmpxx4ldbl:amd64.
(Reading database ... 120493 files and directories currently installed.)
Preparing to unpack .../libgmpxx4ldbl_2%3a6.2.1+dfsg-3ubuntu1_amd64.deb ...
Unpacking libgmpxx4ldbl:amd64 (2:6.2.1+dfsg-3ubuntu1) .

In [None]:
!pip install git+https://github.com/LucasAlegre/morl-baselines.git

Collecting git+https://github.com/LucasAlegre/morl-baselines.git
  Cloning https://github.com/LucasAlegre/morl-baselines.git to /tmp/pip-req-build-cm5fj0kk
  Running command git clone --filter=blob:none --quiet https://github.com/LucasAlegre/morl-baselines.git /tmp/pip-req-build-cm5fj0kk
  Resolved https://github.com/LucasAlegre/morl-baselines.git to commit 448434abb878498e6f38416d0fe3d14072a50cb1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting wandb>=0.14.0 (from morl-baselines==1.0.0)
  Downloading wandb-0.15.7-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting fire (from morl-baselines==1.0.0)
  Downloading fire-0.5.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

## Deep Sea Treasure

Here, we will use Pareto Q-Learning (https://lucasalegre.github.io/morl-baselines/algos/multi_policy/pareto_q_learning/) to solve the DST problem (https://mo-gymnasium.farama.org/environments/deep-sea-treasure/).

In [None]:
import gymnasium as gym
import mo_gymnasium as mo_gym
from mo_gymnasium.utils import MORecordEpisodeStatistics

GAMMA = 0.99

env = mo_gym.make("deep-sea-treasure-v0")
env = MORecordEpisodeStatistics(env, gamma=GAMMA)  # wrapper for recording statistics

eval_env = mo_gym.make("deep-sea-treasure-v0") # environment used for evaluation

  logger.warn(
  logger.warn(


### Pareto Q-Learning

Check the metrics on weights and biases!

Tip: at the runs.summary["eval/front"] panel, change Render as to Combined plot to see a 2D plot of the Pareto front.

In [None]:
import numpy as np
from morl_baselines.multi_policy.pareto_q_learning.pql import PQL

agent = PQL(
    env=env,
    ref_point=np.array([0, -50]),  # used to compute hypervolume
    gamma=GAMMA,
    log=True,  # use weights and biases to see the results!
)

agent.train(total_timesteps=100000, eval_env=eval_env, ref_point=np.array([0, -50]))

  logger.warn(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


{(0.699999988079071, -1.0),
 (8.03681981306076, -2.9701),
 (11.046854115, -4.90099501),
 (13.180722091614, -6.793465209301),
 (14.074187108950262, -7.72553055720799),
 (14.85618993228868, -8.64827525163591),
 (17.373143823765123, -12.247897700103202),
 (17.813676097383638, -13.12541872310217),
 (19.07265374771985, -15.705680661607312),
 (19.777976783050544, -17.383137616441328)}

### Exercise 1:

Solving Resource Gathering using GPI-LS / GPI-PD

- Resource Gathering environment (see https://mo-gymnasium.farama.org/environments/resource-gathering/)
- GPI-LS (see https://lucasalegre.github.io/morl-baselines/algos/multi_policy/mp_mo_q_learning/).

Tips:
- You need to set weight_selection_algo='gpi-ls' and use_gpi_policy=True in the MPMOQLearning constructor in order to use GPI-LS.
- To use GPI-PD (the model-based version of GPI-LS), set dyna=True and gpi_pd=True
- Because this environment has stochastic transitions, set num_eval_episodes_for_front=50 in the train() method in order to evaluate the value of the policies with more precision.
- Use 10 iterations of 10k steps:
 total_timesteps=100000, timesteps_per_iteration=10000
- Use epsilon-greedy exploration with
    initial_epsilon=1.0,
    final_epsilon=0.05,
    epsilon_decay_steps=100000
- Observe the metrics in "eval/" panel of weights and biases (e.g., "eval/eum" for expected utility)

In [None]:
import gymnasium as gym
import mo_gymnasium as mo_gym
from mo_gymnasium.utils import MORecordEpisodeStatistics
import numpy as np

GAMMA = 0.9
ref_point = np.array([-1., -1., -2.])

env = mo_gym.make("resource-gathering-v0")
env = MORecordEpisodeStatistics(env, gamma=GAMMA)  # wrapper for recording statistics

eval_env = mo_gym.make("resource-gathering-v0") # environment used for evaluation

env.pareto_front(GAMMA) # known Pareto front

  logger.warn(
  logger.warn(


In [None]:
from morl_baselines.multi_policy.multi_policy_moqlearning.mp_mo_q_learning import MPMOQLearning

# Your code here:
agent = MPMOQLearning(
    env,
    initial_epsilon=1.0,
    final_epsilon=0.05,
    epsilon_decay_steps=100000,
    gamma=GAMMA,
    dyna=True,
    gpi_pd=True,
    weight_selection_algo='gpi-ls',
    use_gpi_policy=True
)

agent.train(total_timesteps=100000, timesteps_per_iteration=10000, eval_env=eval_env, num_eval_episodes_for_front=50, ref_point=ref_point)

  logger.warn(



CCS: [] CCS size: 0
Next weight: [1. 0. 0.]
Adding value: [0. 0. 0.] to CCS.
W_corner: [array([1., 0., 0.]), array([0., 1., 0.]), array([0., 0., 1.])] W_corner size: 3
CCS: [array([0., 0., 0.], dtype=float32)] CCS size: 1
Next weight: [0. 1. 0.]
Adding value: [-0.0849  0.4209  0.    ] to CCS.
W_corner: [array([0.8321, 0.1679, 0.    ]), array([1., 0., 0.]), array([0., 1., 0.]), array([0., 0., 1.])] W_corner size: 4
CCS: [array([0., 0., 0.], dtype=float32), array([-0.0849,  0.4209,  0.    ])] CCS size: 2
Next weight: [0.8321 0.1679 0.    ]
Adding value: [0.     0.3138 0.    ] to CCS.
removed value [0. 0. 0.]
W_corner: [array([0.5578, 0.4422, 0.    ]), array([1., 0., 0.]), array([0., 1., 0.]), array([0., 0., 1.])] W_corner size: 4
CCS: [array([-0.0849,  0.4209,  0.    ]), array([0.    , 0.3138, 0.    ])] CCS size: 2
Next weight: [0. 0. 1.]
Adding value: [0.     0.     0.3874] to CCS.
W_corner: [array([1., 0., 0.]), array([0.5578, 0.4422, 0.    ]), array([0.4107, 0.3256, 0.2637]), array([0

0,1
charts_0/epsilon,█▇▆▆▅▄▃▃▂▁
charts_1/epsilon,█▇▆▆▅▄▃▃▂▁
charts_2/epsilon,█▇▆▆▅▄▃▃▂▁
charts_3/epsilon,█▇▆▆▅▄▃▃▂▁
charts_4/SPS,█▁
charts_4/episode_time,█▁
charts_4/epsilon,█▇▆▆▅▄▃▃▂▁
charts_4/timesteps_per_episode,▁█
charts_5/epsilon,█▇▆▆▅▄▃▃▂▁
charts_6/epsilon,█▇▆▆▅▄▃▃▂▁

0,1
charts_0/epsilon,0.905
charts_1/epsilon,0.81
charts_2/epsilon,0.715
charts_3/epsilon,0.62
charts_4/SPS,1835.0
charts_4/episode_time,0.09082
charts_4/epsilon,0.525
charts_4/timesteps_per_episode,39.0
charts_5/epsilon,0.43
charts_6/epsilon,0.335


In [None]:
env.pareto_front(0.9)

[array([-0.0827,  0.2288,  0.2288]),
 array([-0.126 ,  0.3487,  0.    ]),
 array([0.    , 0.    , 0.3487]),
 array([-0.0478,  0.2059,  0.2059]),
 array([-0.0478,  0.3138,  0.    ]),
 array([0.    , 0.2824, 0.    ])]

In [None]:
agent.linear_support.ccs

[array([-0.0849,  0.4209,  0.    ]),
 array([0.    , 0.3138, 0.    ]),
 array([0.    , 0.    , 0.3874]),
 array([-0.1243,  0.2573,  0.2573]),
 array([-0.1304,  0.2573,  0.2573])]

### Exercise 2:

Use your learned agent and visualize how the learned behaviours change depending on the utility!

Use the make_gif function of morl-baselines (https://lucasalegre.github.io/morl-baselines/features/misc/#morl_baselines.common.utils.make_gif).

How does the policy for the following linear weights differ?
* [0.9, 0.1, 0.0]
* [0.3, 0.7, 0.0]
* [0.0, 1.0, 0.0]

In [None]:
from morl_baselines.common.utils import make_gif

env2 = mo_gym.make("resource-gathering-v0", render_mode='rgb_array')  # you need to set the render_mode to render the gifs

# Your code here:
make_gif(env2, agent, weight=np.array([0.9, 0.1, 0.0]), fps=10, fullpath="./myagent1")
make_gif(env2, agent, weight=np.array([0.3, 0.7, 0.0]), fps=10, fullpath="./myagent2")
make_gif(env2, agent, weight=np.array([0.0, 1.0, 0.0]), fps=10, fullpath="./myagent3")

MoviePy - Building file ./myagent1.gif with imageio.




Saved gif at: ./myagent1.gif
MoviePy - Building file ./myagent2.gif with imageio.




Saved gif at: ./myagent2.gif
MoviePy - Building file ./myagent3.gif with imageio.


                                                  

Saved gif at: ./myagent3.gif




## Minecart

Next, we will play with continuous states and function approximation!

In [None]:
import gymnasium as gym
import mo_gymnasium as mo_gym
from mo_gymnasium.utils import MORecordEpisodeStatistics

  logger.warn(
  logger.warn(


### Pareto Conditioned Network (PCN)

Let's solve the Minecart problem (https://mo-gymnasium.farama.org/environments/minecart-deterministic) using PCN (https://lucasalegre.github.io/morl-baselines/algos/multi_policy/pcn)!

In [None]:
from morl_baselines.multi_policy.pcn.pcn import PCN


GAMMA = 1.0

env = mo_gym.make("minecart-deterministic-v0")
env = MORecordEpisodeStatistics(env, gamma=GAMMA)  # wrapper for recording statistics

eval_env = mo_gym.make("minecart-deterministic-v0") # environment used for evaluation

agent = PCN(
    env,
    scaling_factor=np.array([1.0, 1.0, 0.1, 0.1]),
)

agent.train(1000000,
            eval_env=eval_env,
            ref_point=np.array([-1,-1,-200]),
            max_return=np.array([1.5,1.5,0.0]),
            max_buffer_size=200,
 )

VBox(children=(Label(value='0.566 MB of 0.566 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/eum,▁▆▆████████████
eval/hypervolume,▁▆▆████████████
eval/sparsity,█▅▅▁▁▁▁▁▁▁▁▁▁▁▁
global_step,▁▁▁▁▂▂▃▃▃▃▃▃▄▄▅▅▅▅▅▅▆▆▇▇▇▇▇▇██

0,1
eval/eum,5.60456
eval/hypervolume,854.85038
eval/sparsity,12.61937
global_step,150000.0


  th.tensor(obs).to(self.device),


step 6062 	 return [  0.45       0.38     -12.942019], ([0.59999996 0.50705034 7.158154  ]) 	 loss 1.791E+00
step 8320 	 return [  0.375      0.385    -15.606035], ([0.32190838 0.34788647 6.6768785 ]) 	 loss 1.789E+00
step 10573 	 return [  0.3        0.3      -15.596042], ([ 0.501996  0.501996 11.752216]) 	 loss 1.788E+00
step 12188 	 return [  0.35999998   0.17       -11.290022  ], ([0.5969925 0.3407345 9.013392 ]) 	 loss 1.786E+00
step 13789 	 return [  0.45       0.55     -10.822019], ([0.32863355 0.3847077  8.187723  ]) 	 loss 1.786E+00
step 14619 	 return [ 0.15        0.53000003 -6.029998  ], ([0.3074085  0.59169245 2.2210867 ]) 	 loss 1.787E+00
step 16065 	 return [  0.182      0.368    -10.872014], ([0.29071635 0.520784   6.7859497 ]) 	 loss 1.787E+00
step 18032 	 return [  0.385        0.14500001 -14.024031  ], ([0.51090604 0.29021543 8.266772  ]) 	 loss 1.784E+00
step 18990 	 return [ 0.12799999  0.47199997 -6.8460007 ], ([0.20846103 0.5941851  3.290925  ]) 	 loss 1.785E+00


KeyboardInterrupt: ignored

### GPI-Linear Support (GPI-LS)

### Exercise 3

Now try to solve the stochastic version with GPI-LS (https://lucasalegre.github.io/morl-baselines/algos/multi_policy/gpi_pd)!

In [None]:
from morl_baselines.multi_policy.gpi_pd.gpi_pd import GPILS

GAMMA = 0.98

env = mo_gym.make("minecart-v0")
env = MORecordEpisodeStatistics(env, gamma=GAMMA)  # wrapper for recording statistics

eval_env = mo_gym.make("minecart-v0") # environment used for evaluation

# Your code here:
agent = GPILS(
    env,
    per=True,
    initial_epsilon=1.0,
    final_epsilon=0.05,
    epsilon_decay_steps=200000,
    target_net_update_freq=200,
    gradient_updates=10
)

agent.train(total_timesteps=200000,
            eval_env=eval_env,
            ref_point=np.array([-1,-1,-200])
)

0,1
eval/eum,▃▁▄▅▃▄▆▆▆▆▆▆▆▆▆▆▅▆▆▅▆▇▆▆▆▇▆▆▆▇▆▆▅▆▇▆▇▆█▆
eval/hypervolume,█▇▆▇▇█▇▇▂█▇▇▇▇█▇▇▇▅▃▇▇▆█▇▇▃▇▇▇▁▆▅▇▇▂▇▇█▇
eval/sparsity,▇▇▂▂▂▃▁▁▄▂▁▂▁▁▁█▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▁▃▁▁▁▁
global_step,▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇████
train/desired_return_0,▇▄▇▃▅▁▄▇▄▄▃▅▅▃▃▃▃▂▄▅▁▃▇▁▄▇▇▁▃▂▃▂█▃▄▄▃▃▃▅
train/desired_return_1,▂▅▁▅▅█▅▃▅▅▆▄▅▆▇▅▅█▅▄▁▅▁▁▆▁▂▅▅▇▇▁▁▅▆▅█▅▆▅
train/desired_return_2,▆▇█▅▁▆▇▆▇▇▆▇▇▅▅█▁▆▆▄█▅▇▇▇▇▇▇▇▆▇▇▇▅▄▇▅▇▅▇
train/entropy,███████▇▇▇▇▇▆▆▆▆▆▅▅▅▆▅▅▄▃▃▃▃▃▂▂▂▁▁▁▁▂▁▁▁
train/episode,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/horizon_desired,▂▂▂▄█▂▃▂▁▁▃▂▁▃▃▁█▂▃▄▁▄▁▁▁▁▁▁▁▂▂▁▁▄▄▂▃▁▃▁

0,1
eval/eum,0.05932
eval/hypervolume,934.17215
eval/sparsity,1.57886
global_step,59576.0
train/desired_return_0,0.96368
train/desired_return_1,0.75
train/desired_return_2,-2.26
train/entropy,441.63425
train/episode,550.0
train/horizon_desired,31.0


CCS: [] CCS size: 0
Next weight: [1. 0. 0.]
Next weight vector: [1. 0. 0.]
Episode infos:
Steps: 26, Time: 0.011230000294744968
Total Reward: [ 0.    0.   -1.02], Discounted: [ 0.      0.     -0.8909]
Scalarized Reward: 0.0, Discounted: 0.0
Episode infos:
Steps: 231, Time: 6.251953125
Total Reward: [  0.7    0.8  -15.12], Discounted: [ 0.0687  0.0785 -5.7549]
Scalarized Reward: 0.699999988079071, Discounted: 0.06867963820695877
Episode infos:
Steps: 75, Time: 2.1308600902557373
Total Reward: [ 0.   0.  -5.6], Discounted: [ 0.      0.     -3.9749]
Scalarized Reward: 0.0, Discounted: 0.0
Episode infos:
Steps: 310, Time: 9.98876953125
Total Reward: [  0.75     0.75   -20.2001], Discounted: [ 0.0333  0.0333 -6.1631]
Scalarized Reward: 0.75, Discounted: 0.03326378017663956
Episode infos:
Steps: 259, Time: 8.2783203125
Total Reward: [  1.5      0.     -17.9801], Discounted: [ 0.1111  0.     -6.7974]
Scalarized Reward: 0.0, Discounted: 0.0
Episode infos:
Steps: 76, Time: 2.241698980331421
Tot