## Using the StableBaselines3 library for reinforcement learning

In this notebook we test an implementation of the proximal policy optimization (PPO)
PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described (in this paper )[https://arxiv.org/abs/1502.05477]. The PPO algorithm works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![ppo](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/ppo.png)

## Setup

We begin by importing the required libraries and our OpenAI-Gym compatible environment.

In [40]:
import warnings

import gym
from sb3_contrib.ppo_mask import MaskablePPO
from stable_baselines3.common.env_checker import check_env

from src.utils import boxes_generator

from plotly_gif import GIF

import io
from PIL import Image

In [41]:
def make_env(
        container_size,
        num_boxes,
        num_visible_boxes=1,
        seed=0,
        render_mode=None,
        random_boxes=False,
        only_terminal_reward=False,
):
    """
    Parameters

    ----------
    container_size: size of the container
    num_boxes: number of boxes to be packed
    num_visible_boxes: number of boxes visible to the agent
    seed: seed for RNG
    render_mode: render mode for the environment
    random_boxes: whether to use random boxes or not
    only_terminal_reward: whether to use only terminal reward or not
    """
    env = gym.make(
        "PackingEnv-v0",
        container_size=container_size,
        box_sizes=boxes_generator(container_size, num_boxes, seed),
        num_visible_boxes=num_visible_boxes,
        render_mode=render_mode,
        random_boxes=random_boxes,
        only_terminal_reward=only_terminal_reward,
    )

Next we set up the environment for training

In [42]:
warnings.filterwarnings("ignore")
container_size = [10, 10, 10]
box_sizes2 = [[3, 3, 3], [3, 2, 3], [3, 4, 2], [3, 2, 4], [3, 2, 3]]

orig_env = gym.make(
    "PackingEnv-v0",
    container_size=container_size,
    box_sizes=box_sizes2,
    num_visible_boxes=1,
    render_mode="human",
    random_boxes=False,
    only_terminal_reward=False,
)

env = gym.make(
    "PackingEnv-v0",
    container_size=container_size,
    box_sizes=box_sizes2,
    num_visible_boxes=1,
    render_mode="human",
    random_boxes=False,
    only_terminal_reward=True,
)

check_env(env, warn=True)

We train the agent with the default multiinput policy that uses an MLP.

In [43]:
model = MaskablePPO("MultiInputPolicy", env, verbose=1)
print("begin training")
model.learn(total_timesteps=10)
print("done training")
model.save("ppo_mask")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
begin training
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 5        |
|    ep_rew_mean     | 0.111    |
| time/              |          |
|    fps             | 15       |
|    iterations      | 1        |
|    time_elapsed    | 130      |
|    total_timesteps | 2048     |
---------------------------------
done training


Next we roll out the trained agent

In [44]:
from sb3_contrib.common.maskable.utils import get_action_masks

obs = orig_env.reset()
done = False
figs = []
step = 1
while not done:
    print(step)
    action_masks = get_action_masks(env)
    action, _states = model.predict(obs, deterministic=True,action_masks=action_masks)
    obs, rewards, done, info = orig_env.step(action)
    fig = orig_env.render(mode="human")
    fig_png = fig.to_image(format="png")
    buf = io.BytesIO(fig_png)
    img = Image.open(buf)
    figs.append(img)
    step += 1
    print(step)
print("done packing")
env.close()

1
2
2
3
3
4
4
5
5
6
done packing


In [45]:
import os
cwd = os.getcwd()
print(cwd)

/Users/luis/Documents/code/fourthbrain-mle-course/repos/3D-bin-packing/nb


Next we save the rollout as a gif

In [46]:
figs[0].save('../gifs/train_5_boxes.gif', format='GIF',
             append_images=figs[1:],
             save_all=True,
             duration=300, loop=10)