## Using the StableBaselines3 library for reinforcement learning

In this notebook we test an implementation of the proximal policy optimization (PPO)
PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described (in this paper )[https://arxiv.org/abs/1502.05477]. The PPO algorithm works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![ppo](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/ppo.png)

We begin by installing Python 3.8 in our environment, mounting Google Drive and cloning the repository with the 3D bin packing environment (only if you are using Google Colab).

In [1]:
!sudo apt-get install python3.8

#change alternatives
!sudo update-alternatives --install / usr / bin / python3 python3 / usr / bin / python3.7.1
!sudo update-alternatives --install / usr / bin / python3 python3 / usr / bin / python3.8.2

#check python version
!python --version

#install pip
!sudo apt-get install python-pip
!sudo apt install python3.8-distutils

#force reinstall pip (fixes issue with pip not working)
#see: https://askubuntu.com/questions/1025189/pip-is-not-working-importerror-no-module-named-pip-internal
!curl https: // bootstrap.pypa.io / get-pip.py -o get-pip.py
!python3 get-pip.py --force-reinstall



Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libpython3.8-minimal libpython3.8-stdlib python3.8-minimal
Suggested packages:
  python3.8-venv binfmt-support
The following NEW packages will be installed:
  libpython3.8-minimal libpython3.8-stdlib python3.8 python3.8-minimal
0 upgraded, 4 newly installed, 0 to remove and 20 not upgraded.
Need to get 4,695 kB of archives.
After this operation, 18.5 MB of additional disk space will be used.
Get:1 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 libpython3.8-minimal amd64 3.8.14-1+bionic1 [762 kB]
Get:2 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 python3.8-minimal amd64 3.8.14-1+bionic1 [1,839 kB]
Get:3 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main am

In [1]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
% cd /content/drive/MyDrive
! rm -rf /Github/3D-bin-packing
!git clone https://github.com/ luisgarciar/3D-bin-packing.git
% cd Github/3D-bin-packing/
!git checkout dev-luis

/content/drive/MyDrive
Cloning into '3D-bin-packing'...
remote: Enumerating objects: 400, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 400 (delta 3), reused 8 (delta 1), pack-reused 369[K
Receiving objects: 100% (400/400), 8.84 MiB | 12.13 MiB/s, done.
Resolving deltas: 100% (211/211), done.
/content/drive/MyDrive/3D-bin-packing
Branch 'dev-luis' set up to track remote branch 'dev-luis' from 'origin'.
Switched to a new branch 'dev-luis'


We now install the required libraries. We need a special version of the library 'stable-baselines-3' compatible with OpenAI Gym version >=0.24.

In [4]:
#!pip install git+https://github.com/carlosluis/stable-baselines3@fix_tests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/carlosluis/stable-baselines3@fix_tests
  Cloning https://github.com/carlosluis/stable-baselines3 (to revision fix_tests) to /tmp/pip-req-build-5lcodqdf
  Running command git clone -q https://github.com/carlosluis/stable-baselines3 /tmp/pip-req-build-5lcodqdf
  Running command git checkout -b fix_tests --track origin/fix_tests
  Switched to a new branch 'fix_tests'
  Branch 'fix_tests' set up to track remote branch 'fix_tests' from 'origin'.
Collecting gym==0.25
  Downloading gym-0.25.0.tar.gz (720 kB)
[K     |████████████████████████████████| 720 kB 2.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: stable-baselines3, gym
  Building wheel for stable-baselines3 (setup.py) ... [?25l[?25hdon

We complete the instalation of the rest of the required libraries with the requirements.txt file from the repository.

In [6]:
!pip install -r /content/drive/MyDrive/Github/3D-bin-packing/requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nptyping==2.3.1
  Using cached nptyping-2.3.1-py3-none-any.whl (32 kB)
Collecting numpy==1.21.5
  Using cached numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Collecting Pillow==9.2.0
  Using cached Pillow-9.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
Collecting plotly==5.9.0
  Using cached plotly-5.9.0-py2.py3-none-any.whl (15.2 MB)
Collecting pytest==7.1.2
  Using cached pytest-7.1.2-py3-none-any.whl (297 kB)
Collecting sb3_contrib==1.6.0
  Using cached sb3_contrib-1.6.0-py3-none-any.whl (78 kB)
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Installing collected packages: numpy, pluggy, iniconfig, sb3-contrib, pytest, plotly, Pillow, nptyping
  Attempting uninstall: numpy
    Found existing installat

Next we add the path to be able to find the environment files.

In [13]:
import sys
import os

py_file_location = "/content/drive/MyDrive/Github/3D-bin-packing"
sys.path.append(os.path.abspath(py_file_location))

We now test the PPO algorithm with the 3D bin packing environment.

In [6]:
import warnings

import gym
from numpy.typing import NDArray
from sb3_contrib.common.wrappers import ActionMasker
from sb3_contrib.ppo_mask import MaskablePPO
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.env_util import make_vec_env

from src.utils import boxes_generator


In [7]:

def mask_fn(env: gym.Env) -> NDArray:
    return env.get_action_mask


def make_env(container_size, num_boxes, num_visible_boxes=1, seed=0):
    """
    Utility function for initializing bin packing env with action masking
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    """

    env = gym.make(
        "PackingEnv-v0",
        container_size=container_size,
        box_sizes=boxes_generator(container_size, num_boxes, seed),
        num_visible_boxes=num_visible_boxes,
    )
    env = ActionMasker(env, mask_fn)
    return env

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Environment initialization
container_size = [5, 5, 5]
num_boxes = 10
num_visible_boxes = 10
num_env = 4
env_kwargs = dict(
    container_size=container_size,
    num_boxes=num_boxes,
    num_visible_boxes=num_visible_boxes,
    seed=42,
    render_mode="rgb_array",
)
env = make_vec_env(make_env, n_envs=num_env, env_kwargs=env_kwargs)
print("finished initialization of vectorized environment")
print("beginning training")

# MaskablePPO initialization
model = MaskablePPO("MultiInputPolicy", env, gamma=0.4, verbose=1)
checkpoint_callback = CheckpointCallback(
    save_freq=50, save_path="../logs/", name_prefix="rl_model"
)
model.learn(5000, callback=checkpoint_callback)
print("done training")
model.save("ppo_mask")

Process ForkServerProcess-13:
Process ForkServerProcess-14:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 27, in _worker
    env = env_fn_wrapper.var()
  File "/opt/anaconda3/envs/3D-bin-packing/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 27, in _worker
 

KeyboardInterrupt: 

In [None]:
# obs = env.reset()
# while True:
#     # Retrieve current action mask
#     action_masks = get_action_masks(env)
#     action, _states = model.predict(obs, action_masks=action_masks)
#     obs, rewards, dones, info = env.step(action)
#     env.render()
