## Using the StableBaselines3 library for reinforcement learning

In this notebook we test an implementation of the proximal policy optimization (PPO)
PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described (in this paper )[https://arxiv.org/abs/1502.05477]. The PPO algorithm works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![ppo](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/ppo.png)

We begin by installing Python 3.8 in our environment, mounting Google Drive and cloning the repository with the 3D bin packing environment (only if you are using Google Colab).

In [1]:
!sudo apt-get install python3.8

#change alternatives
!sudo update-alternatives --install /usr/bin/python3  python3/usr/bin/python3.7.1
!sudo update-alternatives --install /usr/bin/python3 python3/usr/bin/python3.8.2

#check python version
!python --version

#install pip
!sudo apt-get install python-pip
!sudo apt install python3.8-distutils

#force reinstall pip (fixes issue with pip not working)
#see: https://askubuntu.com/questions/1025189/pip-is-not-working-importerror-no-module-named-pip-internal
!curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
!python3 get-pip.py --force-reinstall


Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3.8 is already the newest version (3.8.10-0ubuntu1~20.04.6).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.
update-alternatives: --install needs <link> <name> <path> <priority>

Use 'update-alternatives --help' for program usage information.
update-alternatives: --install needs <link> <name> <path> <priority>

Use 'update-alternatives --help' for program usage information.
Python 3.8.10
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package python-pip is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  python3-pip

E: Package 'python-pip' has no installation candidate
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, s

In [4]:
from google.colab import drive

ROOT = "/content/drive"     # default location for the drive

drive.mount(ROOT)           # we mount the google drive at /content/drive




Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
from os.path import join  

MY_GOOGLE_DRIVE_PATH = 'content/drive/MyDrive/Github/3D-bin-packing' 
# replace with your Github username 
GIT_USERNAME = "luisgarciar" 
# definitely replace with your
GIT_TOKEN = "aaa"
# Replace with your github repository 
GIT_REPOSITORY = "3D-bin-packing" 

PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH)

# It's good to print out the value if you are not sure 
print("PROJECT_PATH: ", PROJECT_PATH)   

# In case we haven't created the folder already; we will create a folder in the project path 
!rm -rf "PROJECT_PATH"
!mkdir "PROJECT_PATH"    

#GIT_PATH = "https://{GIT_TOKEN}@github.com/{GIT_USERNAME}/{GIT_REPOSITORY}.git" this return 400 Bad Request for me
GIT_PATH = "https://" + GIT_TOKEN + "@github.com/" + GIT_USERNAME + "/" + GIT_REPOSITORY + ".git"
print("GIT_PATH: ", GIT_PATH)

PROJECT_PATH:  /content/drive/content/drive/MyDrive/Github/3D-bin-packing
GIT_PATH:  https://aaa@github.com/luisgarciar/3D-bin-packing.git


In [15]:
%cd /content/drive/MyDrive/Github
%rm -rf 3D-bin-packing/
!git clone 

/content/drive/MyDrive/Github
Cloning into '3D-bin-packing'...
remote: Enumerating objects: 1064, done.[K
remote: Counting objects: 100% (695/695), done.[K
remote: Compressing objects: 100% (183/183), done.[K
remote: Total 1064 (delta 581), reused 595 (delta 507), pack-reused 369[K
Receiving objects: 100% (1064/1064), 23.57 MiB | 11.78 MiB/s, done.
Resolving deltas: 100% (789/789), done.
Updating files: 100% (265/265), done.


In [19]:
%cd 3D-bin-packing/
!git checkout dev-luis

/content/drive/MyDrive/Github/3D-bin-packing
M	logs/rl_model_4600_steps.zip
Branch 'dev-luis' set up to track remote branch 'dev-luis' from 'origin'.
Switched to a new branch 'dev-luis'


We now install the required libraries from the requirements.txt file from the repository. 

In [20]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

Next we add the path to be able to find the environment files.

In [21]:
import sys
import os

py_file_location = "/content/drive/MyDrive/Github/3D-bin-packing"
sys.path.append(os.path.abspath(py_file_location))

We now test the PPO algorithm with the 3D bin packing environment.

In [22]:
import io
import warnings

import gym
from PIL import Image
from sb3_contrib.ppo_mask import MaskablePPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import CheckpointCallback


from src.utils import boxes_generator

In [23]:
def make_env(
        container_size,
        num_boxes,
        num_visible_boxes=1,
        seed=0,
        render_mode=None,
        random_boxes=False,
        only_terminal_reward=False,
):
    """
    Parameters

    ----------
    container_size: size of the container
    num_boxes: number of boxes to be packed
    num_visible_boxes: number of boxes visible to the agent
    seed: seed for RNG
    render_mode: render mode for the environment
    random_boxes: whether to use random boxes or not
    only_terminal_reward: whether to use only terminal reward or not
    """
    env = gym.make(
        "PackingEnv-v0",
        container_size=container_size,
        box_sizes=boxes_generator(container_size, num_boxes, seed),
        num_visible_boxes=num_visible_boxes,
        render_mode=render_mode,
        random_boxes=random_boxes,
        only_terminal_reward=only_terminal_reward,
    )

We initialize two environments: one for training and another one for testing.

In [24]:
warnings.filterwarnings("ignore")
container_size = [10, 10, 10]
box_sizes2 = [[3, 3, 3], [3, 2, 3], [3, 4, 2], [3, 2, 4], [3, 2, 3]]

train_env = gym.make(
    "PackingEnv-v0",
    container_size=container_size,
    box_sizes=boxes_generator([10,10,10], num_items = 15),
    num_visible_boxes=5,
    render_mode="human",
    only_terminal_reward=True,
)

test_env = gym.make(
    "PackingEnv-v0",
    box_sizes=boxes_generator([10,10,10], num_items = 15),
    container_size=container_size,
    num_visible_boxes=5,
    render_mode="human",
    only_terminal_reward=True,
)

check_env(train_env, warn=True)
check_env(test_env, warn=True)

Next, we train our model with the default MLP multiinput policy. 

In [25]:
model = MaskablePPO("MultiInputPolicy", train_env, verbose=1)
checkpoint_callback = CheckpointCallback(
    save_freq=50, save_path="../logs/", name_prefix="rl_model"
)
print("begin training")
model.learn(total_timesteps=10000)
print("done training")
model.save("ppo_mask")

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
begin training
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22       |
|    ep_rew_mean     | 0.433    |
| time/              |          |
|    fps             | 24       |
|    iterations      | 1        |
|    time_elapsed    | 83       |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 20.7        |
|    ep_rew_mean          | 0.47        |
| time/                   |             |
|    fps                  | 23          |
|    iterations           | 2           |
|    time_elapsed         | 176         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.018508974 |
|    clip_fraction        | 0.193       |
|    clip_range           | 0.2         |
|   

Finally, we test the trained model in the test environment.

In [26]:
from sb3_contrib.common.maskable.utils import get_action_masks

test_env = gym.make(
    "PackingEnv-v0",
    box_sizes=boxes_generator([10,10,10], num_items = 15, seed = 3),
    container_size=container_size,
    num_visible_boxes=5,
    render_mode="human",
    only_terminal_reward=True,
)


obs = test_env.reset()
done = False
figs = []
fig = test_env.render(mode="human")
fig_png = fig.to_image(format="png")
buf = io.BytesIO(fig_png)
img = Image.open(buf)
figs.append(img)
step = 1
while not done:
    action_masks = get_action_masks(test_env)
    action, _states = model.predict(obs, deterministic=False, action_masks=action_masks)
    obs, rewards, done, info = test_env.step(action)
    fig = test_env.render(mode="human")
    fig_png = fig.to_image(format="png")
    buf = io.BytesIO(fig_png)
    img = Image.open(buf)
    figs.append(img)
    step += 1
print("done packing")
test_env.close()

print("begin saving packing rollout")
figs[0].save('gifs/train_15_boxes.gif', format='GIF',
             append_images=figs[1:],
             save_all=True,
             duration=300, loop=10)
print("end saving packing rollout")
# Save gif


done packing
begin saving packing rollout
end saving packing rollout


In [27]:
%cd ~/../content/drive/MyDrive/Github/3D-bin-packing/

/content/drive/MyDrive/Github/3D-bin-packing


In [28]:
!git add -A
!git commit -m "test nb training in colab"
!git config --global user.email "luisgarciar@gmail.com"
!git config --global user.name "luisgarciar"

[dev-luis d562aa6] test nb training in colab
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 gifs/train_15_boxes.gif
 create mode 100644 ppo_mask.zip


In [29]:
!git push -u origin dev-luis

Enumerating objects: 7, done.
Counting objects:  14% (1/7)Counting objects:  28% (2/7)Counting objects:  42% (3/7)Counting objects:  57% (4/7)Counting objects:  71% (5/7)Counting objects:  85% (6/7)Counting objects: 100% (7/7)Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 2.17 MiB | 1.62 MiB/s, done.
Total 5 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/luisgarciar/3D-bin-packing.git
   1ddeb7f..d562aa6  dev-luis -> dev-luis
Branch 'dev-luis' set up to track remote branch 'dev-luis' from 'origin'.
