**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [1]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
import shutil
USE_NBCAP = False

if not shutil.which('apt') is None:
    !apt update
    !apt install -y xvfb x11-utils
    !{sys.executable} -m pip install pyscreenshot pyvirtualdisplay
    !{sys.executable} -m pip install --upgrade pyglet
    !{sys.executable} -m pip install git+https://github.com/michalgregor/nbcap.git

    USE_NBCAP = True

!{sys.executable} -m pip install gym[classic_control]
!{sys.executable} -m pip install class_utils[tensorboard]@git+https://github.com/michalgregor/class_utils.git
!{sys.executable} -m pip install tianshou
!{sys.executable} -m pip install git+https://github.com/michalgregor/tianshou_agents.git

[33m
0% [Working][0m
            
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[33m
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [Wa[0m
                                                                               
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
[33m
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [2 [0m[33m
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [Wa[0m[33m
0% [2 InRelease gpgv 3,626 B] [Waiting for headers] [Connecting to security.ubu[0m
                                                                               
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86

In [2]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
%load_ext tensorboard

import shutil
if shutil.which('apt') is None:
    USE_NBCAP = False
else:
    USE_NBCAP = True

    from nbcap import ShowVideoCallback, ScreenRecorder, OutputManager, DisplayProcess

from tianshou_agents.utils import VectorEnvRenderWrapper
from tianshou_agents.methods.sac import sac_simple
from tianshou.data import Collector
from tianshou.env import BaseVectorEnv
from tianshou_agents.components.preset import AgentPresetWrapper

In [3]:
#@title -- Auxiliary Functions -- { display-mode: "form" }

if USE_NBCAP:
    display_size=(600, 400)
    show_video = ShowVideoCallback(dimensions=display_size)

    # make sure that only one instance
    # of the display is ever created
    try:
        DISP_PROC
    except NameError:
        DISP_PROC = DisplayProcess(display_size=display_size)

    def make_screen_recorder(max_gui_outputs=1):
        video_path="output"
        segment_time=10

        output_manager = OutputManager(max_gui_outputs=max_gui_outputs)
        video_callback=output_manager(show_video)
        display = DISP_PROC.id

        screen_recorder = ScreenRecorder(
            display, display_size, video_path,
            segment_time=segment_time, video_callback=video_callback
        )
        
        return screen_recorder

    SCREEN_RECORDER = make_screen_recorder()
else:
    from contextlib import suppress
    SCREEN_RECORDER = suppress()

class RenderCollector(Collector):
    def __init__(self, collector, render=0.01):
        self.collector = collector

        if isinstance(self.collector.env, BaseVectorEnv):
            self.collector.env = VectorEnvRenderWrapper(
                self.collector.env)

        self._render = render

    @property
    def collect_time(self):
        return max(self.collector.collect_time, 1e-20)

    @collect_time.setter
    def collect_time(self, val):
        self.collector.collect_time = val

    def collect(
        self, n_step = None, n_episode = None, random = False,
        render = None, no_grad = True,
    ):
        with SCREEN_RECORDER:
            render = render or self._render
            return self.collector.collect(n_step, n_episode, random, render, no_grad)

    def __getattr__(self, name):
        if name.startswith('_'):
            raise AttributeError("attempted to get missing private attribute '{}'".format(name))
        return getattr(self.collector, name)

    def __str__(self):
        return '<{}{}>'.format(type(self).__name__, self.collector)

    def __repr__(self):
        return str(self)

class AgentPresetPatch(AgentPresetWrapper):
    def __init__(self, preset, render=0.01):
        super().__init__(preset)
        self._prev_test_envs = None
        self.render = render

    def __call__(self, *args, **kwargs):
        



        agent = self._preset(*args, **kwargs)
        
        # we close the previous pyglet window before
        # opening a new one to work around a bug on Windows
        if not self._prev_test_envs is None:
            self._prev_test_envs.close()

        agent.test_collector = RenderCollector(
            agent.test_collector, render=self.render
        )
        
        self._prev_test_envs = agent.test_envs
        
        return agent
        
sac_simple = AgentPresetPatch(sac_simple)

## Continuous Control using SAC: The Inverted Pendulum

As our next exercise we are going to have a look at continuous control using RL. We are going to be using a method known as the soft actor-critic (SAC). Like DQN, it is an off-policy method that uses a replay buffer: it collects experience on the fly and then keeps replaying random batches of it and updating the policy.

Like we did with DQN, we are again going to be using a preset to construct our environment. In addition to specifying the Gym environment as `Pendulum-v0`, we are going to fix the seed of the random generators to `0`.

We are also going to use a stop criterion. We will run a couple of testing episodes at the end of each training epoch and once the mean testing rewards reach a predefined boundary, training is going to stop.



In [4]:
agent = sac_simple('Pendulum-v0', stop_criterion=-250, seed=0)

In [5]:
%%time
train_results = agent.train(max_epoch=10, step_per_epoch=1000)
print(f"\nGathered {agent.env_step} samples from the environment and took {agent.gradient_step} gradient steps.")

Epoch #1: 1001it [00:43, 23.04it/s, alpha=0.752, env_step=1000, len=200, loss/actor=24.248, loss/alpha=-0.661, loss/critic1=0.381, loss/critic2=0.440, n/ep=0, n/st=1, rew=-1127.31]                          


Epoch #1: test_reward: -1658.078074 ± 226.072297, best_reward: -1212.540195 ± 272.279226 in #0


Epoch #2: 1001it [00:44, 22.48it/s, alpha=0.566, env_step=2000, len=200, loss/actor=52.786, loss/alpha=-1.126, loss/critic1=0.460, loss/critic2=0.562, n/ep=0, n/st=1, rew=-1023.78]                          


Epoch #2: test_reward: -936.753510 ± 174.024905, best_reward: -936.753510 ± 174.024905 in #2


Epoch #3: 1001it [00:41, 23.84it/s, alpha=0.444, env_step=3000, len=200, loss/actor=70.094, loss/alpha=-1.091, loss/critic1=1.431, loss/critic2=2.143, n/ep=0, n/st=1, rew=-1149.94]                          


Epoch #3: test_reward: -1247.662167 ± 56.787020, best_reward: -936.753510 ± 174.024905 in #2


Epoch #4: 1001it [00:39, 25.64it/s, alpha=0.356, env_step=4000, len=200, loss/actor=80.697, loss/alpha=-1.086, loss/critic1=1.920, loss/critic2=2.578, n/ep=0, n/st=1, rew=-384.52]                          


Epoch #4: test_reward: -123.516245 ± 94.270022, best_reward: -123.516245 ± 94.270022 in #4

Gathered 4000 samples from the environment and took 4000 gradient steps.
Wall time: 4min 18s


It is again possible to run further testing episodes using `agent.test`.



In [7]:
test_results = agent.test()

### The Replay Buffer and Sample Efficiency

Now, by default, our agent collects experience from the environment and after each collection, it draws a batch from the replay buffer and updates the policy. But having the replay buffer at our disposal we can actually choose how many times we draw a batch and update the policy per each collection of experience.

This is not too important if it is easy to get fresh samples from the environment. In our case, for example, the simulation we are running is very cheap in terms of the time and computation involved. Very often, though, new samples are expensive to gather so we want to make as much use of the samples that we already have, as we possibly can.

---
#### Task 1: Changing the Number of Updates per Collect

**Experiment with changing the `update_per_collect` hyperparameter (it equals 1 by default) and observe what influence that has on the results. What do you observe when you increase `update_per_collect`? How does that influence the total time it takes to train the agent? How does it influence the number of samples that we needed to collect from the environment?** 

---


In [None]:
agent = sac_simple('Pendulum-v0',
    update_per_collect=5,
    stop_criterion=-250, seed=0
)

In [None]:
%%time
train_results = agent.train(max_epoch=20, step_per_epoch=150)
print(f"\nGathered {agent.env_step+1} samples from the environment and took {agent.gradient_step} gradient steps.")

Epoch #1: 151it [00:10, 14.53it/s, alpha=0.812, env_step=150, len=200, loss/actor=21.752, loss/alpha=-0.473, loss/critic1=0.677, loss/critic2=0.620, n/ep=0, n/st=1, rew=-1623.52]                         
Epoch #2:   2%|2         | 3/150 [00:00<00:07, 20.00it/s, alpha=0.809, env_step=153, len=200, loss/actor=22.074, loss/alpha=-0.481, loss/critic1=0.647, loss/critic2=0.603, n/ep=0, n/st=1, rew=-1623.52]

Epoch #1: test_reward: -1761.546694 ± 129.250209, best_reward: -1276.069056 ± 334.114004 in #0


Epoch #2: 151it [00:10, 13.73it/s, alpha=0.652, env_step=300, len=200, loss/actor=44.769, loss/alpha=-0.939, loss/critic1=0.264, loss/critic2=0.286, n/ep=0, n/st=1, rew=-1534.82]                         
Epoch #3:   2%|2         | 3/150 [00:00<00:07, 20.55it/s, alpha=0.650, env_step=303, len=200, loss/actor=45.201, loss/alpha=-0.946, loss/critic1=0.263, loss/critic2=0.283, n/ep=0, n/st=1, rew=-1534.82]

Epoch #2: test_reward: -1640.565342 ± 147.378831, best_reward: -1276.069056 ± 334.114004 in #0


Epoch #3: 151it [00:10, 14.26it/s, alpha=0.528, env_step=450, len=200, loss/actor=68.112, loss/alpha=-1.218, loss/critic1=0.290, loss/critic2=0.281, n/ep=0, n/st=1, rew=-1534.82]                         
Epoch #4:   2%|2         | 3/150 [00:00<00:07, 20.69it/s, alpha=0.526, env_step=453, len=200, loss/actor=68.437, loss/alpha=-1.214, loss/critic1=0.293, loss/critic2=0.285, n/ep=0, n/st=1, rew=-1534.82]

Epoch #3: test_reward: -1103.608403 ± 108.747141, best_reward: -1103.608403 ± 108.747141 in #3


Epoch #4: 151it [00:10, 14.36it/s, alpha=0.439, env_step=600, len=200, loss/actor=83.289, loss/alpha=-1.169, loss/critic1=0.607, loss/critic2=0.659, n/ep=0, n/st=1, rew=-1510.06]                         
Epoch #5:   2%|2         | 3/150 [00:00<00:08, 16.48it/s, alpha=0.438, env_step=603, len=200, loss/actor=83.433, loss/alpha=-1.180, loss/critic1=0.616, loss/critic2=0.672, n/ep=0, n/st=1, rew=-1510.06]

Epoch #4: test_reward: -1338.751805 ± 139.804951, best_reward: -1103.608403 ± 108.747141 in #3


Epoch #5: 151it [00:11, 13.27it/s, alpha=0.371, env_step=750, len=200, loss/actor=97.157, loss/alpha=-1.184, loss/critic1=2.761, loss/critic2=2.148, n/ep=0, n/st=1, rew=-915.44]                         
Epoch #6:   2%|2         | 3/150 [00:00<00:08, 18.29it/s, alpha=0.370, env_step=753, len=200, loss/actor=97.254, loss/alpha=-1.203, loss/critic1=2.838, loss/critic2=2.193, n/ep=0, n/st=1, rew=-915.44]

Epoch #5: test_reward: -1016.901824 ± 536.669973, best_reward: -1016.901824 ± 536.669973 in #5


Epoch #6: 151it [00:11, 13.47it/s, alpha=0.311, env_step=900, len=200, loss/actor=109.430, loss/alpha=-1.296, loss/critic1=1.785, loss/critic2=1.742, n/ep=0, n/st=1, rew=-1344.91]                         
Epoch #7:   2%|2         | 3/150 [00:00<00:07, 19.87it/s, alpha=0.310, env_step=903, len=200, loss/actor=109.574, loss/alpha=-1.302, loss/critic1=1.790, loss/critic2=1.727, n/ep=0, n/st=1, rew=-1344.91]

Epoch #6: test_reward: -992.132114 ± 541.335989, best_reward: -992.132114 ± 541.335989 in #6


Epoch #7: 151it [00:10, 13.94it/s, alpha=0.258, env_step=1050, len=200, loss/actor=121.669, loss/alpha=-1.409, loss/critic1=2.198, loss/critic2=1.850, n/ep=0, n/st=1, rew=-1344.91]                         


Epoch #7: test_reward: -218.999008 ± 132.020717, best_reward: -218.999008 ± 132.020717 in #7

Gathered 1050 samples from the environment and took 5250 gradient steps.
Wall time: 2min 18s


To run further testing episodes, invoke `agent.test()`.



In [None]:
test_results = agent.test()