**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
import shutil
USE_NBCAP = False

if not shutil.which('apt') is None:
    !apt update
    !apt install -y xvfb x11-utils
    !apt install swig
    !{sys.executable} -m pip install pyscreenshot pyvirtualdisplay
    !{sys.executable} -m pip install --upgrade pyglet
    !{sys.executable} -m pip install git+https://github.com/michalgregor/nbcap.git

    USE_NBCAP = True

!{sys.executable} -m pip install gymnasium[box2d]
!{sys.executable} -m pip install class_utils[tensorboard]@git+https://github.com/michalgregor/class_utils.git
!{sys.executable} -m pip install pip install "ray[rllib]"

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import shutil
if shutil.which('apt') is None:
    USE_NBCAP = False
else:
    USE_NBCAP = True

    from nbcap import ShowVideoCallback, ScreenRecorder, OutputManager, DisplayProcess
    from pyscreenshot import grab as img_grab  

import gymnasium as gym
from ray.tune.registry import register_env

_lunar_max_episode_steps = 250

def timed_lunar_lander(**kwargs):
    env = gym.make('LunarLander-v2', **kwargs)
    env = gym.wrappers.TimeLimit(env,
        max_episode_steps=_lunar_max_episode_steps)
    return env

gym.envs.registration.register(
     id='TimedLunarLander-v0',
     entry_point='__main__:timed_lunar_lander',
 )

register_env("TimedLunarLander-v0", timed_lunar_lander)

In [None]:
#@title -- Auxiliary Functions -- { display-mode: "form" }

if USE_NBCAP:
    display_size=(600, 400)
    show_video = ShowVideoCallback(dimensions=display_size)

    # make sure that only one instance
    # of the display is ever created
    try:
        DISP_PROC
    except NameError:
        DISP_PROC = DisplayProcess(display_size=display_size)

    def make_screen_recorder(max_gui_outputs=3):
        video_path="output"
        segment_time=10

        output_manager = OutputManager(max_gui_outputs=max_gui_outputs)
        video_callback=output_manager(show_video)
        display = DISP_PROC.id

        screen_recorder = ScreenRecorder(
            display, display_size, video_path,
            segment_time=segment_time, video_callback=video_callback
        )
        
        return screen_recorder

    SCREEN_RECORDER = make_screen_recorder()
else:
    from contextlib import suppress
    SCREEN_RECORDER = suppress()

    def img_grab():
        pass

## OpenAI Gym Environments

Before we go and explore how deep reinforcement learning works, we are first going to play around with one OpenAI Gym environment: the [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) from the [OpenAI's Gym package](https://gym.openai.com/). Gym contains a number of useful RL benchmark tasks, all with the same unified interface.

To construct the environment, we are going to call `gym.make('LunarLander-v2')`. We specify `render_mode='human'` so that the environment also gets visualized.



In [None]:
env = gym.make('TimedLunarLander-v0', render_mode='human')

To start an episode and set up the initial state, we need to call `reset()`. This will return our initial observation and an info dict (which we are going to ignore here).

We are not going to go into the details of what all the individual values in the observation mean just now. Just for instance though, the first two numbers correspond to the coordinates of our lunar lander.



In [None]:
obs, info = env.reset()
obs

We can inquire about the shape and bounds of the observation space programmatically using `env.observation_space`. In our case, this will tell us that observations are composed of 8 numbers.



In [None]:
env.observation_space.shape

They say that an image is worth a thousand words, so let's display the visualization of our initial state. The state already got rendered, but since we are running our code on a remote server, the rendering was done there. We can display the result by grabbing a screenshot of the remote screen using `img_grab`.



In [None]:
img_grab()

As our next step, let's find out what actions our agent will be able to take. We can inquire about this using `env.action_space`. In our case, the action space is discrete and consists of 4 different actions – the meaning of these is as follows:

* **0:**  do nothing;
* **1:**  fire right orientation engine;
* **2:**  fire main engine;
* **3:**  fire left orientation engine.


In [None]:
env.action_space

Actions are executed by calling `env.step(action)`. The function takes care of performing the action and also returns a tuple with:

* `obs`: the next observation;
* `reward`: the immediate reward;
* `terminated`: whether the environment has terminated naturally;
* `truncated`: whether the environment's execution was truncated;
* `info`: an environment-specific dictionary with additional info;
Let's try doing nothing (action 0) for a few steps to see what happens to our lunar lander.



In [None]:
obs, reward, terminated, truncated, info = env.step(0)
print("observation:", obs)
print("immediate reward:", reward)

To make the difference more obvious, we will run the same action several more times before inspecting the rendering. It should be obvious that the lunar lander has descended.



In [None]:
for i in range(60):
    env.step(0)

img_grab()

If you like, you can now experiment with the interface further and perhaps even try to design a rudimentary controller by hand.

