
# OpenAI Gym API and Gymnasium

We'll learn:
- the basics of `Gymnasium` (a fork of `OpenAI Gym` implementing the same API), a library used to provide a uniform API for an RL agent and lots of RL environments (previously, this API was implemented by `OpenAI Gym` library, but no longer maintained)
- to write our first randomly behaving agent and become familiar with the basic concepts of RL that we covered so far

## 1. The anatomy of the agent

1. **The agent**: in practice, is some piece of code the implements some policy. Basically, this policy decides what action is needed at every time step, given our observations
2. **The environment**: everything that is external to the agent and has the responsibility of providing observations and giving rewards. The environment changes its state based on the agent's actions

Let's define an environment that will give the agent random rewards for a limited number of steps, regardless of the agent's actions ([agent anatomy script](01_agent_anatomy.py)):

In [1]:
import random
from typing import List


class Environment:
    def __init__(self):
        self.steps_left = 10

    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]

    def get_actions(self) -> List[int]:
        return [0, 1]

    def is_done(self) -> bool:
        return self.steps_left == 0

    def action(self, action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()

- **Line 5-7**: the environment initializes its internal state. In this case, the state is just a counter that limits the number of time steps that the agent is allowed to take to interact with the environment

- **Line 9-10**: `get_observation()` method returns the current environment's observation to the agent. It is usually implemented as some function of the internal state of the environment
    - in this case, the observation vector is always zero, as the environment basically has no internal state

- **Line 12-13**: `get_actions()` method allows the agent to query the set of actions it can execute
    - normally, the set of actions does not change over time, but some actions can become impossible in different states
    - in this case, there are only two possible actions that the agent can carry out

- **Line 15-16**: `is_done()` method signals the end of the episode to the agent

- **Line 18-22**: `action()` method handles an agent's action and returns the reward for this action
    - in this case, the reward is random and its action is discarded
    - additionally, we update the count of steps and don't continue the episodes that are over


Let's take a look at the agent:


In [2]:
class Agent:
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env: Environment):
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward

**Line 2-3**: the constructor initializes the counter that will keep the total reward accumulated by the agent during the episode

**Line 5-9**: `step()` function accepts the environment as an argument. It allows the agent to perform the following actions:
    - **Line 6**: observe the environment
    - **Line 7**: make a decision about the action to take based on the observations
    - **Line 8**: submit the action to the environment
    - **Line 8-9**: get the reward for the current step

In this case, the agent is dull and ignores the observations obtained during the decision-making process about which action to take. Instead, eveyr action is selected randomly

Let's create both classes and runs one episode:

In [17]:
if __name__ == "__main__":
    env = Environment()
    agent = Agent()

    while not env.is_done():
        agent.step(env)

    print("Total reward got: %.4f" % agent.total_reward)

Total reward got: 3.7976


## 2. Hardware and software requirements

1. The external libraries we'll use: 
    - `Numpy`
    - `OpenCV Python bindings`: computer vision library, provide many functions for image processing
    - `Gymnasium`: a maintained fork of OpenAI Gym library and an RL framework that has various environments that can be communicated with in a unified way
    - `Pytorch`: a flexible and expressive DL library
    - [`Pytorch Ignite`](https://pytorch-ignite.ai/): a set of high-level tools on top of Pytorch used to redude boilerplate code
    - [`PTAN`](https://github.com/Shmuma/ptan): an open source extension to the OpenAI Gym API to support modern DL methods and building blocks

2. refer to [Deep Reinforcement Learning Hands-on (Page 62)](../Deep%20Reinforcement%20Learning%20Hands-on%203rd%20Edition%20-%20Maxim%20Lapan.pdf)

## 3. The OpenAI Gym API and Gymnasium

**Main goal of Gym**: provide a rich collection of environments for RL experiments using a unified interface:
1. Central class in this library is an environment `Env`. Instances of this class expose several methods and fields that provide the required information about its capabilities
2. At a high level, every environment provides these pieces of information and functionality:
    - a set of actions that is allowed to be executed in the environment. Gym supports both discrete and continuous actions, as well as their combination
    - the shape and boundaries of the observations that the environment provides the agent with
    - a `step()` method that executes an action, returns the current observation, the reward, and a flag indicating that the episode is over
    - a `reset()` method that returns the environment to its initial state and obtains the first observation

Let's talk about these components of the environment in detail:

### 3.1 The action space

The actions that an agent can execute can be discrete, continuous, or a combination of the two:
1. **Discrete actions**: are a fixed set of things that an agent can do. Main characteristic is these states are mutually exclusive, only one action from a finite set of actions is possible at a time
2. a **continuous action**: has a value attached to it. A description of a continuous action includes the boundaries of the value that the action could have
3. the environment could take multiple actions. To support such cases, Gym defines a special container class that allows the nesting of several action spaces into one unified action

### 3.2 The observation space

1. Observations are pieces of information that an environment provides the agent with, on every timestamp, besides the reward
- You can see the similarity between actions and observations, and that is how they have been represented in Gyn's classes. Let's look at a class diagram: ![The hierarchy of the Space class in Gym](../images/figure_2-1.png)
- the basic abstract `Space` class includes $1$ property and $3$ methods:
    - `shape`: contain the shape of the space, identical to Numpy array
    - `sample()`: return a random sample from the space
    - `contains(x)`: check whether the argument `x` belongs to the space's domain
    - `seed()`: initialize a random number generator for the space and all subspaces (useful for getting reproducible environment behavior accross several runs)
- all these methods are abstract and reimplemented in each of the `Space` subclasses:
    - `Discrete` class: represent a mutually exclusive set of items, numbered from $0$ to $n-1$
        - can redefine the starting index with the optional constructor argument `start` if needed
        - $n$ is a count of the items our `Discrete` object describes
        - example: `Discrete(n=4)` used for an action space of $4$ directions to move in `[left, right, up, down]`
    - `Box` class: represent an $n$-dimensions tensor of rational numbers with intervals `[low, high]`:
        - examples:
            1. `Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)`
            2. `Box(low=0, high=255, shape=(210, 160, 3), dtype=np.unit8)`
    - `Tuple` class: combine several `Space` class instances together. This enables us to create action and observation spaces of any complexity that we want:
        - example:
        ```python
                    Tuple(spaces=(Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32),
                                  Discrete(n=3),
                                  Discrete(n=2)))
        ```
    - other `Space` subclasses:
        - `Sequence`: represent variable-length sequences
        - `Text`: strings
        - `Graph`: where space is a set of nodes with connections between them

2. Every environment has $2$ memebers of type `Space`: `action_space` and `observation_space`

### 3.3 The environment

1. The environment is represented in Gym by the `Env` class, which has the following members:
    - `action_space`: the field of the `Space` class and provides a specification for allowed actions in the environment
    - `observation_space`: this field has the same `Space` class, but specifies the observations provided by the environment
    - `reset()`: resets the environment to its initial state, returning the initial observation vector and the dict with extra information from the environment
    - `step()`: allows the agent to take the action and returns information about the outcome of the action:
        - the next observation
        - the local reward
        - the end-of-episode flag
        - the flag indicating a truncated episode
        - a dictionary with extra information from the environment
    - `render()` (won't use): obtains the observation in a human-friendly form

2. Let's focus on the core `Env` methods: `reset()` and `step()`

3. `reset()` has no arguments; it instructs an environment to reset into its initial state and obtain the initial observation
    - note: have to call `reset()` after the creation of the environment
    - the agent's communication with the environment may have an end. Such sessions are called episodes, and after the end of the episode, an agent needs to start over
    - the value returned by this method is the first observation of the environment
    - `reset()` also returns the dictionary with extra environment-specific information (empty in most standard environments)

4. `step()` - central piece in the environment's functionality
    - it does several things in one call:
        1. tell the environment which action we'll execute in the next step
        2. get new observation from the environment after this action
        3. get reward the agent gained with this step
        4. get the indication that the episode is over
        5. get the flag which signals an episode truncation (when time limit is enabled, for example)
        6. ge the dict with extra environment-specific information
    - the first item in the preceding list (`action`) i passed as the only argument to the `step()` method, and the rest are returned as a tuple of $5$ elements (`observation`, `reward`, `done`, `truncated`, and `info`). They have these types and meanings:
        1. `observation`: Numpy vector/matrix with observation data
        2. `reward`: float value of the reward
        3. `done`: Boolen - `True` when the episode is over. If this value is `True`, we have to call `reset()` in the environment, as no more actions are possible
        4. `truncated`: Boolen - `True` when the episode is truncated.
            - for most environment, this is a `TimeLimit`, but might have different meaning in some environments
            - if this value is `True`, we have to call `reset()` in the environment
        5. `info`: extra environment-specific information. Usual practice: ignore this value in general RL methods

5. environment usage in an agent's code:
    1. in a loop, call `step()` method with an action to perform until `done` or `truncated` flags become `True`
    2. then, call `reset()` to start over
    3. a missing piece: how to create `Env` object?

### 3.4 Creating an environment

1. Every environment has a unique name of the `EnvironmentName-vN` form, where `N` is the number of used to distinguish between different versions of the same environment
    - use `gymnasium`'s `make(name)` function to create an environment

2. Same environment can have different variations in the settings and observations spaces. For example, the **Atari** game **Breakout** has these environment names:
    - `Breakout-v0`, `Breakout-v4`: original Breakout with a random initial position and direction of the ball
    - `BreakoutDeterministic-v0`, `BreakoutDeterministic-v4`: Breakout with the same initial placement and speed vector of the ball
    - `BreakoutNoFrameskip-v0`, `BreakoutNoFrameskip-v4`: Breakout with every frame displayed to the agent. Without this, every action is executed for several consecutive frames
    - `Breakout-ram-v0`, `Breakout-ram-v4`: Breakout with the observation of the full Atari emulation memory (128 bytes) instead of screen pixels
    - `Breakout-ramDeterministic-v0`, `Breakout-ramDeterministic-v4`: memory observation with the same initial state
    - `Breakout-ramNoFrameskip-v0`, `Breakout-ramNoFrameskip-v4`: memory observation without frame skipping

3. Even after removal of such duplicates, Gymnasium comes with an impressive list of $198$ unique environments, which can be divided into several groups:
    - `classic control problems`: toy tasks used in optimal control theory and RL papers as benchmarks or demonstrations
        - usually simple, with low-dimension observation and action spaces, but are useful as quick checks when implementing algorithms
        - think about them as `MNIST for RL`
    - `Atari 2600`: games from classic game platform from 1970s. There are 63 unique games
    - `Algorithmic`: problems aim to perform small computation tasks, such as copying the observed sequence or adding numbers
    - `Box2D`: environments that use the Box2D physics simulator to learn walking or car control
    - `MuJoCo`: physics simulator used for several continuous control problems
    - `Parameter tuning`: RL used to optimize NN parameters
    - `Toy text`: simple grid world text environments

4. Total number of RL environments supporting the `Gym API` is much larger:
    - the `Farama Foundation` maintains several repos related to special RL topics like multi-agent RL, 3D navigation, robotics, and web automation
    - [third-party repos](https://gymnasium.farama.org/environments/third_party_environments/)

### 3.5 The CartPole session

In [18]:
import gymnasium as gym
e = gym.make("CartPole-v1")

We call the `CartPole` environment from the `gymnasium` package
- this environment is from the classic control group and its gist it to control the platform with a stick attached to its bottom part
- the observation is $4$ floating-point numbers containing information about:
    1. `x` coordinate of the stick's center of mass
    2. its `speed`
    3. its `angle` to the platform
    4. its `angular speed`
- the reward is $1$, given on every time step
- the episode continues until the stick falls, so to get a more accumulated reward, we need to balance the platform in a way to avoid the stick falling

In [20]:
obs, info  = e.reset()
obs, info        

(array([ 0.02871413, -0.01405009, -0.0225007 ,  0.04278427], dtype=float32),
 {})

We reset the environment and obtain the first observation, then we examine the action space and observation space:
- `action_space` field is of `Discrete` type, so our actions will be just $0$ and $1$:
    - $0$: push the platform to the left
    - $1$: push the platform to the right
- `observation_space` is of `Box(4,)`, i.e. a vector of length $4$
    - the first list is the low bound and the second is the high bound of parameters
    - refer to [Gymnasium repo - cartpole.py](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py#L40) for meanings of these values
        1. `Cart position`: in $[-4.8, 4.8]$
        2. `Cart velocity`: in $[-\infty, \infty]$
        3. `Pole angle`: in $[-0.418, 0.418]$
        4. `Pole angular velocity`: in $[-\infty, \infty]$

In [21]:
e.action_space

Discrete(2)

In [22]:
e.observation_space

Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)

In [23]:
e.step(0)

(array([ 0.02843313, -0.20884228, -0.02164502,  0.32828394], dtype=float32),
 1.0,
 False,
 False,
 {})

Let's send an action to the environment:
- `e.step(0)` pushed platform to the left by executing the action `0` and got a tuple of $5$ elements:
    1. a new observation: a new vector of length $4$
    2. a reward of $1.0$
    3. `done` flag with value `False` --> episode is not over yet and we are more or less okay with balancing the pole
    4. `truncated` flag with value `False` --> episode was not truncated
    5. extra information about the environment - an empty dictionary

In [24]:
e.action_space.sample()

np.int64(0)

In [30]:
e.action_space.sample()

np.int64(1)

In [31]:
e.observation_space.sample()

array([0.49597013, 1.0956569 , 0.23275931, 1.0554833 ], dtype=float32)

In [32]:
e.observation_space.sample()

array([ 3.4710777 , -0.409212  ,  0.14004734,  0.3030727 ], dtype=float32)

We used the `sample()` method of the `Space` class on the `action_space` and `observation_space`:
- this method return a random sample from the underlying space
- `Discrete` `action_space` will return a random number of $0$ or $1$ (could be used when we're not sure how to perform an action)
- `observation_space` will return a ramdom vector of length $4$ (not very useful)
- this feature is especially handy because we don't know any RL methods yet, but still want to play around with the Gym environment

Now, let's implement our first randomly behaving agent for `CartPole`:

## 4. The random CartPole agent

1. Refer to [cartpole random script](02_cartpole_random.py):
    - **Line 4-8**: create the environment and initialize the counter of steps and the reward accumulator:
        - **Line 8**: reset the environment to obtain the first observation (which we'll not use, as our agent is stochastic)
    - **Line 10-18**: in the loop, after sampling a random action, we ask the environment to execute it and return to us the next observation `obs`, the `reward`, the `is_done`, and the `is_trunc` flags
        - if the episode is over, we stop the loop and show how many steps we have taken and how much reward has been accumulated

2. Most of the environments in Gym have a `reward boundary`, which is the average reward that the agent should gain during $100$ consecutive episodes to "solve" the environment
    - for `CartPole`, this boundary is $195$, i.e. on average, the agent must hold the stick for $195$ time steps or longer
    - using this perspective, our random agent's performance looks poor

In [38]:
import gymnasium as gym


if __name__ == "__main__":
    env = gym.make("CartPole-v1")
    total_reward = 0.0
    total_steps = 0
    obs, _ = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, is_done, is_trunc, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if is_done:
            break

    print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

Episode done in 44 steps, total reward 44.00


## 5. Extra Gym API functionality

The rest of the API we can live without, but it will make our life easier and the code cleaner. Let's briefly cover them:


### 5.1 Wrappers




1. Very frequently, we'll want to extend the environment's functionality in some generic way. E.g.
    1. imagine an environment gives us some observations, but we want to accumulate them in some buffer and provide to the agent the last $N$ observations (common scenario for computer games, when one single frame is just not enough to get the full information about the game state)
    2. to be able to crop or preprocess an image's pixels to make it more convenient for the agent to digest
    3. to normalize the reward scores

2. There are many such situation that have the same structure - we want to "wrap" the existing environment and add some extra logic for doing something. Gym's `Wrapper` class: ![The hierarchy of the Wrapper class in Gym](../images/figure_2-4.png)
    - inherit the `Env` class
    - constructor accepts the only argument - the instance of the `Env` class to be wrapped
    - to add extra functionality, redefine the methods we want to extend, such as `step()` or `reset()`. The only requirement is to call the original method of the superclass
    - to simplify accessing the environment being wrapped, `Wrapper` has $2$ properties:
        1. `env`, of the immediate environment we're wrapping (which could be another wrapper as well), and
        2. `unwrapped`, which is an `Env` without any wrappers

3. To handle more specific requirements, such as a `Wrapper` class that want to process only observations from the environment, or only actions, there are subclasses of `Wrapper` that allow the filtering of only a specific portion of information. They are as follows:
    1. `ObservationWrapper`: need to redefine the `observation(obs)` method of the parent
        - `obs` arugument is an observation from the wrapped environment, and this method should return the observation that will be given to the agent
    2. `RewardWrapper`: this expose the `reward(rew)` method, which can modify the reward value given to the agent
        - example: scale it to the needed range, add a discount based on some previous actions, etc.
    3. `ActionWrapper`: need to override `action(a)` method, which can tweak the action passed to the wrapped environment by the agent

4. To make it slightly practical, let's imagine a situation where we want to intervene in the stream of actions sent by the agent and, with a probability of $10%$, replace the current action with a random one
    - might look unwise, but this simple trick is one of the mose practical and powerful methods for solving the `exploration/exploitation problem`
    - by issuing random actions, we make our agent explore the environment and from time to time drift awat from the beaten track of its policy
    - this is an easy thing to do using the `ActionWrapper` class

5. Refer to [random action wrapper script](03_random_action_wrapper.py):

In [39]:
import gymnasium as gym
import random


class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env: gym.Env, epsilon: float = 0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon

    def action(self, action: gym.core.WrapperActType) -> gym.core.WrapperActType:
        if random.random() < self.epsilon:
            action = self.env.action_space.sample()
            print(f"Random action {action}")
            return action
        return action

- **Line 5-8**: initialize wrapper by calling a parent's `__init__()` method and saving `epsilon` (the probability of a random action)
- **Line 10-15**: override `action()` method from a parent's class to tweak the agent's actions
    - note: using `action_space` and wrapper abstraction, we were able to write abstract code, which will work with any environment from Gym

Let's apply our wrapper:

In [45]:
if __name__ == "__main__":
    env = RandomActionWrapper(gym.make("CartPole-v1"))

    obs = env.reset()
    total_reward = 0.0

    while True:
        obs, reward, done, _, _ = env.step(0)
        total_reward += reward
        if done:
            break

    print(f"Reward got: {total_reward:.2f}")

Random action 0
Random action 0
Random action 0
Reward got: 8.00


- **Line 2**: create a normal `CartPole` environment and pass it to our `Wrapper` constructor
    - from here on, we'll use our wrapper as a normal `Env` instance, instead of the original `CartPole`
    - as the `Wrapper` class inherits `Env` class and exposes the same interface, we can nest our wrappers as deep as we want. This is a powerful, elegant, and generic solution
- **Line 4-13**: almost the same code as in the random agent, except that every time, we issue the same action $0$, so our agent is dull and does the same thing

Let's look at how we render environment during execution:

### 5.2 Rendering the environment (works with Python scripts, not works with Jupyter notebooks)

It is implemented with $2$ wrappers:
1. `HumanRendering`: open a separate graphical window in which the image from the environment is being shown interactively
    - to be able to render the environment (`CartPole` in this case), it has to be initialized with the `render_mode="rbg_array"` argument
    - this argument tells the environment to return pixels from its `render()` method, which is being called by the `HumanRendering` wrapper
    - refer to [cartpole random monitor script](04_cartpole_random_monitor.py):
2. `RecordVideo`: captures the pixels from the environment and produces a video file of our agent in action
    - used in the same way as `HumanRendering`, but requires an extra argument specifying the directory to store video files (will create one if directory doesn't exist)
    - especially useful in situations when we're running our agent on a remote machine without the GUI

In [46]:
import gymnasium as gym


if __name__ == "__main__":
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    env = gym.wrappers.HumanRendering(env)
    # env = gym.wrappers.RecordVideo(env, video_folder="video")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, done, _, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print(f"Episode done in {total_steps} steps, total reward {total_reward:.2f}")
    env.close()

  from pkg_resources import resource_stream, resource_exists


Episode done in 18 steps, total reward 18.00


Run the code in a Python script (`python .\Chapter02\04_cartpole_random_monitor.py`), the window with environment rendering will appear:
    - as our agent cannot balance the pole for too long (10-30 steps max), the window will disappear quite quickly, once the `env.close()` method is called

In [53]:
import gymnasium as gym


if __name__ == "__main__":
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    # env = gym.wrappers.HumanRendering(env)
    env = gym.wrappers.RecordVideo(env, video_folder="video")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, done, _, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print(f"Episode done in {total_steps} steps, total reward {total_reward:.2f}")
    env.close()

Episode done in 53 steps, total reward 53.00


### 5.3 More wrappers

Refer to [Gymnasium Wrappers]( https://gymnasium.farama.org/api/wrappers/)