### Default RL framework

In [2]:
from typing import List
import random

In [3]:
class Environment:
    def __init__(self):
        self.steps_left = 100
        
    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]
    
    def get_actions(self) -> List[int]:
        return [0, 1]
    
    def is_done(self) -> bool:
        return self.steps_left == 0
    
    def action(self, action: int) -> float:
        
        # Handles and agents action and returns the reward
        
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random() # Reward is random in this example

In [4]:
class Agent:
    def __init__(self):
        self.total_reward = 0.0
        
    def step(self, env: Environment):
        """
        Observe the environment
        Make a decision about the action to take based on the observations
        Submit the action to the environment
        Get the reward for the current step
        """
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward

In [5]:
if __name__ == "__main__":
    env = Environment()
    agent = Agent()
    
    while not env.is_done():
        agent.step(env)
        
    print("Total reward got: %.4f" % agent.total_reward)

Total reward got: 50.6003


## **The OpenAI Gym API**

At a high level, every environment provides these pieces of information and functionality:
- A set of actions that is allowed to be executed in the environment. Gym supports both discrete and continuous actions, as well as their combination
- The shape and boundaries of the observations that the environment provides the agent with
- A method called step to execute an action, which returns the current observation, the reward, and the indication that the episode is over
- A method called reset, which returns the environment to its initial state and obtains the first observation

### ***The observation space***

The basic abstract class Space includes two methods that are relevant to us:
-  sample(): This returns a random sample from the space
-  contains(x): This checks whether the argument, x, belongs to the space's domain

More classes:
- The Discrete class represents a mutually exclusive set of items, numbered from 0 to n – 1. Its only field, n, is a count of the items it describes. For example, Discrete(n=4) can be used for an action space of four directions to move in [left, right, up, or down].
- The Box class represents an n-dimensional tensor of rational numbers with intervals [low, high]. For instance, this could be an accelerator pedal with one single value between 0.0 and 1.0, which could be encoded by Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)(the shape argument is assigned a tuple of length 1 with a single value of 1, which gives us a one-dimensional tensor with a single value). The dtype parameter specifies the space's value type and here we specify it as a NumPy 32-bit float. Another example of Box could be an Atari screen observation (we will cover lots of Atari environments later), which is an RGB (red, green, and blue) image of size 210×160: Box(low=0, high=255, shape=(210, 160, 3), dtype=np.uint8). In this case, the shape argument is a tuple of three elements: the first dimension is the height of the image, the second is the width, and the third equals 3, which all correspond to three color planes for red, green, and blue, respectively. So, in total, every observation is a three-dimensional tensor with 100,800 bytes.
- The final child of Space is a Tuple class, which allows us to combine several Space class instances together. This enables us to create action and observation spaces of any complexity that we want. For example, imagine we want to create an action space specification for a car. The car has several controls that can be changed at every timestamp, including the steering wheel angle, brake pedal position, and accelerator pedal position. These three controls can be specified by three float values in one single Box instance. Besides these essential controls, the car has extra discrete controls, like a turn signal (which could be off, right, or left) or horn (on or off). To combine all of this into one action space specification class, we can create Tuple(spaces=(Box(low=-1.0, high=1.0, shape=(3,), dtype=np. float32), Discrete(n=3),Discrete(n=2))). This flexibility is rarely used; for example, in this book, you will see only the Box and Discrete actions and observation spaces, but the Tuple class can be useful in some cases.

For example, Discrete.sample() returns a random element from a discrete range, and Box.sample() will be a random tensor with proper dimensions and values lying inside the given range.

Every environment has two members of type Space: the action_space and observation_space. 

### ***The environment***

Has the following memebers:

- action_space: This is the field of the Space class and provides a specification for allowed actions in the environment.
- observation_space: This field has the same Space class, but specifies the observations provided by the environment.
- reset(): This resets the environment to its initial state, returning the initial observation vector.
- step(): This method allows the agent to take the action and returns information about the outcome of the action – the next observation, the local reward, and the end-of-episode flag.

The step() method is the central piece in the environment's functionality. It does
several things in one call, which are as follows:
- Telling the environment which action we will execute on the next step
- Getting the new observation from the environment after this action
- Getting the reward the agent gained with this step
- Getting the indication that the episode is over

The first item (action) is passed as the only argument to this method, and the rest are
returned by the step() method. Precisely, this is a tuple (Python tuple and not the
Tuple class we discussed in the previous section) of four elements (observation,
reward, done, and info). They have these types and meanings:
- observation: This is a NumPy vector or a matrix with observation data.
- reward: This is the float value of the reward.
- done: This is a Boolean indicator, which is True when the episode is over.
- info: This could be anything environment-specific with extra information about the environment. The usual practice is to ignore this value in general RL methods (not taking into account the specific details of the particular environment).

### ***Creating and environment***

To create an environment, the gym package provides the make(env_name) function, whose only argument is the environment's name in string form. The full list of environments can be found at https://gym.openai.com/envs

In [6]:
import gym
import warnings
# warnings.simplefilter(action='ignore', category=UserWarning)

In [7]:
e = gym.make('CartPole-v0')

  logger.warn(


In [8]:
# The observation of this environment is four floating-point numbers containing
# information about the x coordinate of the stick's center of mass, its speed, its angle to
# the platform, and its angular speed.
obs = e.reset()
obs

(array([-0.00273233,  0.01887221, -0.02566512, -0.00512351], dtype=float32),
 {})

In [9]:
# The action_space field is of the Discrete type, so our actions will be just 0 or
# 1, where 0 means pushing the platform to the left and 1 means to the right.
e.action_space

Discrete(2)

In [10]:
e.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [11]:
e.action_space.sample()

1

In [12]:
e.action_space.sample()

0

In [13]:
e.observation_space.sample()

array([-3.1253421e-01,  1.9566858e+38,  1.8909226e-01, -2.4372871e+38],
      dtype=float32)

### ***random CartPole agent***

In [14]:
if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    total_reward = 0.0
    total_steps = 0
    obs = env.reset()
    
    while True:
        action = env.action_space.sample()
        obs, reward, done, truncated, info = env.step(action) # the step() method now returns five values
        total_reward += reward
        total_steps += 1
        if done:
            break
    print("Episode done in %d steps, total reward %.2f" %(total_steps, total_reward))        

Episode done in 13 steps, total reward 13.00


  if not isinstance(terminated, (bool, np.bool8)):


### ***Extra Gym functionality - wrappers and monitors***

- Warappers: For situations where you want to "wrap" the existing environment and add some extra logic for doing something. Gym provides a convenient framework for these situations – the Wrapper class. The Wrapper class inherits the Env class. Its constructor accepts the only argument – the instance of the Env class to be "wrapped." To add extra functionality, you need to redefine the methods you want to extend, such as step() or reset(). The only requirement is to call the original method of the superclass.  There are subclasses of Wrapper that allow the filtering of only a specific portion of information:

    - ObservationWrapper: You need to redefine the observation (obs) method of the parent. The obs argument is an observation from the wrapped environment, and this method should return the observation that will be given to the agent.

    - RewardWrapper: This exposes the reward (rew) method, which can modify the reward value given to the agent.
    
    - ActionWrapper: You need to override the action (act) method, which can tweak the action passed to the wrapped environment by the agent.

- Monitors: You can take a look at your agent's life inside the environment. The second argument that we pass to Monitor is the name of the directory that it will write the results to. This directory shouldn't exist, otherwise your program will fail with an exception (to overcome this, you could either remove the existing directory or pass the force=True argument to the Monitor class' constructor).

In [15]:
# imagine a situation where we want to intervene in the stream of actions sent by the agent and, 
# with a probability of 10%, replace the current action with a random one.

import gymnasium as gym
import random

class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env: gym.Env, epsilon: float = 0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon
    
    def action(self, action: gym.core.WrapperActType) -> gym.core.WrapperActType:
        if random.random() < self.epsilon:
            action = self.env.action_space.sample()
            print(f"Random action {action}")
            return action
        return action


In [16]:
if __name__ == "__main__":
    env = RandomActionWrapper(gym.make("CartPole-v1"))

    obs = env.reset()
    total_reward = 0.0

    while True:
        obs, reward, done, _, _ = env.step(0)
        total_reward += reward
        if done:
            break
    
    print(f"Reward got: {total_reward:.2f}")

Random action 1
Random action 0
Random action 0
Reward got: 11.00


In [None]:
if __name__ == "__main__":
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    env = gym.wrappers.HumanRendering(env)
    # env = gym.wrappers.RecordVideo(env, video_folder="video")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, done, _, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print(f"Episode done in {total_steps} steps, total reward {total_reward:.2f}")
    env.close()

2025-02-19 15:59:34.177 python[87918:4256515] +[IMKClient subclass]: chose IMKClient_Legacy
2025-02-19 15:59:34.177 python[87918:4256515] +[IMKInputSession subclass]: chose IMKInputSession_Legacy


Episode done in 24 steps, total reward 24.00


: 