<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_12_1_ai_gym.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 12: Reinforcement Learning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 12 Video Material

* **Part 12.1: Introduction to Introduction to Gymnasium** [[Video]](https://www.youtube.com/watch?v=FvuyrpzvwdI&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_1_reinforcement.ipynb)
* Part 12.2: Introduction to Q-Learning [[Video]](https://www.youtube.com/watch?v=VKuqvbG_KAw&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_2_qlearningreinforcement.ipynb)
* Part 12.3: Stable Baselines Q-Learning [[Video]](https://www.youtube.com/watch?v=kl7zsCjULN0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_3_pytorch_reinforce.ipynb)
* Part 12.4: Atari Games with Stable Baselines Neural Networks [[Video]](https://www.youtube.com/watch?v=maLA1_d4pzQ&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_4_atari.ipynb)
* Part 12.5: Future of Reinforcement Learning [[Video]](https://www.youtube.com/watch?v=-euo5pTjP8E&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_5_rl_future.ipynb)


# Part 12.1: Introduction to Gymnasium

[Gymnasium](https://github.com/Farama-Foundation/Gymnasium) aims to provide an easy-to-setup general-intelligence benchmark with various environments. The goal is to standardize how environments are defined in AI research publications to make published research more easily reproducible. The project claims to provide the user with a simple interface. Gymnasium is a fork of the OpenAI Gym, for which OpenAI ceased support in October 2021. Gymnasium is currently supported by [The Farama Foundation](https://farama.org/).

Gymnasium is pip-installed onto your local machine. There are a few significant limitations to be aware of:

* Gymnasium Atari only **directly** supports Linux and Macintosh
* Gymnasium Atari can be used with Windows; however, it requires a particular [installation procedure](https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30)
* Gymnasium can not directly render animated games in Google CoLab.

Because Gymnasium requires a graphics display, an embedded video is the only way to display Gymnasium in Google CoLab. The presentation of Gymnasium game animations in Google CoLab is discussed later in this module.

## Looking at Gymnasium Environments

The centerpiece of Gymnasium is the environment, which defines the "game" in which your reinforcement algorithm will compete. An environment does not need to be a game; however, it describes the following game-like features:
* **action space**: What actions can we take on the environment at each step/episode to alter the environment.
* **observation space**: What is the current state of the portion of the environment that we can observe. Usually, we can see the entire environment.

Before we begin to look at Gymnasium, it is essential to understand some of the terminology used by this library.

* **Agent** - The machine learning program or model that controls the actions.
Step - One round of issuing actions that affect the observation space.
* **Episode** - A collection of steps that terminates when the agent fails to meet the environment's objective or the episode reaches the maximum number of allowed steps.
* **Render** - Gymnasium can render one frame for display after each episode.
* **Reward** - A positive reinforcement that can occur at the end of each episode, after the agent acts.
* **Non-deterministic** - For some environments, randomness is a factor in deciding what effects actions have on reward and changes to the observation space.

Gymnasium must be installed with the following command.


In [None]:
!pip install gymnasium[accept-rom-license,atari]

It is important to note that many Gymnasium environments specify that they are not non-deterministic even though they use random numbers to process actions. Based on the Gymnasium GitHub issue tracker, a non-deterministic property means a deterministic environment behaves randomly. Even when you give the environment a consistent seed value, this behavior is confirmed. The program can use the seed method of an environment to seed the random number generator for the environment.

The Gymnasium library allows us to query some of these attributes from environments. I created the following function to query Gymnasium environments.

In [None]:
import gymnasium as gym

def query_environment(name):
    env = gym.make(name)
    spec = gym.spec(name)
    print(f"Action Space: {env.action_space}")
    print(f"Observation Space: {env.observation_space}")
    print(f"Max Episode Steps: {spec.max_episode_steps}")
    print(f"Nondeterministic: {spec.nondeterministic}")
    print(f"Reward Range: {env.reward_range}")
    print(f"Reward Threshold: {spec.reward_threshold}")


We will look at the **MountainCar-v0** environment, which challenges an underpowered car to escape the valley between two mountains.  The following code describes the Mountian Car environment.

In [None]:
query_environment("MountainCar-v0")

This environment allows three distinct actions: accelerate forward, decelerate, or backward. The observation space contains two continuous (floating point) values, as evident by the box object. The observation space is simply the position and velocity of the car. The car has 200 steps to escape for each episode. You would have to look at the code, but the mountain car receives no incremental reward. The only reward for the vehicle occurs when it escapes the valley.  

In [None]:
query_environment("CartPole-v1")

The **CartPole-v1** environment challenges the agent to balance a pole while the agent. The environment has an observation space of 4 continuous numbers:

* Cart Position
* Cart Velocity
* Pole Angle
* Pole Velocity At Tip

To achieve this goal, the agent can take the following actions:

* Push cart to the left
* Push cart to the right

There is also a continuous variant of the mountain car. This version does not simply have the motor on or off. The action space is a single floating-point number for the continuous cart that specifies how much forward or backward force the cart currently utilizes.

In [None]:
query_environment("MountainCarContinuous-v0")

Gymnasium provides a versatile platform for developing and comparing reinforcement learning algorithms. It supports a wide range of environments, including classic Atari games through the Arcade Learning Environment (ALE) emulator. This integration allows researchers and enthusiasts to access a suite of retro video games originally designed for the Atari 2600 console, using them as benchmarks for AI performance. By interfacing with ALE, Gymnasium users can easily implement their algorithms and test them against the nuanced challenges presented by these games. Each game presents unique scenarios that can help in training algorithms to learn various tasks, making Gymnasium an invaluable tool for advancing the field of artificial intelligence through these interactive and complex environments.

Reinforcement learning (RL) algorithms can receive input from an Atari game in two primary ways, which cater to different aspects of the game's state and complexity.

The first method involves monitoring the game "screen" or the visual output that the game generates. In this approach, the RL algorithm processes the pixels of the game display as its environment's state. This is akin to how a human player would see and interpret the game. The algorithm analyzes the patterns, movements, and changes within the frames to make decisions about the best action to take at each step. This method requires the RL model to handle high-dimensional data and learn to associate visual cues with game outcomes.

The second method is by monitoring the Atari system's RAM. Despite its limited capacity, the RAM of an Atari system contains all the information about the game's internal state, such as the location of objects, player scores, and game status. By tapping into this memory directly, an RL algorithm can access a more compact and less noisy representation of the game state than the pixel data provides. This can be beneficial for learning more efficiently as the system's state is represented in a more structured and lower-dimensional form.

Both methods have their merits. The screen capture approach forces the algorithm to learn directly from visual input, which is a more general approach and closer to how humans play games. On the other hand, the RAM monitoring method can lead to quicker training times and potentially a deeper understanding of the game mechanics, as it bypasses the need to interpret visual data. Choosing between these methods depends on the specific goals and constraints of the RL project at hand.

First, we see how to monitor the screen of the game [Breakout](https://gymnasium.farama.org/environments/atari/breakout/).

In [None]:
query_environment("ALE/Breakout-v5")

Similarly, we can monitor the RAM of Breakout.

In [None]:
query_environment("ALE/Breakout-ram-v5")

## Render OpenAI Gym Environments from CoLab

It is possible to visualize the game your agent is playing, even on CoLab. This section provides information on generating a video in CoLab that shows you an episode of the game your agent is playing. I based this video process on suggestions found [here](https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t).

Begin by installing **pyvirtualdisplay** and **python-opengl**.

In [None]:
# HIDE OUTPUT
!pip install pyvirtualdisplay
!sudo apt-get install -y xvfb ffmpeg

Next, we install the needed requirements to display an Atari game.

In [None]:
# HIDE OUTPUT
!sudo apt-get install xvfb

Note, the above cell may request to restart the runtime, if this occurs, please restart the CoLab runtime. Next, we define the functions used to show the video by adding it to the CoLab notebook.

Now we are ready to play the game.  We use a simple random agent.

In [None]:
import gym
import gymnasium as gymnasium
from gymnasium.wrappers import RecordVideo
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

# Start virtual display
display = Display(visible=0, size=(1400, 900))
display.start()

# Create Atlantis environment
env = gymnasium.make('Atlantis-v4', render_mode="rgb_array")
env.metadata['render_fps'] = 30
# Reset the environment
env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
env = RecordVideo(env, video_folder='./videos', episode_trigger=video_callable)

# Run the environment until done
terminated = False
truncated = False
while not (terminated or truncated):
    action = env.action_space.sample()  # replace with your own policy!
    obs, reward, terminated, truncated, info = env.step(action)

env.close()

# Display the video
video = io.open(glob.glob('videos/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))


You will note that the **step** and **reset** functions return several values:

* **observation** (ObsType): An element of the environment's observation_space as the next observation due to the agent actions. An example is a numpy array containing the positions and velocities of the pole in CartPole.

* **reward** (SupportsFloat): The reward as a result of taking the action.

* **terminated** (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task) which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barton, Gridworld. If true, the user needs to call reset().

* **truncated** (bool): Whether the truncation condition outside the scope of the MDP is satisfied. Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().

* **info** (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain: metrics that describe the agent's performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward.