<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_12_3_pytorch_reinforce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 12: Reinforcement Learning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 12 Video Material

* Part 12.1: Introduction to Introduction to Gymnasium [[Video]](https://www.youtube.com/watch?v=FvuyrpzvwdI&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_1_reinforcement.ipynb)
* Part 12.2: Introduction to Q-Learning [[Video]](https://www.youtube.com/watch?v=VKuqvbG_KAw&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_2_qlearningreinforcement.ipynb)
* **Part 12.3: Stable Baselines Q-Learning** [[Video]](https://www.youtube.com/watch?v=kl7zsCjULN0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_3_pytorch_reinforce.ipynb)
* Part 12.4: Atari Games with Stable Baselines Neural Networks [[Video]](https://www.youtube.com/watch?v=maLA1_d4pzQ&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_4_atari.ipynb)
* Part 12.5: Future of Reinforcement Learning [[Video]](https://www.youtube.com/watch?v=-euo5pTjP8E&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_12_5_rl_future.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Part 12.3: Stable-Baselines Q-Learning in Gymnasium

As we covered in the previous part, Q-Learning is a robust machine learning algorithm. Unfortunately, Q-Learning requires that the Q-table contain an entry for every possible state that the environment can take. Traditional Q-learning might be a good learning algorithm if the environment only includes a handful of discrete state elements. However, the Q-table can become prohibitively large if the state space is large.

Creating policies for large state spaces is a task that Deep Q-Learning Networks (DQN) can usually handle. Neural networks can generalize these states and learn commonalities. Unlike a table, a neural network does not require the program to represent every combination of state and action. A DQN maps the state to its input neurons and the action Q-values to the output neurons. The DQN effectively becomes a function that accepts the state and suggests action by returning the expected reward for each possible action. Figure 12.DQL demonstrates the DQN structure and mapping between state and action.

**Figure 12.DQL: Deep Q-Learning (DQL)**
![Deep Q-Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/deepqlearning.png "Reinforcement Learning")

As this diagram illustrates, the environment state contains several elements. For the basic DQN, the state can be a mix of continuous and categorical/discrete values. For the DQN, the discrete state elements the program typically encoded as dummy variables. The actions should be discrete when your program implements a DQN. Other algorithms support continuous outputs, which we will discuss later in this chapter.

In the landscape of deep learning, the [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/) library emerges as a torchbearer for reinforcement learning (RL) enthusiasts and researchers opting for PyTorch. As an evolution of the original Stable Baselines framework, this iteration has been meticulously reforged with the PyTorch backend, providing a suite of reliable, high-performance RL algorithms. It is designed for ease of use, offering a straightforward API that invites both novices and seasoned practitioners to implement, experiment with, and extend upon cutting-edge RL methods. With Stable Baselines 3, one can expect robust pre-trained models, customizable neural network architectures, and comprehensive documentation that empowers users to deploy RL solutions efficiently. Its compatibility with PyTorch means that it seamlessly integrates into the dynamic ecosystem of deep learning tools, allowing for rapid prototyping and research iteration. Whether the goal is to solve discrete control tasks or navigate the complexities of high-dimensional environments.

## DQN and the Cart-Pole Problem

Barto (1983) first described the cart-pole problem. [[Cite:barto1983neuronlike]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) A cart is connected to a rigid hinged pole. The cart is free to move only in the vertical plane of the cart/track. The agent can apply an impulsive "left" or "right" force F of a fixed magnitude to the cart at discrete time intervals. The cart-pole environment simulates the physics behind keeping the pole reasonably upright position on the cart. The environment has four state variables:
* $x$ The position of the cart on the track.
* $\theta$ The angle of the pole with the vertical
* $\dot{x}$ The cart velocity.
* $\dot{\theta}$ The rate of change of the angle.

The action space consists of discrete actions:
* Apply force left
* Apply force right

First, we must install Stable Baselines.


In [None]:
# HIDE OUTPUT
if COLAB:
  !pip install stable-baselines3[extra] gymnasium
  !pip install gymnasium[accept-rom-license,atari]
  !pip install pyvirtualdisplay
  !sudo apt-get install -y xvfb python-opengl ffmpeg

### The Cartpole Environment
In the Cartpole environment:

-   `observation` is an array of 4 floats:
    -   the position and velocity of the cart
    -   the angular position and velocity of the pole
-   `reward` is a scalar float value
-   `action` is a scalar integer with only two possible values:
    -   `0` — "move left"
    -   `1` — "move right"


In [None]:
import gymnasium as gym

# Create and initialize the MountainCar environment
env = gym.make('CartPole-v1', render_mode="rgb_array")

time_step = env.reset()
print('Time step:')
print(time_step)

action = 1

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)


We can also visualize this environment.

In [None]:
from PIL import Image

# Render the environment's state to a numpy array
frame = env.render()

# Convert the numpy array to an image and display it
image = Image.fromarray(frame)

# Don't forget to close the environment when you're done!
env.close()

display(image)

The goal is to move the above cart without causing the pole to fall over.

## Training the agent

We will make use of Stable-Baselines3 to train an agent for this environment. We will make use of the MlpPolicy, which makes use of a Multi-Layer Peceptron (MLP), which is another name for neural network.


In [None]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

# Create the CartPole environment
env = make_vec_env('CartPole-v1', n_envs=1)

# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)

# Train the agent
model.learn(total_timesteps=10000)

# Save the agent
model.save("ppo_cartpole")

# Create a fresh environment for evaluation
eval_env = gym.make('CartPole-v1')

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

## Videos

We can easily visulaize the cart pole ageint in a video.

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
import base64
from IPython import display as ipythondisplay
from pathlib import Path

# Record the agent playing
video_folder = '/videos'
video_length = 1500

env = make_vec_env('CartPole-v1', n_envs=1)
env = VecVideoRecorder(env, video_folder,
                       record_video_trigger=lambda x: x == 0, video_length=video_length,
                       name_prefix="ppo-cartpole")

obs = env.reset()
for _ in range(video_length):
    action, _ = model.predict(obs, deterministic=True)
    obs, _, _, _ = env.step(action)

# Close the environment and video recorder
env.close()

# Display the video
video_path = Path(video_folder) / "ppo-cartpole-step-0-to-step-1500.mp4"
video = open(video_path, "rb").read()
encoded = base64.b64encode(video)

ipythondisplay.display(ipythondisplay.HTML(data=f'<video alt="test" autoplay loop controls style="height: 400px;">'
                                        f'<source src="data:video/mp4;base64,{encoded.decode()}" type="video/mp4" />'
                                        f'</video>'))
