# ENN585 - Advanced Machine Learning - Week 4

Welcome to Week 4 of ENN585!


Implementing reinforcement learning algorithms yourself from scratch can be insightful and educational, especially for the basic algorithms. However, the implementation for more advanced algorithms can be quite complex and small implementation details that are often not described in the papers can have a large impact on the performance of the algorithm. Therefore, it is often more practical to use existing libraries.

This week's notebook lets you explore one of those libraries, called Stable Baselined 3, or SB3 for short. It is a popular library for reinforcement learning and contains implementations of many state-of-the-art algorithms.

You can access the documentation for SB3 [here](https://stable-baselines3.readthedocs.io/en/master/). This [blogpost](https://araffin.github.io/post/sb3/) is a good resource too, explaining the motivation behind the library and giving a short overview.

In this notebook, we will use SB3 to train a reinforcement learning agent to solve the CartPole environment. The CartPole environment is a classic control problem, where the goal is to balance a pole on a cart. The environment is considered solved if the agent can balance the pole for 200 time steps.




## Install and Setup

For this notebook you will need the [Stable Baselined 3](https://stable-baselines3.readthedocs.io/en/master/index.html) library, which you can install with `pip install stable-baselines3`.

In [None]:
#@title Install packages - (Run this once at the start)

try:
  import gymnasium as gym
  gym.spec('FetchSlide-v2')
except:
  !pip install gymnasium-robotics
  import gymnasium as gym

try:
    import table_baselines3
except:
    !pip install stable-baselines3     

## First Steps

Let's start by training an Actor-Critic agent using the Advantage-Actor-Critic algorithm (A2C). 

The A2C algorithm is a popular reinforcement learning algorithm that is based on the policy gradient method. It is an on-policy algorithm, meaning that it learns from the data that it collects with the current policy. The algorithm uses a value function to estimate the expected return of a state, and an actor to select actions. The actor and the critic are trained together, and the algorithm uses the advantage function to estimate the quality of the actions.

In [None]:
import gymnasium as gym
from stable_baselines3 import A2C
import numpy as np

# Create the environment
env = gym.make("CartPole-v1", render_mode="rgb_array")

# createing an RL agent is easy, notice how the environment is passed to the agent
model = A2C("MlpPolicy", env)

# Train the agent, this should take around a minute
print('Training ...')
model.learn(total_timesteps=20_000)
print('Done.')

After training the agent, let's see it in action and test it out. 

Notice how we can call `model.predict()` to get the action that the agent wants to take in a given state. 

In [None]:
# we can re-use the environment from above, or just createa a new one
env = gym.make("CartPole-v1", render_mode="rgb_array")

total_rewards = []

# we will perform 10 test runs
for run in range(10):

    # make sure to reset the environment before each run
    observation, info = env.reset()
    total_reward = 0

    while True:
        # we can use the model to predict the next action
        action, _ = model.predict(observation)
        
        # and then apply the action to the environment
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        env.render()

        # we stop the loop if we terminate or run out of time (truncated)
        if terminated or truncated:
            break

    print(f'Agent {run} achieved a total reward of: {total_reward}')
    total_rewards.append(total_reward)


print(f'Average reward over {run+1} runs: {np.mean(total_rewards)}')

A useful feature is to generate a .gif that shows the agent in action. 

Notice how we use the `env.render()` function to render the environment into an image, save all the images into a list, and then use the `imageio` library to save the list of images into a .gif file.

In [None]:
import imageio

env = gym.make("CartPole-v1", render_mode="rgb_array")
observation, info = env.reset()
images = []

while True:
    # we can use the model to predict the next action
    action, _ = model.predict(observation)
    
    # and then apply the action to the environment
    observation, reward, terminated, truncated, info = env.step(action)
    total_reward += reward

    # render the environment and store the image in a list
    img = env.render()
    images.append(img)

    # we stop the loop if we terminate or run out of time (truncated)
    if terminated or truncated:
        break

# now we can generate a .gif animation from all the images
imageio.mimsave("cartpole_a2c.gif", [np.array(img) for img in images], fps=29)

# we can also visualise the gif in the notebook
from IPython.display import Image
display(Image('cartpole_a2c.gif'))

## Diving Deeper into SB3

### Changing the Network Architectures 

SB3 provides default architectures for its policy and value networks. However, it is possible to change the architecture of the networks in a simple way. 

This [documentation page](https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html) provides details on how to create custom network architectures.
Have a look how to do this, especially for the Actor-Critic agents that need two networks (see [here](https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#on-policy-algorithms))


### Speeding Up Training with Parallel Worker Threads

Some algorithms (e.g. A2C but also PPO or SAC) can be parallelized to speed up training. This is done by using multiple worker threads that collect data in parallel. 

The code cell below demonstrates how to use parallel worker threads in SB3. Compare the timing of training with and without parallel worker threads.
 

In [None]:
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# train A2C using a single environment
env = gym.make("CartPole-v1", render_mode="rgb_array")
model = A2C("MlpPolicy", env)
print('Training A2C with a single environment...')
%time model.learn(total_timesteps=10_000)

# train A2C using 8 parallel environments
env = make_vec_env("CartPole-v1", n_envs=8)
model = A2C("MlpPolicy", env)
print('\nTraining A2C with 8 parallel environments...')
%time model.learn(total_timesteps=10_000)

# train A2C using 8 parallel environments, but force it to be on the cpu
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
model = A2C("MlpPolicy", env, device="cpu")
print('\nTraining A2C with 8 parallel environments on the CPU ...')
%time model.learn(total_timesteps=10_000)

### Compare the Performance of Different Algorithms

**YOUR TURN!**

Now that you have seen how to train an A2C agent, try training agents using different algorithms.

When choosing algorithms, be aware that some are designed for discrete action spaces, while others are designed for continuous action spaces. The CartPole environment has a discrete action space, so you have to choose algorithms that are designed for discrete action spaces.

Task:
- Train 3 agents, using 3 different algorithms: A2C, PPO, and DQN.
- Compare the performance of the agents by evaluating them over 10 runs each and reporting the average total reward for them.
- Change the number of training steps and the number of parallel worker threads to see how it affects the training time and the performance of the agents.

In [None]:
# YOUR TURN!

# Modify the code cells above to train and evaluate agents using three different algorithms.

## Beyond Discrete Action Spaces

The CartPole environment has a discrete action space, but many environments have continuous action spaces. This includes the FetchSlide environment you are using for your Assessment 1 project.

**YOUR TURN**

Explore the algortihms in SB3 that are designed for continuous action spaces. Train and compare agents using these algorithms on a continuous action space environment.


In [None]:
# YOUR TURN!
