# Acrobot Study
### This notebook is a study of the Acrobot environment using the Gym environment
### The algorithm library is StableBaselines3

*This notebook was created in Jupyter Notebooks*

Environment: [Acrobot](https://www.gymlibrary.dev/environments/classic_control/acrobot/)

RL-Library: [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/)

In [1]:
# import required libraries
import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import (DummyVecEnv, VecMonitor)
# utils is a python file containing useful scripts such as an mp4 video generator
import video_save_utility

### Acrobot-v1:
The goal of this simulation is to move the end of the two links above the target line. There is one actuated joint on the model between the two links and the goal is to apply torque to this line such that the links move above the target.

Theta1 is the angle of the first joint where 0 indicates the joint is directly down and theta2 is the relative angle of the first link where and angle of 0 indicates the two links have the same angle. The angular velocities are bounded at +/-4pi and +/-9pi respectively.

All steps that do not reach the goal incur a reward of -1. Achieving the goal results in termination with a reward of 0. The reward threshold is -100.

Observation Space (ndarray of size 6 that provides data about the rotational joint angles and their angular velocity):
- Cosine of theta1 (-1 <= theta1 <= 1)
- Sine of theta1 (-1 <= theta1 <= 1)
- Cosine of theta2 (-1 <= theta2 <= 1)
- Sine of theta2 (-1 <= theta2 <= 1)
- Angular velocity of theta1
- Angular velocity of theta2

The action space includes the following:
- Apply -1 torque to actuated joint
- Apply 0 torque to actuated joint
- Apply 1 torque to actuated joint

In [2]:
# create a gym environment
env = gym.make("Acrobot-v1")
# reset the gym environment
observation = env.reset()

In [3]:
print("Observation Space:")
# prints out the shape of our observation space
print("Shape: {}".format(observation.shape))
# prints out a random sample from our observation space
print("Sample: {}".format(env.observation_space.sample()))

Observation Space:
Shape: (6,)
Sample: [ 0.8083192  -0.6782452  -0.39639205  0.13636802  0.8217263  22.775116  ]


In [4]:
print("Action Space:")
# prints out the shape of our observation space
print("Shape: {}".format(env.action_space.n))
# prints out a random sample from our observation space
print("Sample: {}".format(env.action_space.sample()))

Action Space:
Shape: 3
Sample: 0


---
For our training we want to use a vectorized envirnoment so that we can have more diverse experiences during training. This method runs multiple copies of the same environment in parallel and provides a linear speedup in steps taken through sampling the multiple sub-environments at the same time ([Gymnasium Vectors](https://gymnasium.farama.org/api/vector/)).


In [5]:
v_env = make_vec_env('Acrobot-v1', n_envs = 16)

Now if we print out a sample of the obsevation space we have a list of vectors or size 16x8 instead of a single observation vector.

---

### PPO: Proximal Policy Optimization

PPO: Combines the ideas of A2C (multiple workers) and TRPO (it uses a trust region to improve the actor) ([sb3](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D)). The main idea is that after and update, the new policy should not be too far from the old policy. According to the developers, this policy alternates between sampling data through interaction with the environment and optimizing a "surrogate" objective funtion using stochastic gradient ascent ([arxiv.org](https://arxiv.org/abs/1707.06347)).

##### Hyperparameters:

Reinforcement Learning is highly dependent on hyperparameters. In the case of the PPO we have several that we can tune and change. In our case, our inputs are a vector instead of a frame of the game so we should use an MlpPolicy (Multilayer Perceptron). This gym example has actually been optimized already by RLZoo and we can use their parameters as a starting point ([RLZoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml)).

In [6]:
mlp = 'MlpPolicy'
# learning rate
lr = 0.0003
# number of steps (state-action pairs) per environment update (epoch)
n_steps = 256
# number of epochs when optimizing surrogate loss
n_epochs = 4
# Discount factor
gamma = 0.99
# Factor for tradeoff of bias vs variance for Generalized Advantage Estimator
gae_lambda = 0.94
# Entropy coefficient
ent_coef = 0.0
# number of timesteps to train the agent
n_timesteps = 1000000.0


In [7]:
# define a model using our above hyperparameters
model = PPO(policy = mlp,
            env = v_env,
            learning_rate = lr,
            n_steps = n_steps,
            n_epochs = n_epochs,
            gamma = gamma,
            gae_lambda = gae_lambda,
            ent_coef = ent_coef,
            verbose = 1
           )

Using cuda device


Next we train the agent. This process can be time consuming. Let's train for 1 million steps similar to the recommened hyperparameters from RLZoo.

In [8]:
model.learn(total_timesteps=n_timesteps)

-----------------------------
| time/              |      |
|    fps             | 1020 |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 4096 |
-----------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 500          |
|    ep_rew_mean          | -500         |
| time/                   |              |
|    fps                  | 1245         |
|    iterations           | 2            |
|    time_elapsed         | 6            |
|    total_timesteps      | 8192         |
| train/                  |              |
|    approx_kl            | 0.0020780861 |
|    clip_fraction        | 0.000549     |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.1         |
|    explained_variance   | 5.37e-05     |
|    learning_rate        | 0.0003       |
|    loss                 | 12.4         |
|    n_updates            | 4            |
|    policy_grad

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 427         |
|    ep_rew_mean          | -427        |
| time/                   |             |
|    fps                  | 1633        |
|    iterations           | 11          |
|    time_elapsed         | 27          |
|    total_timesteps      | 45056       |
| train/                  |             |
|    approx_kl            | 0.007000312 |
|    clip_fraction        | 0.0281      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.978      |
|    explained_variance   | 0.135       |
|    learning_rate        | 0.0003      |
|    loss                 | 47.1        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.00248    |
|    value_loss           | 67.6        |
-----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 370     

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 100          |
|    ep_rew_mean          | -99.3        |
| time/                   |              |
|    fps                  | 1701         |
|    iterations           | 21           |
|    time_elapsed         | 50           |
|    total_timesteps      | 86016        |
| train/                  |              |
|    approx_kl            | 0.0026426257 |
|    clip_fraction        | 0.0232       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.548       |
|    explained_variance   | 0.912        |
|    learning_rate        | 0.0003       |
|    loss                 | 19.7         |
|    n_updates            | 80           |
|    policy_gradient_loss | -0.00299     |
|    value_loss           | 28.1         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 92.8          |
|    ep_rew_mean          | -91.8         |
| time/                   |               |
|    fps                  | 1697          |
|    iterations           | 31            |
|    time_elapsed         | 74            |
|    total_timesteps      | 126976        |
| train/                  |               |
|    approx_kl            | 0.00083502196 |
|    clip_fraction        | 0.00793       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.295        |
|    explained_variance   | 0.862         |
|    learning_rate        | 0.0003        |
|    loss                 | 10.1          |
|    n_updates            | 120           |
|    policy_gradient_loss | -0.000212     |
|    value_loss           | 27.5          |
-------------------------------------------
------------------------------------------
| rollout/                |      

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 82           |
|    ep_rew_mean          | -81          |
| time/                   |              |
|    fps                  | 1716         |
|    iterations           | 41           |
|    time_elapsed         | 97           |
|    total_timesteps      | 167936       |
| train/                  |              |
|    approx_kl            | 0.0011315494 |
|    clip_fraction        | 0.0109       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.185       |
|    explained_variance   | 0.922        |
|    learning_rate        | 0.0003       |
|    loss                 | 5.94         |
|    n_updates            | 160          |
|    policy_gradient_loss | -0.00106     |
|    value_loss           | 16.2         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 84.8         |
|    ep_rew_mean          | -83.8        |
| time/                   |              |
|    fps                  | 1721         |
|    iterations           | 51           |
|    time_elapsed         | 121          |
|    total_timesteps      | 208896       |
| train/                  |              |
|    approx_kl            | 0.0012385685 |
|    clip_fraction        | 0.00824      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.155       |
|    explained_variance   | 0.923        |
|    learning_rate        | 0.0003       |
|    loss                 | 11.3         |
|    n_updates            | 200          |
|    policy_gradient_loss | -0.00115     |
|    value_loss           | 15.5         |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_l

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 84.7        |
|    ep_rew_mean          | -83.7       |
| time/                   |             |
|    fps                  | 1730        |
|    iterations           | 61          |
|    time_elapsed         | 144         |
|    total_timesteps      | 249856      |
| train/                  |             |
|    approx_kl            | 0.000761994 |
|    clip_fraction        | 0.0112      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.108      |
|    explained_variance   | 0.935       |
|    learning_rate        | 0.0003      |
|    loss                 | 7.21        |
|    n_updates            | 240         |
|    policy_gradient_loss | -0.000467   |
|    value_loss           | 15.5        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 83.4

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 83.9         |
|    ep_rew_mean          | -82.9        |
| time/                   |              |
|    fps                  | 1715         |
|    iterations           | 71           |
|    time_elapsed         | 169          |
|    total_timesteps      | 290816       |
| train/                  |              |
|    approx_kl            | 0.0005182517 |
|    clip_fraction        | 0.0069       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.107       |
|    explained_variance   | 0.943        |
|    learning_rate        | 0.0003       |
|    loss                 | 3.94         |
|    n_updates            | 280          |
|    policy_gradient_loss | -0.000624    |
|    value_loss           | 12.4         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 83           |
|    ep_rew_mean          | -82          |
| time/                   |              |
|    fps                  | 1693         |
|    iterations           | 81           |
|    time_elapsed         | 195          |
|    total_timesteps      | 331776       |
| train/                  |              |
|    approx_kl            | 0.0013123925 |
|    clip_fraction        | 0.012        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0901      |
|    explained_variance   | 0.92         |
|    learning_rate        | 0.0003       |
|    loss                 | 6.56         |
|    n_updates            | 320          |
|    policy_gradient_loss | -0.00116     |
|    value_loss           | 16.7         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 84.9         |
|    ep_rew_mean          | -83.9        |
| time/                   |              |
|    fps                  | 1674         |
|    iterations           | 91           |
|    time_elapsed         | 222          |
|    total_timesteps      | 372736       |
| train/                  |              |
|    approx_kl            | 0.0010613964 |
|    clip_fraction        | 0.012        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0789      |
|    explained_variance   | 0.932        |
|    learning_rate        | 0.0003       |
|    loss                 | 3.92         |
|    n_updates            | 360          |
|    policy_gradient_loss | -0.000704    |
|    value_loss           | 12.3         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 81.2          |
|    ep_rew_mean          | -80.2         |
| time/                   |               |
|    fps                  | 1661          |
|    iterations           | 100           |
|    time_elapsed         | 246           |
|    total_timesteps      | 409600        |
| train/                  |               |
|    approx_kl            | 0.00069126755 |
|    clip_fraction        | 0.00458       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.0828       |
|    explained_variance   | 0.946         |
|    learning_rate        | 0.0003        |
|    loss                 | 2.65          |
|    n_updates            | 396           |
|    policy_gradient_loss | 0.0008        |
|    value_loss           | 12.6          |
-------------------------------------------
-----------------------------------------
| rollout/                |       

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 85.7         |
|    ep_rew_mean          | -84.7        |
| time/                   |              |
|    fps                  | 1661         |
|    iterations           | 110          |
|    time_elapsed         | 271          |
|    total_timesteps      | 450560       |
| train/                  |              |
|    approx_kl            | 0.0005876682 |
|    clip_fraction        | 0.00653      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0686      |
|    explained_variance   | 0.925        |
|    learning_rate        | 0.0003       |
|    loss                 | 9.2          |
|    n_updates            | 436          |
|    policy_gradient_loss | -0.00088     |
|    value_loss           | 14.7         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 80           |
|    ep_rew_mean          | -79          |
| time/                   |              |
|    fps                  | 1659         |
|    iterations           | 120          |
|    time_elapsed         | 296          |
|    total_timesteps      | 491520       |
| train/                  |              |
|    approx_kl            | 0.0010040753 |
|    clip_fraction        | 0.0109       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0697      |
|    explained_variance   | 0.964        |
|    learning_rate        | 0.0003       |
|    loss                 | 5.94         |
|    n_updates            | 476          |
|    policy_gradient_loss | -0.000696    |
|    value_loss           | 7.64         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 79.8          |
|    ep_rew_mean          | -78.8         |
| time/                   |               |
|    fps                  | 1661          |
|    iterations           | 130           |
|    time_elapsed         | 320           |
|    total_timesteps      | 532480        |
| train/                  |               |
|    approx_kl            | 0.00084755605 |
|    clip_fraction        | 0.0135        |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.0691       |
|    explained_variance   | 0.967         |
|    learning_rate        | 0.0003        |
|    loss                 | 5             |
|    n_updates            | 516           |
|    policy_gradient_loss | -0.000287     |
|    value_loss           | 7.59          |
-------------------------------------------
-------------------------------------------
| rollout/                |     

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 84.2         |
|    ep_rew_mean          | -83.2        |
| time/                   |              |
|    fps                  | 1666         |
|    iterations           | 140          |
|    time_elapsed         | 344          |
|    total_timesteps      | 573440       |
| train/                  |              |
|    approx_kl            | 0.0008666402 |
|    clip_fraction        | 0.00854      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0693      |
|    explained_variance   | 0.889        |
|    learning_rate        | 0.0003       |
|    loss                 | 8.46         |
|    n_updates            | 556          |
|    policy_gradient_loss | -0.000426    |
|    value_loss           | 23.3         |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_l

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 84.3          |
|    ep_rew_mean          | -83.3         |
| time/                   |               |
|    fps                  | 1658          |
|    iterations           | 149           |
|    time_elapsed         | 368           |
|    total_timesteps      | 610304        |
| train/                  |               |
|    approx_kl            | 0.00035786052 |
|    clip_fraction        | 0.00623       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.0626       |
|    explained_variance   | 0.924         |
|    learning_rate        | 0.0003        |
|    loss                 | 6.46          |
|    n_updates            | 592           |
|    policy_gradient_loss | -6.05e-06     |
|    value_loss           | 16.3          |
-------------------------------------------
-------------------------------------------
| rollout/                |     

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 83.5         |
|    ep_rew_mean          | -82.5        |
| time/                   |              |
|    fps                  | 1653         |
|    iterations           | 159          |
|    time_elapsed         | 393          |
|    total_timesteps      | 651264       |
| train/                  |              |
|    approx_kl            | 0.0008648481 |
|    clip_fraction        | 0.0104       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0666      |
|    explained_variance   | 0.902        |
|    learning_rate        | 0.0003       |
|    loss                 | 8.39         |
|    n_updates            | 632          |
|    policy_gradient_loss | -0.000924    |
|    value_loss           | 20.8         |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_l

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 85.2          |
|    ep_rew_mean          | -84.2         |
| time/                   |               |
|    fps                  | 1651          |
|    iterations           | 168           |
|    time_elapsed         | 416           |
|    total_timesteps      | 688128        |
| train/                  |               |
|    approx_kl            | 0.00048260196 |
|    clip_fraction        | 0.00763       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.0551       |
|    explained_variance   | 0.886         |
|    learning_rate        | 0.0003        |
|    loss                 | 10.9          |
|    n_updates            | 668           |
|    policy_gradient_loss | -5.75e-05     |
|    value_loss           | 23.4          |
-------------------------------------------
------------------------------------------
| rollout/                |      

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 79.7         |
|    ep_rew_mean          | -78.7        |
| time/                   |              |
|    fps                  | 1645         |
|    iterations           | 178          |
|    time_elapsed         | 443          |
|    total_timesteps      | 729088       |
| train/                  |              |
|    approx_kl            | 0.0009137723 |
|    clip_fraction        | 0.008        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0563      |
|    explained_variance   | 0.927        |
|    learning_rate        | 0.0003       |
|    loss                 | 5            |
|    n_updates            | 708          |
|    policy_gradient_loss | -0.000433    |
|    value_loss           | 15.6         |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_l

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 85           |
|    ep_rew_mean          | -84          |
| time/                   |              |
|    fps                  | 1638         |
|    iterations           | 187          |
|    time_elapsed         | 467          |
|    total_timesteps      | 765952       |
| train/                  |              |
|    approx_kl            | 0.0007498831 |
|    clip_fraction        | 0.0069       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0447      |
|    explained_variance   | 0.862        |
|    learning_rate        | 0.0003       |
|    loss                 | 9.38         |
|    n_updates            | 744          |
|    policy_gradient_loss | -0.000688    |
|    value_loss           | 26.5         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 90            |
|    ep_rew_mean          | -89           |
| time/                   |               |
|    fps                  | 1637          |
|    iterations           | 197           |
|    time_elapsed         | 492           |
|    total_timesteps      | 806912        |
| train/                  |               |
|    approx_kl            | 0.00045041053 |
|    clip_fraction        | 0.00464       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.0401       |
|    explained_variance   | 0.871         |
|    learning_rate        | 0.0003        |
|    loss                 | 9.57          |
|    n_updates            | 784           |
|    policy_gradient_loss | -0.00014      |
|    value_loss           | 21.2          |
-------------------------------------------
------------------------------------------
| rollout/                |      

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 82.7        |
|    ep_rew_mean          | -81.7       |
| time/                   |             |
|    fps                  | 1636        |
|    iterations           | 207         |
|    time_elapsed         | 518         |
|    total_timesteps      | 847872      |
| train/                  |             |
|    approx_kl            | 0.000657188 |
|    clip_fraction        | 0.00836     |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.0465     |
|    explained_variance   | 0.9         |
|    learning_rate        | 0.0003      |
|    loss                 | 5.79        |
|    n_updates            | 824         |
|    policy_gradient_loss | -0.000724   |
|    value_loss           | 19          |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 83.3

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 78.7         |
|    ep_rew_mean          | -77.7        |
| time/                   |              |
|    fps                  | 1640         |
|    iterations           | 217          |
|    time_elapsed         | 541          |
|    total_timesteps      | 888832       |
| train/                  |              |
|    approx_kl            | 0.0009006205 |
|    clip_fraction        | 0.00769      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0446      |
|    explained_variance   | 0.927        |
|    learning_rate        | 0.0003       |
|    loss                 | 5.88         |
|    n_updates            | 864          |
|    policy_gradient_loss | -0.000628    |
|    value_loss           | 15.3         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 78.5         |
|    ep_rew_mean          | -77.5        |
| time/                   |              |
|    fps                  | 1641         |
|    iterations           | 227          |
|    time_elapsed         | 566          |
|    total_timesteps      | 929792       |
| train/                  |              |
|    approx_kl            | 0.0018705723 |
|    clip_fraction        | 0.0126       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.0384      |
|    explained_variance   | 0.912        |
|    learning_rate        | 0.0003       |
|    loss                 | 7.38         |
|    n_updates            | 904          |
|    policy_gradient_loss | -7.1e-05     |
|    value_loss           | 17.3         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 79.4          |
|    ep_rew_mean          | -78.4         |
| time/                   |               |
|    fps                  | 1635          |
|    iterations           | 237           |
|    time_elapsed         | 593           |
|    total_timesteps      | 970752        |
| train/                  |               |
|    approx_kl            | 0.00079671864 |
|    clip_fraction        | 0.00806       |
|    clip_range           | 0.2           |
|    entropy_loss         | -0.046        |
|    explained_variance   | 0.914         |
|    learning_rate        | 0.0003        |
|    loss                 | 5.53          |
|    n_updates            | 944           |
|    policy_gradient_loss | -0.000552     |
|    value_loss           | 16.1          |
-------------------------------------------
------------------------------------------
| rollout/                |      

<stable_baselines3.ppo.ppo.PPO at 0x19b0a25d970>

### Training function outputs:
- Rollout:
    - ep_len_mean: mean episode length
    - ep_rew_mean: mean epsiodic training reward averaged over 100 episodes
- Time:
    - fps: number of frames per second including the time taken by gradient updates
    - iterations: number of iterations (data collection + policy update for A2C/PPO)
    - time_elapsed: time in seconds since beginning of training
    - total_timesteps: total number of timesteps since beginning of training
- Train:
    - approx_kl: approximate mean KL divergence between old and new policy (for PPO). An estimation of how much change happened in the update
    - clip_fraction: mean fraction of surrogate loss that was clipped (aboce clip range threshold)
    - clip_range: current value of clipping factor for surrogate loss
    - entropy_loss: mean value of entropy loss (negative of the average of policy entropy)
    - explained_variance: fraction of the return variance explained by the value function (ev=0 => might as well have predicted 0, ev=1 => perfect prediction, ev<0 => worse than predicting 0
    - learning_rate: current learning rate
    - loss: current total loss
    - n_updates: number of gradient updates so far
    - policy_gradient_loss: current value of policy gradient loss (value does not have much meaning)
    - value_loss: Current value for value function loss for on-policy algorithms, usually the error between value function and Monte-Carlo estimate

In [9]:
#save model
model.save('ppo-Acrobot')

After creating and saving the model it is very important that we evaluate the model and see the results of the training and determine how well our model performs.

In [10]:
model = PPO.load('ppo-Acrobot')
# Create and evaluation environment
dummy_env = DummyVecEnv([lambda: env])
eval_env = VecMonitor(dummy_env)
# mean reward, standard reward refer to the average reward per episode and the standard of that reward
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes = 10, deterministic = True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=-80.40 +/- 23.938251495361328


This is a good result! A result of -80 means that the agent was able to move the link above the target before the termination of -100. Below we will save an mp4 file so that we can visualized the results.

---

In [None]:
#create a copy of the evaluation environment for the replay environment
replay_env = eval_env
#create a length of video in timesteps
video_length = 250
#set model to be deterministic
is_d = True

video_save_utility.generate_replay(model, replay_env, video_length, is_d)
