In [1]:
import gymnasium as gym

#### 1. Setup the environment with render mode to observe

In [2]:
env =gym.make('LunarLander-v2',render_mode='human')

#### 2. Reset the environment

In [3]:
env.reset()

(array([ 0.00406256,  1.403513  ,  0.41147438, -0.32920933, -0.00470068,
        -0.09320503,  0.        ,  0.        ], dtype=float32),
 {})

#### 3. Sample Actions, Observation Space and Sample Observation Space

In [4]:
print ('sample action', env.action_space.sample())
print ('observation space shape',env.observation_space.shape)
print ('sample observation',env.observation_space.sample)

sample action 1
observation space shape (8,)
sample observation <bound method Box.sample of Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)>


#### 4. Close the env. 

In [5]:
env.close()

#### 5. Final reward is -100 at the end of episode. With random action observe not -100 on ~141th epoch

In [None]:
reward = -100
episode = 1
while reward ==-100:
    env =gym.make('LunarLander-v2',render_mode='human')
    env.reset()
    for step in range (200):
        env.render()
        obs,reward,done, info, _ =env.step(env.action_space.sample())
        
    env.close()
    episode +=1 
    print (episode,reward)

##### Key principles to consider in a game

# Gains in RL algorithms (ranked)
- Altering algorithms
- Altering reward space parameters
- Hyper parameter tuning for the algorithm

####| https://stable-baselines3.readthedocs.io/en/master/guide/algos.html

#### 6. Train via A2C

In [6]:
from stable_baselines3 import A2C

In [7]:
env =gym.make('LunarLander-v2')

In [8]:
model = A2C('MlpPolicy', env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [9]:
model.learn(total_timesteps=100000)
episodes=10
for ep in range (episodes):
    obs = env.reset()
    done = False
    while not done:
        env.render()
        obs,reward,done, info, _ =env.step(env.action_space.sample())

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 87        |
|    ep_rew_mean        | -281      |
| time/                 |           |
|    fps                | 229       |
|    iterations         | 100       |
|    time_elapsed       | 2         |
|    total_timesteps    | 500       |
| train/                |           |
|    entropy_loss       | -1.37     |
|    explained_variance | -8.79e-05 |
|    learning_rate      | 0.0007    |
|    n_updates          | 99        |
|    policy_loss        | -23.9     |
|    value_loss         | 286       |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 88.9     |
|    ep_rew_mean        | -292     |
| time/                 |          |
|    fps                | 334      |
|    iterations         | 200      |
|    time_elapsed       | 2        |
|    total_timesteps    | 1000     |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 173      |
|    ep_rew_mean        | -230     |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 1400     |
|    time_elapsed       | 14       |
|    total_timesteps    | 7000     |
| train/                |          |
|    entropy_loss       | -0.704   |
|    explained_variance | -0.00689 |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | 9.54     |
|    value_loss         | 51.9     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 177       |
|    ep_rew_mean        | -227      |
| time/                 |           |
|    fps                | 498       |
|    iterations         | 1500      |
|    time_elapsed       | 15        |
|    total_timesteps    | 7500      |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 207      |
|    ep_rew_mean        | -198     |
| time/                 |          |
|    fps                | 516      |
|    iterations         | 2700     |
|    time_elapsed       | 26       |
|    total_timesteps    | 13500    |
| train/                |          |
|    entropy_loss       | -1.14    |
|    explained_variance | 0.000225 |
|    learning_rate      | 0.0007   |
|    n_updates          | 2699     |
|    policy_loss        | 2.58     |
|    value_loss         | 4.88     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 209      |
|    ep_rew_mean        | -199     |
| time/                 |          |
|    fps                | 515      |
|    iterations         | 2800     |
|    time_elapsed       | 27       |
|    total_timesteps    | 14000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 212      |
|    ep_rew_mean        | -170     |
| time/                 |          |
|    fps                | 514      |
|    iterations         | 4100     |
|    time_elapsed       | 39       |
|    total_timesteps    | 20500    |
| train/                |          |
|    entropy_loss       | -0.495   |
|    explained_variance | 0.127    |
|    learning_rate      | 0.0007   |
|    n_updates          | 4099     |
|    policy_loss        | -1.2     |
|    value_loss         | 59.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 212      |
|    ep_rew_mean        | -171     |
| time/                 |          |
|    fps                | 514      |
|    iterations         | 4200     |
|    time_elapsed       | 40       |
|    total_timesteps    | 21000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 236      |
|    ep_rew_mean        | -122     |
| time/                 |          |
|    fps                | 516      |
|    iterations         | 5500     |
|    time_elapsed       | 53       |
|    total_timesteps    | 27500    |
| train/                |          |
|    entropy_loss       | -0.663   |
|    explained_variance | -0.773   |
|    learning_rate      | 0.0007   |
|    n_updates          | 5499     |
|    policy_loss        | -0.233   |
|    value_loss         | 1.44     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 233      |
|    ep_rew_mean        | -121     |
| time/                 |          |
|    fps                | 516      |
|    iterations         | 5600     |
|    time_elapsed       | 54       |
|    total_timesteps    | 28000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 222      |
|    ep_rew_mean        | -103     |
| time/                 |          |
|    fps                | 515      |
|    iterations         | 6900     |
|    time_elapsed       | 66       |
|    total_timesteps    | 34500    |
| train/                |          |
|    entropy_loss       | -0.484   |
|    explained_variance | -0.0278  |
|    learning_rate      | 0.0007   |
|    n_updates          | 6899     |
|    policy_loss        | 2.27     |
|    value_loss         | 16.3     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 219      |
|    ep_rew_mean        | -105     |
| time/                 |          |
|    fps                | 515      |
|    iterations         | 7000     |
|    time_elapsed       | 67       |
|    total_timesteps    | 35000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 221      |
|    ep_rew_mean        | -105     |
| time/                 |          |
|    fps                | 513      |
|    iterations         | 8300     |
|    time_elapsed       | 80       |
|    total_timesteps    | 41500    |
| train/                |          |
|    entropy_loss       | -0.551   |
|    explained_variance | 0.895    |
|    learning_rate      | 0.0007   |
|    n_updates          | 8299     |
|    policy_loss        | -3.78    |
|    value_loss         | 28.3     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 221      |
|    ep_rew_mean        | -104     |
| time/                 |          |
|    fps                | 513      |
|    iterations         | 8400     |
|    time_elapsed       | 81       |
|    total_timesteps    | 42000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 231      |
|    ep_rew_mean        | -72.1    |
| time/                 |          |
|    fps                | 511      |
|    iterations         | 9700     |
|    time_elapsed       | 94       |
|    total_timesteps    | 48500    |
| train/                |          |
|    entropy_loss       | -0.32    |
|    explained_variance | 0.8      |
|    learning_rate      | 0.0007   |
|    n_updates          | 9699     |
|    policy_loss        | -0.888   |
|    value_loss         | 35.5     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 232      |
|    ep_rew_mean        | -71.4    |
| time/                 |          |
|    fps                | 511      |
|    iterations         | 9800     |
|    time_elapsed       | 95       |
|    total_timesteps    | 49000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 239      |
|    ep_rew_mean        | -59.8    |
| time/                 |          |
|    fps                | 510      |
|    iterations         | 11100    |
|    time_elapsed       | 108      |
|    total_timesteps    | 55500    |
| train/                |          |
|    entropy_loss       | -0.557   |
|    explained_variance | -0.982   |
|    learning_rate      | 0.0007   |
|    n_updates          | 11099    |
|    policy_loss        | 0.211    |
|    value_loss         | 2.48     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 238      |
|    ep_rew_mean        | -64.6    |
| time/                 |          |
|    fps                | 510      |
|    iterations         | 11200    |
|    time_elapsed       | 109      |
|    total_timesteps    | 56000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 242      |
|    ep_rew_mean        | -41.3    |
| time/                 |          |
|    fps                | 509      |
|    iterations         | 12500    |
|    time_elapsed       | 122      |
|    total_timesteps    | 62500    |
| train/                |          |
|    entropy_loss       | -0.277   |
|    explained_variance | -4.86    |
|    learning_rate      | 0.0007   |
|    n_updates          | 12499    |
|    policy_loss        | -1.01    |
|    value_loss         | 128      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 245      |
|    ep_rew_mean        | -37.4    |
| time/                 |          |
|    fps                | 509      |
|    iterations         | 12600    |
|    time_elapsed       | 123      |
|    total_timesteps    | 63000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 261      |
|    ep_rew_mean        | -32.8    |
| time/                 |          |
|    fps                | 508      |
|    iterations         | 13800    |
|    time_elapsed       | 135      |
|    total_timesteps    | 69000    |
| train/                |          |
|    entropy_loss       | -0.326   |
|    explained_variance | 0.0547   |
|    learning_rate      | 0.0007   |
|    n_updates          | 13799    |
|    policy_loss        | -4.34    |
|    value_loss         | 67       |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 260      |
|    ep_rew_mean        | -31.9    |
| time/                 |          |
|    fps                | 508      |
|    iterations         | 13900    |
|    time_elapsed       | 136      |
|    total_timesteps    | 69500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 286      |
|    ep_rew_mean        | -31.8    |
| time/                 |          |
|    fps                | 508      |
|    iterations         | 15200    |
|    time_elapsed       | 149      |
|    total_timesteps    | 76000    |
| train/                |          |
|    entropy_loss       | -0.495   |
|    explained_variance | 0.00678  |
|    learning_rate      | 0.0007   |
|    n_updates          | 15199    |
|    policy_loss        | 0.107    |
|    value_loss         | 0.396    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 292      |
|    ep_rew_mean        | -29.9    |
| time/                 |          |
|    fps                | 508      |
|    iterations         | 15300    |
|    time_elapsed       | 150      |
|    total_timesteps    | 76500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 312      |
|    ep_rew_mean        | -21.3    |
| time/                 |          |
|    fps                | 507      |
|    iterations         | 16500    |
|    time_elapsed       | 162      |
|    total_timesteps    | 82500    |
| train/                |          |
|    entropy_loss       | -1.05    |
|    explained_variance | 0.561    |
|    learning_rate      | 0.0007   |
|    n_updates          | 16499    |
|    policy_loss        | 0.025    |
|    value_loss         | 1.36     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 317      |
|    ep_rew_mean        | -20.9    |
| time/                 |          |
|    fps                | 507      |
|    iterations         | 16600    |
|    time_elapsed       | 163      |
|    total_timesteps    | 83000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 337      |
|    ep_rew_mean        | -11.7    |
| time/                 |          |
|    fps                | 507      |
|    iterations         | 17900    |
|    time_elapsed       | 176      |
|    total_timesteps    | 89500    |
| train/                |          |
|    entropy_loss       | -0.619   |
|    explained_variance | -1.48    |
|    learning_rate      | 0.0007   |
|    n_updates          | 17899    |
|    policy_loss        | 1.34     |
|    value_loss         | 12.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 341      |
|    ep_rew_mean        | -11.9    |
| time/                 |          |
|    fps                | 507      |
|    iterations         | 18000    |
|    time_elapsed       | 177      |
|    total_timesteps    | 90000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 353      |
|    ep_rew_mean        | -5.88    |
| time/                 |          |
|    fps                | 506      |
|    iterations         | 19300    |
|    time_elapsed       | 190      |
|    total_timesteps    | 96500    |
| train/                |          |
|    entropy_loss       | -0.803   |
|    explained_variance | -1.26    |
|    learning_rate      | 0.0007   |
|    n_updates          | 19299    |
|    policy_loss        | -1.41    |
|    value_loss         | 1.16     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 359      |
|    ep_rew_mean        | -7.27    |
| time/                 |          |
|    fps                | 506      |
|    iterations         | 19400    |
|    time_elapsed       | 191      |
|    total_timesteps    | 97000    |
| train/                |          |
|

  gym.logger.warn(


In [None]:
env.close()

#### 7. Try PPO

In [10]:
from stable_baselines3 import PPO

In [11]:
model = PPO('MlpPolicy', env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [13]:
model.learn(total_timesteps=100000)
episodes=10
for ep in range (episodes):
    obs = env.reset()
    done = False
    while not done:
        env.render()
        obs,reward,done, info, _ =env.step(env.action_space.sample())

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 886      |
|    ep_rew_mean     | -75.9    |
| time/              |          |
|    fps             | 740      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 943        |
|    ep_rew_mean          | -42.4      |
| time/                   |            |
|    fps                  | 625        |
|    iterations           | 2          |
|    time_elapsed         | 6          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00767214 |
|    clip_fraction        | 0.0633     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.887     |
|    explained_variance   | 0.525      |
|    learning_rate        | 0.0003     |
|   

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 957        |
|    ep_rew_mean          | -19.9      |
| time/                   |            |
|    fps                  | 571        |
|    iterations           | 11         |
|    time_elapsed         | 39         |
|    total_timesteps      | 22528      |
| train/                  |            |
|    approx_kl            | 0.00451665 |
|    clip_fraction        | 0.0344     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.862     |
|    explained_variance   | 0.342      |
|    learning_rate        | 0.0003     |
|    loss                 | 11.7       |
|    n_updates            | 590        |
|    policy_gradient_loss | -0.00241   |
|    value_loss           | 60.7       |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 960         |
|    ep_rew_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 950         |
|    ep_rew_mean          | 27.5        |
| time/                   |             |
|    fps                  | 573         |
|    iterations           | 21          |
|    time_elapsed         | 74          |
|    total_timesteps      | 43008       |
| train/                  |             |
|    approx_kl            | 0.006881933 |
|    clip_fraction        | 0.0535      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.863      |
|    explained_variance   | 0.424       |
|    learning_rate        | 0.0003      |
|    loss                 | 52          |
|    n_updates            | 690         |
|    policy_gradient_loss | -0.00125    |
|    value_loss           | 82.2        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 945 

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 926        |
|    ep_rew_mean          | 52         |
| time/                   |            |
|    fps                  | 571        |
|    iterations           | 31         |
|    time_elapsed         | 111        |
|    total_timesteps      | 63488      |
| train/                  |            |
|    approx_kl            | 0.00417194 |
|    clip_fraction        | 0.0475     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.756     |
|    explained_variance   | 0.512      |
|    learning_rate        | 0.0003     |
|    loss                 | 116        |
|    n_updates            | 790        |
|    policy_gradient_loss | -0.00239   |
|    value_loss           | 118        |
----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 925          |
|    ep_re

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 889          |
|    ep_rew_mean          | 57.8         |
| time/                   |              |
|    fps                  | 567          |
|    iterations           | 41           |
|    time_elapsed         | 147          |
|    total_timesteps      | 83968        |
| train/                  |              |
|    approx_kl            | 0.0018656413 |
|    clip_fraction        | 0.0347       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.798       |
|    explained_variance   | 0.474        |
|    learning_rate        | 0.0003       |
|    loss                 | 19           |
|    n_updates            | 890          |
|    policy_gradient_loss | -0.000525    |
|    value_loss           | 55.8         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

#### 8. Saving interim models

In [14]:
import os

In [31]:
model = PPO('MlpPolicy', env, verbose=1,tensorboard_log=logdir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [32]:
# Create directories for saving models and logs if they don't exist
models_dir = 'models/PPO'
logdir = 'logs'

# Create the models directory if it doesn't exist
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

# Create the logs directory if it doesn't exist
if not os.path.exists(logdir):
    os.makedirs(logdir)


### Understanding the below PPO Snippet

#### Initialization: `model = PPO('MlpPolicy', env, verbose=1)`

- **`'MlpPolicy'`**: This specifies that the policy network architecture will be a Multi-Layer Perceptron (MLP). MLPs are generally good for simpler, lower-dimensional observation spaces.
  
- **`env`**: This is the environment object where the agent will be trained. It should comply with OpenAI's Gym API, providing methods like `reset()` and `step()`.

- **`verbose=1`**: This sets the logging level to verbose, meaning that training progress will be printed to the console. This is useful for debugging and monitoring.

#### Training Loop: `for i in range(1, 30):`

- **`TIMESTEPS = 10000`**: This sets the number of timesteps for which the model will be trained in each iteration of the loop. 10,000 timesteps is a reasonable starting point for many problems but may need to be adjusted based on the complexity of the task.

- **`model.learn(...)`**: This is where the actual training happens.

  - **`total_timesteps=TIMESTEPS`**: Specifies the number of timesteps for this training iteration.
  
  - **`reset_num_timesteps=False`**: This ensures that the learning continues from where it left off in the previous iteration, rather than resetting. This is crucial for incremental learning.
  
  - **`tb_log_name='PPO'`**: This sets the name of the TensorBoard log, useful for monitoring training metrics.

- **`model.save(f"(models_dir)/{TIMESTEPS*i})`**: This saves the model after each training iteration. The filename includes the total number of timesteps the model has been trained for, which is useful for keeping track of training progress and for potential rollbacks to previous states.

The rationale behind this code is to incrementally train a PPO model for a total of 30 iterations, each with 10,000 timesteps, while saving the model at each step. This allows for monitoring and potentially resuming training from a specific point. The verbose logging and TensorBoard support provide avenues for debugging and performance tracking.

In [33]:
model = PPO('MlpPolicy', env, verbose=1,tensorboard_log=logdir)

TIMESTEPS = 10000
for i in range (1,30):
    model.learn(total_timesteps=TIMESTEPS,reset_num_timesteps=False, tb_log_name='PPO')
    model.save(f"{models_dir}/{TIMESTEPS*i}")



Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to logs\PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 93       |
|    ep_rew_mean     | -227     |
| time/              |          |
|    fps             | 740      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 98.3         |
|    ep_rew_mean          | -198         |
| time/                   |              |
|    fps                  | 637          |
|    iterations           | 2            |
|    time_elapsed         | 6            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0059515145 |
|    clip_fraction        | 0.011        |
|    clip_range           |

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 130         |
|    ep_rew_mean          | -109        |
| time/                   |             |
|    fps                  | 697         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.012837132 |
|    clip_fraction        | 0.0855      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.18       |
|    explained_variance   | -0.0792     |
|    learning_rate        | 0.0003      |
|    loss                 | 108         |
|    n_updates            | 110         |
|    policy_gradient_loss | -0.00424    |
|    value_loss           | 401         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 139   

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 278        |
|    ep_rew_mean          | -79.8      |
| time/                   |            |
|    fps                  | 651        |
|    iterations           | 3          |
|    time_elapsed         | 9          |
|    total_timesteps      | 47104      |
| train/                  |            |
|    approx_kl            | 0.01179202 |
|    clip_fraction        | 0.123      |
|    clip_range           | 0.2        |
|    entropy_loss         | -1.04      |
|    explained_variance   | -0.0999    |
|    learning_rate        | 0.0003     |
|    loss                 | 30.8       |
|    n_updates            | 220        |
|    policy_gradient_loss | -0.00559   |
|    value_loss           | 164        |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 289         |
|    ep_rew_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 468         |
|    ep_rew_mean          | -35         |
| time/                   |             |
|    fps                  | 593         |
|    iterations           | 4           |
|    time_elapsed         | 13          |
|    total_timesteps      | 69632       |
| train/                  |             |
|    approx_kl            | 0.006384302 |
|    clip_fraction        | 0.089       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.753      |
|    explained_variance   | 0.424       |
|    learning_rate        | 0.0003      |
|    loss                 | 20.9        |
|    n_updates            | 330         |
|    policy_gradient_loss | -0.00359    |
|    value_loss           | 110         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 485 

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 630          |
|    ep_rew_mean          | 16.9         |
| time/                   |              |
|    fps                  | 595          |
|    iterations           | 5            |
|    time_elapsed         | 17           |
|    total_timesteps      | 92160        |
| train/                  |              |
|    approx_kl            | 0.0022811186 |
|    clip_fraction        | 0.00723      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.956       |
|    explained_variance   | 0.613        |
|    learning_rate        | 0.0003       |
|    loss                 | 115          |
|    n_updates            | 440          |
|    policy_gradient_loss | -0.00268     |
|    value_loss           | 113          |
------------------------------------------
Logging to logs\PPO_0
---------------------------------
| rollout/           |          |
|    ep

Logging to logs\PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 763      |
|    ep_rew_mean     | 55.8     |
| time/              |          |
|    fps             | 885      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 114688   |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 773         |
|    ep_rew_mean          | 55.8        |
| time/                   |             |
|    fps                  | 685         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 116736      |
| train/                  |             |
|    approx_kl            | 0.013987621 |
|    clip_fraction        | 0.196       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.01       |
|    explained_variance   | 0.866       |
|    lea

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 806         |
|    ep_rew_mean          | 73.1        |
| time/                   |             |
|    fps                  | 661         |
|    iterations           | 2           |
|    time_elapsed         | 6           |
|    total_timesteps      | 137216      |
| train/                  |             |
|    approx_kl            | 0.006992192 |
|    clip_fraction        | 0.0605      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.927      |
|    explained_variance   | 0.918       |
|    learning_rate        | 0.0003      |
|    loss                 | 3.04        |
|    n_updates            | 660         |
|    policy_gradient_loss | -0.00407    |
|    value_loss           | 6.33        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 804   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 778         |
|    ep_rew_mean          | 71.3        |
| time/                   |             |
|    fps                  | 649         |
|    iterations           | 3           |
|    time_elapsed         | 9           |
|    total_timesteps      | 159744      |
| train/                  |             |
|    approx_kl            | 0.012516519 |
|    clip_fraction        | 0.0723      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.812      |
|    explained_variance   | 0.517       |
|    learning_rate        | 0.0003      |
|    loss                 | 108         |
|    n_updates            | 770         |
|    policy_gradient_loss | -0.00361    |
|    value_loss           | 163         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 765   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 677          |
|    ep_rew_mean          | 85.9         |
| time/                   |              |
|    fps                  | 588          |
|    iterations           | 4            |
|    time_elapsed         | 13           |
|    total_timesteps      | 182272       |
| train/                  |              |
|    approx_kl            | 0.0048139626 |
|    clip_fraction        | 0.0238       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.932       |
|    explained_variance   | 0.789        |
|    learning_rate        | 0.0003       |
|    loss                 | 39.1         |
|    n_updates            | 880          |
|    policy_gradient_loss | -0.000556    |
|    value_loss           | 79.3         |
------------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mea

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 583         |
|    ep_rew_mean          | 88.2        |
| time/                   |             |
|    fps                  | 599         |
|    iterations           | 5           |
|    time_elapsed         | 17          |
|    total_timesteps      | 204800      |
| train/                  |             |
|    approx_kl            | 0.011719844 |
|    clip_fraction        | 0.147       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.817      |
|    explained_variance   | 0.899       |
|    learning_rate        | 0.0003      |
|    loss                 | 5.34        |
|    n_updates            | 990         |
|    policy_gradient_loss | -0.00204    |
|    value_loss           | 19.6        |
-----------------------------------------
Logging to logs\PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 561  

Logging to logs\PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 607      |
|    ep_rew_mean     | 86.8     |
| time/              |          |
|    fps             | 880      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 227328   |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 609          |
|    ep_rew_mean          | 90.5         |
| time/                   |              |
|    fps                  | 688          |
|    iterations           | 2            |
|    time_elapsed         | 5            |
|    total_timesteps      | 229376       |
| train/                  |              |
|    approx_kl            | 0.0056414604 |
|    clip_fraction        | 0.0439       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.775       |
|    explained_variance   | 0.699   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 621         |
|    ep_rew_mean          | 92.8        |
| time/                   |             |
|    fps                  | 637         |
|    iterations           | 3           |
|    time_elapsed         | 9           |
|    total_timesteps      | 251904      |
| train/                  |             |
|    approx_kl            | 0.007307873 |
|    clip_fraction        | 0.0465      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.797      |
|    explained_variance   | 0.762       |
|    learning_rate        | 0.0003      |
|    loss                 | 63.1        |
|    n_updates            | 1220        |
|    policy_gradient_loss | -0.00252    |
|    value_loss           | 123         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 628   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 611          |
|    ep_rew_mean          | 105          |
| time/                   |              |
|    fps                  | 609          |
|    iterations           | 4            |
|    time_elapsed         | 13           |
|    total_timesteps      | 274432       |
| train/                  |              |
|    approx_kl            | 0.0066701174 |
|    clip_fraction        | 0.0572       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.768       |
|    explained_variance   | 0.729        |
|    learning_rate        | 0.0003       |
|    loss                 | 53.3         |
|    n_updates            | 1330         |
|    policy_gradient_loss | -0.00232     |
|    value_loss           | 123          |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 582         |
|    ep_rew_mean          | 101         |
| time/                   |             |
|    fps                  | 598         |
|    iterations           | 5           |
|    time_elapsed         | 17          |
|    total_timesteps      | 296960      |
| train/                  |             |
|    approx_kl            | 0.001930079 |
|    clip_fraction        | 0.00195     |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.68        |
|    learning_rate        | 0.0003      |
|    loss                 | 58.3        |
|    n_updates            | 1440        |
|    policy_gradient_loss | -0.00066    |
|    value_loss           | 219         |
-----------------------------------------


##### also do this with A2C

In [34]:
# Create directories for saving models and logs if they don't exist
models_dir = 'models/A2C'
logdir = 'logs'

# Create the models directory if it doesn't exist
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

# Create the logs directory if it doesn't exist
if not os.path.exists(logdir):
    os.makedirs(logdir)

In [35]:
model = A2C('MlpPolicy', env, verbose=1,tensorboard_log=logdir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [36]:
TIMESTEPS = 10000
for i in range (1,30):
    model.learn(total_timesteps=TIMESTEPS,reset_num_timesteps=False, tb_log_name='A2C')
    model.save(f"{models_dir}/{TIMESTEPS*i}")

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 114      |
|    ep_rew_mean        | -324     |
| time/                 |          |
|    fps                | 542      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.24    |
|    explained_variance | 0.00477  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -13.1    |
|    value_loss         | 200      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 117      |
|    ep_rew_mean        | -317     |
| time/                 |          |
|    fps                | 562      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 1000     |
| train/        

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 167       |
|    ep_rew_mean        | -273      |
| time/                 |           |
|    fps                | 543       |
|    iterations         | 1400      |
|    time_elapsed       | 12        |
|    total_timesteps    | 7000      |
| train/                |           |
|    entropy_loss       | -0.801    |
|    explained_variance | -5.73e-05 |
|    learning_rate      | 0.0007    |
|    n_updates          | 1399      |
|    policy_loss        | -16.8     |
|    value_loss         | 202       |
-------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 169       |
|    ep_rew_mean        | -268      |
| time/                 |           |
|    fps                | 541       |
|    iterations         | 1500      |
|    time_elapsed       | 13        |
|    total_timesteps    | 7500      |
| train/    

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 193       |
|    ep_rew_mean        | -215      |
| time/                 |           |
|    fps                | 551       |
|    iterations         | 700       |
|    time_elapsed       | 6         |
|    total_timesteps    | 13500     |
| train/                |           |
|    entropy_loss       | -0.209    |
|    explained_variance | -1.07e-06 |
|    learning_rate      | 0.0007    |
|    n_updates          | 2699      |
|    policy_loss        | 0.253     |
|    value_loss         | 60.8      |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 196      |
|    ep_rew_mean        | -213     |
| time/                 |          |
|    fps                | 543      |
|    iterations         | 800      |
|    time_elapsed       | 7        |
|    total_timesteps    | 14000    |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 216      |
|    ep_rew_mean        | -193     |
| time/                 |          |
|    fps                | 528      |
|    iterations         | 2000     |
|    time_elapsed       | 18       |
|    total_timesteps    | 20000    |
| train/                |          |
|    entropy_loss       | -0.612   |
|    explained_variance | 0.000177 |
|    learning_rate      | 0.0007   |
|    n_updates          | 3999     |
|    policy_loss        | -2.04    |
|    value_loss         | 76.4     |
------------------------------------
Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 216      |
|    ep_rew_mean        | -189     |
| time/                 |          |
|    fps                | 576      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 20500    |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 236      |
|    ep_rew_mean        | -148     |
| time/                 |          |
|    fps                | 527      |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 26500    |
| train/                |          |
|    entropy_loss       | -0.687   |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n_updates          | 5299     |
|    policy_loss        | 2.87     |
|    value_loss         | 23.9     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 236      |
|    ep_rew_mean        | -148     |
| time/                 |          |
|    fps                | 524      |
|    iterations         | 1400     |
|    time_elapsed       | 13       |
|    total_timesteps    | 27000    |
| train/                |          |
|

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 279       |
|    ep_rew_mean        | -126      |
| time/                 |           |
|    fps                | 502       |
|    iterations         | 600       |
|    time_elapsed       | 5         |
|    total_timesteps    | 33000     |
| train/                |           |
|    entropy_loss       | -0.316    |
|    explained_variance | -0.000693 |
|    learning_rate      | 0.0007    |
|    n_updates          | 6599      |
|    policy_loss        | -0.211    |
|    value_loss         | 2.96      |
-------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 280      |
|    ep_rew_mean        | -125     |
| time/                 |          |
|    fps                | 502      |
|    iterations         | 700      |
|    time_elapsed       | 6        |
|    total_timesteps    | 33500    |
| train/             

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 297      |
|    ep_rew_mean        | -105     |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 1900     |
|    time_elapsed       | 18       |
|    total_timesteps    | 39500    |
| train/                |          |
|    entropy_loss       | -0.434   |
|    explained_variance | 0.00897  |
|    learning_rate      | 0.0007   |
|    n_updates          | 7899     |
|    policy_loss        | -0.131   |
|    value_loss         | 1.38     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 297      |
|    ep_rew_mean        | -104     |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 2000     |
|    time_elapsed       | 20       |
|    total_timesteps    | 40000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 300      |
|    ep_rew_mean        | -95.7    |
| time/                 |          |
|    fps                | 509      |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 46000    |
| train/                |          |
|    entropy_loss       | -0.429   |
|    explained_variance | -0.00242 |
|    learning_rate      | 0.0007   |
|    n_updates          | 9199     |
|    policy_loss        | 8.43     |
|    value_loss         | 156      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 298      |
|    ep_rew_mean        | -97.6    |
| time/                 |          |
|    fps                | 509      |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 46500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 307      |
|    ep_rew_mean        | -77.1    |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 500      |
|    time_elapsed       | 4        |
|    total_timesteps    | 52500    |
| train/                |          |
|    entropy_loss       | -0.844   |
|    explained_variance | 0.367    |
|    learning_rate      | 0.0007   |
|    n_updates          | 10499    |
|    policy_loss        | -1.01    |
|    value_loss         | 25.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 308      |
|    ep_rew_mean        | -75.6    |
| time/                 |          |
|    fps                | 502      |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 53000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 289      |
|    ep_rew_mean        | -62.4    |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 1900     |
|    time_elapsed       | 18       |
|    total_timesteps    | 59500    |
| train/                |          |
|    entropy_loss       | -0.618   |
|    explained_variance | -0.187   |
|    learning_rate      | 0.0007   |
|    n_updates          | 11899    |
|    policy_loss        | 0.769    |
|    value_loss         | 11.3     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 287      |
|    ep_rew_mean        | -59.4    |
| time/                 |          |
|    fps                | 506      |
|    iterations         | 2000     |
|    time_elapsed       | 19       |
|    total_timesteps    | 60000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 274      |
|    ep_rew_mean        | -60.2    |
| time/                 |          |
|    fps                | 510      |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 66000    |
| train/                |          |
|    entropy_loss       | -0.39    |
|    explained_variance | 0.592    |
|    learning_rate      | 0.0007   |
|    n_updates          | 13199    |
|    policy_loss        | 2.64     |
|    value_loss         | 17.5     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 273      |
|    ep_rew_mean        | -61.7    |
| time/                 |          |
|    fps                | 516      |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 66500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 273      |
|    ep_rew_mean        | -53.6    |
| time/                 |          |
|    fps                | 504      |
|    iterations         | 500      |
|    time_elapsed       | 4        |
|    total_timesteps    | 72500    |
| train/                |          |
|    entropy_loss       | -0.587   |
|    explained_variance | 0.957    |
|    learning_rate      | 0.0007   |
|    n_updates          | 14499    |
|    policy_loss        | 0.336    |
|    value_loss         | 1.71     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 273      |
|    ep_rew_mean        | -50.3    |
| time/                 |          |
|    fps                | 517      |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 73000    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 253      |
|    ep_rew_mean        | -38.1    |
| time/                 |          |
|    fps                | 532      |
|    iterations         | 1800     |
|    time_elapsed       | 16       |
|    total_timesteps    | 79000    |
| train/                |          |
|    entropy_loss       | -0.498   |
|    explained_variance | 0.296    |
|    learning_rate      | 0.0007   |
|    n_updates          | 15799    |
|    policy_loss        | 1.99     |
|    value_loss         | 8.93     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 255      |
|    ep_rew_mean        | -33.5    |
| time/                 |          |
|    fps                | 531      |
|    iterations         | 1900     |
|    time_elapsed       | 17       |
|    total_timesteps    | 79500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 252      |
|    ep_rew_mean        | -30.2    |
| time/                 |          |
|    fps                | 507      |
|    iterations         | 1100     |
|    time_elapsed       | 10       |
|    total_timesteps    | 85500    |
| train/                |          |
|    entropy_loss       | -0.0574  |
|    explained_variance | -3.63    |
|    learning_rate      | 0.0007   |
|    n_updates          | 17099    |
|    policy_loss        | 0.165    |
|    value_loss         | 0.45     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 251       |
|    ep_rew_mean        | -26.9     |
| time/                 |           |
|    fps                | 506       |
|    iterations         | 1200      |
|    time_elapsed       | 11        |
|    total_timesteps    | 86000     |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 221      |
|    ep_rew_mean        | -6.56    |
| time/                 |          |
|    fps                | 487      |
|    iterations         | 400      |
|    time_elapsed       | 4        |
|    total_timesteps    | 92000    |
| train/                |          |
|    entropy_loss       | -0.593   |
|    explained_variance | 0.728    |
|    learning_rate      | 0.0007   |
|    n_updates          | 18399    |
|    policy_loss        | -1.02    |
|    value_loss         | 2.75     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 216      |
|    ep_rew_mean        | -9.48    |
| time/                 |          |
|    fps                | 488      |
|    iterations         | 500      |
|    time_elapsed       | 5        |
|    total_timesteps    | 92500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 157      |
|    ep_rew_mean        | -34.1    |
| time/                 |          |
|    fps                | 489      |
|    iterations         | 1800     |
|    time_elapsed       | 18       |
|    total_timesteps    | 99000    |
| train/                |          |
|    entropy_loss       | -0.411   |
|    explained_variance | 0.406    |
|    learning_rate      | 0.0007   |
|    n_updates          | 19799    |
|    policy_loss        | 2.1      |
|    value_loss         | 20.9     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 151       |
|    ep_rew_mean        | -34.5     |
| time/                 |           |
|    fps                | 489       |
|    iterations         | 1900      |
|    time_elapsed       | 19        |
|    total_timesteps    | 99500     |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 140      |
|    ep_rew_mean        | -52.1    |
| time/                 |          |
|    fps                | 496      |
|    iterations         | 1100     |
|    time_elapsed       | 11       |
|    total_timesteps    | 105500   |
| train/                |          |
|    entropy_loss       | -0.391   |
|    explained_variance | -0.047   |
|    learning_rate      | 0.0007   |
|    n_updates          | 21099    |
|    policy_loss        | -0.306   |
|    value_loss         | 0.525    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 143      |
|    ep_rew_mean        | -48.4    |
| time/                 |          |
|    fps                | 496      |
|    iterations         | 1200     |
|    time_elapsed       | 12       |
|    total_timesteps    | 106000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 187      |
|    ep_rew_mean        | -31.2    |
| time/                 |          |
|    fps                | 493      |
|    iterations         | 400      |
|    time_elapsed       | 4        |
|    total_timesteps    | 112000   |
| train/                |          |
|    entropy_loss       | -0.451   |
|    explained_variance | 0.958    |
|    learning_rate      | 0.0007   |
|    n_updates          | 22399    |
|    policy_loss        | -2.28    |
|    value_loss         | 16.2     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 191      |
|    ep_rew_mean        | -27.1    |
| time/                 |          |
|    fps                | 492      |
|    iterations         | 500      |
|    time_elapsed       | 5        |
|    total_timesteps    | 112500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 192      |
|    ep_rew_mean        | -26.9    |
| time/                 |          |
|    fps                | 538      |
|    iterations         | 1800     |
|    time_elapsed       | 16       |
|    total_timesteps    | 119000   |
| train/                |          |
|    entropy_loss       | -0.615   |
|    explained_variance | 0.982    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23799    |
|    policy_loss        | -0.441   |
|    value_loss         | 0.888    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 194      |
|    ep_rew_mean        | -26.4    |
| time/                 |          |
|    fps                | 539      |
|    iterations         | 1900     |
|    time_elapsed       | 17       |
|    total_timesteps    | 119500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 228      |
|    ep_rew_mean        | -19.9    |
| time/                 |          |
|    fps                | 493      |
|    iterations         | 1100     |
|    time_elapsed       | 11       |
|    total_timesteps    | 125500   |
| train/                |          |
|    entropy_loss       | -0.00416 |
|    explained_variance | 0.042    |
|    learning_rate      | 0.0007   |
|    n_updates          | 25099    |
|    policy_loss        | 0.000106 |
|    value_loss         | 0.0612   |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 238      |
|    ep_rew_mean        | -17.6    |
| time/                 |          |
|    fps                | 493      |
|    iterations         | 1200     |
|    time_elapsed       | 12       |
|    total_timesteps    | 126000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 188      |
|    ep_rew_mean        | -58.5    |
| time/                 |          |
|    fps                | 504      |
|    iterations         | 400      |
|    time_elapsed       | 3        |
|    total_timesteps    | 132000   |
| train/                |          |
|    entropy_loss       | -0.648   |
|    explained_variance | -3.9     |
|    learning_rate      | 0.0007   |
|    n_updates          | 26399    |
|    policy_loss        | -3.28    |
|    value_loss         | 71.1     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 185      |
|    ep_rew_mean        | -56.2    |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 500      |
|    time_elapsed       | 4        |
|    total_timesteps    | 132500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 194      |
|    ep_rew_mean        | -54.6    |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 1700     |
|    time_elapsed       | 17       |
|    total_timesteps    | 138500   |
| train/                |          |
|    entropy_loss       | -0.556   |
|    explained_variance | 0.58     |
|    learning_rate      | 0.0007   |
|    n_updates          | 27699    |
|    policy_loss        | 0.518    |
|    value_loss         | 3.01     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 194      |
|    ep_rew_mean        | -54.6    |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 1800     |
|    time_elapsed       | 18       |
|    total_timesteps    | 139000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 209      |
|    ep_rew_mean        | -53.6    |
| time/                 |          |
|    fps                | 542      |
|    iterations         | 1000     |
|    time_elapsed       | 9        |
|    total_timesteps    | 145000   |
| train/                |          |
|    entropy_loss       | -0.508   |
|    explained_variance | 0.338    |
|    learning_rate      | 0.0007   |
|    n_updates          | 28999    |
|    policy_loss        | -0.49    |
|    value_loss         | 0.606    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 218      |
|    ep_rew_mean        | -52.9    |
| time/                 |          |
|    fps                | 540      |
|    iterations         | 1100     |
|    time_elapsed       | 10       |
|    total_timesteps    | 145500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 195      |
|    ep_rew_mean        | -36.6    |
| time/                 |          |
|    fps                | 501      |
|    iterations         | 300      |
|    time_elapsed       | 2        |
|    total_timesteps    | 151500   |
| train/                |          |
|    entropy_loss       | -0.589   |
|    explained_variance | -0.606   |
|    learning_rate      | 0.0007   |
|    n_updates          | 30299    |
|    policy_loss        | -0.53    |
|    value_loss         | 4.05     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 199      |
|    ep_rew_mean        | -36.2    |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 400      |
|    time_elapsed       | 4        |
|    total_timesteps    | 152000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 205      |
|    ep_rew_mean        | -33.9    |
| time/                 |          |
|    fps                | 497      |
|    iterations         | 1600     |
|    time_elapsed       | 16       |
|    total_timesteps    | 158000   |
| train/                |          |
|    entropy_loss       | -0.155   |
|    explained_variance | -3.77    |
|    learning_rate      | 0.0007   |
|    n_updates          | 31599    |
|    policy_loss        | 0.207    |
|    value_loss         | 22.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 205      |
|    ep_rew_mean        | -30.3    |
| time/                 |          |
|    fps                | 497      |
|    iterations         | 1700     |
|    time_elapsed       | 17       |
|    total_timesteps    | 158500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 172      |
|    ep_rew_mean        | -13.2    |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 900      |
|    time_elapsed       | 9        |
|    total_timesteps    | 164500   |
| train/                |          |
|    entropy_loss       | -0.873   |
|    explained_variance | -0.398   |
|    learning_rate      | 0.0007   |
|    n_updates          | 32899    |
|    policy_loss        | 0.397    |
|    value_loss         | 1.25     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 173      |
|    ep_rew_mean        | -13.6    |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 1000     |
|    time_elapsed       | 10       |
|    total_timesteps    | 165000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 176      |
|    ep_rew_mean        | 25.6     |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 171000   |
| train/                |          |
|    entropy_loss       | -0.457   |
|    explained_variance | 0.81     |
|    learning_rate      | 0.0007   |
|    n_updates          | 34199    |
|    policy_loss        | -0.342   |
|    value_loss         | 1.52     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 180       |
|    ep_rew_mean        | 29.3      |
| time/                 |           |
|    fps                | 498       |
|    iterations         | 300       |
|    time_elapsed       | 3         |
|    total_timesteps    | 171500    |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 190      |
|    ep_rew_mean        | 62.5     |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 1500     |
|    time_elapsed       | 15       |
|    total_timesteps    | 177500   |
| train/                |          |
|    entropy_loss       | -0.289   |
|    explained_variance | 0.00594  |
|    learning_rate      | 0.0007   |
|    n_updates          | 35499    |
|    policy_loss        | -1.79    |
|    value_loss         | 2.64     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 192      |
|    ep_rew_mean        | 62.7     |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 1600     |
|    time_elapsed       | 16       |
|    total_timesteps    | 178000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 231      |
|    ep_rew_mean        | 79.5     |
| time/                 |          |
|    fps                | 503      |
|    iterations         | 800      |
|    time_elapsed       | 7        |
|    total_timesteps    | 184000   |
| train/                |          |
|    entropy_loss       | -0.565   |
|    explained_variance | 0.341    |
|    learning_rate      | 0.0007   |
|    n_updates          | 36799    |
|    policy_loss        | -0.217   |
|    value_loss         | 1.04     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 231      |
|    ep_rew_mean        | 79.5     |
| time/                 |          |
|    fps                | 503      |
|    iterations         | 900      |
|    time_elapsed       | 8        |
|    total_timesteps    | 184500   |
| train/                |          |
|

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 263      |
|    ep_rew_mean        | 87.2     |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 190500   |
| train/                |          |
|    entropy_loss       | -0.465   |
|    explained_variance | 0.439    |
|    learning_rate      | 0.0007   |
|    n_updates          | 38099    |
|    policy_loss        | 0.293    |
|    value_loss         | 2.58     |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 262       |
|    ep_rew_mean        | 87.3      |
| time/                 |           |
|    fps                | 500       |
|    iterations         | 200       |
|    time_elapsed       | 1         |
|    total_timesteps    | 191000    |
| train

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 240      |
|    ep_rew_mean        | 58.4     |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 1400     |
|    time_elapsed       | 14       |
|    total_timesteps    | 197000   |
| train/                |          |
|    entropy_loss       | -0.507   |
|    explained_variance | 0.62     |
|    learning_rate      | 0.0007   |
|    n_updates          | 39399    |
|    policy_loss        | -1.74    |
|    value_loss         | 10.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 233      |
|    ep_rew_mean        | 49       |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 1500     |
|    time_elapsed       | 15       |
|    total_timesteps    | 197500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 263      |
|    ep_rew_mean        | 46.9     |
| time/                 |          |
|    fps                | 496      |
|    iterations         | 700      |
|    time_elapsed       | 7        |
|    total_timesteps    | 203500   |
| train/                |          |
|    entropy_loss       | -0.434   |
|    explained_variance | 0.721    |
|    learning_rate      | 0.0007   |
|    n_updates          | 40699    |
|    policy_loss        | 0.402    |
|    value_loss         | 0.876    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 271      |
|    ep_rew_mean        | 48.9     |
| time/                 |          |
|    fps                | 496      |
|    iterations         | 800      |
|    time_elapsed       | 8        |
|    total_timesteps    | 204000   |
| train/                |          |
|

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 300      |
|    ep_rew_mean        | 24.7     |
| time/                 |          |
|    fps                | 508      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 210500   |
| train/                |          |
|    entropy_loss       | -0.616   |
|    explained_variance | -1.53    |
|    learning_rate      | 0.0007   |
|    n_updates          | 42099    |
|    policy_loss        | -0.0916  |
|    value_loss         | 0.501    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 300      |
|    ep_rew_mean        | 24.7     |
| time/                 |          |
|    fps                | 506      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 211000   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 325      |
|    ep_rew_mean        | 11.8     |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 1400     |
|    time_elapsed       | 13       |
|    total_timesteps    | 217000   |
| train/                |          |
|    entropy_loss       | -0.637   |
|    explained_variance | 0.762    |
|    learning_rate      | 0.0007   |
|    n_updates          | 43399    |
|    policy_loss        | -1.26    |
|    value_loss         | 3.04     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 325      |
|    ep_rew_mean        | 11       |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 1500     |
|    time_elapsed       | 14       |
|    total_timesteps    | 217500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 360      |
|    ep_rew_mean        | 11.3     |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 700      |
|    time_elapsed       | 6        |
|    total_timesteps    | 223500   |
| train/                |          |
|    entropy_loss       | -0.724   |
|    explained_variance | 0.964    |
|    learning_rate      | 0.0007   |
|    n_updates          | 44699    |
|    policy_loss        | -0.257   |
|    value_loss         | 0.213    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 355      |
|    ep_rew_mean        | 10.7     |
| time/                 |          |
|    fps                | 499      |
|    iterations         | 800      |
|    time_elapsed       | 8        |
|    total_timesteps    | 224000   |
| train/                |          |
|

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 390      |
|    ep_rew_mean        | 13.8     |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 230500   |
| train/                |          |
|    entropy_loss       | -0.484   |
|    explained_variance | 0.592    |
|    learning_rate      | 0.0007   |
|    n_updates          | 46099    |
|    policy_loss        | 0.398    |
|    value_loss         | 7.49     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 399      |
|    ep_rew_mean        | 14.4     |
| time/                 |          |
|    fps                | 503      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 231000   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 449      |
|    ep_rew_mean        | 16.2     |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 1400     |
|    time_elapsed       | 13       |
|    total_timesteps    | 237000   |
| train/                |          |
|    entropy_loss       | -0.483   |
|    explained_variance | -0.795   |
|    learning_rate      | 0.0007   |
|    n_updates          | 47399    |
|    policy_loss        | 0.0942   |
|    value_loss         | 0.528    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 458      |
|    ep_rew_mean        | 16.7     |
| time/                 |          |
|    fps                | 504      |
|    iterations         | 1500     |
|    time_elapsed       | 14       |
|    total_timesteps    | 237500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 511      |
|    ep_rew_mean        | 14.7     |
| time/                 |          |
|    fps                | 505      |
|    iterations         | 700      |
|    time_elapsed       | 6        |
|    total_timesteps    | 243500   |
| train/                |          |
|    entropy_loss       | -0.469   |
|    explained_variance | 0.293    |
|    learning_rate      | 0.0007   |
|    n_updates          | 48699    |
|    policy_loss        | 0.0625   |
|    value_loss         | 0.393    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 511      |
|    ep_rew_mean        | 14.7     |
| time/                 |          |
|    fps                | 504      |
|    iterations         | 800      |
|    time_elapsed       | 7        |
|    total_timesteps    | 244000   |
| train/                |          |
|

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 563      |
|    ep_rew_mean        | 7.76     |
| time/                 |          |
|    fps                | 501      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 250500   |
| train/                |          |
|    entropy_loss       | -0.614   |
|    explained_variance | 0.33     |
|    learning_rate      | 0.0007   |
|    n_updates          | 50099    |
|    policy_loss        | -0.715   |
|    value_loss         | 2.36     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 567      |
|    ep_rew_mean        | 3.92     |
| time/                 |          |
|    fps                | 498      |
|    iterations         | 200      |
|    time_elapsed       | 2        |
|    total_timesteps    | 251000   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 621      |
|    ep_rew_mean        | -1.44    |
| time/                 |          |
|    fps                | 539      |
|    iterations         | 1400     |
|    time_elapsed       | 12       |
|    total_timesteps    | 257000   |
| train/                |          |
|    entropy_loss       | -0.251   |
|    explained_variance | 0.816    |
|    learning_rate      | 0.0007   |
|    n_updates          | 51399    |
|    policy_loss        | 0.0377   |
|    value_loss         | 0.285    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 621      |
|    ep_rew_mean        | -1.44    |
| time/                 |          |
|    fps                | 541      |
|    iterations         | 1500     |
|    time_elapsed       | 13       |
|    total_timesteps    | 257500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 673      |
|    ep_rew_mean        | -7.4     |
| time/                 |          |
|    fps                | 502      |
|    iterations         | 700      |
|    time_elapsed       | 6        |
|    total_timesteps    | 263500   |
| train/                |          |
|    entropy_loss       | -0.437   |
|    explained_variance | 0.757    |
|    learning_rate      | 0.0007   |
|    n_updates          | 52699    |
|    policy_loss        | 1.38     |
|    value_loss         | 13.8     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 682      |
|    ep_rew_mean        | -8.28    |
| time/                 |          |
|    fps                | 503      |
|    iterations         | 800      |
|    time_elapsed       | 7        |
|    total_timesteps    | 264000   |
| train/                |          |
|

Logging to logs\A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 736      |
|    ep_rew_mean        | -9.8     |
| time/                 |          |
|    fps                | 504      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 270500   |
| train/                |          |
|    entropy_loss       | -0.504   |
|    explained_variance | 0.842    |
|    learning_rate      | 0.0007   |
|    n_updates          | 54099    |
|    policy_loss        | 0.392    |
|    value_loss         | 2.01     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 745      |
|    ep_rew_mean        | -10.3    |
| time/                 |          |
|    fps                | 500      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 271000   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 798      |
|    ep_rew_mean        | -16.2    |
| time/                 |          |
|    fps                | 520      |
|    iterations         | 1400     |
|    time_elapsed       | 13       |
|    total_timesteps    | 277000   |
| train/                |          |
|    entropy_loss       | -0.484   |
|    explained_variance | 0.979    |
|    learning_rate      | 0.0007   |
|    n_updates          | 55399    |
|    policy_loss        | 0.0869   |
|    value_loss         | 0.143    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 798      |
|    ep_rew_mean        | -16.2    |
| time/                 |          |
|    fps                | 519      |
|    iterations         | 1500     |
|    time_elapsed       | 14       |
|    total_timesteps    | 277500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 852      |
|    ep_rew_mean        | -20.6    |
| time/                 |          |
|    fps                | 492      |
|    iterations         | 700      |
|    time_elapsed       | 7        |
|    total_timesteps    | 283500   |
| train/                |          |
|    entropy_loss       | -0.456   |
|    explained_variance | 0.723    |
|    learning_rate      | 0.0007   |
|    n_updates          | 56699    |
|    policy_loss        | 0.388    |
|    value_loss         | 0.155    |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 861       |
|    ep_rew_mean        | -21.3     |
| time/                 |           |
|    fps                | 491       |
|    iterations         | 800       |
|    time_elapsed       | 8         |
|    total_timesteps    | 284000    |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 892      |
|    ep_rew_mean        | -29.6    |
| time/                 |          |
|    fps                | 521      |
|    iterations         | 2000     |
|    time_elapsed       | 19       |
|    total_timesteps    | 290000   |
| train/                |          |
|    entropy_loss       | -0.379   |
|    explained_variance | 0.552    |
|    learning_rate      | 0.0007   |
|    n_updates          | 57999    |
|    policy_loss        | -0.469   |
|    value_loss         | 1.2      |
------------------------------------


In [37]:
from stable_baselines3 import DQN

In [38]:
# Create directories for saving models and logs if they don't exist
models_dir = 'models/DQN'
logdir = 'logs'

# Create the models directory if it doesn't exist
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

# Create the logs directory if it doesn't exist
if not os.path.exists(logdir):
    os.makedirs(logdir)

In [39]:
model = DQN('MlpPolicy', env, verbose=1,tensorboard_log=logdir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [None]:
TIMESTEPS = 10000
for i in range (1,30):
    model.learn(total_timesteps=TIMESTEPS,reset_num_timesteps=False, tb_log_name='DQN')
    model.save(f"{models_dir}/{TIMESTEPS*i}")

Logging to logs\DQN_0
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 108      |
|    ep_rew_mean      | -188     |
|    exploration_rate | 0.591    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 10767    |
|    time_elapsed     | 0        |
|    total_timesteps  | 431      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 98.8     |
|    ep_rew_mean      | -185     |
|    exploration_rate | 0.25     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 11117    |
|    time_elapsed     | 0        |
|    total_timesteps  | 790      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 98.8     |
|    ep_rew_mean      | -190     |
|    exploration_rate | 0.05     |
| time/               |          

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 100      |
|    ep_rew_mean      | -178     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 92       |
|    fps              | 11663    |
|    time_elapsed     | 0        |
|    total_timesteps  | 9216     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 100      |
|    ep_rew_mean      | -177     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 96       |
|    fps              | 11674    |
|    time_elapsed     | 0        |
|    total_timesteps  | 9622     |
----------------------------------
Logging to logs\DQN_0
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 100      |
|    ep_rew_mean      | -182     |
|    exploration_rate | 0.05     |
| time/               |          

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 95.3     |
|    ep_rew_mean      | -174     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 180      |
|    fps              | 11764    |
|    time_elapsed     | 0        |
|    total_timesteps  | 17640    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 96       |
|    ep_rew_mean      | -178     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 184      |
|    fps              | 11780    |
|    time_elapsed     | 0        |
|    total_timesteps  | 18028    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 95.3     |
|    ep_rew_mean      | -180     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.3     |
|    ep_rew_mean      | -197     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 268      |
|    fps              | 11655    |
|    time_elapsed     | 0        |
|    total_timesteps  | 25728    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.4     |
|    ep_rew_mean      | -195     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 272      |
|    fps              | 11669    |
|    time_elapsed     | 0        |
|    total_timesteps  | 26062    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 92.6     |
|    ep_rew_mean      | -196     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 93.7     |
|    ep_rew_mean      | -210     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 356      |
|    fps              | 11518    |
|    time_elapsed     | 0        |
|    total_timesteps  | 34018    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 93.1     |
|    ep_rew_mean      | -208     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 360      |
|    fps              | 11554    |
|    time_elapsed     | 0        |
|    total_timesteps  | 34331    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 92.7     |
|    ep_rew_mean      | -212     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 89.4     |
|    ep_rew_mean      | -180     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 444      |
|    fps              | 11326    |
|    time_elapsed     | 0        |
|    total_timesteps  | 41916    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 90.4     |
|    ep_rew_mean      | -178     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 448      |
|    fps              | 11394    |
|    time_elapsed     | 0        |
|    total_timesteps  | 42338    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 90.3     |
|    ep_rew_mean      | -176     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

Logging to logs\DQN_0
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 92.6     |
|    ep_rew_mean      | -173     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 532      |
|    fps              | 333      |
|    time_elapsed     | 0        |
|    total_timesteps  | 50011    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 7.34     |
|    n_updates        | 2        |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.8     |
|    ep_rew_mean      | -185     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 536      |
|    fps              | 613      |
|    time_elapsed     | 0        |
|    total_timesteps  | 50360    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.44     

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 204      |
|    ep_rew_mean      | -303     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 596      |
|    fps              | 926      |
|    time_elapsed     | 7        |
|    total_timesteps  | 66862    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.778    |
|    n_updates        | 4215     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 214      |
|    ep_rew_mean      | -309     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 600      |
|    fps              | 924      |
|    time_elapsed     | 8        |
|    total_timesteps  | 68313    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.456    |
|    n_updates      

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 385      |
|    ep_rew_mean      | -172     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 660      |
|    fps              | 893      |
|    time_elapsed     | 4        |
|    total_timesteps  | 94322    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.709    |
|    n_updates        | 11080    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 390      |
|    ep_rew_mean      | -169     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 664      |
|    fps              | 900      |
|    time_elapsed     | 6        |
|    total_timesteps  | 96080    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.793    |
|    n_updates      

Logging to logs\DQN_0
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 778      |
|    ep_rew_mean      | -110     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 724      |
|    fps              | 952      |
|    time_elapsed     | 3        |
|    total_timesteps  | 153697   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.03     |
|    n_updates        | 25924    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 803      |
|    ep_rew_mean      | -107     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 728      |
|    fps              | 941      |
|    time_elapsed     | 8        |
|    total_timesteps  | 157697   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.576    

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 999      |
|    ep_rew_mean      | -112     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 788      |
|    fps              | 943      |
|    time_elapsed     | 8        |
|    total_timesteps  | 217549   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.543    |
|    n_updates        | 41887    |
----------------------------------
Logging to logs\DQN_0
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 999      |
|    ep_rew_mean      | -114     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 792      |
|    fps              | 952      |
|    time_elapsed     | 1        |
|    total_timesteps  | 221549   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.455    

# Understanding the output

Sample:
| rollout/            |          |
|    ep_len_mean      | 101      |
|    ep_rew_mean      | -283     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 552      |
|    fps              | 879      |
|    time_elapsed     | 3        |
|    total_timesteps  | 52787    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.35     |
|    n_updates     | 696     |
----------------------------------


let's break down each of the output parameters from this sample training log, which is often seen when training reinforcement learning models using libraries like Stable Baselines or OpenAI Gym.

### Rollout Parameters

- **`ep_len_mean` (Mean Episode Length)**: This is the average length of episodes over a certain period. A value of 101 suggests that, on average, episodes last for 101 timesteps. This can be an indicator of how long the agent is able to "survive" or maintain a certain level of performance in the environment.

- **`ep_rew_mean` (Mean Episode Reward)**: This is the average reward obtained per episode over a certain period. A value of -283 suggests that the agent is, on average, receiving a negative reward, which usually indicates poor performance.

- **`exploration_rate`**: This is the rate at which the agent explores the environment, usually governed by an ε-greedy strategy or similar. A value of 0.05 means that 5% of the time, the agent takes a random action to explore the state space.

### Time Parameters

- **`episodes`**: This indicates the total number of episodes that have been completed. A value of 552 suggests that the agent has completed 552 episodes.

- **`fps` (Frames Per Second)**: This is the speed of the simulation, indicating how many timesteps the model can process per second. A value of 879 means the model is processing 879 timesteps per second.

- **`time_elapsed`**: This is the total time elapsed in seconds since the training started. A value of 3 suggests that the training has been running for 3 seconds.

- **`total_timesteps`**: This indicates the total number of timesteps that have been processed so far. A value of 52,787 means the agent has interacted with the environment for that many timesteps.

### Training Parameters

- **`learning_rate`**: This is the learning rate used in the optimization algorithm. A value of 0.0001 is fairly typical for many RL problems.

- **`loss`**: This is the value of the loss function, which the optimization algorithm is trying to minimize. A value of 1.35 suggests the current "error" or "loss" in the model's predictions.

- **`n_updates`**: This indicates the number of times the model's parameters have been updated. A value of 696 suggests that 696 updates have been made.

Each of these metrics provides valuable insights into the training process, helping you understand the agent's performance, the efficiency of the training, and the stability of the learning.

In [1]:
from IPython.display import Image
Image(filename="tensorboard_pic.JPG")

<IPython.core.display.Image object>

The command `tensorboard --logdir=logs` is used to launch TensorBoard, a visualization tool for machine learning experiments, and point it to the directory where the log files are stored. In this case, the log files are expected to be in a directory named `logs`. Once TensorBoard is running, it will continuously monitor the `logs` directory for any updates and reflect those changes in its visualizations. This is particularly useful for tracking metrics, visualizing model architectures, and monitoring training progress in real-time. By setting up TensorBoard in this manner, you can gain valuable insights into the behavior and performance of your reinforcement learning models, making it easier to debug issues, optimize hyperparameters, and understand the learning dynamics.

!pip install tensorflow to install tensorboard #see https://stackoverflow.com/questions/33634008/how-do-i-install-tensorflows-tensorboard/33634101#33634101

tensorboard --logdir=logs
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.13.0 at http://localhost:6006/ (Press CTRL+C to quit)
