### To view the tensorboard: 
    1: tensorboard --logdir ray_results 
    2: see http://localhost:6006/ in browser

In [1]:
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray import tune

## 0: RLlib Training APIs: 
1: At a high level, RLlib provides an Trainer class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed. In multi-agent training, the trainer manages the querying and optimization of multiple policies at once.

2: rllib train --run DQN --env CartPole-v0  --config '{"num_workers": 8}'
    To see the tensorboard: tensorboard --logdir=~/ray_results

3: rllib rollout ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

4: Loading and restoring a trained agent from a checkpoint is simple:
    
    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)
    
5: Computing Actions

The simplest way to programmatically compute actions from a trained agent is to use trainer.compute_action(). This method preprocesses and filters the observation before passing it to the agent policy. Here is a simple example of testing a trained agent for one episode:

    # instantiate env class
    env = env_class(env_config)

    # run until episode ends
    episode_reward = 0
    done = False
    obs = env.reset()
    while not done:
        action = agent.compute_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward

In [2]:
ray.shutdown()
ray.init()

2021-10-15 10:50:05,689	INFO services.py:1263 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m


{'node_ip_address': '192.168.0.42',
 'raylet_ip_address': '192.168.0.42',
 'redis_address': '192.168.0.42:27869',
 'object_store_address': '/tmp/ray/session_2021-10-15_10-50-04_282399_1061/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-10-15_10-50-04_282399_1061/sockets/raylet',
 'webui_url': '127.0.0.1:8266',
 'session_dir': '/tmp/ray/session_2021-10-15_10-50-04_282399_1061',
 'metrics_export_port': 57969,
 'node_id': '7d4c98cc3697a0840fae3f832d847c4bc04e4aae4ee62657176f6e19'}

#### 1 Example of Traing a PPO Agent

In [4]:
config = ppo.DEFAULT_CONFIG.copy()
config['num_gpus'] = 0
config['num_workers'] = 2
trainer = ppo.PPOTrainer(config = config, env='CartPole-v0') 

for i in range(30):
    result = trainer.train()
    if i % 10 ==0:
        checkpoint = trainer.save()
        print('checkpoint saved')
        

2021-09-30 21:13:18,338	INFO trainer.py:714 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-09-30 21:13:18,341	INFO ppo.py:158 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2021-09-30 21:13:18,342	INFO trainer.py:726 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


checkpoint saved
checkpoint saved
checkpoint saved


#### 2 Example of Using Tune

In [3]:
alg = 'PPO'
tune.run(alg,
    stop={'episode_reward_mean':200},
    config={
        'env':'CartPole-v0',
        'num_gpus':0,
        'num_workers':2,
        'lr':tune.grid_search([.01,.001,.0001])     
    }
)

Trial name,status,loc,lr
PPO_CartPole-v0_97f47_00000,PENDING,,0.01
PPO_CartPole-v0_97f47_00001,PENDING,,0.001
PPO_CartPole-v0_97f47_00002,PENDING,,0.0001


[2m[36m(pid=1096)[0m 2021-10-15 10:50:22,046	INFO trainer.py:714 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=1096)[0m 2021-10-15 10:50:22,046	INFO ppo.py:158 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(pid=1096)[0m 2021-10-15 10:50:22,046	INFO trainer.py:726 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=1100)[0m 2021-10-15 10:50:22,046	INFO trainer.py:714 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=1100)[0m 2021-10-15 10:50:22,046	INFO ppo.py:158 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(pid=1100)[0m 2021-10-15 10:50:22,046	INFO trainer.py:7

Result for PPO_CartPole-v0_97f47_00001:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-10-15_10-50-31
  done: false
  episode_len_mean: 23.023121387283236
  episode_media: {}
  episode_reward_max: 64.0
  episode_reward_mean: 23.023121387283236
  episode_reward_min: 8.0
  episodes_this_iter: 173
  episodes_total: 173
  experiment_id: a7d670197b8449aca715ca10f26be361
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0010000000474974513
          entropy: 0.6616495251655579
          entropy_coeff: 0.0
          kl: 0.032735127955675125
          model: {}
          policy_loss: -0.05019322782754898
          total_loss: 91.09257507324219
          vf_explained_var: 0.28667253255844116
          vf_loss: 91.13622283935547
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,,0.01,,,,,,,
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,1.0,3.87207,4000.0,23.0231,64.0,8.0,23.0231
PPO_CartPole-v0_97f47_00002,RUNNING,,0.0001,,,,,,,


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-10-15_10-50-31
  done: false
  episode_len_mean: 23.53846153846154
  episode_media: {}
  episode_reward_max: 80.0
  episode_reward_mean: 23.53846153846154
  episode_reward_min: 9.0
  episodes_this_iter: 169
  episodes_total: 169
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.009999999776482582
          entropy: 0.6601420640945435
          entropy_coeff: 0.0
          kl: 0.03442293405532837
          model: {}
          policy_loss: -0.03388642519712448
          total_loss: 99.54802703857422
          vf_explained_var: 0.3093220889568329
          vf_loss: 99.57502746582031
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 400

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,2,7.7063,8000,42.69,151,10,42.69
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,3,11.2618,12000,74.56,200,10,74.56
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,2,7.69307,8000,43.16,148,10,43.16


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-10-15_10-50-39
  done: false
  episode_len_mean: 73.29
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 73.29
  episode_reward_min: 10.0
  episodes_this_iter: 31
  episodes_total: 305
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 9.999999747378752e-05
          entropy: 0.5829192996025085
          entropy_coeff: 0.0
          kl: 0.008481566794216633
          model: {}
          policy_loss: -0.0166219063103199
          total_loss: 716.9107666015625
          vf_explained_var: 0.12892436981201172
          vf_loss: 716.9248657226562
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  iterations_s

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,4,15.1435,16000,93.33,200,10,93.33
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,5,18.6934,20000,126.87,200,14,126.87
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,4,15.0983,16000,101.77,200,11,101.77


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-10-15_10-50-46
  done: false
  episode_len_mean: 129.05
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 129.05
  episode_reward_min: 11.0
  episodes_this_iter: 21
  episodes_total: 350
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 9.999999747378752e-05
          entropy: 0.5596845746040344
          entropy_coeff: 0.0
          kl: 0.00785556435585022
          model: {}
          policy_loss: -0.009856896474957466
          total_loss: 341.8866271972656
          vf_explained_var: 0.37634173035621643
          vf_loss: 341.8952941894531
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
  iteration

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,6,22.5009,24000,148.91,200,18,148.91
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,7,25.7851,28000,159.84,200,15,159.84
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,6,22.4216,24000,158.32,200,13,158.32


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-10-15_10-50-53
  done: false
  episode_len_mean: 169.02
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 169.02
  episode_reward_min: 16.0
  episodes_this_iter: 22
  episodes_total: 394
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 9.999999747378752e-05
          entropy: 0.567983865737915
          entropy_coeff: 0.0
          kl: 0.0063636647537350655
          model: {}
          policy_loss: -0.0028207108844071627
          total_loss: 323.7685546875
          vf_explained_var: 0.38853269815444946
          vf_loss: 323.7704162597656
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
  iterations

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,8,29.5732,32000,178.08,200,22,178.08
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,9,32.7769,36000,180.46,200,26,180.46
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,8,29.4613,32000,183.15,200,57,183.15


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-10-15_10-51-00
  done: false
  episode_len_mean: 188.57
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 188.57
  episode_reward_min: 92.0
  episodes_this_iter: 20
  episodes_total: 435
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 9.999999747378752e-05
          entropy: 0.5437036156654358
          entropy_coeff: 0.0
          kl: 0.0033580162562429905
          model: {}
          policy_loss: 0.0004592960758600384
          total_loss: 577.7265014648438
          vf_explained_var: 0.09605903178453445
          vf_loss: 577.7254638671875
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,10,36.6239,40000,176.5,200,64,176.5
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,11,39.7845,44000,194.15,200,90,194.15
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,10,36.4843,40000,192.06,200,92,192.06


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 44000
  custom_metrics: {}
  date: 2021-10-15_10-51-07
  done: false
  episode_len_mean: 196.19
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.19
  episode_reward_min: 92.0
  episodes_this_iter: 20
  episodes_total: 475
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5528479814529419
          entropy_coeff: 0.0
          kl: 0.003560141660273075
          model: {}
          policy_loss: -0.0008763718069531024
          total_loss: 568.4649658203125
          vf_explained_var: -0.00033595471177250147
          vf_loss: 568.465576171875
    num_agent_steps_sampled: 44000
    num_agent_steps_trained: 44000
    num_steps_sampled: 44000
    num_steps_trained: 44000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,12,43.6388,48000,174.65,200,64,174.65
PPO_CartPole-v0_97f47_00001,RUNNING,192.168.0.42:1099,0.001,13,46.715,52000,197.19,200,114,197.19
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,12,43.5018,48000,197.25,200,92,197.25


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 52000
  custom_metrics: {}
  date: 2021-10-15_10-51-14
  done: false
  episode_len_mean: 198.77
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.77
  episode_reward_min: 126.0
  episodes_this_iter: 20
  episodes_total: 515
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.03750000149011612
          cur_lr: 9.999999747378752e-05
          entropy: 0.5157675743103027
          entropy_coeff: 0.0
          kl: 0.003722771303728223
          model: {}
          policy_loss: -0.004048094619065523
          total_loss: 462.3150939941406
          vf_explained_var: 0.15132272243499756
          vf_loss: 462.31903076171875
    num_agent_steps_sampled: 52000
    num_agent_steps_trained: 52000
    num_steps_sampled: 52000
    num_steps_trained: 52000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,14,50.5949,56000,182.13,200,64,182.13
PPO_CartPole-v0_97f47_00002,RUNNING,192.168.0.42:1100,0.0001,14,50.4287,56000,199.51,200,151,199.51
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00002:
  agent_timesteps_total: 60000
  custom_metrics: {}
  date: 2021-10-15_10-51-21
  done: false
  episode_len_mean: 199.51
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.51
  episode_reward_min: 151.0
  episodes_this_iter: 20
  episodes_total: 555
  experiment_id: d20693eb10024d9ab0f562ddf7dfc3be
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.01875000074505806
          cur_lr: 9.999999747378752e-05
          entropy: 0.49684691429138184
          entropy_coeff: 0.0
          kl: 0.007127116899937391
          model: {}
          policy_loss: -0.0019269874319434166
          total_loss: 472.16656494140625
          vf_explained_var: 0.22081834077835083
          vf_loss: 472.1683349609375
    num_agent_steps_sampled: 60000
    num_agent_steps_trained: 60000
    num_steps_sampled: 60000
    num_steps_trained: 60000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,16,56.8174,64000,191.99,200,146,191.99
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 68000
  custom_metrics: {}
  date: 2021-10-15_10-51-27
  done: false
  episode_len_mean: 193.3
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.3
  episode_reward_min: 146.0
  episodes_this_iter: 20
  episodes_total: 608
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3.417187452316284
          cur_lr: 0.009999999776482582
          entropy: 0.15705595910549164
          entropy_coeff: 0.0
          kl: 0.017499247565865517
          model: {}
          policy_loss: 0.006452866364270449
          total_loss: 406.8672180175781
          vf_explained_var: 0.4457138180732727
          vf_loss: 406.8009033203125
    num_agent_steps_sampled: 68000
    num_agent_steps_trained: 68000
    num_steps_sampled: 68000
    num_steps_trained: 68000
  iterations_si

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,20,66.4922,80000,192.34,200,140,192.34
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 92000
  custom_metrics: {}
  date: 2021-10-15_10-51-41
  done: false
  episode_len_mean: 192.1
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.1
  episode_reward_min: 140.0
  episodes_this_iter: 21
  episodes_total: 734
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.5628905296325684
          cur_lr: 0.009999999776482582
          entropy: 0.09845804423093796
          entropy_coeff: 0.0
          kl: 0.012499627657234669
          model: {}
          policy_loss: 0.0034497424494475126
          total_loss: 256.9383544921875
          vf_explained_var: 0.629697859287262
          vf_loss: 256.9028625488281
    num_agent_steps_sampled: 92000
    num_agent_steps_trained: 92000
    num_steps_sampled: 92000
    num_steps_trained: 92000
  iterations_s

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,23,73.4116,92000,192.1,200,140,192.1
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 104000
  custom_metrics: {}
  date: 2021-10-15_10-51-48
  done: false
  episode_len_mean: 193.52
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.52
  episode_reward_min: 141.0
  episodes_this_iter: 21
  episodes_total: 795
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3.8443360328674316
          cur_lr: 0.009999999776482582
          entropy: 0.0751693919301033
          entropy_coeff: 0.0
          kl: 0.0021325002890080214
          model: {}
          policy_loss: 0.005034258123487234
          total_loss: 212.39666748046875
          vf_explained_var: 0.7189307808876038
          vf_loss: 212.38343811035156
    num_agent_steps_sampled: 104000
    num_agent_steps_trained: 104000
    num_steps_sampled: 104000
    num_steps_trained: 104000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,26,80.3135,104000,193.52,200,141,193.52
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 116000
  custom_metrics: {}
  date: 2021-10-15_10-51-55
  done: false
  episode_len_mean: 195.38
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.38
  episode_reward_min: 146.0
  episodes_this_iter: 20
  episodes_total: 856
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.9610840082168579
          cur_lr: 0.009999999776482582
          entropy: 0.07955187559127808
          entropy_coeff: 0.0
          kl: 0.011995332315564156
          model: {}
          policy_loss: 0.0031312373466789722
          total_loss: 193.80494689941406
          vf_explained_var: 0.6406446099281311
          vf_loss: 193.790283203125
    num_agent_steps_sampled: 116000
    num_agent_steps_trained: 116000
    num_steps_sampled: 116000
    num_steps_trained: 116000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,29,87.297,116000,195.38,200,146,195.38
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 128000
  custom_metrics: {}
  date: 2021-10-15_10-52-02
  done: false
  episode_len_mean: 193.17
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.17
  episode_reward_min: 144.0
  episodes_this_iter: 21
  episodes_total: 919
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.9610840082168579
          cur_lr: 0.009999999776482582
          entropy: 0.06383916735649109
          entropy_coeff: 0.0
          kl: 0.009253853000700474
          model: {}
          policy_loss: 0.000588017632253468
          total_loss: 278.4478454589844
          vf_explained_var: 0.6496924757957458
          vf_loss: 278.4383850097656
    num_agent_steps_sampled: 128000
    num_agent_steps_trained: 128000
    num_steps_sampled: 128000
    num_steps_trained: 128000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,32,94.2268,128000,193.17,200,144,193.17
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 140000
  custom_metrics: {}
  date: 2021-10-15_10-52-09
  done: false
  episode_len_mean: 195.02
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.02
  episode_reward_min: 143.0
  episodes_this_iter: 20
  episodes_total: 980
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.441625952720642
          cur_lr: 0.009999999776482582
          entropy: 0.0636262446641922
          entropy_coeff: 0.0
          kl: 0.030850810930132866
          model: {}
          policy_loss: 0.008732305839657784
          total_loss: 293.4499206542969
          vf_explained_var: 0.5388401746749878
          vf_loss: 293.3966979980469
    num_agent_steps_sampled: 140000
    num_agent_steps_trained: 140000
    num_steps_sampled: 140000
    num_steps_trained: 140000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,35,101.131,140000,195.02,200,143,195.02
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 152000
  custom_metrics: {}
  date: 2021-10-15_10-52-16
  done: false
  episode_len_mean: 194.35
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.35
  episode_reward_min: 132.0
  episodes_this_iter: 21
  episodes_total: 1042
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.865487575531006
          cur_lr: 0.009999999776482582
          entropy: 0.07182817161083221
          entropy_coeff: 0.0
          kl: 0.014971381984651089
          model: {}
          policy_loss: 0.003922880627214909
          total_loss: 193.9669647216797
          vf_explained_var: 0.7505271434783936
          vf_loss: 193.89019775390625
    num_agent_steps_sampled: 152000
    num_agent_steps_trained: 152000
    num_steps_sampled: 152000
    num_steps_trained: 152000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,38,108.046,152000,194.35,200,132,194.35
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 164000
  custom_metrics: {}
  date: 2021-10-15_10-52-23
  done: false
  episode_len_mean: 195.06
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.06
  episode_reward_min: 132.0
  episodes_this_iter: 20
  episodes_total: 1103
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.865487575531006
          cur_lr: 0.009999999776482582
          entropy: 0.07145307213068008
          entropy_coeff: 0.0
          kl: 0.01787406951189041
          model: {}
          policy_loss: 0.0057264831848442554
          total_loss: 162.24905395507812
          vf_explained_var: 0.673206090927124
          vf_loss: 162.1563720703125
    num_agent_steps_sampled: 164000
    num_agent_steps_trained: 164000
    num_steps_sampled: 164000
    num_steps_trained: 164000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,41,114.965,164000,195.06,200,132,195.06
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 176000
  custom_metrics: {}
  date: 2021-10-15_10-52-30
  done: false
  episode_len_mean: 195.08
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.08
  episode_reward_min: 135.0
  episodes_this_iter: 20
  episodes_total: 1165
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.865487575531006
          cur_lr: 0.009999999776482582
          entropy: 0.07095066457986832
          entropy_coeff: 0.0
          kl: 0.018658123910427094
          model: {}
          policy_loss: 0.010569063015282154
          total_loss: 266.86920166015625
          vf_explained_var: 0.629869282245636
          vf_loss: 266.7678527832031
    num_agent_steps_sampled: 176000
    num_agent_steps_trained: 176000
    num_steps_sampled: 176000
    num_steps_trained: 176000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,44,121.877,176000,195.08,200,135,195.08
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 188000
  custom_metrics: {}
  date: 2021-10-15_10-52-37
  done: false
  episode_len_mean: 193.63
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.63
  episode_reward_min: 135.0
  episodes_this_iter: 20
  episodes_total: 1227
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.865487575531006
          cur_lr: 0.009999999776482582
          entropy: 0.05926322564482689
          entropy_coeff: 0.0
          kl: 0.02037595957517624
          model: {}
          policy_loss: 0.004495890345424414
          total_loss: 164.83383178710938
          vf_explained_var: 0.7539454102516174
          vf_loss: 164.73020935058594
    num_agent_steps_sampled: 188000
    num_agent_steps_trained: 188000
    num_steps_sampled: 188000
    num_steps_trained: 188000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,47,128.782,188000,193.63,200,135,193.63
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-10-15_10-52-44
  done: false
  episode_len_mean: 194.7
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.7
  episode_reward_min: 137.0
  episodes_this_iter: 20
  episodes_total: 1289
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 16.4210205078125
          cur_lr: 0.009999999776482582
          entropy: 0.06893593072891235
          entropy_coeff: 0.0
          kl: 0.01765809766948223
          model: {}
          policy_loss: 0.0012831453932449222
          total_loss: 185.15457153320312
          vf_explained_var: 0.7418593764305115
          vf_loss: 184.8633270263672
    num_agent_steps_sampled: 200000
    num_agent_steps_trained: 200000
    num_steps_sampled: 200000
    num_steps_trained: 200000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,50,135.688,200000,194.7,200,137,194.7
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 212000
  custom_metrics: {}
  date: 2021-10-15_10-52-50
  done: false
  episode_len_mean: 190.78
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 190.78
  episode_reward_min: 136.0
  episodes_this_iter: 21
  episodes_total: 1353
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8.21051025390625
          cur_lr: 0.009999999776482582
          entropy: 0.06738966703414917
          entropy_coeff: 0.0
          kl: 0.0019532055594027042
          model: {}
          policy_loss: 0.0035769108217209578
          total_loss: 259.8935241699219
          vf_explained_var: 0.5878764390945435
          vf_loss: 259.8739318847656
    num_agent_steps_sampled: 212000
    num_agent_steps_trained: 212000
    num_steps_sampled: 212000
    num_steps_trained: 212000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,53,142.597,212000,190.78,200,136,190.78
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 224000
  custom_metrics: {}
  date: 2021-10-15_10-52-57
  done: false
  episode_len_mean: 191.46
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 191.46
  episode_reward_min: 123.0
  episodes_this_iter: 22
  episodes_total: 1416
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.0263137817382812
          cur_lr: 0.009999999776482582
          entropy: 0.06617111712694168
          entropy_coeff: 0.0
          kl: 0.00456637516617775
          model: {}
          policy_loss: 0.001747599570080638
          total_loss: 217.2114715576172
          vf_explained_var: 0.7338910698890686
          vf_loss: 217.20504760742188
    num_agent_steps_sampled: 224000
    num_agent_steps_trained: 224000
    num_steps_sampled: 224000
    num_steps_trained: 224000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,56,149.499,224000,191.46,200,123,191.46
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 236000
  custom_metrics: {}
  date: 2021-10-15_10-53-04
  done: false
  episode_len_mean: 192.94
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.94
  episode_reward_min: 121.0
  episodes_this_iter: 20
  episodes_total: 1477
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.7697353363037109
          cur_lr: 0.009999999776482582
          entropy: 0.053205594420433044
          entropy_coeff: 0.0
          kl: 0.035875000059604645
          model: {}
          policy_loss: 0.009056166745722294
          total_loss: 283.8392333984375
          vf_explained_var: 0.43887683749198914
          vf_loss: 283.80255126953125
    num_agent_steps_sampled: 236000
    num_agent_steps_trained: 236000
    num_steps_sampled: 236000
    num_steps_trained: 236000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,59,156.413,236000,192.94,200,121,192.94
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 248000
  custom_metrics: {}
  date: 2021-10-15_10-53-11
  done: false
  episode_len_mean: 192.18
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.18
  episode_reward_min: 107.0
  episodes_this_iter: 21
  episodes_total: 1540
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2.5978567600250244
          cur_lr: 0.009999999776482582
          entropy: 0.045058246701955795
          entropy_coeff: 0.0
          kl: 0.026448369026184082
          model: {}
          policy_loss: 0.0026669239159673452
          total_loss: 308.12518310546875
          vf_explained_var: 0.5753390192985535
          vf_loss: 308.0538330078125
    num_agent_steps_sampled: 248000
    num_agent_steps_trained: 248000
    num_steps_sampled: 248000
    num_steps_trained: 248000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,62,163.302,248000,192.18,200,107,192.18
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 260000
  custom_metrics: {}
  date: 2021-10-15_10-53-18
  done: false
  episode_len_mean: 193.17
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.17
  episode_reward_min: 118.0
  episodes_this_iter: 20
  episodes_total: 1601
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5.84517765045166
          cur_lr: 0.009999999776482582
          entropy: 0.036678045988082886
          entropy_coeff: 0.0
          kl: 0.010560956783592701
          model: {}
          policy_loss: 0.0032543730922043324
          total_loss: 260.6420593261719
          vf_explained_var: 0.6794266700744629
          vf_loss: 260.57708740234375
    num_agent_steps_sampled: 260000
    num_agent_steps_trained: 260000
    num_steps_sampled: 260000
    num_steps_trained: 260000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,65,170.207,260000,193.17,200,118,193.17
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 272000
  custom_metrics: {}
  date: 2021-10-15_10-53-25
  done: false
  episode_len_mean: 192.73
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.73
  episode_reward_min: 143.0
  episodes_this_iter: 21
  episodes_total: 1665
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5.84517765045166
          cur_lr: 0.009999999776482582
          entropy: 0.03346505016088486
          entropy_coeff: 0.0
          kl: 0.020436739549040794
          model: {}
          policy_loss: 0.002027573063969612
          total_loss: 152.76043701171875
          vf_explained_var: 0.6178138852119446
          vf_loss: 152.63893127441406
    num_agent_steps_sampled: 272000
    num_agent_steps_trained: 272000
    num_steps_sampled: 272000
    num_steps_trained: 272000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,68,177.123,272000,192.73,200,143,192.73
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 284000
  custom_metrics: {}
  date: 2021-10-15_10-53-32
  done: false
  episode_len_mean: 194.15
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.15
  episode_reward_min: 143.0
  episodes_this_iter: 20
  episodes_total: 1725
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8.767766952514648
          cur_lr: 0.009999999776482582
          entropy: 0.03575684875249863
          entropy_coeff: 0.0
          kl: 0.02222457528114319
          model: {}
          policy_loss: 0.009206690825521946
          total_loss: 203.05517578125
          vf_explained_var: 0.6771593689918518
          vf_loss: 202.85108947753906
    num_agent_steps_sampled: 284000
    num_agent_steps_trained: 284000
    num_steps_sampled: 284000
    num_steps_trained: 284000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,71,184.027,284000,194.15,200,143,194.15
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 296000
  custom_metrics: {}
  date: 2021-10-15_10-53-39
  done: false
  episode_len_mean: 197.72
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.72
  episode_reward_min: 146.0
  episodes_this_iter: 21
  episodes_total: 1786
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 13.151650428771973
          cur_lr: 0.009999999776482582
          entropy: 0.02526565082371235
          entropy_coeff: 0.0
          kl: 0.009685052558779716
          model: {}
          policy_loss: 0.005840488243848085
          total_loss: 415.63616943359375
          vf_explained_var: 0.3977401852607727
          vf_loss: 415.5029602050781
    num_agent_steps_sampled: 296000
    num_agent_steps_trained: 296000
    num_steps_sampled: 296000
    num_steps_trained: 296000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,74,190.974,296000,197.72,200,146,197.72
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 308000
  custom_metrics: {}
  date: 2021-10-15_10-53-46
  done: false
  episode_len_mean: 196.62
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.62
  episode_reward_min: 142.0
  episodes_this_iter: 21
  episodes_total: 1847
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 19.727476119995117
          cur_lr: 0.009999999776482582
          entropy: 0.019349191337823868
          entropy_coeff: 0.0
          kl: 0.022581012919545174
          model: {}
          policy_loss: 0.006893371231853962
          total_loss: 226.9624786376953
          vf_explained_var: 0.6425521969795227
          vf_loss: 226.51011657714844
    num_agent_steps_sampled: 308000
    num_agent_steps_trained: 308000
    num_steps_sampled: 308000
    num_steps_trained: 308000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,77,197.887,308000,196.62,200,142,196.62
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 320000
  custom_metrics: {}
  date: 2021-10-15_10-53-53
  done: false
  episode_len_mean: 195.41
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.41
  episode_reward_min: 142.0
  episodes_this_iter: 20
  episodes_total: 1908
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 66.58023071289062
          cur_lr: 0.009999999776482582
          entropy: 0.01946626417338848
          entropy_coeff: 0.0
          kl: 0.02720554545521736
          model: {}
          policy_loss: 0.007632967084646225
          total_loss: 381.4862976074219
          vf_explained_var: 0.5989806652069092
          vf_loss: 379.66729736328125
    num_agent_steps_sampled: 320000
    num_agent_steps_trained: 320000
    num_steps_sampled: 320000
    num_steps_trained: 320000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,80,204.795,320000,195.41,200,142,195.41
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 332000
  custom_metrics: {}
  date: 2021-10-15_10-54-00
  done: false
  episode_len_mean: 196.24
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.24
  episode_reward_min: 140.0
  episodes_this_iter: 21
  episodes_total: 1970
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 224.70826721191406
          cur_lr: 0.009999999776482582
          entropy: 0.021055147051811218
          entropy_coeff: 0.0
          kl: 0.016842177137732506
          model: {}
          policy_loss: 0.0024194265715777874
          total_loss: 196.8347930908203
          vf_explained_var: 0.6409368515014648
          vf_loss: 193.0478057861328
    num_agent_steps_sampled: 332000
    num_agent_steps_trained: 332000
    num_steps_sampled: 332000
    num_steps_trained: 332000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,83,211.696,332000,196.24,200,140,196.24
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 344000
  custom_metrics: {}
  date: 2021-10-15_10-54-07
  done: false
  episode_len_mean: 195.69
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.69
  episode_reward_min: 140.0
  episodes_this_iter: 20
  episodes_total: 2031
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 224.70826721191406
          cur_lr: 0.009999999776482582
          entropy: 0.027449363842606544
          entropy_coeff: 0.0
          kl: 0.01374414935708046
          model: {}
          policy_loss: 0.004829069133847952
          total_loss: 185.62564086914062
          vf_explained_var: 0.7453989386558533
          vf_loss: 182.5323944091797
    num_agent_steps_sampled: 344000
    num_agent_steps_trained: 344000
    num_steps_sampled: 344000
    num_steps_trained: 344000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,86,218.613,344000,195.69,200,140,195.69
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 356000
  custom_metrics: {}
  date: 2021-10-15_10-54-14
  done: false
  episode_len_mean: 196.21
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.21
  episode_reward_min: 146.0
  episodes_this_iter: 20
  episodes_total: 2093
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 224.70826721191406
          cur_lr: 0.009999999776482582
          entropy: 0.028140677139163017
          entropy_coeff: 0.0
          kl: 0.024204334244132042
          model: {}
          policy_loss: 0.006712507456541061
          total_loss: 247.04502868652344
          vf_explained_var: 0.5732285380363464
          vf_loss: 241.59939575195312
    num_agent_steps_sampled: 356000
    num_agent_steps_trained: 356000
    num_steps_sampled: 356000
    num_steps_trained: 356000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,89,225.518,356000,196.21,200,146,196.21
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 368000
  custom_metrics: {}
  date: 2021-10-15_10-54-21
  done: false
  episode_len_mean: 194.39
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.39
  episode_reward_min: 142.0
  episodes_this_iter: 20
  episodes_total: 2154
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 758.3904418945312
          cur_lr: 0.009999999776482582
          entropy: 0.023459644988179207
          entropy_coeff: 0.0
          kl: 0.02090902253985405
          model: {}
          policy_loss: 0.0016999930376186967
          total_loss: 182.66595458984375
          vf_explained_var: 0.7748510241508484
          vf_loss: 166.80706787109375
    num_agent_steps_sampled: 368000
    num_agent_steps_trained: 368000
    num_steps_sampled: 368000
    num_steps_trained: 368000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,92,232.418,368000,194.39,200,142,194.39
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 380000
  custom_metrics: {}
  date: 2021-10-15_10-54-28
  done: false
  episode_len_mean: 191.99
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 191.99
  episode_reward_min: 128.0
  episodes_this_iter: 21
  episodes_total: 2218
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1706.37841796875
          cur_lr: 0.009999999776482582
          entropy: 0.02332630567252636
          entropy_coeff: 0.0
          kl: 0.018093610182404518
          model: {}
          policy_loss: 0.0061448197811841965
          total_loss: 292.1915283203125
          vf_explained_var: 0.6776750683784485
          vf_loss: 261.31085205078125
    num_agent_steps_sampled: 380000
    num_agent_steps_trained: 380000
    num_steps_sampled: 380000
    num_steps_trained: 380000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,95,239.342,380000,191.99,200,128,191.99
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 392000
  custom_metrics: {}
  date: 2021-10-15_10-54-35
  done: false
  episode_len_mean: 194.3
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.3
  episode_reward_min: 141.0
  episodes_this_iter: 20
  episodes_total: 2279
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3839.3515625
          cur_lr: 0.009999999776482582
          entropy: 0.023386547341942787
          entropy_coeff: 0.0
          kl: 0.017789440229535103
          model: {}
          policy_loss: 0.001947210170328617
          total_loss: 348.0703430175781
          vf_explained_var: 0.5118612051010132
          vf_loss: 279.7684326171875
    num_agent_steps_sampled: 392000
    num_agent_steps_trained: 392000
    num_steps_sampled: 392000
    num_steps_trained: 392000
  iterations_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,98,246.243,392000,194.3,200,141,194.3
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 404000
  custom_metrics: {}
  date: 2021-10-15_10-54-42
  done: false
  episode_len_mean: 195.03
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.03
  episode_reward_min: 142.0
  episodes_this_iter: 21
  episodes_total: 2342
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8638.541015625
          cur_lr: 0.009999999776482582
          entropy: 0.022322041913866997
          entropy_coeff: 0.0
          kl: 0.017135147005319595
          model: {}
          policy_loss: 0.010686936788260937
          total_loss: 350.6269836425781
          vf_explained_var: 0.7103278040885925
          vf_loss: 202.59361267089844
    num_agent_steps_sampled: 404000
    num_agent_steps_trained: 404000
    num_steps_sampled: 404000
    num_steps_trained: 404000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,101,253.13,404000,195.03,200,142,195.03
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 416000
  custom_metrics: {}
  date: 2021-10-15_10-54-49
  done: false
  episode_len_mean: 198.79
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.79
  episode_reward_min: 160.0
  episodes_this_iter: 20
  episodes_total: 2402
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 19436.716796875
          cur_lr: 0.009999999776482582
          entropy: 0.021654171869158745
          entropy_coeff: 0.0
          kl: 0.022682569921016693
          model: {}
          policy_loss: -0.004589298274368048
          total_loss: 691.1602783203125
          vf_explained_var: 0.6154492497444153
          vf_loss: 250.29017639160156
    num_agent_steps_sampled: 416000
    num_agent_steps_trained: 416000
    num_steps_sampled: 416000
    num_steps_trained: 416000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,104,260.047,416000,198.79,200,160,198.79
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 428000
  custom_metrics: {}
  date: 2021-10-15_10-54-56
  done: false
  episode_len_mean: 198.13
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.13
  episode_reward_min: 169.0
  episodes_this_iter: 20
  episodes_total: 2462
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 29155.076171875
          cur_lr: 0.009999999776482582
          entropy: 0.022679787129163742
          entropy_coeff: 0.0
          kl: 0.02036457695066929
          model: {}
          policy_loss: 0.0032970828469842672
          total_loss: 728.3359985351562
          vf_explained_var: 0.8014825582504272
          vf_loss: 134.60195922851562
    num_agent_steps_sampled: 428000
    num_agent_steps_trained: 428000
    num_steps_sampled: 428000
    num_steps_trained: 428000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,107,266.958,428000,198.13,200,169,198.13
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 440000
  custom_metrics: {}
  date: 2021-10-15_10-55-02
  done: false
  episode_len_mean: 196.82
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.82
  episode_reward_min: 146.0
  episodes_this_iter: 21
  episodes_total: 2524
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 43732.61328125
          cur_lr: 0.009999999776482582
          entropy: 0.02334582805633545
          entropy_coeff: 0.0
          kl: 0.012013006955385208
          model: {}
          policy_loss: 0.003686140524223447
          total_loss: 674.8966674804688
          vf_explained_var: 0.7558326125144958
          vf_loss: 149.5327911376953
    num_agent_steps_sampled: 440000
    num_agent_steps_trained: 440000
    num_steps_sampled: 440000
    num_steps_trained: 440000
  iteratio

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,110,273.853,440000,196.82,200,146,196.82
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 452000
  custom_metrics: {}
  date: 2021-10-15_10-55-09
  done: false
  episode_len_mean: 194.98
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.98
  episode_reward_min: 141.0
  episodes_this_iter: 20
  episodes_total: 2585
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 65598.921875
          cur_lr: 0.009999999776482582
          entropy: 0.02137686498463154
          entropy_coeff: 0.0
          kl: 0.0178182665258646
          model: {}
          policy_loss: 0.005861296784132719
          total_loss: 1397.6710205078125
          vf_explained_var: 0.6271872520446777
          vf_loss: 228.80621337890625
    num_agent_steps_sampled: 452000
    num_agent_steps_trained: 452000
    num_steps_sampled: 452000
    num_steps_trained: 452000
  iterations

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,113,280.76,452000,194.98,200,141,194.98
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 464000
  custom_metrics: {}
  date: 2021-10-15_10-55-16
  done: false
  episode_len_mean: 194.45
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.45
  episode_reward_min: 123.0
  episodes_this_iter: 21
  episodes_total: 2647
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 32799.4609375
          cur_lr: 0.009999999776482582
          entropy: 0.025732535868883133
          entropy_coeff: 0.0
          kl: 0.0010122357634827495
          model: {}
          policy_loss: 0.002971413778141141
          total_loss: 297.4949645996094
          vf_explained_var: 0.6478614807128906
          vf_loss: 264.29119873046875
    num_agent_steps_sampled: 464000
    num_agent_steps_trained: 464000
    num_steps_sampled: 464000
    num_steps_trained: 464000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,116,287.664,464000,194.45,200,123,194.45
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 476000
  custom_metrics: {}
  date: 2021-10-15_10-55-23
  done: false
  episode_len_mean: 193.32
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.32
  episode_reward_min: 129.0
  episodes_this_iter: 21
  episodes_total: 2709
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4099.9326171875
          cur_lr: 0.009999999776482582
          entropy: 0.021833982318639755
          entropy_coeff: 0.0
          kl: 4.932177034788765e-05
          model: {}
          policy_loss: 0.0017498948145657778
          total_loss: 214.48110961914062
          vf_explained_var: 0.6783857345581055
          vf_loss: 214.2771453857422
    num_agent_steps_sampled: 476000
    num_agent_steps_trained: 476000
    num_steps_sampled: 476000
    num_steps_trained: 476000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,119,294.577,476000,193.32,200,129,193.32
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 488000
  custom_metrics: {}
  date: 2021-10-15_10-55-30
  done: false
  episode_len_mean: 191.34
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 191.34
  episode_reward_min: 117.0
  episodes_this_iter: 21
  episodes_total: 2772
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 512.4915771484375
          cur_lr: 0.009999999776482582
          entropy: 0.02447798289358616
          entropy_coeff: 0.0
          kl: 1.5047836541270954e-06
          model: {}
          policy_loss: 0.0010183107806369662
          total_loss: 280.0281982421875
          vf_explained_var: 0.5318291187286377
          vf_loss: 280.0263977050781
    num_agent_steps_sampled: 488000
    num_agent_steps_trained: 488000
    num_steps_sampled: 488000
    num_steps_trained: 488000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,122,301.484,488000,191.34,200,117,191.34
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 500000
  custom_metrics: {}
  date: 2021-10-15_10-55-37
  done: false
  episode_len_mean: 193.39
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.39
  episode_reward_min: 125.0
  episodes_this_iter: 20
  episodes_total: 2834
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 64.06144714355469
          cur_lr: 0.009999999776482582
          entropy: 0.026976358145475388
          entropy_coeff: 0.0
          kl: 7.330835160246352e-06
          model: {}
          policy_loss: -0.0073959254659712315
          total_loss: 251.12245178222656
          vf_explained_var: 0.6693288683891296
          vf_loss: 251.12936401367188
    num_agent_steps_sampled: 500000
    num_agent_steps_trained: 500000
    num_steps_sampled: 500000
    num_steps_trained: 500000
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,125,308.403,500000,193.39,200,125,193.39
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 512000
  custom_metrics: {}
  date: 2021-10-15_10-55-44
  done: false
  episode_len_mean: 195.45
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.45
  episode_reward_min: 141.0
  episodes_this_iter: 20
  episodes_total: 2895
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8.007680892944336
          cur_lr: 0.009999999776482582
          entropy: 0.022639010101556778
          entropy_coeff: 0.0
          kl: 3.507618384901434e-05
          model: {}
          policy_loss: -0.004638582468032837
          total_loss: 223.90518188476562
          vf_explained_var: 0.6234924793243408
          vf_loss: 223.90953063964844
    num_agent_steps_sampled: 512000
    num_agent_steps_trained: 512000
    num_steps_sampled: 512000
    num_steps_trained: 512000
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,128,315.326,512000,195.45,200,141,195.45
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 524000
  custom_metrics: {}
  date: 2021-10-15_10-55-51
  done: false
  episode_len_mean: 194.36
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.36
  episode_reward_min: 126.0
  episodes_this_iter: 20
  episodes_total: 2957
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.000960111618042
          cur_lr: 0.009999999776482582
          entropy: 0.0245366208255291
          entropy_coeff: 0.0
          kl: 0.0003373818763066083
          model: {}
          policy_loss: 0.0012311121681705117
          total_loss: 187.77406311035156
          vf_explained_var: 0.6041390299797058
          vf_loss: 187.77249145507812
    num_agent_steps_sampled: 524000
    num_agent_steps_trained: 524000
    num_steps_sampled: 524000
    num_steps_trained: 524000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,131,322.321,524000,194.36,200,126,194.36
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 536000
  custom_metrics: {}
  date: 2021-10-15_10-55-58
  done: false
  episode_len_mean: 193.43
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.43
  episode_reward_min: 136.0
  episodes_this_iter: 20
  episodes_total: 3019
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.12512001395225525
          cur_lr: 0.009999999776482582
          entropy: 0.02591009996831417
          entropy_coeff: 0.0
          kl: 0.0021655333694070578
          model: {}
          policy_loss: -0.001225546351633966
          total_loss: 260.375732421875
          vf_explained_var: 0.6275671720504761
          vf_loss: 260.3766784667969
    num_agent_steps_sampled: 536000
    num_agent_steps_trained: 536000
    num_steps_sampled: 536000
    num_steps_trained: 536000
  it

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,134,329.255,536000,193.43,200,136,193.43
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 548000
  custom_metrics: {}
  date: 2021-10-15_10-56-05
  done: false
  episode_len_mean: 191.78
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 191.78
  episode_reward_min: 136.0
  episodes_this_iter: 21
  episodes_total: 3082
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.03128000348806381
          cur_lr: 0.009999999776482582
          entropy: 0.030147982761263847
          entropy_coeff: 0.0
          kl: 0.021021060645580292
          model: {}
          policy_loss: 0.007217974402010441
          total_loss: 259.5378112792969
          vf_explained_var: 0.6014902591705322
          vf_loss: 259.52996826171875
    num_agent_steps_sampled: 548000
    num_agent_steps_trained: 548000
    num_steps_sampled: 548000
    num_steps_trained: 548000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,137,336.174,548000,191.78,200,136,191.78
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 560000
  custom_metrics: {}
  date: 2021-10-15_10-56-12
  done: false
  episode_len_mean: 193.44
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.44
  episode_reward_min: 136.0
  episodes_this_iter: 22
  episodes_total: 3144
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10557001084089279
          cur_lr: 0.009999999776482582
          entropy: 0.017136044800281525
          entropy_coeff: 0.0
          kl: 0.03332974761724472
          model: {}
          policy_loss: 0.0011737802997231483
          total_loss: 181.55380249023438
          vf_explained_var: 0.6397672295570374
          vf_loss: 181.54910278320312
    num_agent_steps_sampled: 560000
    num_agent_steps_trained: 560000
    num_steps_sampled: 560000
    num_steps_trained: 560000
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,140,343.114,560000,193.44,200,136,193.44
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 572000
  custom_metrics: {}
  date: 2021-10-15_10-56-19
  done: false
  episode_len_mean: 193.98
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.98
  episode_reward_min: 128.0
  episodes_this_iter: 21
  episodes_total: 3205
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.3562987744808197
          cur_lr: 0.009999999776482582
          entropy: 0.021979190409183502
          entropy_coeff: 0.0
          kl: 0.050189364701509476
          model: {}
          policy_loss: -0.017488112673163414
          total_loss: 262.1922912597656
          vf_explained_var: 0.6160300374031067
          vf_loss: 262.19189453125
    num_agent_steps_sampled: 572000
    num_agent_steps_trained: 572000
    num_steps_sampled: 572000
    num_steps_trained: 572000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,143,350.035,572000,193.98,200,128,193.98
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 584000
  custom_metrics: {}
  date: 2021-10-15_10-56-26
  done: false
  episode_len_mean: 194.68
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.68
  episode_reward_min: 128.0
  episodes_this_iter: 21
  episodes_total: 3267
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.2025083303451538
          cur_lr: 0.009999999776482582
          entropy: 0.034568868577480316
          entropy_coeff: 0.0
          kl: 0.037081968039274216
          model: {}
          policy_loss: 0.0033723257947713137
          total_loss: 250.09112548828125
          vf_explained_var: 0.6437906622886658
          vf_loss: 250.0431671142578
    num_agent_steps_sampled: 584000
    num_agent_steps_trained: 584000
    num_steps_sampled: 584000
    num_steps_trained: 584000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,146,356.949,584000,194.68,200,128,194.68
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 596000
  custom_metrics: {}
  date: 2021-10-15_10-56-33
  done: false
  episode_len_mean: 196.82
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.82
  episode_reward_min: 133.0
  episodes_this_iter: 20
  episodes_total: 3327
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4.058465957641602
          cur_lr: 0.009999999776482582
          entropy: 0.03258999064564705
          entropy_coeff: 0.0
          kl: 0.01804187148809433
          model: {}
          policy_loss: 0.007334898225963116
          total_loss: 375.8227844238281
          vf_explained_var: 0.43233582377433777
          vf_loss: 375.7422790527344
    num_agent_steps_sampled: 596000
    num_agent_steps_trained: 596000
    num_steps_sampled: 596000
    num_steps_trained: 596000
  itera

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,149,363.859,596000,196.82,200,133,196.82
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 608000
  custom_metrics: {}
  date: 2021-10-15_10-56-40
  done: false
  episode_len_mean: 198.85
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.85
  episode_reward_min: 171.0
  episodes_this_iter: 20
  episodes_total: 3387
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 9.131547927856445
          cur_lr: 0.009999999776482582
          entropy: 0.03856142982840538
          entropy_coeff: 0.0
          kl: 0.041313789784908295
          model: {}
          policy_loss: 0.011208759620785713
          total_loss: 225.64028930664062
          vf_explained_var: 0.6634702682495117
          vf_loss: 225.2518310546875
    num_agent_steps_sampled: 608000
    num_agent_steps_trained: 608000
    num_steps_sampled: 608000
    num_steps_trained: 608000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,152,370.935,608000,198.85,200,171,198.85
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 620000
  custom_metrics: {}
  date: 2021-10-15_10-56-47
  done: false
  episode_len_mean: 199.8
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.8
  episode_reward_min: 180.0
  episodes_this_iter: 20
  episodes_total: 3447
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 20.545982360839844
          cur_lr: 0.009999999776482582
          entropy: 0.03140302747488022
          entropy_coeff: 0.0
          kl: 0.03467385843396187
          model: {}
          policy_loss: 0.009347275830805302
          total_loss: 164.9791259765625
          vf_explained_var: 0.775076150894165
          vf_loss: 164.25738525390625
    num_agent_steps_sampled: 620000
    num_agent_steps_trained: 620000
    num_steps_sampled: 620000
    num_steps_trained: 620000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,155,377.839,620000,199.8,200,180,199.8
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 632000
  custom_metrics: {}
  date: 2021-10-15_10-56-54
  done: false
  episode_len_mean: 199.3
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.3
  episode_reward_min: 164.0
  episodes_this_iter: 20
  episodes_total: 3507
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 46.22846221923828
          cur_lr: 0.009999999776482582
          entropy: 0.03377624601125717
          entropy_coeff: 0.0
          kl: 0.029570963233709335
          model: {}
          policy_loss: 0.0057837641797959805
          total_loss: 214.0142364501953
          vf_explained_var: 0.8016908168792725
          vf_loss: 212.6414337158203
    num_agent_steps_sampled: 632000
    num_agent_steps_trained: 632000
    num_steps_sampled: 632000
    num_steps_trained: 632000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,158,384.759,632000,199.3,200,164,199.3
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 644000
  custom_metrics: {}
  date: 2021-10-15_10-57-01
  done: false
  episode_len_mean: 195.46
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.46
  episode_reward_min: 111.0
  episodes_this_iter: 22
  episodes_total: 3570
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 104.0140380859375
          cur_lr: 0.009999999776482582
          entropy: 0.028327936306595802
          entropy_coeff: 0.0
          kl: 0.023265670984983444
          model: {}
          policy_loss: 0.0018994394922628999
          total_loss: 230.52706909179688
          vf_explained_var: 0.5045852661132812
          vf_loss: 228.10520935058594
    num_agent_steps_sampled: 644000
    num_agent_steps_trained: 644000
    num_steps_sampled: 644000
    num_steps_trained: 644000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,161,391.686,644000,195.46,200,111,195.46
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 656000
  custom_metrics: {}
  date: 2021-10-15_10-57-08
  done: false
  episode_len_mean: 194.23
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.23
  episode_reward_min: 111.0
  episodes_this_iter: 20
  episodes_total: 3631
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 351.0473937988281
          cur_lr: 0.009999999776482582
          entropy: 0.024887733161449432
          entropy_coeff: 0.0
          kl: 0.018525080755352974
          model: {}
          policy_loss: 0.008267736993730068
          total_loss: 251.63320922851562
          vf_explained_var: 0.5858461260795593
          vf_loss: 245.1217498779297
    num_agent_steps_sampled: 656000
    num_agent_steps_trained: 656000
    num_steps_sampled: 656000
    num_steps_trained: 656000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,164,398.624,656000,194.23,200,111,194.23
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 668000
  custom_metrics: {}
  date: 2021-10-15_10-57-15
  done: false
  episode_len_mean: 194.95
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.95
  episode_reward_min: 103.0
  episodes_this_iter: 21
  episodes_total: 3692
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 351.0473937988281
          cur_lr: 0.009999999776482582
          entropy: 0.025736594572663307
          entropy_coeff: 0.0
          kl: 0.01904536969959736
          model: {}
          policy_loss: 0.001065110438503325
          total_loss: 205.11697387695312
          vf_explained_var: 0.6688538193702698
          vf_loss: 198.43008422851562
    num_agent_steps_sampled: 668000
    num_agent_steps_trained: 668000
    num_steps_sampled: 668000
    num_steps_trained: 668000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,167,405.533,668000,194.95,200,103,194.95
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 680000
  custom_metrics: {}
  date: 2021-10-15_10-57-22
  done: false
  episode_len_mean: 195.29
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.29
  episode_reward_min: 103.0
  episodes_this_iter: 20
  episodes_total: 3754
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 789.8566284179688
          cur_lr: 0.009999999776482582
          entropy: 0.023882951587438583
          entropy_coeff: 0.0
          kl: 0.021309973672032356
          model: {}
          policy_loss: 0.0005092198844067752
          total_loss: 341.5623779296875
          vf_explained_var: 0.4555712640285492
          vf_loss: 324.7300109863281
    num_agent_steps_sampled: 680000
    num_agent_steps_trained: 680000
    num_steps_sampled: 680000
    num_steps_trained: 680000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,170,412.452,680000,195.29,200,103,195.29
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 692000
  custom_metrics: {}
  date: 2021-10-15_10-57-29
  done: false
  episode_len_mean: 194.57
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.57
  episode_reward_min: 110.0
  episodes_this_iter: 20
  episodes_total: 3816
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2665.76611328125
          cur_lr: 0.009999999776482582
          entropy: 0.027045657858252525
          entropy_coeff: 0.0
          kl: 0.01053878478705883
          model: {}
          policy_loss: 0.001407517702318728
          total_loss: 383.3592834472656
          vf_explained_var: 0.1400946080684662
          vf_loss: 355.2639465332031
    num_agent_steps_sampled: 692000
    num_agent_steps_trained: 692000
    num_steps_sampled: 692000
    num_steps_trained: 692000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,173,419.37,692000,194.57,200,110,194.57
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 704000
  custom_metrics: {}
  date: 2021-10-15_10-57-36
  done: false
  episode_len_mean: 192.76
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.76
  episode_reward_min: 103.0
  episodes_this_iter: 21
  episodes_total: 3879
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3998.649169921875
          cur_lr: 0.009999999776482582
          entropy: 0.03227750584483147
          entropy_coeff: 0.0
          kl: 0.014185648411512375
          model: {}
          policy_loss: 0.005689940880984068
          total_loss: 415.17218017578125
          vf_explained_var: 0.5092543959617615
          vf_loss: 358.4430236816406
    num_agent_steps_sampled: 704000
    num_agent_steps_trained: 704000
    num_steps_sampled: 704000
    num_steps_trained: 704000
  iter

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,176,426.312,704000,192.76,200,103,192.76
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 716000
  custom_metrics: {}
  date: 2021-10-15_10-57-43
  done: false
  episode_len_mean: 193.36
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.36
  episode_reward_min: 103.0
  episodes_this_iter: 21
  episodes_total: 3940
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5997.9736328125
          cur_lr: 0.009999999776482582
          entropy: 0.030675478279590607
          entropy_coeff: 0.0
          kl: 0.03457265719771385
          model: {}
          policy_loss: 0.009151836857199669
          total_loss: 443.317138671875
          vf_explained_var: 0.5839788913726807
          vf_loss: 235.94204711914062
    num_agent_steps_sampled: 716000
    num_agent_steps_trained: 716000
    num_steps_sampled: 716000
    num_steps_trained: 716000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,179,433.221,716000,193.36,200,103,193.36
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 728000
  custom_metrics: {}
  date: 2021-10-15_10-57-50
  done: false
  episode_len_mean: 194.37
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.37
  episode_reward_min: 137.0
  episodes_this_iter: 20
  episodes_total: 4002
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 20243.16015625
          cur_lr: 0.009999999776482582
          entropy: 0.02080993354320526
          entropy_coeff: 0.0
          kl: 0.01546079758554697
          model: {}
          policy_loss: 0.0073468307964503765
          total_loss: 564.1886596679688
          vf_explained_var: 0.5882787704467773
          vf_loss: 251.2059326171875
    num_agent_steps_sampled: 728000
    num_agent_steps_trained: 728000
    num_steps_sampled: 728000
    num_steps_trained: 728000
  iteratio

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,182,440.146,728000,194.37,200,137,194.37
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 740000
  custom_metrics: {}
  date: 2021-10-15_10-57-57
  done: false
  episode_len_mean: 194.77
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.77
  episode_reward_min: 124.0
  episodes_this_iter: 20
  episodes_total: 4063
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 45547.11328125
          cur_lr: 0.009999999776482582
          entropy: 0.019286973401904106
          entropy_coeff: 0.0
          kl: 0.03301788121461868
          model: {}
          policy_loss: 0.009217972867190838
          total_loss: 1910.1221923828125
          vf_explained_var: 0.36658698320388794
          vf_loss: 406.2437744140625
    num_agent_steps_sampled: 740000
    num_agent_steps_trained: 740000
    num_steps_sampled: 740000
    num_steps_trained: 740000
  iterat

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,185,447.071,740000,194.77,200,124,194.77
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 752000
  custom_metrics: {}
  date: 2021-10-15_10-58-04
  done: false
  episode_len_mean: 196.2
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.2
  episode_reward_min: 124.0
  episodes_this_iter: 20
  episodes_total: 4124
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 153721.5
          cur_lr: 0.009999999776482582
          entropy: 0.023309336975216866
          entropy_coeff: 0.0
          kl: 0.04794064164161682
          model: {}
          policy_loss: 0.013489311560988426
          total_loss: 7566.92236328125
          vf_explained_var: 0.6164832711219788
          vf_loss: 197.40150451660156
    num_agent_steps_sampled: 752000
    num_agent_steps_trained: 752000
    num_steps_sampled: 752000
    num_steps_trained: 752000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,188,453.999,752000,196.2,200,124,196.2
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 764000
  custom_metrics: {}
  date: 2021-10-15_10-58-11
  done: false
  episode_len_mean: 195.69
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.69
  episode_reward_min: 119.0
  episodes_this_iter: 21
  episodes_total: 4186
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 518810.0625
          cur_lr: 0.009999999776482582
          entropy: 0.021534398198127747
          entropy_coeff: 0.0
          kl: 0.026961898431181908
          model: {}
          policy_loss: 0.006563364062458277
          total_loss: 14330.7802734375
          vf_explained_var: 0.595470666885376
          vf_loss: 342.6692199707031
    num_agent_steps_sampled: 764000
    num_agent_steps_trained: 764000
    num_steps_sampled: 764000
    num_steps_trained: 764000
  iterations_s

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,191,460.936,764000,195.69,200,119,195.69
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 776000
  custom_metrics: {}
  date: 2021-10-15_10-58-18
  done: false
  episode_len_mean: 192.75
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.75
  episode_reward_min: 104.0
  episodes_this_iter: 22
  episodes_total: 4249
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1750984.0
          cur_lr: 0.009999999776482582
          entropy: 0.031994014978408813
          entropy_coeff: 0.0
          kl: 0.021958844736218452
          model: {}
          policy_loss: -0.01428163144737482
          total_loss: 38739.5859375
          vf_explained_var: 0.5673038363456726
          vf_loss: 290.0136413574219
    num_agent_steps_sampled: 776000
    num_agent_steps_trained: 776000
    num_steps_sampled: 776000
    num_steps_trained: 776000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,194,467.872,776000,192.75,200,104,192.75
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 788000
  custom_metrics: {}
  date: 2021-10-15_10-58-25
  done: false
  episode_len_mean: 189.91
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 189.91
  episode_reward_min: 104.0
  episodes_this_iter: 21
  episodes_total: 4312
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5909571.0
          cur_lr: 0.009999999776482582
          entropy: 0.017025498673319817
          entropy_coeff: 0.0
          kl: 0.016995375975966454
          model: {}
          policy_loss: 0.002879914827644825
          total_loss: 100704.0859375
          vf_explained_var: 0.49949559569358826
          vf_loss: 268.6899108886719
    num_agent_steps_sampled: 788000
    num_agent_steps_trained: 788000
    num_steps_sampled: 788000
    num_steps_trained: 788000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,197,474.804,788000,189.91,200,104,189.91
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 800000
  custom_metrics: {}
  date: 2021-10-15_10-58-32
  done: false
  episode_len_mean: 189.88
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 189.88
  episode_reward_min: 112.0
  episodes_this_iter: 22
  episodes_total: 4376
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 13296535.0
          cur_lr: 0.009999999776482582
          entropy: 0.02348567359149456
          entropy_coeff: 0.0
          kl: 0.04315152391791344
          model: {}
          policy_loss: 0.001680226530879736
          total_loss: 573934.4375
          vf_explained_var: 0.7847064137458801
          vf_loss: 168.73873901367188
    num_agent_steps_sampled: 800000
    num_agent_steps_trained: 800000
    num_steps_sampled: 800000
    num_steps_trained: 800000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,200,481.72,800000,189.88,200,112,189.88
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 812000
  custom_metrics: {}
  date: 2021-10-15_10-58-39
  done: false
  episode_len_mean: 191.33
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 191.33
  episode_reward_min: 116.0
  episodes_this_iter: 21
  episodes_total: 4438
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 29917204.0
          cur_lr: 0.009999999776482582
          entropy: 0.02115635573863983
          entropy_coeff: 0.0
          kl: 0.027731379494071007
          model: {}
          policy_loss: 0.019519612193107605
          total_loss: 829976.6875
          vf_explained_var: 0.4529228210449219
          vf_loss: 331.32989501953125
    num_agent_steps_sampled: 812000
    num_agent_steps_trained: 812000
    num_steps_sampled: 812000
    num_steps_trained: 812000
  iterations_since_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,203,488.65,812000,191.33,200,116,191.33
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 824000
  custom_metrics: {}
  date: 2021-10-15_10-58-45
  done: false
  episode_len_mean: 192.21
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.21
  episode_reward_min: 116.0
  episodes_this_iter: 20
  episodes_total: 4499
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 67313704.0
          cur_lr: 0.009999999776482582
          entropy: 0.021896956488490105
          entropy_coeff: 0.0
          kl: 0.03206692636013031
          model: {}
          policy_loss: 0.0048996577970683575
          total_loss: 2158780.75
          vf_explained_var: 0.5775529742240906
          vf_loss: 237.25743103027344
    num_agent_steps_sampled: 824000
    num_agent_steps_trained: 824000
    num_steps_sampled: 824000
    num_steps_trained: 824000
  iterations_since_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,206,495.569,824000,192.21,200,116,192.21
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 836000
  custom_metrics: {}
  date: 2021-10-15_10-58-52
  done: false
  episode_len_mean: 192.71
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.71
  episode_reward_min: 125.0
  episodes_this_iter: 22
  episodes_total: 4562
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 227183760.0
          cur_lr: 0.009999999776482582
          entropy: 0.0352695994079113
          entropy_coeff: 0.0
          kl: 0.024791201576590538
          model: {}
          policy_loss: 0.004102627281099558
          total_loss: 5632440.5
          vf_explained_var: 0.6282277703285217
          vf_loss: 281.7555236816406
    num_agent_steps_sampled: 836000
    num_agent_steps_trained: 836000
    num_steps_sampled: 836000
    num_steps_trained: 836000
  iterations_since_res

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,209,502.496,836000,192.71,200,125,192.71
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 848000
  custom_metrics: {}
  date: 2021-10-15_10-58-59
  done: false
  episode_len_mean: 196.93
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.93
  episode_reward_min: 127.0
  episodes_this_iter: 20
  episodes_total: 4622
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 511163456.0
          cur_lr: 0.009999999776482582
          entropy: 0.030641235411167145
          entropy_coeff: 0.0
          kl: 0.017854493111371994
          model: {}
          policy_loss: 0.007640997413545847
          total_loss: 9126719.0
          vf_explained_var: 0.7636121511459351
          vf_loss: 153.9801788330078
    num_agent_steps_sampled: 848000
    num_agent_steps_trained: 848000
    num_steps_sampled: 848000
    num_steps_trained: 848000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,212,509.498,848000,196.93,200,127,196.93
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 860000
  custom_metrics: {}
  date: 2021-10-15_10-59-06
  done: false
  episode_len_mean: 198.25
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.25
  episode_reward_min: 150.0
  episodes_this_iter: 20
  episodes_total: 4683
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1150117760.0
          cur_lr: 0.009999999776482582
          entropy: 0.027932295575737953
          entropy_coeff: 0.0
          kl: 0.03444070369005203
          model: {}
          policy_loss: 0.009583555161952972
          total_loss: 39611084.0
          vf_explained_var: 0.6858659386634827
          vf_loss: 223.56442260742188
    num_agent_steps_sampled: 860000
    num_agent_steps_trained: 860000
    num_steps_sampled: 860000
    num_steps_trained: 860000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,215,516.47,860000,198.25,200,150,198.25
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 872000
  custom_metrics: {}
  date: 2021-10-15_10-59-14
  done: false
  episode_len_mean: 197.64
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.64
  episode_reward_min: 136.0
  episodes_this_iter: 21
  episodes_total: 4744
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 3881647616.0
          cur_lr: 0.009999999776482582
          entropy: 0.019189799204468727
          entropy_coeff: 0.0
          kl: 0.01766931638121605
          model: {}
          policy_loss: -0.00022242992417886853
          total_loss: 68586368.0
          vf_explained_var: 0.6173312067985535
          vf_loss: 311.75860595703125
    num_agent_steps_sampled: 872000
    num_agent_steps_trained: 872000
    num_steps_sampled: 872000
    num_steps_trained: 872000
  iterations_si

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,218,523.481,872000,197.64,200,136,197.64
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 884000
  custom_metrics: {}
  date: 2021-10-15_10-59-21
  done: false
  episode_len_mean: 195.57
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.57
  episode_reward_min: 119.0
  episodes_this_iter: 20
  episodes_total: 4805
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 5822471168.0
          cur_lr: 0.009999999776482582
          entropy: 0.02485632710158825
          entropy_coeff: 0.0
          kl: 0.034962113946676254
          model: {}
          policy_loss: 0.006677574012428522
          total_loss: 203566096.0
          vf_explained_var: 0.680285632610321
          vf_loss: 192.2202606201172
    num_agent_steps_sampled: 884000
    num_agent_steps_trained: 884000
    num_steps_sampled: 884000
    num_steps_trained: 884000
  iterations_since_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,221,530.422,884000,195.57,200,119,195.57
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 896000
  custom_metrics: {}
  date: 2021-10-15_10-59-27
  done: false
  episode_len_mean: 194.4
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.4
  episode_reward_min: 119.0
  episodes_this_iter: 20
  episodes_total: 4867
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8733707264.0
          cur_lr: 0.009999999776482582
          entropy: 0.03407536819577217
          entropy_coeff: 0.0
          kl: 0.012303782626986504
          model: {}
          policy_loss: 0.008334421552717686
          total_loss: 107457832.0
          vf_explained_var: 0.5875982046127319
          vf_loss: 197.11773681640625
    num_agent_steps_sampled: 896000
    num_agent_steps_trained: 896000
    num_steps_sampled: 896000
    num_steps_trained: 896000
  iterations_since_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,224,537.357,896000,194.4,200,119,194.4
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 908000
  custom_metrics: {}
  date: 2021-10-15_10-59-34
  done: false
  episode_len_mean: 190.88
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 190.88
  episode_reward_min: 111.0
  episodes_this_iter: 21
  episodes_total: 4931
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8733707264.0
          cur_lr: 0.009999999776482582
          entropy: 0.024730896577239037
          entropy_coeff: 0.0
          kl: 0.009019691497087479
          model: {}
          policy_loss: -0.0017479113303124905
          total_loss: 78775504.0
          vf_explained_var: 0.7395049333572388
          vf_loss: 156.16004943847656
    num_agent_steps_sampled: 908000
    num_agent_steps_trained: 908000
    num_steps_sampled: 908000
    num_steps_trained: 908000
  iterations_si

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,227,544.3,908000,190.88,200,111,190.88
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 920000
  custom_metrics: {}
  date: 2021-10-15_10-59-41
  done: false
  episode_len_mean: 193.41
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 193.41
  episode_reward_min: 112.0
  episodes_this_iter: 20
  episodes_total: 4992
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8733707264.0
          cur_lr: 0.009999999776482582
          entropy: 0.028100520372390747
          entropy_coeff: 0.0
          kl: 0.013197510503232479
          model: {}
          policy_loss: 0.0031167808920145035
          total_loss: 115263456.0
          vf_explained_var: 0.5506707429885864
          vf_loss: 261.7149353027344
    num_agent_steps_sampled: 920000
    num_agent_steps_trained: 920000
    num_steps_sampled: 920000
    num_steps_trained: 920000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,230,551.248,920000,193.41,200,112,193.41
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 932000
  custom_metrics: {}
  date: 2021-10-15_10-59-48
  done: false
  episode_len_mean: 196.37
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.37
  episode_reward_min: 131.0
  episodes_this_iter: 20
  episodes_total: 5053
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8733707264.0
          cur_lr: 0.009999999776482582
          entropy: 0.023616168648004532
          entropy_coeff: 0.0
          kl: 0.004240493290126324
          model: {}
          policy_loss: 0.00014298449968919158
          total_loss: 37035564.0
          vf_explained_var: 0.4620436728000641
          vf_loss: 343.8981628417969
    num_agent_steps_sampled: 932000
    num_agent_steps_trained: 932000
    num_steps_sampled: 932000
    num_steps_trained: 932000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,233,558.157,932000,196.37,200,131,196.37
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 944000
  custom_metrics: {}
  date: 2021-10-15_10-59-55
  done: false
  episode_len_mean: 196.78
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.78
  episode_reward_min: 134.0
  episodes_this_iter: 21
  episodes_total: 5116
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1091713408.0
          cur_lr: 0.009999999776482582
          entropy: 0.025478346273303032
          entropy_coeff: 0.0
          kl: 0.00023892277386039495
          model: {}
          policy_loss: 0.0026462336536496878
          total_loss: 261046.03125
          vf_explained_var: 0.7254422307014465
          vf_loss: 210.84754943847656
    num_agent_steps_sampled: 944000
    num_agent_steps_trained: 944000
    num_steps_sampled: 944000
    num_steps_trained: 944000
  iterations

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,236,565.077,944000,196.78,200,134,196.78
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 956000
  custom_metrics: {}
  date: 2021-10-15_11-00-02
  done: false
  episode_len_mean: 195.98
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.98
  episode_reward_min: 128.0
  episodes_this_iter: 20
  episodes_total: 5176
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 136464176.0
          cur_lr: 0.009999999776482582
          entropy: 0.022412654012441635
          entropy_coeff: 0.0
          kl: 1.7414913600077853e-05
          model: {}
          policy_loss: 0.008121867664158344
          total_loss: 2536.642578125
          vf_explained_var: 0.6902815103530884
          vf_loss: 160.12290954589844
    num_agent_steps_sampled: 956000
    num_agent_steps_trained: 956000
    num_steps_sampled: 956000
    num_steps_trained: 956000
  iterations

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,239,572.019,956000,195.98,200,128,195.98
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 968000
  custom_metrics: {}
  date: 2021-10-15_11-00-09
  done: false
  episode_len_mean: 196.73
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.73
  episode_reward_min: 128.0
  episodes_this_iter: 20
  episodes_total: 5237
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 17058022.0
          cur_lr: 0.009999999776482582
          entropy: 0.026489922776818275
          entropy_coeff: 0.0
          kl: 8.358487684745342e-05
          model: {}
          policy_loss: -0.003357628360390663
          total_loss: 1659.3350830078125
          vf_explained_var: 0.5304856300354004
          vf_loss: 233.54586791992188
    num_agent_steps_sampled: 968000
    num_agent_steps_trained: 968000
    num_steps_sampled: 968000
    num_steps_trained: 968000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,242,579.09,968000,196.73,200,128,196.73
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 980000
  custom_metrics: {}
  date: 2021-10-15_11-00-16
  done: false
  episode_len_mean: 195.71
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.71
  episode_reward_min: 149.0
  episodes_this_iter: 22
  episodes_total: 5299
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 2132252.75
          cur_lr: 0.009999999776482582
          entropy: 0.026114661246538162
          entropy_coeff: 0.0
          kl: 1.811305082810577e-05
          model: {}
          policy_loss: -0.010423549450933933
          total_loss: 239.53477478027344
          vf_explained_var: 0.5148080587387085
          vf_loss: 200.92359924316406
    num_agent_steps_sampled: 980000
    num_agent_steps_trained: 980000
    num_steps_sampled: 980000
    num_steps_trained: 980000
  iterati

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,245,586.011,980000,195.71,200,149,195.71
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 992000
  custom_metrics: {}
  date: 2021-10-15_11-00-23
  done: false
  episode_len_mean: 195.88
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.88
  episode_reward_min: 144.0
  episodes_this_iter: 20
  episodes_total: 5359
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 266531.59375
          cur_lr: 0.009999999776482582
          entropy: 0.028410296887159348
          entropy_coeff: 0.0
          kl: 3.7436437283489e-10
          model: {}
          policy_loss: 0.003789838869124651
          total_loss: 318.15093994140625
          vf_explained_var: 0.5341739654541016
          vf_loss: 318.1470031738281
    num_agent_steps_sampled: 992000
    num_agent_steps_trained: 992000
    num_steps_sampled: 992000
    num_steps_trained: 992000
  iteration

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,248,592.925,992000,195.88,200,144,195.88
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1004000
  custom_metrics: {}
  date: 2021-10-15_11-00-30
  done: false
  episode_len_mean: 197.64
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.64
  episode_reward_min: 134.0
  episodes_this_iter: 21
  episodes_total: 5421
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 33316.44921875
          cur_lr: 0.009999999776482582
          entropy: 0.025386707857251167
          entropy_coeff: 0.0
          kl: -2.981513091970811e-10
          model: {}
          policy_loss: -0.005782286170870066
          total_loss: 343.87481689453125
          vf_explained_var: 0.6897535920143127
          vf_loss: 343.8806457519531
    num_agent_steps_sampled: 1004000
    num_agent_steps_trained: 1004000
    num_steps_sampled: 1004000
    num_steps_trained: 1004000


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,251,599.854,1004000,197.64,200,134,197.64
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1016000
  custom_metrics: {}
  date: 2021-10-15_11-00-37
  done: false
  episode_len_mean: 198.03
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.03
  episode_reward_min: 134.0
  episodes_this_iter: 20
  episodes_total: 5481
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 4164.55615234375
          cur_lr: 0.009999999776482582
          entropy: 0.023829441517591476
          entropy_coeff: 0.0
          kl: 1.6749061859666625e-10
          model: {}
          policy_loss: 0.008326556533575058
          total_loss: 254.33033752441406
          vf_explained_var: 0.6106386184692383
          vf_loss: 254.322021484375
    num_agent_steps_sampled: 1016000
    num_agent_steps_trained: 1016000
    num_steps_sampled: 1016000
    num_steps_trained: 1016000


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,254,606.779,1016000,198.03,200,134,198.03
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1028000
  custom_metrics: {}
  date: 2021-10-15_11-00-44
  done: false
  episode_len_mean: 197.46
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.46
  episode_reward_min: 128.0
  episodes_this_iter: 20
  episodes_total: 5541
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 520.5695190429688
          cur_lr: 0.009999999776482582
          entropy: 0.027320418506860733
          entropy_coeff: 0.0
          kl: 8.933809803046699e-10
          model: {}
          policy_loss: 0.0029551705811172724
          total_loss: 147.1508331298828
          vf_explained_var: 0.7828692197799683
          vf_loss: 147.14788818359375
    num_agent_steps_sampled: 1028000
    num_agent_steps_trained: 1028000
    num_steps_sampled: 1028000
    num_steps_trained: 102800

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,257,613.686,1028000,197.46,200,128,197.46
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1040000
  custom_metrics: {}
  date: 2021-10-15_11-00-51
  done: false
  episode_len_mean: 197.21
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.21
  episode_reward_min: 128.0
  episodes_this_iter: 20
  episodes_total: 5603
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 65.0711898803711
          cur_lr: 0.009999999776482582
          entropy: 0.026989445090293884
          entropy_coeff: 0.0
          kl: 1.9392402350604243e-08
          model: {}
          policy_loss: 0.005751763936132193
          total_loss: 375.0908508300781
          vf_explained_var: 0.6021448373794556
          vf_loss: 375.0850830078125
    num_agent_steps_sampled: 1040000
    num_agent_steps_trained: 1040000
    num_steps_sampled: 1040000
    num_steps_trained: 1040000


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,260,620.591,1040000,197.21,200,128,197.21
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1052000
  custom_metrics: {}
  date: 2021-10-15_11-00-58
  done: false
  episode_len_mean: 197.4
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.4
  episode_reward_min: 126.0
  episodes_this_iter: 20
  episodes_total: 5663
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 8.133898735046387
          cur_lr: 0.009999999776482582
          entropy: 0.026945369318127632
          entropy_coeff: 0.0
          kl: 1.3249793937575305e-06
          model: {}
          policy_loss: 0.000999676063656807
          total_loss: 325.82733154296875
          vf_explained_var: 0.561767578125
          vf_loss: 325.82635498046875
    num_agent_steps_sampled: 1052000
    num_agent_steps_trained: 1052000
    num_steps_sampled: 1052000
    num_steps_trained: 1052000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,263,627.511,1052000,197.4,200,126,197.4
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1064000
  custom_metrics: {}
  date: 2021-10-15_11-01-05
  done: false
  episode_len_mean: 198.02
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.02
  episode_reward_min: 161.0
  episodes_this_iter: 20
  episodes_total: 5724
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 1.0167373418807983
          cur_lr: 0.009999999776482582
          entropy: 0.023746836930513382
          entropy_coeff: 0.0
          kl: 3.557523086783476e-05
          model: {}
          policy_loss: 0.0023269110824912786
          total_loss: 165.41827392578125
          vf_explained_var: 0.7044664025306702
          vf_loss: 165.41592407226562
    num_agent_steps_sampled: 1064000
    num_agent_steps_trained: 1064000
    num_steps_sampled: 1064000
    num_steps_trained: 1064

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,266,634.573,1064000,198.02,200,161,198.02
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1076000
  custom_metrics: {}
  date: 2021-10-15_11-01-12
  done: false
  episode_len_mean: 198.7
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.7
  episode_reward_min: 155.0
  episodes_this_iter: 20
  episodes_total: 5784
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.1270921677350998
          cur_lr: 0.009999999776482582
          entropy: 0.02892422303557396
          entropy_coeff: 0.0
          kl: 0.00016042716742958874
          model: {}
          policy_loss: 0.0010385289788246155
          total_loss: 219.11968994140625
          vf_explained_var: 0.6915619969367981
          vf_loss: 219.11862182617188
    num_agent_steps_sampled: 1076000
    num_agent_steps_trained: 1076000
    num_steps_sampled: 1076000
    num_steps_trained: 107600

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,269,641.57,1076000,198.7,200,155,198.7
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1088000
  custom_metrics: {}
  date: 2021-10-15_11-01-19
  done: false
  episode_len_mean: 199.17
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.17
  episode_reward_min: 155.0
  episodes_this_iter: 20
  episodes_total: 5844
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.015886520966887474
          cur_lr: 0.009999999776482582
          entropy: 0.028329340741038322
          entropy_coeff: 0.0
          kl: 0.00047230214113369584
          model: {}
          policy_loss: -0.0011466203723102808
          total_loss: 305.0863342285156
          vf_explained_var: 0.40951842069625854
          vf_loss: 305.08746337890625
    num_agent_steps_sampled: 1088000
    num_agent_steps_trained: 1088000
    num_steps_sampled: 1088000
    num_steps_trained: 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,272,648.467,1088000,199.17,200,155,199.17
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1100000
  custom_metrics: {}
  date: 2021-10-15_11-01-51
  done: false
  episode_len_mean: 198.01
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.01
  episode_reward_min: 144.0
  episodes_this_iter: 20
  episodes_total: 5905
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0019858151208609343
          cur_lr: 0.009999999776482582
          entropy: 0.024646740406751633
          entropy_coeff: 0.0
          kl: 0.0007612459594383836
          model: {}
          policy_loss: -0.000705285114236176
          total_loss: 305.65521240234375
          vf_explained_var: 0.5105039477348328
          vf_loss: 305.6559143066406
    num_agent_steps_sampled: 1100000
    num_agent_steps_trained: 1100000
    num_steps_sampled: 1100000
    num_steps_trained: 11

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,275,679.75,1100000,198.01,200,144,198.01
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1112000
  custom_metrics: {}
  date: 2021-10-15_11-01-58
  done: false
  episode_len_mean: 197.38
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 197.38
  episode_reward_min: 144.0
  episodes_this_iter: 20
  episodes_total: 5966
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0002482268901076168
          cur_lr: 0.009999999776482582
          entropy: 0.0356253907084465
          entropy_coeff: 0.0
          kl: 0.0038500267546623945
          model: {}
          policy_loss: 0.0024661272764205933
          total_loss: 334.39520263671875
          vf_explained_var: 0.4850882291793823
          vf_loss: 334.3927001953125
    num_agent_steps_sampled: 1112000
    num_agent_steps_trained: 1112000
    num_steps_sampled: 1112000
    num_steps_trained: 1112

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,278,686.694,1112000,197.38,200,144,197.38
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1124000
  custom_metrics: {}
  date: 2021-10-15_11-02-07
  done: false
  episode_len_mean: 196.3
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.3
  episode_reward_min: 130.0
  episodes_this_iter: 20
  episodes_total: 6027
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.00018617016030475497
          cur_lr: 0.009999999776482582
          entropy: 0.023129213601350784
          entropy_coeff: 0.0
          kl: 0.03502307087182999
          model: {}
          policy_loss: 0.008444827049970627
          total_loss: 198.7564239501953
          vf_explained_var: 0.6857905983924866
          vf_loss: 198.7479705810547
    num_agent_steps_sampled: 1124000
    num_agent_steps_trained: 1124000
    num_steps_sampled: 1124000
    num_steps_trained: 1124000

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,281,696.412,1124000,196.3,200,130,196.3
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1128000
  custom_metrics: {}
  date: 2021-10-15_11-02-16
  done: false
  episode_len_mean: 196.07
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 196.07
  episode_reward_min: 130.0
  episodes_this_iter: 21
  episodes_total: 6048
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.00027925524045713246
          cur_lr: 0.009999999776482582
          entropy: 0.020569102838635445
          entropy_coeff: 0.0
          kl: 0.05568401515483856
          model: {}
          policy_loss: 0.0036334909964352846
          total_loss: 256.29388427734375
          vf_explained_var: 0.6911838054656982
          vf_loss: 256.2902526855469
    num_agent_steps_sampled: 1128000
    num_agent_steps_trained: 1128000
    num_steps_sampled: 1128000
    num_steps_trained: 112

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,282,704.761,1128000,196.07,200,130,196.07
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1136000
  custom_metrics: {}
  date: 2021-10-15_11-02-23
  done: false
  episode_len_mean: 194.01
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.01
  episode_reward_min: 130.0
  episodes_this_iter: 20
  episodes_total: 6089
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0006283242837525904
          cur_lr: 0.009999999776482582
          entropy: 0.012516897171735764
          entropy_coeff: 0.0
          kl: 0.503638505935669
          model: {}
          policy_loss: 0.012160060927271843
          total_loss: 365.1748046875
          vf_explained_var: 0.40124407410621643
          vf_loss: 365.162353515625
    num_agent_steps_sampled: 1136000
    num_agent_steps_trained: 1136000
    num_steps_sampled: 1136000
    num_steps_trained: 1136000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,284,712.07,1136000,194.01,200,130,194.01
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1144000
  custom_metrics: {}
  date: 2021-10-15_11-02-31
  done: false
  episode_len_mean: 192.6
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 192.6
  episode_reward_min: 132.0
  episodes_this_iter: 20
  episodes_total: 6131
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0014137296238914132
          cur_lr: 0.009999999776482582
          entropy: 0.014949028380215168
          entropy_coeff: 0.0
          kl: 0.5027269124984741
          model: {}
          policy_loss: 0.01573752798140049
          total_loss: 262.5810546875
          vf_explained_var: 0.6576183438301086
          vf_loss: 262.5646057128906
    num_agent_steps_sampled: 1144000
    num_agent_steps_trained: 1144000
    num_steps_sampled: 1144000
    num_steps_trained: 1144000
  ite

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,286,719.583,1144000,192.6,200,132,192.6
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1152000
  custom_metrics: {}
  date: 2021-10-15_11-02-38
  done: false
  episode_len_mean: 194.77
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 194.77
  episode_reward_min: 132.0
  episodes_this_iter: 20
  episodes_total: 6171
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.003180891741067171
          cur_lr: 0.009999999776482582
          entropy: 0.014601812697947025
          entropy_coeff: 0.0
          kl: 1.0165799856185913
          model: {}
          policy_loss: 0.03437485918402672
          total_loss: 228.48565673828125
          vf_explained_var: 0.628554105758667
          vf_loss: 228.4480438232422
    num_agent_steps_sampled: 1152000
    num_agent_steps_trained: 1152000
    num_steps_sampled: 1152000
    num_steps_trained: 1152000
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,288,727.106,1152000,194.77,200,132,194.77
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1156000
  custom_metrics: {}
  date: 2021-10-15_11-31-24
  done: false
  episode_len_mean: 195.87
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.87
  episode_reward_min: 139.0
  episodes_this_iter: 20
  episodes_total: 6191
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.0047713378444314
          cur_lr: 0.009999999776482582
          entropy: 0.007644451688975096
          entropy_coeff: 0.0
          kl: 2.001596689224243
          model: {}
          policy_loss: 0.028907151892781258
          total_loss: 223.6616973876953
          vf_explained_var: 0.6034905910491943
          vf_loss: 223.6232452392578
    num_agent_steps_sampled: 1156000
    num_agent_steps_trained: 1156000
    num_steps_sampled: 1156000
    num_steps_trained: 1156000
  i

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,RUNNING,192.168.0.42:1096,0.01,289,2452.95,1156000,195.87,200,139,195.87
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200.0,200,200,200.0
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200.0,200,200,200.0


Result for PPO_CartPole-v0_97f47_00000:
  agent_timesteps_total: 1168000
  custom_metrics: {}
  date: 2021-10-15_11-31-31
  done: true
  episode_len_mean: 200.0
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 200.0
  episode_reward_min: 200.0
  episodes_this_iter: 20
  episodes_total: 6251
  experiment_id: f750806d183f44688b827b772469e4d4
  hostname: Liweis-iMac.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.016103263944387436
          cur_lr: 0.009999999776482582
          entropy: 0.008555091917514801
          entropy_coeff: 0.0
          kl: 0.10656091570854187
          model: {}
          policy_loss: 0.005454843398183584
          total_loss: 164.05479431152344
          vf_explained_var: 0.5440415740013123
          vf_loss: 164.047607421875
    num_agent_steps_sampled: 1168000
    num_agent_steps_trained: 1168000
    num_steps_sampled: 1168000
    num_steps_trained: 1168000
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,TERMINATED,,0.01,292,2459.61,1168000,200,200,200,200
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200,200,200,200
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200,200,200,200


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_97f47_00000,TERMINATED,,0.01,292,2459.61,1168000,200,200,200,200
PPO_CartPole-v0_97f47_00001,TERMINATED,,0.001,15,53.5871,60000,200,200,200,200
PPO_CartPole-v0_97f47_00002,TERMINATED,,0.0001,17,59.4017,68000,200,200,200,200


2021-10-15 11:31:32,087	INFO tune.py:561 -- Total run time: 2474.84 seconds (2474.02 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fcba3e1e9a0>

In [4]:
alg = 'DDPG'
tune.run(alg,
    stop={"training_iteration": 30},
    config={
        'env':'Pendulum-v0',
        'num_gpus':0,
        'num_workers':2,
        'lr':tune.grid_search([.001,])     
    }
)


Trial name,status,loc,lr
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,,0.001


[2m[36m(pid=17553)[0m Instructions for updating:
[2m[36m(pid=17553)[0m non-resource variables are not supported in the long term
[2m[36m(pid=17553)[0m 2021-07-13 20:54:31,552	INFO trainer.py:591 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=17553)[0m 2021-07-13 20:54:31,552	INFO trainer.py:616 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=17556)[0m Instructions for updating:
[2m[36m(pid=17556)[0m non-resource variables are not supported in the long term
[2m[36m(pid=17555)[0m Instructions for updating:
[2m[36m(pid=17555)[0m non-resource variables are not supported in the long term


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-54-41
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1131.8902231793475
  episode_reward_min: -1410.9375467664963
  episodes_this_iter: 6
  episodes_total: 6
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 1500
    learner:
      default_policy:
        max_q: 0.1796478033065796
        mean_q: -0.10519210994243622
        min_q: -0.6069953441619873
        model: {}
    num_steps_sampled: 1500
    num_steps_trained: 256
    num_target_updates: 1
  iterations_since_restore: 1
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.799999999999997
    ram_util_percent: 63.55
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0973585918644

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,1,2.53136,1500,-1131.89,-781.121,-1410.94,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-54-58
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1331.3202113249129
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 12
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 2500
    learner:
      default_policy:
        max_q: -0.50157630443573
        mean_q: -11.473435401916504
        min_q: -24.651611328125
        model: {}
    num_steps_sampled: 2500
    num_steps_trained: 128256
    num_target_updates: 501
  iterations_since_restore: 2
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.545833333333334
    ram_util_percent: 63.48333333333334
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,2,19.6074,2500,-1331.32,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-15
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1398.030697460801
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 16
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 3500
    learner:
      default_policy:
        max_q: -0.1660407930612564
        mean_q: -18.283344268798828
        min_q: -33.41071319580078
        model: {}
    num_steps_sampled: 3500
    num_steps_trained: 256256
    num_target_updates: 1001
  iterations_since_restore: 3
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.1125
    ram_util_percent: 64.00416666666666
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0981330

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,3,36.4448,3500,-1398.03,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-32
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1425.8668538197078
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 22
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 4500
    learner:
      default_policy:
        max_q: -0.6811314225196838
        mean_q: -23.862667083740234
        min_q: -44.24247360229492
        model: {}
    num_steps_sampled: 4500
    num_steps_trained: 384256
    num_target_updates: 1501
  iterations_since_restore: 4
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 27.125
    ram_util_percent: 64.6375
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09838679174479754

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,4,53.642,4500,-1425.87,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-50
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1444.8272991160627
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 26
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 5500
    learner:
      default_policy:
        max_q: 0.3510587811470032
        mean_q: -29.40019416809082
        min_q: -47.306602478027344
        model: {}
    num_steps_sampled: 5500
    num_steps_trained: 512256
    num_target_updates: 2001
  iterations_since_restore: 5
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 28.332000000000004
    ram_util_percent: 64.232
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0986227

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,5,71.5977,5500,-1444.83,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-07
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1436.744265061477
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 32
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 6500
    learner:
      default_policy:
        max_q: -0.09803511202335358
        mean_q: -34.90961456298828
        min_q: -63.530277252197266
        model: {}
    num_steps_sampled: 6500
    num_steps_trained: 640256
    num_target_updates: 2501
  iterations_since_restore: 6
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.962500000000002
    ram_util_percent: 63.50416666666667
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_m

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,6,89.2018,6500,-1436.74,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-26
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1438.6196838828807
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 36
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 7500
    learner:
      default_policy:
        max_q: 0.008759599179029465
        mean_q: -40.90607452392578
        min_q: -63.068790435791016
        model: {}
    num_steps_sampled: 7500
    num_steps_trained: 768256
    num_target_updates: 3001
  iterations_since_restore: 7
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 28.47307692307692
    ram_util_percent: 65.36153846153846
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_m

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,7,107.375,7500,-1438.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-44
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1412.6190355939557
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 42
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 8500
    learner:
      default_policy:
        max_q: 1.5346838235855103
        mean_q: -44.83782958984375
        min_q: -71.63545227050781
        model: {}
    num_steps_sampled: 8500
    num_steps_trained: 896256
    num_target_updates: 3501
  iterations_since_restore: 8
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.483999999999995
    ram_util_percent: 60.803999999999995
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,8,125.718,8500,-1412.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-02
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1398.6154951909177
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 46
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 9500
    learner:
      default_policy:
        max_q: -0.022161278873682022
        mean_q: -50.20433044433594
        min_q: -73.2963638305664
        model: {}
    num_steps_sampled: 9500
    num_steps_trained: 1024256
    num_target_updates: 4001
  iterations_since_restore: 9
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.676
    ram_util_percent: 60.964
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09955409264438625

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,9,143.433,9500,-1398.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-20
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1348.1319338104424
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 52
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 10500
    learner:
      default_policy:
        max_q: 0.9547820687294006
        mean_q: -53.53691864013672
        min_q: -75.73155975341797
        model: {}
    num_steps_sampled: 10500
    num_steps_trained: 1152256
    num_target_updates: 4501
  iterations_since_restore: 10
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.58076923076923
    ram_util_percent: 62.80384615384616
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,10,161.598,10500,-1348.13,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-38
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1336.2461581544399
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 56
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 11500
    learner:
      default_policy:
        max_q: -1.291930079460144
        mean_q: -56.992698669433594
        min_q: -83.67649841308594
        model: {}
    num_steps_sampled: 11500
    num_steps_trained: 1280256
    num_target_updates: 5001
  iterations_since_restore: 11
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.019999999999996
    ram_util_percent: 63.53200000000001
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,11,179.777,11500,-1336.25,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-57
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1288.7165801323201
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 62
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 12500
    learner:
      default_policy:
        max_q: 1.1449592113494873
        mean_q: -62.394222259521484
        min_q: -90.33869934082031
        model: {}
    num_steps_sampled: 12500
    num_steps_trained: 1408256
    num_target_updates: 5501
  iterations_since_restore: 12
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.78846153846154
    ram_util_percent: 61.83461538461539
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,12,198.182,12500,-1288.72,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-14
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1269.864064156319
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 66
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 13500
    learner:
      default_policy:
        max_q: 0.9233301877975464
        mean_q: -62.598182678222656
        min_q: -89.0087890625
        model: {}
    num_steps_sampled: 13500
    num_steps_trained: 1536256
    num_target_updates: 6001
  iterations_since_restore: 13
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.896
    ram_util_percent: 62.232
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10005915854217143
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,13,216.094,13500,-1269.86,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-33
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -3.1396012759307346
  episode_reward_mean: -1193.5720942077833
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 72
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 14500
    learner:
      default_policy:
        max_q: 2.8817272186279297
        mean_q: -63.85460662841797
        min_q: -94.13409423828125
        model: {}
    num_steps_sampled: 14500
    num_steps_trained: 1664256
    num_target_updates: 6501
  iterations_since_restore: 14
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.12
    ram_util_percent: 63.36
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1001575898790833


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,14,234.42,14500,-1193.57,-3.1396,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-51
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -3.1396012759307346
  episode_reward_mean: -1179.2687786561391
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 76
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 15500
    learner:
      default_policy:
        max_q: 2.4560861587524414
        mean_q: -63.485443115234375
        min_q: -99.68254089355469
        model: {}
    num_steps_sampled: 15500
    num_steps_trained: 1792256
    num_target_updates: 7001
  iterations_since_restore: 15
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.6
    ram_util_percent: 63.26
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10020802807056446

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,15,252.232,15500,-1179.27,-3.1396,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-08
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.8323446584304053
  episode_reward_mean: -1113.7970069430512
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 82
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 16500
    learner:
      default_policy:
        max_q: 3.410090684890747
        mean_q: -70.59851837158203
        min_q: -103.12985229492188
        model: {}
    num_steps_sampled: 16500
    num_steps_trained: 1920256
    num_target_updates: 7501
  iterations_since_restore: 16
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.516
    ram_util_percent: 63.868
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100262394661624

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,16,269.855,16500,-1113.8,-1.83234,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-26
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.8323446584304053
  episode_reward_mean: -1085.0581441625716
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 86
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 17500
    learner:
      default_policy:
        max_q: 3.329087018966675
        mean_q: -69.15478515625
        min_q: -105.27165222167969
        model: {}
    num_steps_sampled: 17500
    num_steps_trained: 2048256
    num_target_updates: 8001
  iterations_since_restore: 17
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.852000000000004
    ram_util_percent: 61.42
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1002903

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,17,287.561,17500,-1085.06,-1.83234,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-44
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -1034.789899227471
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 92
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 18500
    learner:
      default_policy:
        max_q: 3.6690609455108643
        mean_q: -67.78599548339844
        min_q: -111.97421264648438
        model: {}
    num_steps_sampled: 18500
    num_steps_trained: 2176256
    num_target_updates: 8501
  iterations_since_restore: 18
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.984
    ram_util_percent: 61.907999999999994
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,18,305.711,18500,-1034.79,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-02
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -997.122177864407
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 96
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 19500
    learner:
      default_policy:
        max_q: 5.729738235473633
        mean_q: -72.16944122314453
        min_q: -117.57238006591797
        model: {}
    num_steps_sampled: 19500
    num_steps_trained: 2304256
    num_target_updates: 9001
  iterations_since_restore: 19
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 20.772
    ram_util_percent: 63.343999999999994
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10034

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,19,323.599,19500,-997.122,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-20
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -962.5109391533188
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 102
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 20500
    learner:
      default_policy:
        max_q: 5.6856279373168945
        mean_q: -70.61603546142578
        min_q: -114.73045349121094
        model: {}
    num_steps_sampled: 20500
    num_steps_trained: 2432256
    num_target_updates: 9501
  iterations_since_restore: 20
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.34
    ram_util_percent: 63.876
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100398433589643

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,20,341.518,20500,-962.511,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-38
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -926.8117919657061
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 106
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 21500
    learner:
      default_policy:
        max_q: 4.623827934265137
        mean_q: -73.03113555908203
        min_q: -121.44898986816406
        model: {}
    num_steps_sampled: 21500
    num_steps_trained: 2560256
    num_target_updates: 10001
  iterations_since_restore: 21
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 20.892
    ram_util_percent: 64.264
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10054363674887

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,21,359.284,21500,-926.812,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-56
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -845.3818124377199
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 6
  episodes_total: 112
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 22500
    learner:
      default_policy:
        max_q: 5.559916973114014
        mean_q: -77.15824890136719
        min_q: -124.76898956298828
        model: {}
    num_steps_sampled: 22500
    num_steps_trained: 2688256
    num_target_updates: 10501
  iterations_since_restore: 22
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.011538461538464
    ram_util_percent: 62.280769230769245
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_process

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,22,377.508,22500,-845.382,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-14
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -789.2500149746359
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 4
  episodes_total: 116
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 23500
    learner:
      default_policy:
        max_q: 5.940964221954346
        mean_q: -75.83016967773438
        min_q: -128.67138671875
        model: {}
    num_steps_sampled: 23500
    num_steps_trained: 2816256
    num_target_updates: 11001
  iterations_since_restore: 23
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.7
    ram_util_percent: 61.724
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1007344092037897
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,23,395.654,23500,-789.25,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-33
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -711.1807312858703
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 6
  episodes_total: 122
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 24500
    learner:
      default_policy:
        max_q: 5.302407264709473
        mean_q: -76.78132629394531
        min_q: -129.2058563232422
        model: {}
    num_steps_sampled: 24500
    num_steps_trained: 2944256
    num_target_updates: 11501
  iterations_since_restore: 24
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.71153846153846
    ram_util_percent: 63.11538461538461
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,24,414.032,24500,-711.181,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-50
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -654.9789331537673
  episode_reward_min: -1543.9091369921614
  episodes_this_iter: 4
  episodes_total: 126
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 25500
    learner:
      default_policy:
        max_q: 8.509936332702637
        mean_q: -74.9183349609375
        min_q: -133.55616760253906
        model: {}
    num_steps_sampled: 25500
    num_steps_trained: 3072256
    num_target_updates: 12001
  iterations_since_restore: 25
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.025000000000002
    ram_util_percent: 63.50416666666666
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,25,431.321,25500,-654.979,-1.11165,-1543.91,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-09
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -581.5873670837315
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 6
  episodes_total: 132
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 26500
    learner:
      default_policy:
        max_q: 5.026442050933838
        mean_q: -76.33003234863281
        min_q: -136.049560546875
        model: {}
    num_steps_sampled: 26500
    num_steps_trained: 3200256
    num_target_updates: 12501
  iterations_since_restore: 26
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.35
    ram_util_percent: 62.79615384615385
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1008556

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,26,450.279,26500,-581.587,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-28
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -543.6693148351875
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 4
  episodes_total: 136
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 27500
    learner:
      default_policy:
        max_q: 4.031546592712402
        mean_q: -76.26606750488281
        min_q: -138.93124389648438
        model: {}
    num_steps_sampled: 27500
    num_steps_trained: 3328256
    num_target_updates: 13001
  iterations_since_restore: 27
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.355555555555558
    ram_util_percent: 62.829629629629636
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,27,469.222,27500,-543.669,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-47
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -503.3479170812064
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 6
  episodes_total: 142
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 28500
    learner:
      default_policy:
        max_q: 8.385159492492676
        mean_q: -79.76459503173828
        min_q: -143.51950073242188
        model: {}
    num_steps_sampled: 28500
    num_steps_trained: 3456256
    num_target_updates: 13501
  iterations_since_restore: 28
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.71153846153846
    ram_util_percent: 62.20384615384616
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,28,488.013,28500,-503.348,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-03-06
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -461.2358136139652
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 4
  episodes_total: 146
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 29500
    learner:
      default_policy:
        max_q: 6.932671070098877
        mean_q: -74.86430358886719
        min_q: -146.92617797851562
        model: {}
    num_steps_sampled: 29500
    num_steps_trained: 3584256
    num_target_updates: 14001
  iterations_since_restore: 29
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.81851851851852
    ram_util_percent: 62.31481481481482
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,29,507.24,29500,-461.236,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-03-25
  done: true
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -444.3119269581459
  episode_reward_min: -1524.5879513507703
  episodes_this_iter: 6
  episodes_total: 152
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 30500
    learner:
      default_policy:
        max_q: 5.001977920532227
        mean_q: -80.31802368164062
        min_q: -147.42247009277344
        model: {}
    num_steps_sampled: 30500
    num_steps_trained: 3712256
    num_target_updates: 14501
  iterations_since_restore: 30
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.751851851851853
    ram_util_percent: 63.059259259259264
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,TERMINATED,,0.001,30,526.471,30500,-444.312,-1.11165,-1524.59,200


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,TERMINATED,,0.001,30,526.471,30500,-444.312,-1.11165,-1524.59,200


2021-07-13 21:03:26,318	INFO tune.py:448 -- Total run time: 540.24 seconds (539.55 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fadfcd0d490>

## 1: RLlib Environments

1: RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments.

2: RLlib uses Gym as its environment interface for single-agent training.



#### 1: Configuring Environments

    https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py

In [3]:
import gym, ray
from ray.rllib.agents import ppo

class MyEnv(gym.Env):
    def __init__(self, env_config):
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self):
        return <obs>
    def step(self, action):
        return <obs>, <reward: float>, <done: bool>, <info: dict>

ray.init()
trainer = ppo.PPOTrainer(env=MyEnv, config={
    "env_config": {},  # config to pass to env class
})

while True:
    print(trainer.train())

SyntaxError: invalid syntax (<ipython-input-3-7352217a18a1>, line 6)