### To view the tensorboard: 
    1: tensorboard --logdir ray_results 
    2: see http://localhost:6006/ in browser

In [2]:
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray import tune




## 0: RLlib Training APIs: 
1: At a high level, RLlib provides an Trainer class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed. In multi-agent training, the trainer manages the querying and optimization of multiple policies at once.

2: rllib train --run DQN --env CartPole-v0  --config '{"num_workers": 8}'
    To see the tensorboard: tensorboard --logdir=~/ray_results

3: rllib rollout ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

4: Loading and restoring a trained agent from a checkpoint is simple:
    
    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)
    
5: Computing Actions

The simplest way to programmatically compute actions from a trained agent is to use trainer.compute_action(). This method preprocesses and filters the observation before passing it to the agent policy. Here is a simple example of testing a trained agent for one episode:

    # instantiate env class
    env = env_class(env_config)

    # run until episode ends
    episode_reward = 0
    done = False
    obs = env.reset()
    while not done:
        action = agent.compute_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
6: It’s recommended that you run RLlib trainers with Tune, for easy experiment management and visualization of results. Just set "run": ALG_NAME, "env": ENV_NAME in the experiment . config. All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune/

7: tune.run() returns an ExperimentAnalysis object that allows further analysis of the training results and retrieving the checkpoint(s) of the trained agent. It also simplifies saving the trained agent. For example:

a: tune.run() allows setting a custom log directory (other than ``~/ray-results``) and automatically saving the trained agent

    analysis = ray.tune.run(
        ppo.PPOTrainer,
        config=config,
        local_dir=log_dir,
        stop=stop_criteria,
        checkpoint_at_end=True)

b: list of lists: one list per checkpoint; each checkpoint list contains 1st the path, 2nd the metric value

        checkpoints = analysis.get_trial_checkpoints_paths(
            trial=analysis.get_best_trial("episode_reward_mean"),
            metric="episode_reward_mean")

c: or simply get the last checkpoint (with highest "training_iteration")

        last_checkpoint = analysis.get_last_checkpoint()
    
d: if there are multiple trials, select a specific trial or automatically choose the best one according to a given metric

        last_checkpoint = analysis.get_last_checkpoint(
            metric="episode_reward_mean", mode="max"
        )

e: Loading and restoring a trained agent from a checkpoint is simple:

    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)

In [2]:
ray.shutdown()
ray.init()

2021-10-16 02:47:55,173	INFO services.py:1252 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m


{'node_ip_address': '172.16.21.53',
 'raylet_ip_address': '172.16.21.53',
 'redis_address': '172.16.21.53:48500',
 'object_store_address': '/tmp/ray/session_2021-10-16_02-47-53_348974_46171/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-10-16_02-47-53_348974_46171/sockets/raylet',
 'webui_url': '127.0.0.1:8267',
 'session_dir': '/tmp/ray/session_2021-10-16_02-47-53_348974_46171',
 'metrics_export_port': 64532,
 'node_id': 'fc79172c023b86b61972689708962114fb1c722f304004b0b5893f8a'}

#### 1 Example of Traing a PPO Agent

In [5]:
config = ppo.DEFAULT_CONFIG.copy()
config['num_gpus'] = 1
config['num_workers'] = 2
trainer = ppo.PPOTrainer(config = config, env='CartPole-v1') 

for i in range(30):
    result = trainer.train()
    if i % 10 ==0:
        checkpoint = trainer.save()
        print('checkpoint saved at:', checkpoint) 

[2m[36m(pid=48648)[0m 
[2m[36m(pid=48647)[0m 
[2m[36m(pid=48648)[0m 2021-10-16 16:08:08.679896: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2m[36m(pid=48648)[0m Instructions for updating:
[2m[36m(pid=48648)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=48647)[0m 2021-10-16 16:08:08.682076: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2m[36m(pid=48647)[0m Instructions for updating:
[2m[36m(pid=48647)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=48648)[0m Instructions for updating:
[2m[36m(pid=48648)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=48647)[0m Instructions for updating:
[2m[36m(pid=48647)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=48648)[0m

[2m[36m(RolloutWorker pid=48648)[0m [2021-10-16 16:08:09.558 ip-172-16-19-112:48648 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(RolloutWorker pid=48648)[0m [2021-10-16 16:08:09.592 ip-172-16-19-112:48648 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[2m[36m(RolloutWorker pid=48647)[0m [2021-10-16 16:08:09.611 ip-172-16-19-112:48647 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(RolloutWorker pid=48647)[0m [2021-10-16 16:08:09.646 ip-172-16-19-112:48647 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.




checkpoint saved at: /home/ec2-user/ray_results/PPO_CartPole-v1_2021-10-16_16-08-04wkiy6cxa/checkpoint_000001/checkpoint-1
checkpoint saved at: /home/ec2-user/ray_results/PPO_CartPole-v1_2021-10-16_16-08-04wkiy6cxa/checkpoint_000011/checkpoint-11
checkpoint saved at: /home/ec2-user/ray_results/PPO_CartPole-v1_2021-10-16_16-08-04wkiy6cxa/checkpoint_000021/checkpoint-21


#### 2 Example of Using Tune

In [4]:
alg = 'PPO'
tune.run(alg,
    stop={'episode_reward_mean':200},
    config={
        'env':'CartPole-v0',
        'num_gpus':1,
        'num_workers':2,
        'lr':tune.grid_search([.01,.001,.0001])     
    }
)

Trial name,status,loc,lr
PPO_CartPole-v0_eda9d_00000,PENDING,,0.01
PPO_CartPole-v0_eda9d_00001,PENDING,,0.001
PPO_CartPole-v0_eda9d_00002,PENDING,,0.0001


[2m[36m(pid=41261)[0m 
[2m[36m(pid=41259)[0m 
[2m[36m(pid=41260)[0m 
[2m[36m(pid=41259)[0m 2021-10-16 15:44:24,968	INFO trainer.py:741 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=41259)[0m 2021-10-16 15:44:24,968	INFO ppo.py:165 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(pid=41259)[0m 2021-10-16 15:44:24,968	INFO trainer.py:760 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=41261)[0m 2021-10-16 15:44:25,001	INFO trainer.py:741 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=41261)[0m 2021-10-16 15:44:25,001	INFO ppo.py:165 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this does

Trial name,status,loc,lr
PPO_CartPole-v0_eda9d_00000,RUNNING,,0.01
PPO_CartPole-v0_eda9d_00001,RUNNING,,0.001
PPO_CartPole-v0_eda9d_00002,RUNNING,,0.0001


[2m[36m(pid=41366)[0m 
[2m[36m(pid=41367)[0m 
[2m[36m(pid=41370)[0m 
[2m[36m(pid=41373)[0m 
[2m[36m(pid=41368)[0m 
[2m[36m(pid=41369)[0m 
[2m[36m(pid=41373)[0m 2021-10-16 15:44:30.430709: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2m[36m(pid=41366)[0m 2021-10-16 15:44:30.466143: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2m[36m(pid=41366)[0m Instructions for updating:
[2m[36m(pid=41366)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=41367)[0m 2021-10-16 15:44:30.490393: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2m[36m(pid=41367)[0m Instructions for updating:
[2m[36m(pid=41367)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=41370)

Trial name,status,loc,lr
PPO_CartPole-v0_eda9d_00000,RUNNING,,0.01
PPO_CartPole-v0_eda9d_00001,RUNNING,,0.001
PPO_CartPole-v0_eda9d_00002,RUNNING,,0.0001


[2m[36m(pid=41366)[0m Instructions for updating:
[2m[36m(pid=41366)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41367)[0m Instructions for updating:
[2m[36m(pid=41367)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41370)[0m Instructions for updating:
[2m[36m(pid=41370)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41373)[0m Instructions for updating:
[2m[36m(pid=41373)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41368)[0m Instructions for updating:
[2m[36m(pid=41368)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41369)[0m Instructions for updating:
[2m[36m(pid=41369)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41366)[0m 
[2m[36m(pid=41366)[0m 
[2m[36m(pid=41370)[0m 
[2m[36m(pid=41370)[0m 
[2m[36m(pid=41373)[0m 
[2m[

[2m[36m(RolloutWorker pid=41373)[0m [2021-10-16 15:44:31.333 ip-172-16-19-112:41373 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(RolloutWorker pid=41366)[0m [2021-10-16 15:44:31.357 ip-172-16-19-112:41366 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(RolloutWorker pid=41366)[0m [2021-10-16 15:44:31.391 ip-172-16-19-112:41366 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[2m[36m(RolloutWorker pid=41373)[0m [2021-10-16 15:44:31.367 ip-172-16-19-112:41373 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[2m[36m(RolloutWorker pid=41367)[0m [2021-10-16 15:44:31.403 ip-172-16-19-112:41367 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(RolloutWorker pid=41367)[0m [2021-10-16 15:44:31.437 ip-172-16-19-112:41367 INFO profiler_config_parser.py:111] Unable to find config at /opt/

[2m[36m(pid=41261)[0m Instructions for updating:
[2m[36m(pid=41261)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=41259)[0m Instructions for updating:
[2m[36m(pid=41259)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=41260)[0m Instructions for updating:
[2m[36m(pid=41260)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=41259)[0m 2021-10-16 15:44:32.666603: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
[2m[36m(pid=41259)[0m *** SIGABRT received at time=1634399072 on cpu 5 ***
[2m[36m(pid=41259)[0m PC: @     0x7fef6733a3b7  (unknown)  raise
[2m[36m(pid=41259)[0m     @     0x7fef67ff2600  (unknown)  (unknown)
[2m[36m(pid=41261)[0m Instructions for updating:
[2m[36m(pid=41261)[0m Use tf.where in 2.0, which has the same broadcast rule as np.where
[2m[36m(pid=41260)[0m Instruction

[2m[36m(PPO pid=41260)[0m [2021-10-16 15:44:34.095 ip-172-16-19-112:41260 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(PPO pid=41260)[0m [2021-10-16 15:44:34.126 ip-172-16-19-112:41260 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
[2m[36m(PPO pid=41261)[0m [2021-10-16 15:44:34.128 ip-172-16-19-112:41261 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2m[36m(PPO pid=41261)[0m [2021-10-16 15:44:34.160 ip-172-16-19-112:41261 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.


[2m[36m(pid=41261)[0m 
[2m[36m(pid=41261)[0m 
[2m[36m(pid=41259)[0m     @     0x7fcb32dc61aa        720  tensorflow::GPUUtil::CopyCPUTensorToGPU()
[2m[36m(pid=41259)[0m     @     0x7fcb32dc79f9        128  tensorflow::GPUDeviceContext::CopyCPUTensorToDevice()
[2m[36m(pid=41259)[0m     @     0x7fcb32dafe82        400  tensorflow::BaseGPUDevice::MaybeCopyTensorToGPU()
[2m[36m(pid=41259)[0m     @     0x7fcb32db2bb0        352  tensorflow::BaseGPUDevice::MakeTensorFromProto()
[2m[36m(pid=41259)[0m     @     0x7fcb3a10f211        192  tensorflow::ConstantOp::ConstantOp()
[2m[36m(pid=41259)[0m     @     0x7fcb3a10f792         32  tensorflow::{lambda()#15}::_FUN()
[2m[36m(pid=41259)[0m     @     0x7fcb32b63d65       1040  tensorflow::CreateOpKernel()
[2m[36m(pid=41259)[0m     @     0x7fcb32e0bde1        144  tensorflow::CreateNonCachedKernel()
[2m[36m(pid=41259)[0m     @     0x7fcb32e28cc1       1200  tensorflow::FunctionLibraryRuntimeImpl::CreateKernel()
[2m

Trial name,status,loc,lr
PPO_CartPole-v0_eda9d_00000,RUNNING,,0.01
PPO_CartPole-v0_eda9d_00001,RUNNING,,0.001
PPO_CartPole-v0_eda9d_00002,RUNNING,,0.0001


[2m[36m(pid=41259)[0m     @     0x5608068faef4  (unknown)  call_function
[2m[36m(pid=41259)[0m     @     0x560806ae06e0  (unknown)  (unknown)
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,585 E 41259 41259] logging.cc:315: *** SIGABRT received at time=1634399075 on cpu 5 ***
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,586 E 41259 41259] logging.cc:315: PC: @     0x7fef6733a3b7  (unknown)  raise
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,589 E 41259 41259] logging.cc:315:     @     0x7fef67ff2600  (unknown)  (unknown)
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,589 E 41259 41259] logging.cc:315:     @     0x7fcb3e813f7c        640  stream_executor::gpu::(anonymous namespace)::CheckPointerIsValid<>()
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,589 E 41259 41259] logging.cc:315:     @     0x7fcb3e81c49b        560  stream_executor::gpu::GpuDriver::AsynchronousMemcpyH2D()
[2m[36m(pid=41259)[0m [2021-10-16 15:44:35,589 E 41259 41259] logging.cc:315:     @     0x7fcb3e8db8e

Result for PPO_CartPole-v0_eda9d_00001:
  {}
  


[2m[36m(pid=41260)[0m 2021-10-16 15:44:36,295	INFO trainable.py:112 -- Trainable.setup took 11.291 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=41261)[0m 2021-10-16 15:44:36,464	INFO trainable.py:112 -- Trainable.setup took 11.463 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Trial name,status,loc,lr
PPO_CartPole-v0_eda9d_00000,RUNNING,,0.01
PPO_CartPole-v0_eda9d_00002,RUNNING,,0.0001
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-10-16_15-44-44
  done: false
  episode_len_mean: 22.92485549132948
  episode_media: {}
  episode_reward_max: 78.0
  episode_reward_mean: 22.92485549132948
  episode_reward_min: 8.0
  episodes_this_iter: 173
  episodes_total: 173
  experiment_id: fee4be6e696a4767b3e6c9df50a77f99
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.009999999776482582
          entropy: 0.6559051871299744
          entropy_coeff: 0.0
          kl: 0.039926424622535706
          model: {}
          policy_loss: -0.04267517849802971
          total_loss: 92.9194107055664
          vf_explained_var: 0.3246573805809021
          vf_loss: 92.9541015625
        train: null
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iteratio

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,1.0,8.66673,4000.0,22.9249,78.0,8.0,22.9249
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,1.0,8.59015,4000.0,22.9827,73.0,9.0,22.9827
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,1.0,8.66673,4000.0,22.9249,78.0,8.0,22.9249
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,1.0,8.59015,4000.0,22.9827,73.0,9.0,22.9827
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-10-16_15-44-52
  done: false
  episode_len_mean: 50.89
  episode_media: {}
  episode_reward_max: 177.0
  episode_reward_mean: 50.89
  episode_reward_min: 8.0
  episodes_this_iter: 51
  episodes_total: 224
  experiment_id: fee4be6e696a4767b3e6c9df50a77f99
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.009999999776482582
          entropy: 0.5796728134155273
          entropy_coeff: 0.0
          kl: 0.02672765962779522
          model: {}
          policy_loss: -0.015455672517418861
          total_loss: 441.8940124511719
          vf_explained_var: 0.24405714869499207
          vf_loss: 441.9041442871094
        train: null
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
  iterations_since_restore: 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,2.0,16.555,8000.0,50.89,177.0,8.0,50.89
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,2.0,16.4461,8000.0,42.62,181.0,10.0,42.62
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-10-16_15-45-00
  done: false
  episode_len_mean: 65.2
  episode_media: {}
  episode_reward_max: 194.0
  episode_reward_mean: 65.2
  episode_reward_min: 10.0
  episodes_this_iter: 41
  episodes_total: 303
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.566329836845398
          entropy_coeff: 0.0
          kl: 0.010319194756448269
          model: {}
          policy_loss: -0.02498255856335163
          total_loss: 481.6850891113281
          vf_explained_var: 0.19725137948989868
          vf_loss: 481.7080078125
        train: null
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  iterations_since_restore:

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,3.0,24.4522,12000.0,83.84,200.0,9.0,83.84
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,3.0,24.2784,12000.0,65.2,194.0,10.0,65.2
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,3.0,24.4522,12000.0,83.84,200.0,9.0,83.84
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,3.0,24.2784,12000.0,65.2,194.0,10.0,65.2
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-10-16_15-45-08
  done: false
  episode_len_mean: 94.56
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 94.56
  episode_reward_min: 10.0
  episodes_this_iter: 26
  episodes_total: 329
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5372269153594971
          entropy_coeff: 0.0
          kl: 0.006838078144937754
          model: {}
          policy_loss: -0.016161803156137466
          total_loss: 417.2235107421875
          vf_explained_var: 0.33221960067749023
          vf_loss: 417.2383117675781
        train: null
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,4.0,32.444,16000.0,117.39,200.0,13.0,117.39
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,4.0,32.0559,16000.0,94.56,200.0,10.0,94.56
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-10-16_15-45-16
  done: false
  episode_len_mean: 124.1
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 124.1
  episode_reward_min: 10.0
  episodes_this_iter: 21
  episodes_total: 350
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5300061702728271
          entropy_coeff: 0.0
          kl: 0.00630270317196846
          model: {}
          policy_loss: -0.008943001739680767
          total_loss: 356.69757080078125
          vf_explained_var: 0.4037415683269501
          vf_loss: 356.705322265625
        train: null
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
  iterations_since_res

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,5.0,40.3958,20000.0,139.13,200.0,13.0,139.13
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,5.0,40.0145,20000.0,124.1,200.0,10.0,124.1
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,5.0,40.3958,20000.0,139.13,200.0,13.0,139.13
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,5.0,40.0145,20000.0,124.1,200.0,10.0,124.1
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-10-16_15-45-24
  done: false
  episode_len_mean: 148.77
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 148.77
  episode_reward_min: 15.0
  episodes_this_iter: 20
  episodes_total: 370
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5496629476547241
          entropy_coeff: 0.0
          kl: 0.005510628689080477
          model: {}
          policy_loss: -0.00802704319357872
          total_loss: 316.805419921875
          vf_explained_var: 0.47900041937828064
          vf_loss: 316.81231689453125
        train: null
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
  iterations_since_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,6.0,48.294,24000.0,160.57,200.0,13.0,160.57
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,6.0,47.8176,24000.0,148.77,200.0,15.0,148.77
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-10-16_15-45-32
  done: false
  episode_len_mean: 171.24
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 171.24
  episode_reward_min: 15.0
  episodes_this_iter: 20
  episodes_total: 390
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5375571250915527
          entropy_coeff: 0.0
          kl: 0.0032317060977220535
          model: {}
          policy_loss: -0.0022186448331922293
          total_loss: 318.508056640625
          vf_explained_var: 0.4775434136390686
          vf_loss: 318.5096740722656
        train: null
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,7.0,56.1889,28000.0,174.66,200.0,44.0,174.66
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,7.0,55.6513,28000.0,171.24,200.0,15.0,171.24
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-10-16_15-45-40
  done: false
  episode_len_mean: 187.34
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 187.34
  episode_reward_min: 57.0
  episodes_this_iter: 21
  episodes_total: 411
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5155300498008728
          entropy_coeff: 0.0
          kl: 0.0047530303709208965
          model: {}
          policy_loss: -0.0044758738949894905
          total_loss: 369.3384094238281
          vf_explained_var: 0.38220083713531494
          vf_loss: 369.3419494628906
        train: null
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,8.0,64.2579,32000.0,183.9,200.0,48.0,183.9
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,8.0,63.5449,32000.0,187.34,200.0,57.0,187.34
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,8.0,64.2579,32000.0,183.9,200.0,48.0,183.9
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,8.0,63.5449,32000.0,187.34,200.0,57.0,187.34
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-10-16_15-45-48
  done: false
  episode_len_mean: 195.29
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 195.29
  episode_reward_min: 76.0
  episodes_this_iter: 20
  episodes_total: 431
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.506653368473053
          entropy_coeff: 0.0
          kl: 0.0024979093577712774
          model: {}
          policy_loss: 0.0024217134341597557
          total_loss: 437.5098571777344
          vf_explained_var: 0.47176593542099
          vf_loss: 437.5069274902344
        train: null
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
  iterations_since_re

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,9.0,72.2257,36000.0,183.93,200.0,48.0,183.93
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,9.0,71.4315,36000.0,195.29,200.0,76.0,195.29
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,9.0,72.2257,36000.0,183.93,200.0,48.0,183.93
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,9.0,71.4315,36000.0,195.29,200.0,76.0,195.29
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-10-16_15-45-56
  done: false
  episode_len_mean: 198.74
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.74
  episode_reward_min: 158.0
  episodes_this_iter: 20
  episodes_total: 451
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5164790153503418
          entropy_coeff: 0.0
          kl: 0.002633011667057872
          model: {}
          policy_loss: 0.0010574019979685545
          total_loss: 471.9617004394531
          vf_explained_var: 0.29917800426483154
          vf_loss: 471.9600830078125
        train: null
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
  iterations_sinc

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,10.0,80.1968,40000.0,189.66,200.0,48.0,189.66
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,10.0,79.3437,40000.0,198.74,200.0,158.0,198.74
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 44000
  custom_metrics: {}
  date: 2021-10-16_15-46-04
  done: false
  episode_len_mean: 198.87
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.87
  episode_reward_min: 158.0
  episodes_this_iter: 20
  episodes_total: 471
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.5321366786956787
          entropy_coeff: 0.0
          kl: 0.0026774462312459946
          model: {}
          policy_loss: 0.001468151924200356
          total_loss: 388.29180908203125
          vf_explained_var: 0.45292162895202637
          vf_loss: 388.289794921875
        train: null
    num_agent_steps_sampled: 44000
    num_agent_steps_trained: 44000
    num_steps_sampled: 44000
    num_steps_trained: 44000
  iterations_sinc

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,11.0,88.0935,44000.0,195.54,200.0,122.0,195.54
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,11.0,87.1435,44000.0,198.87,200.0,158.0,198.87
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,11.0,88.0935,44000.0,195.54,200.0,122.0,195.54
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,11.0,87.1435,44000.0,198.87,200.0,158.0,198.87
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 48000
  custom_metrics: {}
  date: 2021-10-16_15-46-12
  done: false
  episode_len_mean: 199.01
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.01
  episode_reward_min: 158.0
  episodes_this_iter: 20
  episodes_total: 491
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.4930127263069153
          entropy_coeff: 0.0
          kl: 0.0037080482579767704
          model: {}
          policy_loss: -0.005831567104905844
          total_loss: 409.0596923828125
          vf_explained_var: 0.4206683933734894
          vf_loss: 409.0647888183594
        train: null
    num_agent_steps_sampled: 48000
    num_agent_steps_trained: 48000
    num_steps_sampled: 48000
    num_steps_trained: 48000
  iterations_sinc

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,12.0,95.9913,48000.0,198.89,200.0,137.0,198.89
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,12.0,95.0504,48000.0,199.01,200.0,158.0,199.01
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 52000
  custom_metrics: {}
  date: 2021-10-16_15-46-20
  done: false
  episode_len_mean: 198.79
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.79
  episode_reward_min: 115.0
  episodes_this_iter: 21
  episodes_total: 512
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.47625434398651123
          entropy_coeff: 0.0
          kl: 0.0026906304992735386
          model: {}
          policy_loss: 0.0019425859209150076
          total_loss: 370.4873046875
          vf_explained_var: 0.4837858974933624
          vf_loss: 370.48480224609375
        train: null
    num_agent_steps_sampled: 52000
    num_agent_steps_trained: 52000
    num_steps_sampled: 52000
    num_steps_trained: 52000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,13.0,103.913,52000.0,198.37,200.0,137.0,198.37
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,13.0,102.987,52000.0,198.79,200.0,115.0,198.79
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-10-16_15-46-28
  done: false
  episode_len_mean: 198.79
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.79
  episode_reward_min: 115.0
  episodes_this_iter: 20
  episodes_total: 532
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.4743359088897705
          entropy_coeff: 0.0
          kl: 0.002090497175231576
          model: {}
          policy_loss: -0.0012050993973389268
          total_loss: 273.14276123046875
          vf_explained_var: 0.6138594150543213
          vf_loss: 273.1435852050781
        train: null
    num_agent_steps_sampled: 56000
    num_agent_steps_trained: 56000
    num_steps_sampled: 56000
    num_steps_trained: 56000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,14.0,111.902,56000.0,198.37,200.0,137.0,198.37
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,14.0,110.9,56000.0,198.79,200.0,115.0,198.79
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,14.0,111.902,56000.0,198.37,200.0,137.0,198.37
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,14.0,110.9,56000.0,198.79,200.0,115.0,198.79
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 60000
  custom_metrics: {}
  date: 2021-10-16_15-46-36
  done: false
  episode_len_mean: 198.99
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.99
  episode_reward_min: 115.0
  episodes_this_iter: 20
  episodes_total: 552
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.4306418001651764
          entropy_coeff: 0.0
          kl: 0.002831242047250271
          model: {}
          policy_loss: 0.0013957979390397668
          total_loss: 332.7289733886719
          vf_explained_var: 0.4252530336380005
          vf_loss: 332.7269592285156
        train: null
    num_agent_steps_sampled: 60000
    num_agent_steps_trained: 60000
    num_steps_sampled: 60000
    num_steps_trained: 60000
  iterations_since

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,15.0,119.908,60000.0,198.37,200.0,137.0,198.37
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,15.0,118.85,60000.0,198.99,200.0,115.0,198.99
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-10-16_15-46-44
  done: false
  episode_len_mean: 198.99
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.99
  episode_reward_min: 115.0
  episodes_this_iter: 20
  episodes_total: 572
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.44470614194869995
          entropy_coeff: 0.0
          kl: 0.0044144997373223305
          model: {}
          policy_loss: -0.002049433533102274
          total_loss: 271.29095458984375
          vf_explained_var: 0.46868187189102173
          vf_loss: 271.2921142578125
        train: null
    num_agent_steps_sampled: 64000
    num_agent_steps_trained: 64000
    num_steps_sampled: 64000
    num_steps_trained: 64000
  iterations_s

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,15.0,119.908,60000.0,198.37,200.0,137.0,198.37
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,16.0,126.712,64000.0,198.99,200.0,115.0,198.99
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00000:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-10-16_15-46-44
  done: false
  episode_len_mean: 198.59
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.59
  episode_reward_min: 137.0
  episodes_this_iter: 20
  episodes_total: 524
  experiment_id: fee4be6e696a4767b3e6c9df50a77f99
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.009999999776482582
          entropy: 0.13560442626476288
          entropy_coeff: 0.0
          kl: 0.09108387678861618
          model: {}
          policy_loss: 0.015016925521194935
          total_loss: 341.0106201171875
          vf_explained_var: 0.4172094464302063
          vf_loss: 340.9773864746094
        train: null
    num_agent_steps_sampled: 64000
    num_agent_steps_trained: 64000
    num_steps_sampled: 64000
    num_steps_trained: 64000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,16.0,127.778,64000.0,198.59,200.0,137.0,198.59
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,16.0,126.712,64000.0,198.99,200.0,115.0,198.99
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 68000
  custom_metrics: {}
  date: 2021-10-16_15-46-52
  done: false
  episode_len_mean: 198.99
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 198.99
  episode_reward_min: 115.0
  episodes_this_iter: 20
  episodes_total: 592
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.4091958701610565
          entropy_coeff: 0.0
          kl: 0.0030504337046295404
          model: {}
          policy_loss: 0.0008771381690166891
          total_loss: 283.1387023925781
          vf_explained_var: 0.48283061385154724
          vf_loss: 283.1372375488281
        train: null
    num_agent_steps_sampled: 68000
    num_agent_steps_trained: 68000
    num_steps_sampled: 68000
    num_steps_trained: 68000
  iterations_sin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,17.0,135.916,68000.0,199.48,200.0,148.0,199.48
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,17.0,134.809,68000.0,198.99,200.0,115.0,198.99
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-10-16_15-47-00
  done: false
  episode_len_mean: 199.95
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.95
  episode_reward_min: 195.0
  episodes_this_iter: 20
  episodes_total: 612
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.419415146112442
          entropy_coeff: 0.0
          kl: 0.0035245378967374563
          model: {}
          policy_loss: -0.002441737800836563
          total_loss: 200.3439178466797
          vf_explained_var: 0.6449847221374512
          vf_loss: 200.34564208984375
        train: null
    num_agent_steps_sampled: 72000
    num_agent_steps_trained: 72000
    num_steps_sampled: 72000
    num_steps_trained: 72000
  iterations_sinc

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,RUNNING,172.16.19.112:41260,0.01,17.0,135.916,68000.0,199.48,200.0,148.0,199.48
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,18.0,142.722,72000.0,199.95,200.0,195.0,199.95
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00000:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-10-16_15-47-01
  done: true
  episode_len_mean: 200.0
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 200.0
  episode_reward_min: 200.0
  episodes_this_iter: 20
  episodes_total: 564
  experiment_id: fee4be6e696a4767b3e6c9df50a77f99
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.009999999776482582
          entropy: 0.1091565415263176
          entropy_coeff: 0.0
          kl: 0.07244950532913208
          model: {}
          policy_loss: 0.005926945246756077
          total_loss: 325.9423522949219
          vf_explained_var: 0.4758451282978058
          vf_loss: 325.92193603515625
        train: null
    num_agent_steps_sampled: 72000
    num_agent_steps_trained: 72000
    num_steps_sampled: 72000
    num_steps_trained: 72000
  iterations_since_rest

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,18.0,142.722,72000.0,199.95,200.0,195.0,199.95
PPO_CartPole-v0_eda9d_00000,TERMINATED,,0.01,18.0,143.921,72000.0,200.0,200.0,200.0,200.0
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 76000
  custom_metrics: {}
  date: 2021-10-16_15-47-07
  done: false
  episode_len_mean: 199.95
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 199.95
  episode_reward_min: 195.0
  episodes_this_iter: 20
  episodes_total: 632
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.4171346127986908
          entropy_coeff: 0.0
          kl: 0.0031355975661426783
          model: {}
          policy_loss: 0.0022702447604388
          total_loss: 273.60052490234375
          vf_explained_var: 0.536511242389679
          vf_loss: 273.5976257324219
        train: null
    num_agent_steps_sampled: 76000
    num_agent_steps_trained: 76000
    num_steps_sampled: 76000
    num_steps_trained: 76000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00002,RUNNING,172.16.19.112:41261,0.0001,19.0,150.304,76000.0,199.95,200.0,195.0,199.95
PPO_CartPole-v0_eda9d_00000,TERMINATED,,0.01,18.0,143.921,72000.0,200.0,200.0,200.0,200.0
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


Result for PPO_CartPole-v0_eda9d_00002:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-10-16_15-47-15
  done: true
  episode_len_mean: 200.0
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 200.0
  episode_reward_min: 200.0
  episodes_this_iter: 20
  episodes_total: 652
  experiment_id: 36a8061aa2d24b2a90c2e8a7584a9cbe
  hostname: ip-172-16-19-112
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 0.39377790689468384
          entropy_coeff: 0.0
          kl: 0.002887232694774866
          model: {}
          policy_loss: -0.0012949386145919561
          total_loss: 222.5024871826172
          vf_explained_var: 0.6581946015357971
          vf_loss: 222.503173828125
        train: null
    num_agent_steps_sampled: 80000
    num_agent_steps_trained: 80000
    num_steps_sampled: 80000
    num_steps_trained: 80000
  iterations_since_r

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v0_eda9d_00000,TERMINATED,,0.01,18.0,143.921,72000.0,200.0,200.0,200.0,200.0
PPO_CartPole-v0_eda9d_00002,TERMINATED,,0.0001,20.0,157.822,80000.0,200.0,200.0,200.0,200.0
PPO_CartPole-v0_eda9d_00001,ERROR,,0.001,,,,,,,

Trial name,# failures,error file
PPO_CartPole-v0_eda9d_00001,1,/home/ec2-user/ray_results/PPO/PPO_CartPole-v0_eda9d_00001_1_lr=0.001_2021-10-16_15-44-21/error.txt


[2m[36m(pid=41370)[0m [2021-10-16 15:47:15,698 E 41370 41370] raylet_client.cc:159: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.
[2m[36m(pid=41373)[0m 2021-10-16 15:47:15,697	ERROR worker.py:428 -- SystemExit was raised from the worker
[2m[36m(pid=41373)[0m Traceback (most recent call last):
[2m[36m(pid=41373)[0m   File "python/ray/_raylet.pyx", line 561, in ray._raylet.execute_task
[2m[36m(pid=41373)[0m   File "python/ray/_raylet.pyx", line 568, in ray._raylet.execute_task
[2m[36m(pid=41373)[0m   File "python/ray/_raylet.pyx", line 572, in ray._raylet.execute_task
[2m[36m(pid=41373)[0m   File "python/ray/_raylet.pyx", line 522, in ray._raylet.execute_task.function_executor
[2m[36m(pid=41373)[0m   File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/_private/function_manager.py", line 579, in actor_method_executor
[2m[36m(pid=41373)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=41373)[

TuneError: ('Trials did not complete', [PPO_CartPole-v0_eda9d_00001])

In [4]:
alg = 'DDPG'
tune.run(alg,
    stop={"training_iteration": 30},
    config={
        'env':'Pendulum-v0',
        'num_gpus':0,
        'num_workers':2,
        'lr':tune.grid_search([.001,])     
    }
)


Trial name,status,loc,lr
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,,0.001


[2m[36m(pid=17553)[0m Instructions for updating:
[2m[36m(pid=17553)[0m non-resource variables are not supported in the long term
[2m[36m(pid=17553)[0m 2021-07-13 20:54:31,552	INFO trainer.py:591 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=17553)[0m 2021-07-13 20:54:31,552	INFO trainer.py:616 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=17556)[0m Instructions for updating:
[2m[36m(pid=17556)[0m non-resource variables are not supported in the long term
[2m[36m(pid=17555)[0m Instructions for updating:
[2m[36m(pid=17555)[0m non-resource variables are not supported in the long term


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-54-41
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1131.8902231793475
  episode_reward_min: -1410.9375467664963
  episodes_this_iter: 6
  episodes_total: 6
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 1500
    learner:
      default_policy:
        max_q: 0.1796478033065796
        mean_q: -0.10519210994243622
        min_q: -0.6069953441619873
        model: {}
    num_steps_sampled: 1500
    num_steps_trained: 256
    num_target_updates: 1
  iterations_since_restore: 1
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.799999999999997
    ram_util_percent: 63.55
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0973585918644

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,1,2.53136,1500,-1131.89,-781.121,-1410.94,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-54-58
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1331.3202113249129
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 12
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 2500
    learner:
      default_policy:
        max_q: -0.50157630443573
        mean_q: -11.473435401916504
        min_q: -24.651611328125
        model: {}
    num_steps_sampled: 2500
    num_steps_trained: 128256
    num_target_updates: 501
  iterations_since_restore: 2
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.545833333333334
    ram_util_percent: 63.48333333333334
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,2,19.6074,2500,-1331.32,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-15
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1398.030697460801
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 16
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 3500
    learner:
      default_policy:
        max_q: -0.1660407930612564
        mean_q: -18.283344268798828
        min_q: -33.41071319580078
        model: {}
    num_steps_sampled: 3500
    num_steps_trained: 256256
    num_target_updates: 1001
  iterations_since_restore: 3
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.1125
    ram_util_percent: 64.00416666666666
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0981330

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,3,36.4448,3500,-1398.03,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-32
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1425.8668538197078
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 22
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 4500
    learner:
      default_policy:
        max_q: -0.6811314225196838
        mean_q: -23.862667083740234
        min_q: -44.24247360229492
        model: {}
    num_steps_sampled: 4500
    num_steps_trained: 384256
    num_target_updates: 1501
  iterations_since_restore: 4
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 27.125
    ram_util_percent: 64.6375
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09838679174479754

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,4,53.642,4500,-1425.87,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-55-50
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1444.8272991160627
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 26
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 5500
    learner:
      default_policy:
        max_q: 0.3510587811470032
        mean_q: -29.40019416809082
        min_q: -47.306602478027344
        model: {}
    num_steps_sampled: 5500
    num_steps_trained: 512256
    num_target_updates: 2001
  iterations_since_restore: 5
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 28.332000000000004
    ram_util_percent: 64.232
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0986227

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,5,71.5977,5500,-1444.83,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-07
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1436.744265061477
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 32
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 6500
    learner:
      default_policy:
        max_q: -0.09803511202335358
        mean_q: -34.90961456298828
        min_q: -63.530277252197266
        model: {}
    num_steps_sampled: 6500
    num_steps_trained: 640256
    num_target_updates: 2501
  iterations_since_restore: 6
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.962500000000002
    ram_util_percent: 63.50416666666667
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_m

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,6,89.2018,6500,-1436.74,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-26
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1438.6196838828807
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 36
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 7500
    learner:
      default_policy:
        max_q: 0.008759599179029465
        mean_q: -40.90607452392578
        min_q: -63.068790435791016
        model: {}
    num_steps_sampled: 7500
    num_steps_trained: 768256
    num_target_updates: 3001
  iterations_since_restore: 7
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 28.47307692307692
    ram_util_percent: 65.36153846153846
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_m

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,7,107.375,7500,-1438.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-56-44
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1412.6190355939557
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 42
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 8500
    learner:
      default_policy:
        max_q: 1.5346838235855103
        mean_q: -44.83782958984375
        min_q: -71.63545227050781
        model: {}
    num_steps_sampled: 8500
    num_steps_trained: 896256
    num_target_updates: 3501
  iterations_since_restore: 8
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.483999999999995
    ram_util_percent: 60.803999999999995
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,8,125.718,8500,-1412.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-02
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -781.1207639597676
  episode_reward_mean: -1398.6154951909177
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 46
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 9500
    learner:
      default_policy:
        max_q: -0.022161278873682022
        mean_q: -50.20433044433594
        min_q: -73.2963638305664
        model: {}
    num_steps_sampled: 9500
    num_steps_trained: 1024256
    num_target_updates: 4001
  iterations_since_restore: 9
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.676
    ram_util_percent: 60.964
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09955409264438625

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,9,143.433,9500,-1398.62,-781.121,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-20
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1348.1319338104424
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 52
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 10500
    learner:
      default_policy:
        max_q: 0.9547820687294006
        mean_q: -53.53691864013672
        min_q: -75.73155975341797
        model: {}
    num_steps_sampled: 10500
    num_steps_trained: 1152256
    num_target_updates: 4501
  iterations_since_restore: 10
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.58076923076923
    ram_util_percent: 62.80384615384616
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,10,161.598,10500,-1348.13,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-38
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1336.2461581544399
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 56
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 11500
    learner:
      default_policy:
        max_q: -1.291930079460144
        mean_q: -56.992698669433594
        min_q: -83.67649841308594
        model: {}
    num_steps_sampled: 11500
    num_steps_trained: 1280256
    num_target_updates: 5001
  iterations_since_restore: 11
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.019999999999996
    ram_util_percent: 63.53200000000001
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,11,179.777,11500,-1336.25,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-57-57
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1288.7165801323201
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 62
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 12500
    learner:
      default_policy:
        max_q: 1.1449592113494873
        mean_q: -62.394222259521484
        min_q: -90.33869934082031
        model: {}
    num_steps_sampled: 12500
    num_steps_trained: 1408256
    num_target_updates: 5501
  iterations_since_restore: 12
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.78846153846154
    ram_util_percent: 61.83461538461539
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,12,198.182,12500,-1288.72,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-14
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -11.722382280782014
  episode_reward_mean: -1269.864064156319
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 66
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 13500
    learner:
      default_policy:
        max_q: 0.9233301877975464
        mean_q: -62.598182678222656
        min_q: -89.0087890625
        model: {}
    num_steps_sampled: 13500
    num_steps_trained: 1536256
    num_target_updates: 6001
  iterations_since_restore: 13
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.896
    ram_util_percent: 62.232
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10005915854217143
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,13,216.094,13500,-1269.86,-11.7224,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-33
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -3.1396012759307346
  episode_reward_mean: -1193.5720942077833
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 72
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 14500
    learner:
      default_policy:
        max_q: 2.8817272186279297
        mean_q: -63.85460662841797
        min_q: -94.13409423828125
        model: {}
    num_steps_sampled: 14500
    num_steps_trained: 1664256
    num_target_updates: 6501
  iterations_since_restore: 14
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.12
    ram_util_percent: 63.36
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1001575898790833


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,14,234.42,14500,-1193.57,-3.1396,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-58-51
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -3.1396012759307346
  episode_reward_mean: -1179.2687786561391
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 76
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 15500
    learner:
      default_policy:
        max_q: 2.4560861587524414
        mean_q: -63.485443115234375
        min_q: -99.68254089355469
        model: {}
    num_steps_sampled: 15500
    num_steps_trained: 1792256
    num_target_updates: 7001
  iterations_since_restore: 15
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.6
    ram_util_percent: 63.26
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10020802807056446

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,15,252.232,15500,-1179.27,-3.1396,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-08
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.8323446584304053
  episode_reward_mean: -1113.7970069430512
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 82
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 16500
    learner:
      default_policy:
        max_q: 3.410090684890747
        mean_q: -70.59851837158203
        min_q: -103.12985229492188
        model: {}
    num_steps_sampled: 16500
    num_steps_trained: 1920256
    num_target_updates: 7501
  iterations_since_restore: 16
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.516
    ram_util_percent: 63.868
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100262394661624

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,16,269.855,16500,-1113.8,-1.83234,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-26
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.8323446584304053
  episode_reward_mean: -1085.0581441625716
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 86
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 17500
    learner:
      default_policy:
        max_q: 3.329087018966675
        mean_q: -69.15478515625
        min_q: -105.27165222167969
        model: {}
    num_steps_sampled: 17500
    num_steps_trained: 2048256
    num_target_updates: 8001
  iterations_since_restore: 17
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.852000000000004
    ram_util_percent: 61.42
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1002903

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,17,287.561,17500,-1085.06,-1.83234,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_20-59-44
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -1034.789899227471
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 92
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 18500
    learner:
      default_policy:
        max_q: 3.6690609455108643
        mean_q: -67.78599548339844
        min_q: -111.97421264648438
        model: {}
    num_steps_sampled: 18500
    num_steps_trained: 2176256
    num_target_updates: 8501
  iterations_since_restore: 18
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.984
    ram_util_percent: 61.907999999999994
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,18,305.711,18500,-1034.79,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-02
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -997.122177864407
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 96
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 19500
    learner:
      default_policy:
        max_q: 5.729738235473633
        mean_q: -72.16944122314453
        min_q: -117.57238006591797
        model: {}
    num_steps_sampled: 19500
    num_steps_trained: 2304256
    num_target_updates: 9001
  iterations_since_restore: 19
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 20.772
    ram_util_percent: 63.343999999999994
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10034

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,19,323.599,19500,-997.122,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-20
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -962.5109391533188
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 6
  episodes_total: 102
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 20500
    learner:
      default_policy:
        max_q: 5.6856279373168945
        mean_q: -70.61603546142578
        min_q: -114.73045349121094
        model: {}
    num_steps_sampled: 20500
    num_steps_trained: 2432256
    num_target_updates: 9501
  iterations_since_restore: 20
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.34
    ram_util_percent: 63.876
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.100398433589643

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,20,341.518,20500,-962.511,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-38
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -926.8117919657061
  episode_reward_min: -1796.9534032278605
  episodes_this_iter: 4
  episodes_total: 106
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 21500
    learner:
      default_policy:
        max_q: 4.623827934265137
        mean_q: -73.03113555908203
        min_q: -121.44898986816406
        model: {}
    num_steps_sampled: 21500
    num_steps_trained: 2560256
    num_target_updates: 10001
  iterations_since_restore: 21
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 20.892
    ram_util_percent: 64.264
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10054363674887

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,21,359.284,21500,-926.812,-1.11165,-1796.95,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-00-56
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -845.3818124377199
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 6
  episodes_total: 112
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 22500
    learner:
      default_policy:
        max_q: 5.559916973114014
        mean_q: -77.15824890136719
        min_q: -124.76898956298828
        model: {}
    num_steps_sampled: 22500
    num_steps_trained: 2688256
    num_target_updates: 10501
  iterations_since_restore: 22
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 23.011538461538464
    ram_util_percent: 62.280769230769245
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_process

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,22,377.508,22500,-845.382,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-14
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -789.2500149746359
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 4
  episodes_total: 116
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 23500
    learner:
      default_policy:
        max_q: 5.940964221954346
        mean_q: -75.83016967773438
        min_q: -128.67138671875
        model: {}
    num_steps_sampled: 23500
    num_steps_trained: 2816256
    num_target_updates: 11001
  iterations_since_restore: 23
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 21.7
    ram_util_percent: 61.724
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1007344092037897
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,23,395.654,23500,-789.25,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-33
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -711.1807312858703
  episode_reward_min: -1702.4878189939052
  episodes_this_iter: 6
  episodes_total: 122
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 24500
    learner:
      default_policy:
        max_q: 5.302407264709473
        mean_q: -76.78132629394531
        min_q: -129.2058563232422
        model: {}
    num_steps_sampled: 24500
    num_steps_trained: 2944256
    num_target_updates: 11501
  iterations_since_restore: 24
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 22.71153846153846
    ram_util_percent: 63.11538461538461
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,24,414.032,24500,-711.181,-1.11165,-1702.49,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-01-50
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -654.9789331537673
  episode_reward_min: -1543.9091369921614
  episodes_this_iter: 4
  episodes_total: 126
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 25500
    learner:
      default_policy:
        max_q: 8.509936332702637
        mean_q: -74.9183349609375
        min_q: -133.55616760253906
        model: {}
    num_steps_sampled: 25500
    num_steps_trained: 3072256
    num_target_updates: 12001
  iterations_since_restore: 25
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.025000000000002
    ram_util_percent: 63.50416666666666
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processin

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,25,431.321,25500,-654.979,-1.11165,-1543.91,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-09
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -581.5873670837315
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 6
  episodes_total: 132
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 26500
    learner:
      default_policy:
        max_q: 5.026442050933838
        mean_q: -76.33003234863281
        min_q: -136.049560546875
        model: {}
    num_steps_sampled: 26500
    num_steps_trained: 3200256
    num_target_updates: 12501
  iterations_since_restore: 26
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 26.35
    ram_util_percent: 62.79615384615385
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1008556

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,26,450.279,26500,-581.587,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-28
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -543.6693148351875
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 4
  episodes_total: 136
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 27500
    learner:
      default_policy:
        max_q: 4.031546592712402
        mean_q: -76.26606750488281
        min_q: -138.93124389648438
        model: {}
    num_steps_sampled: 27500
    num_steps_trained: 3328256
    num_target_updates: 13001
  iterations_since_restore: 27
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.355555555555558
    ram_util_percent: 62.829629629629636
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,27,469.222,27500,-543.669,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-02-47
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -503.3479170812064
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 6
  episodes_total: 142
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 28500
    learner:
      default_policy:
        max_q: 8.385159492492676
        mean_q: -79.76459503173828
        min_q: -143.51950073242188
        model: {}
    num_steps_sampled: 28500
    num_steps_trained: 3456256
    num_target_updates: 13501
  iterations_since_restore: 28
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.71153846153846
    ram_util_percent: 62.20384615384616
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,28,488.013,28500,-503.348,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-03-06
  done: false
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -461.2358136139652
  episode_reward_min: -1519.854635629977
  episodes_this_iter: 4
  episodes_total: 146
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 29500
    learner:
      default_policy:
        max_q: 6.932671070098877
        mean_q: -74.86430358886719
        min_q: -146.92617797851562
        model: {}
    num_steps_sampled: 29500
    num_steps_trained: 3584256
    num_target_updates: 14001
  iterations_since_restore: 29
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.81851851851852
    ram_util_percent: 62.31481481481482
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,RUNNING,192.168.0.23:17553,0.001,29,507.24,29500,-461.236,-1.11165,-1519.85,200


Result for DDPG_Pendulum-v0_6b1dd_00000:
  custom_metrics: {}
  date: 2021-07-13_21-03-25
  done: true
  episode_len_mean: 200.0
  episode_reward_max: -1.1116457547643066
  episode_reward_mean: -444.3119269581459
  episode_reward_min: -1524.5879513507703
  episodes_this_iter: 6
  episodes_total: 152
  experiment_id: 43107a19371e49529eb221515032f0aa
  hostname: Mingjuns-MacBook-Pro.local
  info:
    last_target_update_ts: 30500
    learner:
      default_policy:
        max_q: 5.001977920532227
        mean_q: -80.31802368164062
        min_q: -147.42247009277344
        model: {}
    num_steps_sampled: 30500
    num_steps_trained: 3712256
    num_target_updates: 14501
  iterations_since_restore: 30
  node_ip: 192.168.0.23
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 25.751851851851853
    ram_util_percent: 63.059259259259264
  pid: 17553
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processi

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,TERMINATED,,0.001,30,526.471,30500,-444.312,-1.11165,-1524.59,200


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DDPG_Pendulum-v0_6b1dd_00000,TERMINATED,,0.001,30,526.471,30500,-444.312,-1.11165,-1524.59,200


2021-07-13 21:03:26,318	INFO tune.py:448 -- Total run time: 540.24 seconds (539.55 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fadfcd0d490>

## 1: RLlib Environments

1: RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments.

2: RLlib uses Gym as its environment interface for single-agent training.



#### 1: Configuring Environments

    https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py

In [3]:
import gym, ray
from ray.rllib.agents import ppo

class MyEnv(gym.Env):
    def __init__(self, env_config):
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self):
        return <obs>
    def step(self, action):
        return <obs>, <reward: float>, <done: bool>, <info: dict>

ray.init()
trainer = ppo.PPOTrainer(env=MyEnv, config={
    "env_config": {},  # config to pass to env class
})

while True:
    print(trainer.train())

SyntaxError: invalid syntax (<ipython-input-3-7352217a18a1>, line 6)