In [1]:
import os
import ray
import supersuit as ss 
from ray import tune    # experiment runner
from pettingzoo.butterfly import cooperative_pong_v5
from ray.tune.registry import register_env
from ray.rllib.env import ParallelPettingZooEnv
from ray.rllib.algorithms.ppo import PPOConfig

ray.shutdown()   # restart ray
ray.init(num_cpus=6)

2024-10-31 15:35:54,973	INFO worker.py:1816 -- Started a local Ray instance.


0,1
Python version:,3.11.6
Ray version:,2.38.0


# Ray, RLLib and Pettingzoo
To perform RL experiments with Ray, four things are needed: 
1. RL Environment 
2. RL Algorithm
3. Configuration of the environment, algorithm and the experiment
4. Experiment Runner


The multi-agent environments from PettingZoo are not directly compatible with Ray, and need to be wrapped with either the ```PettingZooEnv``` or the ```PettingZooParallelEnv``` wrappers, depending on whether an ```AEC``` or a ```ParallelEnv``` environment is being used. In the following the ```ParallelEnv``` option will be shown using the ```cooperative_pong_v5```.

In [2]:
env_name = 'cooperative_pong'

def env_creator(env_config):
    env = cooperative_pong_v5.parallel_env(render_mode=env_config.get("render_mode", "human"))
    env = ss.color_reduction_v0(env, mode='B')
    env = ss.resize_v1(env, x_size=84, y_size=84)
    env = ss.frame_stack_v1(env, 4)
    env = ss.dtype_v0(env, 'float32')
    return env

register_env(env_name, lambda config: ParallelPettingZooEnv(env_creator(config)))

In the ```env_creator``` function we initialise the environment and using ```Supersuit``` we wrap the environment to be compatible with the ```rllib``` algorithms. The single functions are described [here](https://pypi.org/project/SuperSuit/3.3.1/). 

By registering the environment under the ```env_name``` we can access it from the ```rllib``` API. When the env is called, it invokes the ```env_creator``` function defined above and passes the arguments in ```config```. The ```.get()``` function retrieves the values for the key, allowing the specification of a default value if that key doesn't exist in the dictionary.

## Algorithm and AlgorithmConfig

We pass all important configuration setting to the ```PPOConfig``` class and run an experiment using a ``tune.Tuner`` object. 

, two variants of configuring and running an algorithm. One method is to set all configuration settings in the ```config``` object directly when accessing ```PPOConfig``` and run the configuration using ```tune.run()```.

In [6]:
config = (
    PPOConfig()
    .environment(env=env_name)
    .framework("torch")
    .training(
            train_batch_size=512,
            lr=2e-5,
            gamma=0.99,
            lambda_=0.9,
            use_gae=True,
            clip_param=0.4,
            grad_clip=None,
            entropy_coeff=0.1,
            vf_loss_coeff=0.25,
            num_sgd_iter=10,
    )
)





## Ray Train vs. Tune
When a use-defined training function is desired, using the ``Trainer`` from the ``ray.train`` module can be useful, as it allows for custom training logic, model and dataset loading, training loop, checkpoint saving and logging metrics. In our case, no custom training function is necessary, so we can allow ``ray.tune.run()`` to handle the training logic for us. 

We take the previously defined configuration, passing the name of our experiment and set stop criterion based on the average episode reward and overall steps. 

In [10]:
results = tune.run(
               'PPO',
               config=config.to_dict(),
               name='ppo_cooperative_pong',
               stop={'agent_timesteps_total': 1e4, 'env_runners/episode_reward_mean': 10},
)

2024-10-31 16:52:54,332	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949


0,1
Current time:,2024-10-31 17:04:06
Running for:,00:11:12.40
Memory:,19.9/31.3 GiB

Trial name,status,loc,iter,total time (s),ts,num_healthy_workers,num_in_flight_async_ sample_reqs,num_remote_worker_re starts
PPO_cooperative_pong_30efb_00000,RUNNING,127.0.0.1:34840,13,645.765,6656,2,0,0


Trial name,agent_timesteps_total,counters,custom_metrics,env_runners,episode_media,info,num_agent_steps_sampled,num_agent_steps_sampled_lifetime,num_agent_steps_trained,num_env_steps_sampled,num_env_steps_sampled_lifetime,num_env_steps_sampled_this_iter,num_env_steps_sampled_throughput_per_sec,num_env_steps_trained,num_env_steps_trained_this_iter,num_env_steps_trained_throughput_per_sec,num_healthy_workers,num_in_flight_async_sample_reqs,num_remote_worker_restarts,num_steps_trained_this_iter,perf,timers
PPO_cooperative_pong_30efb_00000,13312,"{'num_env_steps_sampled': 6656, 'num_env_steps_trained': 6656, 'num_agent_steps_sampled': 13312, 'num_agent_steps_trained': 13312}",{},"{'episode_reward_max': 49.333333333334295, 'episode_reward_min': -13.555555555555564, 'episode_reward_mean': np.float64(-3.906172839506155), 'episode_len_mean': np.float64(73.42222222222222), 'episode_media': {}, 'episodes_timesteps_total': 6608, 'policy_reward_min': {'default_policy': np.float64(-6.777777777777777)}, 'policy_reward_max': {'default_policy': np.float64(24.666666666666643)}, 'policy_reward_mean': {'default_policy': np.float64(-1.9530864197530997)}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-1.7777777777778248, -2.6666666666667105, -13.555555555555564, -1.3333333333333819, -6.8888888888889195, -12.444444444444457, 49.333333333334295, 8.888888888888804, -12.666666666666679, 14.222222222222186, -11.333333333333348, -13.555555555555564, -11.333333333333348, -11.333333333333348, -11.333333333333348, -12.666666666666679, -12.666666666666679, -6.8888888888889195, 12.888888888888815, -6.8888888888889195, 31.111111111111555, 0.4444444444443896, -12.666666666666679, -12.444444444444457, -12.444444444444457, -9.555555555555577, -13.555555555555564, -6.8888888888889195, -11.333333333333348, -13.555555555555564, -12.444444444444457, -11.333333333333348, -12.666666666666679, -13.555555555555564, -1.5555555555556033, -12.666666666666679, -13.555555555555564, -0.888888888888939, -7.111111111111141, -6.8888888888889195, -3.7777777777778176, -13.555555555555564, 13.111111111111043, -13.555555555555564, -13.555555555555564, -9.555555555555577, -6.8888888888889195, -11.333333333333348, -3.7777777777778176, -13.555555555555564, 6.888888888888811, 26.444444444444756, -1.5555555555556033, 16.000000000000014, -12.444444444444457, -9.555555555555577, -6.8888888888889195, -11.333333333333348, -12.444444444444457, -13.555555555555564, -13.555555555555564, -11.333333333333348, 10.444444444444354, -13.555555555555564, -13.555555555555564, -13.555555555555564, -6.8888888888889195, -11.333333333333348, 44.888888888889724, -6.8888888888889195, -6.8888888888889195, -11.333333333333348, 37.333333333333954, -13.555555555555564, 26.888888888889213, 10.666666666666575, 13.333333333333272, -13.555555555555564, -6.8888888888889195, -13.555555555555564, -11.333333333333348, -12.666666666666679, 22.444444444444642, -13.555555555555564, -12.444444444444457, -2.6666666666667105, -12.444444444444457, 9.555555555555468, -11.333333333333348, 18.2222222222223], 'episode_lengths': [83, 79, 30, 85, 60, 35, 313, 131, 34, 155, 40, 30, 40, 40, 40, 34, 34, 60, 149, 60, 231, 93, 34, 35, 35, 48, 30, 60, 40, 30, 35, 40, 34, 30, 84, 34, 30, 87, 59, 60, 74, 30, 150, 30, 30, 48, 60, 40, 74, 30, 122, 210, 84, 163, 35, 48, 60, 40, 35, 30, 30, 40, 138, 30, 30, 30, 60, 40, 293, 60, 60, 40, 259, 30, 212, 139, 151, 30, 60, 30, 40, 34, 192, 30, 35, 79, 35, 134, 40, 173], 'policy_default_policy_reward': [-0.8888888888889053, -0.8888888888889053, -1.3333333333333481, -1.3333333333333481, -6.777777777777777, -6.777777777777777, -0.6666666666666838, -0.6666666666666838, -3.4444444444444526, -3.4444444444444526, -6.222222222222221, -6.222222222222221, 24.666666666666643, 24.666666666666643, 4.444444444444409, 4.444444444444409, -6.333333333333332, -6.333333333333332, 7.111111111111068, 7.111111111111068, -5.666666666666667, -5.666666666666667, -6.777777777777777, -6.777777777777777, -5.666666666666667, -5.666666666666667, -5.666666666666667, -5.666666666666667, -5.666666666666667, -5.666666666666667, -6.333333333333332, -6.333333333333332, -6.333333333333332, -6.333333333333332, -3.4444444444444526, -3.4444444444444526, 6.444444444444404, 6.444444444444404, -3.4444444444444526, -3.4444444444444526, 15.555555555555483, 15.555555555555483, 0.2222222222222019, 0.2222222222222019, -6.333333333333332, -6.333333333333332, -6.222222222222221, -6.222222222222221, -6.222222222222221, -6.222222222222221, -4.777777777777781, -4.777777777777781, -6.777777777777777, -6.777777777777777, -3.4444444444444526, -3.4444444444444526, -5.666666666666667, -5.666666666666667, -6.777777777777777, -6.777777777777777, -6.222222222222221, -6.222222222222221, -5.666666666666667, -5.666666666666667, -6.333333333333332, -6.333333333333332, -6.777777777777777, -6.777777777777777, -0.7777777777777946, -0.7777777777777946, -6.333333333333332, -6.333333333333332, -6.777777777777777, -6.777777777777777, -0.4444444444444624, -0.4444444444444624, -3.5555555555555634, -3.5555555555555634, -3.4444444444444526, -3.4444444444444526, -1.8888888888889017, -1.8888888888889017, -6.777777777777777, -6.777777777777777, 6.5555555555555145, 6.5555555555555145, -6.777777777777777, -6.777777777777777, -6.777777777777777, -6.777777777777777, -4.777777777777781, -4.777777777777781, -3.4444444444444526, -3.4444444444444526, -5.666666666666667, -5.666666666666667, -1.8888888888889017, -1.8888888888889017, -6.777777777777777, -6.777777777777777, 3.4444444444444127, 3.4444444444444127, 13.222222222222157, 13.222222222222157, -0.7777777777777946, -0.7777777777777946, 7.999999999999954, 7.999999999999954, -6.222222222222221, -6.222222222222221, -4.777777777777781, -4.777777777777781, -3.4444444444444526, -3.4444444444444526, -5.666666666666667, -5.666666666666667, -6.222222222222221, -6.222222222222221, -6.777777777777777, -6.777777777777777, -6.777777777777777, -6.777777777777777, -5.666666666666667, -5.666666666666667, 5.222222222222184, 5.222222222222184, -6.777777777777777, -6.777777777777777, -6.777777777777777, -6.777777777777777, -6.777777777777777, -6.777777777777777, -3.4444444444444526, -3.4444444444444526, -5.666666666666667, -5.666666666666667, 22.444444444444358, 22.444444444444358, -3.4444444444444526, -3.4444444444444526, -3.4444444444444526, -3.4444444444444526, -5.666666666666667, -5.666666666666667, 18.666666666666583, 18.666666666666583, -6.777777777777777, -6.777777777777777, 13.444444444444379, 13.444444444444379, 5.333333333333295, 5.333333333333295, 6.666666666666625, 6.666666666666625, -6.777777777777777, -6.777777777777777, -3.4444444444444526, -3.4444444444444526, -6.777777777777777, -6.777777777777777, -5.666666666666667, -5.666666666666667, -6.333333333333332, -6.333333333333332, 11.222222222222165, 11.222222222222165, -6.777777777777777, -6.777777777777777, -6.222222222222221, -6.222222222222221, -1.3333333333333481, -1.3333333333333481, -6.222222222222221, -6.222222222222221, 4.777777777777741, 4.777777777777741, -5.666666666666667, -5.666666666666667, 9.111111111111061, 9.111111111111061]}, 'sampler_perf': {'mean_raw_obs_processing_ms': np.float64(1.5577672024148892), 'mean_inference_ms': np.float64(6.990539197788687), 'mean_action_processing_ms': np.float64(0.15441445226364314), 'mean_env_wait_ms': np.float64(125.28802776120958), 'mean_env_render_ms': np.float64(0.0)}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': np.float64(0.013468265533447266), 'StateBufferConnector_ms': np.float64(0.0057252248128255205), 'ViewRequirementAgentConnector_ms': np.float64(0.2936585744222005)}, 'num_episodes': 8, 'episode_return_max': 49.333333333334295, 'episode_return_min': -13.555555555555564, 'episode_return_mean': np.float64(-3.906172839506155), 'episodes_this_iter': 8}",{},"{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': np.float64(0.0), 'grad_gnorm': np.float32(44.679874), 'cur_kl_coeff': np.float64(5.125781250000001), 'cur_lr': np.float64(2e-05), 'total_loss': np.float64(0.19177034539170562), 'policy_loss': np.float64(-0.07176895532757044), 'vf_loss': np.float64(1.2496330261230468), 'vf_explained_var': np.float64(0.7856563657522202), 'kl': np.float64(0.006372157076655965), 'entropy': np.float64(0.8153123579919338), 'entropy_coeff': np.float64(0.09999999999999999)}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': np.float64(128.0), 'num_grad_updates_lifetime': np.float64(1000.5), 'diff_num_grad_updates_vs_sampler_policy': np.float64(39.5)}}, 'num_env_steps_sampled': 6656, 'num_env_steps_trained': 6656, 'num_agent_steps_sampled': 13312, 'num_agent_steps_trained': 13312}",13312,13312,13312,6656,6656,512,9.64331,6656,512,9.64331,2,0,0,512,"{'cpu_util_percent': np.float64(5.902666666666666), 'ram_util_percent': np.float64(62.65866666666666)}","{'training_iteration_time_ms': 49458.651, 'restore_workers_time_ms': 0.0, 'training_step_time_ms': 49458.651, 'sample_time_ms': 34466.41, 'load_time_ms': 3.835, 'load_throughput': 133515.105, 'learn_time_ms': 14979.38, 'learn_throughput': 34.18, 'synch_weights_time_ms': 8.725}"


2024-10-31 17:04:06,742	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'C:/Users/ushe/ray_results/ppo_cooperative_pong' in 0.0171s.
2024-10-31 17:04:16,859	INFO tune.py:1041 -- Total run time: 682.53 seconds (672.38 seconds for the tuning loop).
Resume experiment with: tune.run(..., resume=True)
