# CartPole with Ray and RLLib
This notebook trains an agent to solve the ``cartpole-v1`` environment from ``gym``, doing so with multiple different algorithms via the ``ray`` API. For basic understanding of how RLLib is structured, [this](https://www.youtube.com/watch?v=nF02NWK1Rug&t=28s) video can be quite helpful. One representation from the video is shown below. The [code](https://github.com/DeUmbraTX/practical_rllib_tutorial/tree/main) from the video contains simple examples of all aspects of RLLib.

<img src="./imgs/rllib_overview.png" width="600" />

Although RLLib is capable of handling multi-agent problems and environments, in the ``CartPole`` example, we consider a simple single-agent problem. To perform any RL experiment with the Ray framework we need four things: 
1. RL Environment 
2. RL Algorithm
3. Configuration of the environment, algorithm and the experiment
4. Experiment Runner

To begin we import all packages and initialise Ray to use 4 CPUs 

In [None]:
import os
import ray
import gymnasium
from ray import tune, train    # experiment runner
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn.dqn import DQNConfig

ray.init(num_cpus=4)

2024-10-30 16:28:04,677	INFO worker.py:1816 -- Started a local Ray instance.


0,1
Python version:,3.11.6
Ray version:,2.38.0


## Environment
Using ``gymnasium`` is particularly easy with Ray, as it has native support for all gym environments and we can simply pass the name of the ``gymanisum`` environment to the configurator. With other environments such as ``AEC`` or ``ParallelEnv`` from PettingZoo, preprocessing and wrapping would be necessary here.

## Algorithm
In this example we test two different built-in algorithms from Ray, PPO and DQN. To begin, we must configure the algorithms using their dedicated config classes, which inhert from ``AlgorihmConfig`` and are ``PPOConfig`` and ``DQNConfig``, respectively. There are two methods of training algorithms: 

1. Build the ``Algorithm`` from the config dictionary and use its methods to train
2. Pass the ``AlgorithmConfig`` dictionary to a ``ray.tune.Tuner`` object and perform an experiment.

For PPO these two examples are shown in the [documentation](https://docs.ray.io/en/latest/rllib/rllib-training.html?_gl=1*wviehk*_up*MQ..*_ga*MTA1MDM5NzY0Ny4xNzMwMTEwMTE0*_ga_0LCWHW1N3S*MTczMDExMDExMy4xLjEuMTczMDExMDE5MC4wLjAuMA..#using-the-python-api). Beginning with the first method, we build the PPO and DQN Algorithms from their respective ``AlgorithmConfig`` classes and train 10 iterations, printing the result after the final training iteration. More details on the ``Algorithm`` class and information on how to create a custom algorithm can be found [here](https://docs.ray.io/en/latest/rllib/package_ref/algorithm.html?_gl=1*1s6zgz8*_up*MQ..*_ga*MjExNDg5MjYzMC4xNzMwMjg1NDg1*_ga_0LCWHW1N3S*MTczMDMwMjgzMC4yLjAuMTczMDMwMjgzMC4wLjAuMA..#building-custom-algorithm-classes).

In [6]:
from pprint import pprint

ppo_config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
    .env_runners(num_env_runners=1)
)

algo = ppo_config.build()

for i in range(19):
    ppo_result = algo.train()

ppo_result.pop('config')
pprint(ppo_result)

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
  logger.warn(f"Overriding environment {new_spec.id} already in registry.")
  gym.logger.warn(
  logger.warn(
  logger.warn(
  logger.warn(


{'date': '2024-10-30_17-11-17',
 'done': False,
 'env_runners': {'agent_episode_returns_mean': {'default_agent': 479.92},
                 'episode_duration_sec_mean': 0.2996239929999865,
                 'episode_len_max': 500,
                 'episode_len_mean': 479.92,
                 'episode_len_min': 120,
                 'episode_return_max': 500.0,
                 'episode_return_mean': 479.92,
                 'episode_return_min': 120.0,
                 'module_episode_returns_mean': {'default_policy': 479.92},
                 'num_agent_steps_sampled': {'default_agent': 4000},
                 'num_agent_steps_sampled_lifetime': {'default_agent': 760000},
                 'num_env_steps_sampled': 4000,
                 'num_env_steps_sampled_lifetime': 760000,
                 'num_episodes': 8,
                 'num_module_steps_sampled': {'default_policy': 4000},
                 'num_module_steps_sampled_lifetime': {'default_policy': 760000},
                 'sample

We then do the same with with ``DQNConfig``. Note that we pass the same parameters, despite the two algorithms being different. This is because we are leaving all algorithm-specific configurations to default values. [Algorithm-specific configuration options](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html?_gl=1*mfabvl*_up*MQ..*_ga*MjExNDg5MjYzMC4xNzMwMjg1NDg1*_ga_0LCWHW1N3S*MTczMDI4NTQ4NS4xLjAuMTczMDI4NTQ4NS4wLjAuMA..) are available in the Ray documentation.

In [7]:
from ray.rllib.algorithms.dqn.dqn import DQNConfig

dqn_config = (
    DQNConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
    .env_runners(num_env_runners=1)
)

dqn_algo = dqn_config.build()

for i in range(19):
    dqn_result = dqn_algo.train()

dqn_result.pop('config')
pprint(dqn_result)

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
  logger.warn(f"Overriding environment {new_spec.id} already in registry.")
  gym.logger.warn(
  logger.warn(
  logger.warn(
  logger.warn(


{'date': '2024-10-30_17-20-50',
 'done': False,
 'env_runners': {'agent_episode_returns_mean': {'default_agent': 129.04},
                 'episode_duration_sec_mean': 0.0,
                 'episode_len_max': 500,
                 'episode_len_mean': 129.04,
                 'episode_len_min': 10,
                 'episode_return_max': 500.0,
                 'episode_return_mean': 129.04,
                 'episode_return_min': 10.0,
                 'module_episode_returns_mean': {'default_policy': 129.04},
                 'num_agent_steps_sampled': {'default_agent': 1},
                 'num_agent_steps_sampled_lifetime': {'default_agent': 180509500},
                 'num_env_steps_sampled': 1,
                 'num_env_steps_sampled_lifetime': 180509500,
                 'num_episodes': 0,
                 'num_module_steps_sampled': {'default_policy': 1},
                 'num_module_steps_sampled_lifetime': {'default_policy': 180509500},
                 'sample': 0.001875815295

We can compare the output of the two models. The most relevant information can be found in the ``env_runners`` key, which contains information on the environments and agents run. In particular we are interested in the ``agent_episode_returns_mean``, which gives us the average return over all episodes for each agent. In our case there is only one agent, which is denoted with ``default_agent``. 

In [16]:
ppo_agent_results = ppo_result['env_runners']
print('------ PPO RESULT ------')
print(f'Average Episode Reward: {ppo_agent_results["agent_episode_returns_mean"]["default_agent"]}')
print(f'Average Episode Duration: {round(ppo_agent_results["episode_duration_sec_mean"], 4)} seconds')
print(f'Reward: (max: {ppo_agent_results["episode_return_max"]}, mean: {ppo_agent_results["episode_return_mean"]}, min: {ppo_agent_results["episode_return_min"]})')
print(f'Episode Length: (max: {ppo_agent_results["episode_len_max"]}, mean: {ppo_agent_results["episode_len_mean"]}, min: {ppo_agent_results["episode_len_min"]})')

dqn_agent_results = dqn_result['env_runners']
print()
print('------ DQN RESULT ------')
print(f'Average Episode Reward: {dqn_agent_results["agent_episode_returns_mean"]["default_agent"]}')
print(f'Average Episode Duration: {round(dqn_agent_results["episode_duration_sec_mean"], 4)} seconds')
print(f'Reward: (max: {dqn_agent_results["episode_return_max"]}, mean: {dqn_agent_results["episode_return_mean"]}, min: {dqn_agent_results["episode_return_min"]})')
print(f'Episode Length: (max: {dqn_agent_results["episode_len_max"]}, mean: {dqn_agent_results["episode_len_mean"]}, min: {dqn_agent_results["episode_len_min"]})')

------ PPO RESULT ------
Average Episode Reward: 479.92
Average Episode Duration: 0.2996 seconds
Reward: (max: 500.0, mean: 479.92, min: 120.0)
Episode Length: (max: 500, mean: 479.92, min: 120)

------ DQN RESULT ------
Average Episode Reward: 129.04
Average Episode Duration: 0.0 seconds
Reward: (max: 500.0, mean: 129.04, min: 10.0)
Episode Length: (max: 500, mean: 129.04, min: 10)


The second method is to use ``ray.tune`` to perform an experiments with an ``AlgorithmConfig`` object. ``ray.tune`` was originally intended for hyperparameter tuning, hence its name. In the following examples, we will only run a single experiment, e.g. we won't pass any search spaces for hyperparameters (this will be analysed in a different notebook). In the ``Tuner`` object we indicate that we want to stop training once an average reward of 200 has been reached. 

In [None]:
ppo_config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True, 
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
)

tuner = tune.Tuner(
    "PPO",
    param_space=ppo_config,
    run_config=train.RunConfig(
        stop={'env_runners/episode_return_mean': 100},
    )
)

tuner.fit()

0,1
Current time:,2024-10-31 09:55:02
Running for:,00:01:50.40
Memory:,19.8/31.3 GiB

Trial name,status,loc
PPO_CartPole-v1_8f6ed_00000,PENDING,


2024-10-31 09:55:02,979	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'C:/Users/ushe/ray_results/PPO_2024-10-31_09-53-12' in 0.0060s.
2024-10-31 09:55:02,995	INFO tune.py:1041 -- Total run time: 110.43 seconds (110.40 seconds for the tuning loop).
Resume experiment with: Tuner.restore(path="C:/Users/ushe/ray_results/PPO_2024-10-31_09-53-12", trainable=...)
- PPO_CartPole-v1_8f6ed_00000: FileNotFoundError('Could not fetch metrics for PPO_CartPole-v1_8f6ed_00000: both result.json and progress.csv were not found at C:/Users/ushe/ray_results/PPO_2024-10-31_09-53-12/PPO_CartPole-v1_8f6ed_00000_0_2024-10-31_09-53-12')


ResultGrid<[
  Result(
    metrics={},
    path='C:/Users/ushe/ray_results/PPO_2024-10-31_09-53-12/PPO_CartPole-v1_8f6ed_00000_0_2024-10-31_09-53-12',
    filesystem='local',
    checkpoint=None
  )
]>

In [None]:
dqn_config = (
    DQNConfig()
    .api_stack(
        enable_env_runner_and_connector_v2=True,
        enable_rl_module_and_learner=True,
    )
    .environment('CartPole-v1')
)

tuner = tune.Tuner(
    "DQN",
    param_space=dqn_config,
    run_config=train.RunConfig(
        stop={'env_runners/episode_return_mean': 100},
    )
)