# CartPole with Ray and RLLib
This notebook trains an agent to solve the ``cartpole-v1`` environment from ``gym``, doing so with multiple different algorithms via the ``ray`` API. For basic understanding of how RLLib is structured, [this](https://www.youtube.com/watch?v=nF02NWK1Rug&t=28s) video can be quite helpful. One representation from the video is shown below. The [code](https://github.com/DeUmbraTX/practical_rllib_tutorial/tree/main) from the video contains simple examples of all aspects of RLLib.

<img src="./imgs/rllib_overview.png" width="600" />

Although RLLib is capable of handling multi-agent problems and environments, in the ``CartPole`` example, we consider a simple single-agent problem. To perform any RL experiment with the Ray framework we need four things: 
1. RL Environment 
2. RL Algorithm
3. Configuration of the environment, algorithm and the experiment
4. Experiment Runner

To begin we import all packages and initialise Ray to use 4 CPUs 

In [1]:
import os
import ray
import gymnasium
from ray import tune, train    # experiment runner
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn.dqn import DQNConfig

ray.init(num_cpus=4)

2024-10-31 12:30:04,976	INFO worker.py:1816 -- Started a local Ray instance.


0,1
Python version:,3.11.6
Ray version:,2.38.0


## Environment
Using ``gymnasium`` is particularly easy with Ray, as it has native support for all gym environments and we can simply pass the name of the ``gymanisum`` environment to the configurator. With other environments such as ``AEC`` or ``ParallelEnv`` from PettingZoo, preprocessing and wrapping would be necessary here.

## Algorithm
In this example we test two different built-in algorithms from Ray, PPO and DQN. To begin, we must configure the algorithms using their dedicated config classes, which inhert from ``AlgorihmConfig`` and are ``PPOConfig`` and ``DQNConfig``, respectively. There are two methods of training algorithms: 

1. Build the ``Algorithm`` from the config dictionary and use its methods to train
2. Pass the ``AlgorithmConfig`` dictionary to a ``ray.tune.Tuner`` object and perform an experiment.

For PPO these two examples are shown in the [documentation](https://docs.ray.io/en/latest/rllib/rllib-training.html?_gl=1*wviehk*_up*MQ..*_ga*MTA1MDM5NzY0Ny4xNzMwMTEwMTE0*_ga_0LCWHW1N3S*MTczMDExMDExMy4xLjEuMTczMDExMDE5MC4wLjAuMA..#using-the-python-api). Beginning with the first method, we build the PPO and DQN Algorithms from their respective ``AlgorithmConfig`` classes and train 10 iterations, printing the result after the final training iteration. More details on the ``Algorithm`` class and information on how to create a custom algorithm can be found [here](https://docs.ray.io/en/latest/rllib/package_ref/algorithm.html?_gl=1*1s6zgz8*_up*MQ..*_ga*MjExNDg5MjYzMC4xNzMwMjg1NDg1*_ga_0LCWHW1N3S*MTczMDMwMjgzMC4yLjAuMTczMDMwMjgzMC4wLjAuMA..#building-custom-algorithm-classes).

In [None]:
from pprint import pprint

ppo_config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
    .env_runners(num_env_runners=1)
)

algo = ppo_config.build()

for i in range(19):
    ppo_result = algo.train()

ppo_result.pop('config')
pprint(ppo_result)

We then do the same with with ``DQNConfig``. Note that we pass the same parameters, despite the two algorithms being different. This is because we are leaving all algorithm-specific configurations to default values. [Algorithm-specific configuration options](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html?_gl=1*mfabvl*_up*MQ..*_ga*MjExNDg5MjYzMC4xNzMwMjg1NDg1*_ga_0LCWHW1N3S*MTczMDI4NTQ4NS4xLjAuMTczMDI4NTQ4NS4wLjAuMA..) are available in the Ray documentation.

In [None]:
from ray.rllib.algorithms.dqn.dqn import DQNConfig

dqn_config = (
    DQNConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
    .env_runners(num_env_runners=1)
)

dqn_algo = dqn_config.build()

for i in range(19):
    dqn_result = dqn_algo.train()

dqn_result.pop('config')
pprint(dqn_result)

We can compare the output of the two models. The most relevant information can be found in the ``env_runners`` key, which contains information on the environments and agents run. In particular we are interested in the ``agent_episode_returns_mean``, which gives us the average return over all episodes for each agent. In our case there is only one agent, which is denoted with ``default_agent``. 

In [None]:
ppo_agent_results = ppo_result['env_runners']
print('------ PPO RESULT ------')
print(f'Average Episode Reward: {ppo_agent_results["agent_episode_returns_mean"]["default_agent"]}')
print(f'Average Episode Duration: {round(ppo_agent_results["episode_duration_sec_mean"], 4)} seconds')
print(f'Reward: (max: {ppo_agent_results["episode_return_max"]}, mean: {ppo_agent_results["episode_return_mean"]}, min: {ppo_agent_results["episode_return_min"]})')
print(f'Episode Length: (max: {ppo_agent_results["episode_len_max"]}, mean: {ppo_agent_results["episode_len_mean"]}, min: {ppo_agent_results["episode_len_min"]})')

dqn_agent_results = dqn_result['env_runners']
print()
print('------ DQN RESULT ------')
print(f'Average Episode Reward: {dqn_agent_results["agent_episode_returns_mean"]["default_agent"]}')
print(f'Average Episode Duration: {round(dqn_agent_results["episode_duration_sec_mean"], 4)} seconds')
print(f'Reward: (max: {dqn_agent_results["episode_return_max"]}, mean: {dqn_agent_results["episode_return_mean"]}, min: {dqn_agent_results["episode_return_min"]})')
print(f'Episode Length: (max: {dqn_agent_results["episode_len_max"]}, mean: {dqn_agent_results["episode_len_mean"]}, min: {dqn_agent_results["episode_len_min"]})')

The second method is to use ``ray.tune`` to perform an experiments with an ``AlgorithmConfig`` object. ``ray.tune`` was originally intended for hyperparameter tuning, hence its name. In the following examples, we will only run a single experiment, e.g. we won't pass any search spaces for hyperparameters (this will be analysed in a different notebook). In the ``Tuner`` object we indicate that we want to stop training once an average reward of 200 has been reached. 

In [None]:
ppo_config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True, 
        enable_env_runner_and_connector_v2=True,
    )
    .environment('CartPole-v1')
)

tuner = tune.Tuner(
    "PPO",
    param_space=ppo_config,
    run_config=train.RunConfig(
        stop={'env_runners/episode_return_mean': 100},
    )
)

ray.cluster_resources()
results = tuner.fit()
pprint(results)


0,1
Current time:,2024-10-31 12:35:29
Running for:,00:00:42.38
Memory:,19.6/31.3 GiB

Trial name,status,loc,iter,total time (s),num_env_steps_sample d_lifetime,num_episodes_lifetim e,num_env_steps_traine d_lifetime
PPO_CartPole-v1_21906_00000,TERMINATED,127.0.0.1:38160,5,31.9567,20000,417,20427


2024-10-31 12:35:29,052	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'C:/Users/ushe/ray_results/PPO_2024-10-31_12-34-46' in 0.0090s.
2024-10-31 12:35:29,469	INFO tune.py:1041 -- Total run time: 42.81 seconds (42.37 seconds for the tuning loop).


ResultGrid<[
  Result(
    metrics={'timers': {'env_runner_sampling_timer': 1.127984262819409, 'learner_update_timer': 5.340735067917285, 'synch_weights': 0.006793373029419923, 'synch_env_connectors': 0.008673361914479588, 'training_iteration_time_sec': 6.323401594161988, 'restore_workers_time_sec': 0.0, 'training_step_time_sec': 6.323201084136963}, 'env_runners': {'num_episodes': 16, 'episode_len_mean': 103.94, 'num_module_steps_sampled': {'default_policy': 4000}, 'episode_return_mean': 103.94, 'module_episode_returns_mean': {'default_policy': 103.94}, 'episode_return_min': 12.0, 'num_env_steps_sampled': 4000, 'episode_len_max': 500, 'episode_len_min': 12, 'num_agent_steps_sampled_lifetime': {'default_agent': 60000}, 'num_agent_steps_sampled': {'default_agent': 4000}, 'sample': np.float64(1.1043094172224353), 'num_module_steps_sampled_lifetime': {'default_policy': 60000}, 'num_env_steps_sampled_lifetime': 100000, 'episode_duration_sec_mean': 0.0539683410001453, 'agent_episode_returns_

In [None]:
dqn_config = (
    DQNConfig()
    .api_stack(
        enable_env_runner_and_connector_v2=True,
        enable_rl_module_and_learner=True,
    )
    .environment('CartPole-v1')
)

tuner = tune.Tuner(
    "DQN",
    param_space=dqn_config,
    run_config=train.RunConfig(
        stop={'env_runners/episode_return_mean': 100},
    )
)

results = tuner.fit()
pprint(results)

0,1
Current time:,2024-10-31 12:40:40
Running for:,00:04:23.65
Memory:,18.7/31.3 GiB

Trial name,status,loc,iter,total time (s),num_env_steps_sample d_lifetime,num_episodes_lifetim e,num_env_steps_traine d_lifetime
DQN_CartPole-v1_5790a_00000,TERMINATED,127.0.0.1:41688,16,257.817,16000,303,480032


2024-10-31 12:40:40,913	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'C:/Users/ushe/ray_results/DQN_2024-10-31_12-36-17' in 0.0110s.
2024-10-31 12:40:40,929	INFO tune.py:1041 -- Total run time: 263.67 seconds (263.63 seconds for the tuning loop).


ResultGrid<[
  Result(
    metrics={'timers': {'env_runner_sampling_timer': 0.0016692082588963132, 'replay_buffer_add_data_timer': 0.00046086340736068527, 'replay_buffer_sampling_timer': 0.0037381102358033423, 'learner_update_timer': 0.005862153423527652, 'replay_buffer_update_prios_timer': 0.00038514787872257644, 'synch_weights': 0.004218789637538372, 'synch_env_connectors': 2.5466819840972386e-05, 'training_iteration_time_sec': 17.99058496952057, 'restore_workers_time_sec': 0.0, 'training_step_time_sec': 0.015424990653991699}, 'env_runners': {'num_env_steps_sampled': 1, 'num_episodes': 0, 'sample': 0.001348025295885876, 'num_agent_steps_sampled': {'default_agent': 1}, 'num_agent_steps_sampled_lifetime': {'default_agent': 128008000}, 'num_env_steps_sampled_lifetime': 128008000, 'num_module_steps_sampled': {'default_policy': 1}, 'num_module_steps_sampled_lifetime': {'default_policy': 128008000}, 'time_between_sampling': 0.015254951781264391, 'agent_episode_returns_mean': {'default_agen

Two things are noteworthy when regarding these two solutions. The PPO algorithm uses more allocated resources (3/4 CPUs for PPO vs. 1/4 CPUs vor DQN). This being said, the DQN algorithm takes significantly longer to reach the same average episode reward.