<a href="https://colab.research.google.com/github/lcipolina/Ray/blob/main/RLLIB_Ray2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial on how to run RLLIB with Tune and Ray 2.0

* First thing, Ray and RLLIB will still be backwards compatible for some time. 

* The changes is mostly on how we configure the experiment and set the parameters of the net. Ray needs those two things separately:

  **1) A Way to configure the global run (called an "experiment" in Ray).**
  
  For example, how many iterations, how many checkpoints, where to save results. This is done through a dictionary where the parameters are keywords and the values are chosen from a list of possibilities

 **2) A way to configure the net (i.e. RL algo).** 

 This is model dependent. For example, PPO and QMIX will have different parameters. For this, you import the "configuration class" and you can either use their default values or set your own values.

## Create a simple environment
The steps are similar to a gym environment:

1) Customized environments inherit from a Ray-defined base class and need to override certain methods. The input and outputs of these methods are given and can't be changed.

2) Register the environment 

3) Configure RLLIB to run the experiment (we will use Tune combined with Ray, because it gives us more functionalities)

In [None]:
!pip install "ray[all]" #--quiet
import os
os._exit(0)
import gym
import numpy as np
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env

In [None]:
#Environment taken from here: https://discuss.ray.io/t/observation-space-not-provided-in-policyspec/6501/16

class MyEnv(gym.Env):
    def __init__(self, config=None):
        super().__init__()

        self.action_space = gym.spaces.Box(
            low=-1, high=1, shape=(2,), dtype=np.float32)
        self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(42410,), dtype=np.float32)

    def _next_observation(self):
      obs = np.random.rand(42410)
      return obs

    def _take_action(self, action):
      self._reward = 1

    def step(self, action):
        # Execute one time step within the environment
        self._reward = 0
        self._take_action(action)
        done = False
        obs = self._next_observation()
        return obs, self._reward, done, {}

    def reset(self):
        self._reward = 0
        self.total_reward = 0
        self.visualization = None
        return self._next_observation()

### Driver code for training with RLLIB

In [None]:
#REGINSTER THE ENVIRONMENT

env_name = 'my_env'
tune.register_env(env_name, lambda env_ctx: MyEnv()) #the register_env needs a callable/iterable

In [None]:
# OPTIONAL: CONFIGURE THE PARAMETERS OF PPO

# this is done via importing the PPO configuration class and setting its values. 
# The customized configuration is actually not needed, one can use the default setting for PPO. 
# here I am just showing how to do customized configuration of some parameters.

# If one wants to be serious about PPO and RLLIB, it is necessary to understand exactly how RLLIB treats the different parameters
# For example, it has an "adaptive KL term" - meaning that it adapts the weight on the KL component of the loss. 

# see here for all the PPO parameters: 
# https://chuacheowhuan.github.io/RLlib_trainer_config/
# https://discuss.ray.io/t/rllib-ray-rllib-config-parameters-for-ppo/691

N_CPUS = 4
config = PPOConfig()\
    .training(lr=5e-3,num_sgd_iter=10, train_batch_size = 256)\
    .framework("torch")\
    .rollouts(num_rollout_workers=1, horizon= 10, rollout_fragment_length=10)\
    .resources(num_gpus=0,num_cpus_per_worker=1)\
    .environment(env = env_name, env_config={
                                     "num_workers":1}# N_CPUS - 1}, #env_confit: arguments passed to the Env + num_workers = # number of parallel workers
                     )

In [2]:
# INITIALIZE RAY AND RUN

# Remember: Ray is the general library, RLLIB is the library (on top of Ray) for RL experiments and Tune + Air are the libraries for Hyperparam tuning.
# In this example I am running RLLIB with Tune. This is just to give us more functionalities in the case we want to do hyperparam sweeping. 
# But RLLIB can be run either by command line or without Tune (just with fixed parameters) in a much simpler way. I'm just showing the most advanced way.

# Additional information on how to Run RLLIB with different modalities
# https://docs.ray.io/en/latest/rllib/core-concepts.html
# https://docs.ray.io/en/master/rllib/index.html
# https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py



In [None]:
    if ray.is_initialized(): ray.shutdown()
    ray.init(include_dashboard=True, ignore_reinit_error=True,) #Prints the dashboard running on a local port
    experiment_name = 'my_env_experiment'
    tuner = tune.Tuner("PPO", param_space=config.to_dict(), #to run with Tune
                        run_config=air.RunConfig(
                                name =  experiment_name,
                                stop={"timesteps_total": 10},
                                #verbose = 1
                                checkpoint_config=air.CheckpointConfig(
                                checkpoint_frequency=50, checkpoint_at_end=True
                                 ),
                                )
                                  )

    results = tuner.fit()
    ray.shutdown()

    # This is it! results will be printed on the screen and also, look for a folder called "ray_results"

Other ways to run that also work:

In [None]:
#1) Run Tune with the config class (without the second experiment_config dict)
    #https://docs.google.com/document/d/1a3UEHEz6_Jth9O_2GR9Qp9c9IFyD5dObHCNSBsRgPGU/edit#
    results = tune.run('PPO', config=config.to_dict(), stop={"timesteps_total": 10 })

# 2) Alternatively: https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm_config.py
    tune.Tuner("PPO", param_space=config.to_dict()).fit()

# 3) Run Tune without "Air" and a second experiment_config dict
    # Define experiment details
    experiment_name = 'my_env_experiment'
    # Confiugure the experiment
    experiment_dict = {
            'name': experiment_name,
            'run_or_experiment': 'PPO',
            "stop": {
                "timesteps_total": 10
            },
            'checkpoint_freq': 20,
            "config": config.to_dict() #to run with Tune - need to convert the config class back to a dict
        }
    #Run with Tune
     results = tune.run(**experiment_dict)

  # 4) Run the old style with the config as a dict (still compatible)
    config = {
        "env": env_name,
        "num_workers": 1,  # parallelism
        "horizon": 10,
        "rollout_fragment_length": 10,
        "train_batch_size": 256,
    }
    stop = {
        "timesteps_total": 10
    }
    results = tune.run("PPO", config=config, stop=stop)

# 5) Run with Tune and Air (simple example from documentation):
    tuner = tune.Tuner("PPO", param_space=config, run_config=air.RunConfig(stop=stop)    )
    results = tuner.fit()

