### To view the tensorboard: 
    1: tensorboard --logdir ray_results 
    2: see http://localhost:6006/ in browser

In [2]:
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray import tune




## 0: RLlib Training APIs: 
1: At a high level, RLlib provides an Trainer class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed. In multi-agent training, the trainer manages the querying and optimization of multiple policies at once.

2: rllib train --run DQN --env CartPole-v0  --config '{"num_workers": 8}'
    To see the tensorboard: tensorboard --logdir=~/ray_results

3: rllib rollout ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

4: Loading and restoring a trained agent from a checkpoint is simple:
    
    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)
    
5: Computing Actions

The simplest way to programmatically compute actions from a trained agent is to use trainer.compute_action(). This method preprocesses and filters the observation before passing it to the agent policy. Here is a simple example of testing a trained agent for one episode:

    # instantiate env class
    env = env_class(env_config)

    # run until episode ends
    episode_reward = 0
    done = False
    obs = env.reset()
    while not done:
        action = agent.compute_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
6: It’s recommended that you run RLlib trainers with Tune, for easy experiment management and visualization of results. Just set "run": ALG_NAME, "env": ENV_NAME in the experiment . config. All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune/

7: tune.run() returns an ExperimentAnalysis object that allows further analysis of the training results and retrieving the checkpoint(s) of the trained agent. It also simplifies saving the trained agent. For example:

a: tune.run() allows setting a custom log directory (other than ``~/ray-results``) and automatically saving the trained agent

    analysis = ray.tune.run(
        ppo.PPOTrainer,
        config=config,
        local_dir=log_dir,
        stop=stop_criteria,
        checkpoint_at_end=True)

b: list of lists: one list per checkpoint; each checkpoint list contains 1st the path, 2nd the metric value

        checkpoints = analysis.get_trial_checkpoints_paths(
            trial=analysis.get_best_trial("episode_reward_mean"),
            metric="episode_reward_mean")

c: or simply get the last checkpoint (with highest "training_iteration")

        last_checkpoint = analysis.get_last_checkpoint()
    
d: if there are multiple trials, select a specific trial or automatically choose the best one according to a given metric

        last_checkpoint = analysis.get_last_checkpoint(
            metric="episode_reward_mean", mode="max"
        )

e: Loading and restoring a trained agent from a checkpoint is simple:

    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)

In [None]:
ray.shutdown()
ray.init()

#### 1 Example of Traing a PPO Agent

In [None]:
config = ppo.DEFAULT_CONFIG.copy()
config['num_gpus'] = 1
config['num_workers'] = 2
trainer = ppo.PPOTrainer(config = config, env='CartPole-v1') 

for i in range(30):
    result = trainer.train()
    if i % 10 ==0:
        checkpoint = trainer.save()
        print('checkpoint saved at:', checkpoint) 

#### 2 Example of Using Tune

In [None]:
alg = 'PPO'
tune.run(alg,
    stop={'episode_reward_mean':200},
    config={
        'env':'CartPole-v0',
        'num_gpus':1,
        'num_workers':2,
        'lr':tune.grid_search([.01,.001,.0001])     
    }
)

In [None]:
alg = 'DDPG'
tune.run(alg,
    stop={"training_iteration": 30},
    config={
        'env':'Pendulum-v0',
        'num_gpus':0,
        'num_workers':2,
        'lr':tune.grid_search([.001,])     
    }
)


## 1: RLlib Environments

1: RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments.

2: RLlib uses Gym as its environment interface for single-agent training.



#### 1: Configuring Environments

    https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py

In [3]:
import gym, ray
from ray.rllib.agents import ppo

class MyEnv(gym.Env):
    def __init__(self, env_config):
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self):
        return <obs>
    def step(self, action):
        return <obs>, <reward: float>, <done: bool>, <info: dict>

ray.init()
trainer = ppo.PPOTrainer(env=MyEnv, config={
    "env_config": {},  # config to pass to env class
})

while True:
    print(trainer.train())

SyntaxError: invalid syntax (<ipython-input-3-7352217a18a1>, line 6)