# Data Science Summit
## Fantastical bugs and where to find them in HRL systems
This tutorial is designed to familiarize yourself with the most frequent problems and bugs encountered while developing Reinforcement Learning systems, especially the hierarchical ones.
Author: Michał Bortkiewicz

## Imports

In [None]:
%load_ext autoreload
%autoreload 2
%load_ext nb_black

### Packages

In [None]:
import functools
import json
import os
import sys
import wandb
from typing import TypeVar, Callable, Any, cast, Tuple, Union
import numpy as np
import pandas as pd
from IPython.lib.display import IFrame
import matplotlib.pyplot as plt
sys.path.append("..")

### Project files

In [None]:
from config import ROOT_DIR
from scripts.run.train import override_params, create_wandb_config, create_wandb_name
from scripts.run.core import get_env_and_graph, load_params, run_session
from run.run_from_files import seed_everything

# Introduction
## The HRL setup and source code organization
[Graph RL](https://github.com/nicoguertler/graph_rl) library impplmeneted by Nico Gürtler et al., used in [Hierarchical Reinforcement Learning with Timed Subgoals](https://proceedings.neurips.cc/paper/2021/file/b59c21a078fde074a6750e91ed19fb21-Paper.pdf) and our tutorial, provides graph abstraction over agent elements that are represented by nodes and are composed of three elements:
- Policy
- Subtask
- Algorithm

![graph_rl](graph_and_node.png)


## The environment
We will use [Platforms environment](https://www.youtube.com/watch?v=JkPaI3uZU6c&t=432s&ab_channel=NicoG%C3%BCrtler), introduced by Nico Gürtler et al in [Hierarchical Reinforcement Learning with Timed Subgoals](https://proceedings.neurips.cc/paper/2021/file/b59c21a078fde074a6750e91ed19fb21-Paper.pdf). The goal of the (black) ball in this environment is to reach black circle by crossing the moving platforms. Blue ball represents the subgoals for the agent.
![Platforms](platforms.png)
For the purposes of this tutorial we will use easier version of Platforms environment (for faster policy convergence) where the final environment goal is located nearer, just above the first platform.
![EasyPlatforms](easy_platforms.png)

# Running the training and evaluation loop

## Run params and wandb setup

Main run parameters

In [None]:
# Environment and Agent
env_name = "EasyPlatforms"
algo_name = "SAC"
# Additional (overriding the json files) parameters for graph generation and run
graph_params = []
run_params = []
# Paths
env_algo_path = os.path.join(f"{ROOT_DIR}/data", env_name, algo_name + "_trained")
wandb_params_path = os.path.join(
    env_algo_path, "wandb_params.json"
)
log_dir = os.path.join(env_algo_path, "log")
# Session
step = 0

Create overriding functions and load graph and run parameters

In [None]:
graph_params_override = functools.partial(override_params, params=graph_params)
run_params_override = functools.partial(override_params, params=run_params)
run_params, graph_params, _ = load_params(env_algo_path)
if graph_params_override:
    graph_params_override(graph_params)
if run_params_override:
    run_params_override(run_params)

Load wandb parameters

In [None]:
with open(wandb_params_path) as json_file:
    wandb_params = json.load(json_file)

Create config that can be logged to wandb

In [None]:
config_to_log = create_wandb_config(
    env_algo_path,
    override_graph_params=graph_params_override,
    override_run_params=run_params_override,
)

In [None]:
config_to_log

Init wandb run

In [None]:
!wandb login

In [None]:
!wandb online  # Remove this if not debugging

In [None]:
run_wandb = wandb.init(
    dir=os.path.join(ROOT_DIR),
    project=wandb_params["project"],
    entity=wandb_params["entity"],
    name=create_wandb_name(wandb_params["name"], config_to_log),
    group=wandb_params["group"],
    tags=wandb_params["tags"] if "tags" in wandb_params else [],
    sync_tensorboard=wandb_params["sync_tensorboard"],
    monitor_gym=wandb_params["monitor_gym"],
    config=config_to_log,
    save_code=wandb_params["save_code"],
)
wandb_params["run_id"] = run_wandb.id
run_wandb.log_code()

## Session

### Enironment and Graph setup
Seed python, numpy and pytorch

In [None]:
seed = run_params["seed"]
seed_everything(seed)

Create environment and graph

In [None]:
run_params

In [None]:
env, graph = get_env_and_graph(run_params, graph_params, wandb_params)

Environment information

In [None]:
env.action_space

In [None]:
env.reset()

Graph information

In [None]:
graph

### Training session

In [None]:
sess_props = run_session(
    env_algo_path,
    graph,
    env,
    run_params,
    step,
    wandb_params=wandb_params,
    run_wandb=run_wandb,
    graph_params=graph_params,
)

In [None]:
print(run_wandb.url)
IFrame(run_wandb.url, width="100%", height=720)

# Flat Agent
First, we will implement the Flat Agent, that is based on the SAC method. Flat Agent is, understandably, composed of only one node.

## Bug #1 - wrong alpha in SAC
Based on [SpinningUp blog by OpenAI](https://spinningup.openai.com/en/latest/algorithms/sac.html):
> SAC trains a stochastic policy with entropy regularization, and explores in an on-policy way. The entropy regularization coefficient $\alpha$ explicitly controls the explore-exploit tradeoff, with higher $\alpha$ corresponding to more exploration, and lower $\alpha$ corresponding to more exploitation. The right coefficient (the one which leads to the stablest / highest-reward learning) may vary from environment to environment, and could require careful tuning.

Thus, we should first estimate good enough $\alpha$ for exploration in flat agent to later examine the benefits of hierarchical reinforcement learning.

In [None]:
group_url = f"https://wandb.ai/piotrczernecki/herald-hits/groups/DSS-flat/workspace?workspace=user-michalbortkiewicz"
print(group_url)
IFrame(src=group_url, width="100%", height=720)

To examine the results more closely, we can use the wandb API to retrieve the results and perform further analysis:

In [None]:
run_id='obv9hhbo'

api = wandb.Api()
entity, project = "piotrczernecki", "herald-hits"  # set to your entity and project
runs = api.runs(entity + "/" + project)
run = api.run(f"{entity}/{project}/{run_id}")

In [None]:
history = run.scan_history()
rows = []
for i, row in enumerate(history):
    if "hac_node_layer_0_algorithm/train/entropy" in row.keys():
        if i > 10000:
            rows.append({k: row[k] for k in row.keys() if k in ['global_step', 'hac_node_layer_0_algorithm/train/entropy']})
            if row['global_step']>=30000:
                break
df = pd.DataFrame(rows)
df['hac_node_layer_0_algorithm/train/entropy']

In [None]:
plt.plot(df['global_step'], df['hac_node_layer_0_algorithm/train/entropy'])

### Solution: Start with the most suspicious plot of logged metrics
In case of SAC the most insightful plots are *entropy* and *q-value*. But proceed to them only after you verified that loss returns are calculated properly and that loss is correct.

# Hierarchical Agent
Hierarchical agent used in tutorial is composed of two levels i.e. two nodes each corresponding to the policy with different temporal resolution. Main goal is to achieve the same temporal abstraction in the higher level policy that enables the agent to explore environment efficiently and converge faster than the flat agent.

## Bug #2 - weak temporal abstraction due to wrong budget of HL actions
In HiTS, there is a budget of maximum number of actions that can be performed during the episode. If the agent runs out of budget, the episode is terminated and additional penalty for using all the possible HL actions is given. Even, though the budget mechanism is not perfect, i.e. there should be no such thing as budget in HRL algorithm, it is crucial to tune the budget differently for every environment to achieve decent performance in HiTS.

In [None]:
algo_name = "HiTS"
env_algo_path = os.path.join(f"{ROOT_DIR}/data", env_name, algo_name + "_trained")
run_params, graph_params, _ = load_params(env_algo_path)
env, graph = get_env_and_graph(run_params, graph_params, wandb_params)

In [None]:
graph

In [None]:
for node in graph._nodes:
    print(f"NODE:\n{node}\nACTION SPACE:\n{node.policy.action_space}\n")

In [None]:
group_url = f"https://wandb.ai/piotrczernecki/herald-hits/groups/DSS-hierarchical-budget/workspace?workspace=user-michalbortkiewicz"
print(group_url)
IFrame(src=group_url, width="100%", height=720)

### Solution: Set a challenging budget for HL policy
Even though the budget is far from a perfect solution that should be abandoned in future works it serves as the most temporal abstraction widening hyperparameter. Thus, it should be challenging for an agent to achieve success in sparse reward environment.

## Bug #3 - incorrect reproducibility of experiments due to hidden random number generation or numerical error
The bug is quite obvious, however, because of complex logic in hierarchical methods it may be hard to locate. Some algorithms may call the random number generator more often than others. Thus, the runs of almost identical experiment setups may differ significantly, due to chaotic behavior of the RL training algorithm.

### Hidden random number generator calls

In [None]:
F = TypeVar('F', bound=Callable[..., Any])

def reset_seed(func: F) -> F:
    def new_func(*args, **kwargs):
        st0 = np.random.get_state()
        func_output = func(*args, **kwargs)
        np.random.set_state(st0)
        return func_output
    return cast(F, new_func)

In [None]:
def random_outer1() -> Tuple[Union[float, None], float]:
    outer = np.random.randn()
    return None, outer

def random_inner2() -> float:
    return np.random.randn()

def random_outer2() -> Tuple[float, float]:
    inner = random_inner2()
    outer = np.random.randn()
    return inner, outer

@reset_seed
def random_inner3() -> float:
    return np.random.randn()

def random_outer3() -> Tuple[float, float]:
    inner = random_inner3()
    outer = np.random.randn()
    return inner, outer

In [None]:
results_outer = {}
for i in range(1,4):
    results_outer[f"random_outer{i}"] = {}
    np.random.seed(10)
    for j in range(10):
        inner, outer = eval(f"random_outer{i}()")
        results_outer[f"random_outer{i}"][j]=outer

In [None]:
df = pd.DataFrame(results_outer)
df["r_o1==r_o2"] = df["random_outer1"]==df["random_outer2"]
df["r_o1==r_o3"] = df["random_outer1"]==df["random_outer3"]
df

### Numerical error

In [None]:
group_url = f"https://wandb.ai/piotrczernecki/herald-hits/groups/22-v1-reproducibility/workspace?workspace=user-michalbortkiewicz"
print(group_url)
IFrame(src=group_url, width="100%", height=720)

### Solution: To obtain high reproducibility initialize additional random generators for auxiliary functions
It is a good practice, to build new methods incrementally. However, whenever new components are added, make sure that they are decoupled from the original method so that both the original and modified methods may be compared for the same seed.

In addition, it may happen that the numerical error accumulates in the run, for instance, due to: different hardware, CUDA system, package version etc. As a result it can lead to completely different results for the SAME random seed for identical experiment configurations!

# EAT: Emergency action termination for immediate reaction in hierarchical reinforcement learning
Together with Jakub Łyskawa, Paweł Wawrzyński, Mateusz Ostaszewski, Artur Grudkowski and Tomasz Trzciński we submitted EAT to AAMAS 2023 conference. Our main contributions are:
- We introduce a method, EAT, of monitoring and possibly terminating higher level actions in hierarchical RL. This method allows a hierarchical policy to immediately react to random events in the environment.
- We design two strategies for monitoring and terminating the higher level actions.
- We introduce a framework for hierarchical decomposition of Markov Decision Processes into subprocesses in which rewards for future events are discounted over time elapsing to their occurrence rather than over the number of actions to their occurrence.

EAT introduces higher level action interruption based on the heuristic (performed at every environment step) that rejects current (obsolete) action continuation according to one of two approaches:
- *Changing Q*. In this strategy, we terminate the current action if it seems to be worse that the proposed alternative.
- *Changing target*. In this strategy, we terminate the current action if it significantly differs from the proposed one.

![idea](idea.png)

We found out that such action interruption mechanism does not decrease the performance of the hierarchical agent in regular environments. Apparently, EAT significantly improves the success rate in environments with a certain kind of noise, where immediate reaction is necessary to omit emergent obstacles.

EAT in action is presented below, in the modified Platforms environment in which every platform might be frozen with a particular probability at every time step.
![gif](nb.gif)

In [None]:
group_url = f"https://wandb.ai/piotrczernecki/herald-hits/groups/AAMAS-q-noisy-platforms/workspace?workspace=user-michalbortkiewicz"
print(group_url)
IFrame(src=group_url, width="100%", height=720)

## Using wandb for ablation studies and further artifacts analysis

One can easily import the model for further analysis or training from wandb using wandb API as follows:

In [None]:
api = wandb.Api()
artifact = api.artifact('piotrczernecki/herald-hits/HAC_obv9hhbo:v0', type='model')
artifact_dir = artifact.download()
os.listdir(artifact_dir)

In [None]:
model_path = os.path.join(artifact_dir, os.listdir(artifact_dir)[0])
graph.load_parameters(model_path)

Now, we can continue training from the loaded checkpoint.

In [None]:
sess_props = run_session(
    env_algo_path,
    graph,
    env,
    run_params,
    step,
    wandb_params=wandb_params,
    run_wandb=run_wandb,
    graph_params=graph_params,
)