# **Explainable Reinforcement Learning Tutorial**

Welcome to this tutorial on **Explainable Reinforcement Learning (XRL)**! In this guide, we'll explore how to interpret and explain the decisions made by reinforcement learning agents using the SHAP (SHapley Additive exPlanations) library. We'll work through a practical example involving an the simulation simulation in a reinforcement learning setting, and demonstrate how to compute and visualize feature attributions for the agent's actions.

## **Table of Contents**

1. [Introduction](#introduction)

    1.1. [Multi-Agent Deep Reinforcement Learning](#11-multi-agent-deep-reinforcement-learning)

    1.2. Prerequisites

2. [Explainable AI and SHAP Values](#2-explainable-ai-and-shap-values)

    2.1 Understanding Explainable AI 

    2.2 Introduction to SHAP Values 

3. [Calculating SHAP values](#3-calculating-shap-values)

    3.1. [Loading and Preparing Data](#loading-and-preparing-data)

    3.2. [Creating a SHAP Explainer](#32-creating-a-shap-explainer)
    
4. [Visualizing SHAP Values](#visualizing-shap-values)
5. [Conclusion](#conclusion)
6. [Additional Resources](#additional-resources)

## 1. Introduction <a name="introduction"></a>

Reinforcement Learning (RL) has achieved remarkable success in various domains, such as game playing, robotics, and autonomous systems. However, RL models, particularly those using deep neural networks, are often seen as black boxes due to their complex architectures and non-linear computations. This opacity poses challenges in understanding and trusting the decisions made by RL agents, especially in critical applications.

**Explainable Reinforcement Learning (XRL)** aims to bridge this gap by providing insights into the agent's decision-making process. By leveraging explainability techniques, we can interpret the actions of an RL agent, understand the influence of input features, and potentially improve the model's performance and fairness.

In this tutorial, we will demonstrate how to apply SHAP values to a trained actor neural network within an RL framework to explain the agent's actions.

### 1.1 Multi-Agent Deep Reinforcement Learning <a name="MARL"></a>

In ASSUME, we implement RL agents using a Multi-Agent Deep Reinforcement Learning (MADRL) approach. Key aspects include:


- **Observations**: Each agent receives observations comprising market forecasts, unit-specific information, and past actions.
- **Actions**: Agents decide on bidding strategies, such as bid prices for inflexible and flexible capacities.
- **Rewards**: Agents receive rewards based on profits and opportunity costs, guiding them to learn optimal bidding strategies.
- **Algorithm**: We utilize a multi-agent version of the TD3 algorithm, ensuring stable learning in a non-stationary environment.

For a deep dive into the RL configurations we refer to one of the other tutorials, such as 
[Deep Reinforcement Learning Tutorial](https://example.com/deep-rl-tutorial)

Agents need observations to make informed decisions. Observations include:

- **Residual Load Forecast**: Forecasted net demand over the next 24 hours.
- **Price Forecast**: Forecasted market prices over the next 24 hours.
- **Marginal Cost**: Current marginal cost of the unit.
- **Previous Output**: Dispatched capacity from the previous time step.


Agents choose actions based on the observations. The action space is two-dimensional, corresponding to:

- Bid Price for Inflexible Capacity (p_inflex): The price at which the agent offers its minimum power output (must-run capacity) to the market.
- Bid Price for Flexible Capacity (p_flex): The price for the additional capacity above the minimum output that the agent can flexibly adjust.


#### Run an the simulation MADRL Simulation

Similar to the other tutorial, we can run Assume in the following way. 

In [None]:
#!pip install 'assume-framework[learning]'
#!git clone https://github.com/assume-framework/assume.git assume-repo

In [None]:
import importlib.util

# Check if 'google.colab' is available
IN_COLAB = importlib.util.find_spec("google.colab") is not None

colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"

inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path

For XRL, we need enhanced logging of the learning process, which is not currently a feature of ASSUME itself. Therefore, we are overriding some functions to enable this logging specifically for the purpose of this tutorial.

In [None]:
# @title Overwrite run_learning function with enhanced logging

import json
import logging
import os
from collections import defaultdict
from pathlib import Path

import numpy as np
import yaml
from tqdm import tqdm

from assume.common.exceptions import AssumeException
from assume.scenario.loader_csv import (
    load_config_and_create_forecaster,
    setup_world,
)
from assume.world import World

logger = logging.getLogger(__name__)


def run_learning(
    world: World,
    inputs_path: str,
    scenario: str,
    study_case: str,
    verbose: bool = False,
) -> None:
    """
    Train Deep Reinforcement Learning (DRL) agents to act in a simulated market environment.

    This function runs multiple episodes of simulation to train DRL agents, performs evaluation, and saves the best runs. It maintains the buffer and learned agents in memory to avoid resetting them with each new run.

    Args:
        world (World): An instance of the World class representing the simulation environment.
        inputs_path (str): The path to the folder containing input files necessary for the simulation.
        scenario (str): The name of the scenario for the simulation.
        study_case (str): The specific study case for the simulation.

    Note:
        - The function uses a ReplayBuffer to store experiences for training the DRL agents.
        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
    """
    from assume.reinforcement_learning.buffer import ReplayBuffer

    if not verbose:
        logger.setLevel(logging.WARNING)

    # remove csv path so that nothing is written while learning
    temp_csv_path = world.export_csv_path
    world.export_csv_path = ""

    # initialize policies already here to set the obs_dim and act_dim in the learning role
    actors_and_critics = None
    world.learning_role.initialize_policy(actors_and_critics=actors_and_critics)
    world.output_role.del_similar_runs()

    # check if we already stored policies for this simualtion
    save_path = world.learning_config["trained_policies_save_path"]

    if Path(save_path).is_dir():
        # we are in learning mode and about to train new policies, which might overwrite existing ones
        accept = input(
            f"{save_path=} exists - should we overwrite current learnings? (y/N) "
        )
        if not accept.lower().startswith("y"):
            # stop here - do not start learning or save anything
            raise AssumeException("don't overwrite existing strategies")

    # -----------------------------------------
    # Load scenario data to reuse across episodes
    scenario_data = load_config_and_create_forecaster(inputs_path, scenario, study_case)

    # -----------------------------------------
    # Information that needs to be stored across episodes, aka one simulation run
    inter_episodic_data = {
        "buffer": ReplayBuffer(
            buffer_size=int(world.learning_config.get("replay_buffer_size", 5e5)),
            obs_dim=world.learning_role.rl_algorithm.obs_dim,
            act_dim=world.learning_role.rl_algorithm.act_dim,
            n_rl_units=len(world.learning_role.rl_strats),
            device=world.learning_role.device,
            float_type=world.learning_role.float_type,
        ),
        "actors_and_critics": None,
        "max_eval": defaultdict(lambda: -1e9),
        "all_eval": defaultdict(list),
        "avg_all_eval": [],
        "episodes_done": 0,
        "eval_episodes_done": 0,
        "noise_scale": world.learning_config.get("noise_scale", 1.0),
    }

    # -----------------------------------------

    validation_interval = min(
        world.learning_role.training_episodes,
        world.learning_config.get("validation_episodes_interval", 5),
    )

    eval_episode = 1

    for episode in tqdm(
        range(1, world.learning_role.training_episodes + 1),
        desc="Training Episodes",
    ):
        # TODO normally, loading twice should not create issues, somehow a scheduling issue is raised currently
        if episode != 1:
            setup_world(
                world=world,
                scenario_data=scenario_data,
                study_case=study_case,
                episode=episode,
            )

        # -----------------------------------------
        # Give the newly initliazed learning role the needed information across episodes
        world.learning_role.load_inter_episodic_data(inter_episodic_data)

        world.run()

        # -----------------------------------------
        # Store updated information across episodes
        inter_episodic_data = world.learning_role.get_inter_episodic_data()
        inter_episodic_data["episodes_done"] = episode

        # evaluation run:
        if (
            episode % validation_interval == 0
            and episode
            >= world.learning_role.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.reset()

            # load evaluation run
            setup_world(
                world=world,
                scenario_data=scenario_data,
                study_case=study_case,
                perform_evaluation=True,
                eval_episode=eval_episode,
            )

            world.learning_role.load_inter_episodic_data(inter_episodic_data)

            world.run()

            total_rewards = world.output_role.get_sum_reward()
            avg_reward = np.mean(total_rewards)
            # check reward improvement in evaluation run
            # and store best run in eval folder
            terminate = world.learning_role.compare_and_save_policies(
                {"avg_reward": avg_reward}
            )

            inter_episodic_data["eval_episodes_done"] = eval_episode

            # if we have not improved in the last x evaluations, we stop loop
            if terminate:
                break

            eval_episode += 1

        world.reset()

        # if at end of simulation save last policies
        if episode == (world.learning_role.training_episodes):
            world.learning_role.rl_algorithm.save_params(
                directory=f"{world.learning_role.trained_policies_save_path}/last_policies"
            )

            # export buffer_obs.json in the last training episode to get observations later
            export = inter_episodic_data["buffer"].observations.tolist()
            path = f"{world.learning_role.trained_policies_save_path}/buffer_obs"
            os.makedirs(path, exist_ok=True)
            with open(os.path.join(path, "buffer_obs.json"), "w") as f:
                json.dump(export, f)

        # container shutdown implicitly with new initialisation
    logger.info("################")
    logger.info("Training finished, Start evaluation run")
    world.export_csv_path = temp_csv_path

    world.reset()

    # load scenario for evaluation
    setup_world(
        world=world,
        scenario_data=scenario_data,
        study_case=study_case,
        terminate_learning=True,
    )

    world.learning_role.load_inter_episodic_data(inter_episodic_data)

In [None]:
#!cd assume-repo && assume -s example_02a -db "sqlite:///./examples/local_db/assume_db_example_02a.db"

### 1.2. Prerequisites

To follow along with this tutorial, we need some additional libraries.

- `matplotlib`
- `shap`
- `scikit-learn`

In [None]:
!pip install matplotlib
!pip install shap==0.42.1
!pip install scikit-learn==1.3.0

## 2. Explainable AI and SHAP Values <a name="explainable-ai-and-shap-values"></a>

### 2.1 Understanding Explainable AI 
Explainable AI (XAI) refers to techniques and methods that make the behavior and decisions of AI systems understandable to humans. In the context of complex models like deep neural networks, XAI helps to:
- Increase Transparency: Providing insights into how models make decisions.
- Build Trust: Users and stakeholders can trust AI systems if they understand them.
- Ensure Compliance: Regulatory requirements often demand explainability.
- Improve Models: Identifying weaknesses or biases in models.


### 2.2 Introduction to SHAP Values 
Shapley values are a method from cooperative game theory used to explain the contribution of each feature to the prediction of a machine learning model, such as a neural network. They provide an interpretability technique by distributing the "payout" (the prediction) among the input features, attributing the importance of each feature to the prediction.

For a given prediction, the Shapley value of a feature represents the average contribution of that feature to the prediction, considering all possible combinations of other features.

1. **Marginal Contribution**: 
   The marginal contribution of a feature is the difference between the prediction with and without that feature.

2. **Average over all subsets**: 
   The Shapley value is calculated by averaging the marginal contributions over all possible subsets of features.

The formula for the Shapley value of feature $i$ is:

$$
\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} \cdot \left( f(S \cup \{i\}) - f(S) \right)
$$

Where:
- $N$ is the set of all features.
- $S$ is a subset of features.
- $f(S)$ is the model’s prediction when using only the features in subset $S$.


The `shap` library is a popular tool for computing Shapley values for machine learning models, including neural networks.



Why Use SHAP in RL?
- Model-Agnostic: Applicable to any machine learning model, including neural networks.
- Local Explanations: Provides explanations for individual predictions (actions).
- Consistency: Ensures that features contributing more to the prediction have higher Shapley values.


Properties of SHAP:
1. Local Accuracy: The sum of Shapley values equals the difference between the model output and the expected output.
2. Missingness: Features not present in the model have zero Shapley value.
3. Consistency: If a model changes so that a feature contributes more to the prediction, the Shapley value of that feature should not decrease.

## 3. Calculating SHAP values <a name="calculating-shap-values"></a>

We will work with:

- **Observations (`input_data`)**: These are the inputs to our actor neural network, representing the state of the environment.
- **Trained Actor Model**: A neural network representing the decision making of one RL power plant that outputs actions based on the observations.

Our goal is to:

- Load the observations and the trained actor model.
- Use the model to predict actions.
- Apply SHAP to explain the model's predictions.

### 3.1. Loading and Preparing Data <a name="loading-and-preparing-data"></a>

First, let's load the necessary libraries and the data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import shap
import torch as th
from sklearn.model_selection import train_test_split

the simulation common.py contains utility functions and class definitions
from common import load_observations, Actor

**Define the Actor Neural Network Class**

We define the actor neural network class that will be used to predict actions based on observations.

In [None]:
from assume.reinforcement_learning.neural_network_architecture import MLPActor

In [None]:
def load_config(file_path):
    """
    Load the configuration file.
    """
    with open(file_path) as file:
        config = yaml.safe_load(file)
    return config


# Some Variable definitions:

EPISODES = 3
NUMBER_OF_AGENTS = 1
SIM_TIMESPAN_DAYS = 31
ACTOR_NUM = 1
EXAMPLE = 1

SIM_TIMESPAN_HOURS = SIM_TIMESPAN_DAYS * 24

# actor 1-5 are the default non-rl actors, so we just skip those
ACTOR_NUM_ADJ = ACTOR_NUM + 6  # 6 #9


# Get the current working directory
current_dir = os.getcwd()
# Go up one level
one_level_up = os.path.dirname(current_dir)
# Go up two levels
two_levels_up = os.path.dirname(one_level_up)

# Paths
path = os.path.join(
    two_levels_up,
    f"assume/examples/output/{EXAMPLE}/{EPISODES}_episodes_{SIM_TIMESPAN_DAYS}_simDays_{NUMBER_OF_AGENTS}_rlAgents",
)
actor_path = os.path.join(path, f"actor_pp_{ACTOR_NUM_ADJ}.pt")

# DEFINTIONS

We define a utility function to load observations and input data from a specified path. Analyzing the shap values for all observations and all parameters would make this notebook quite lengthy, so we’re filtering the observation data frame to include only 700 observations.

In [None]:
# @title Load observations function


def load_observations(path, ACTOR_NUM, feature_names):
    # Load observations
    obs_path = f"{path}/buffer_obs.json"

    with open(obs_path) as file:
        json_data = json.load(file)

    # Convert the list of lists into a 2D numpy array
    input_data = np.array(json_data)
    input_data = np.squeeze(input_data)

    # filter out arrays where all value are 0
    input_data = input_data[~np.all(input_data == 0, axis=1)]

    # filter only first 700 observations
    input_data = input_data[:700]

    if NUMBER_OF_AGENTS == 1:
        return pd.DataFrame(input_data, columns=feature_names), input_data
    else:
        return pd.DataFrame(
            input_data[:, ACTOR_NUM], columns=feature_names
        ), input_data[:, ACTOR_NUM]

**Define Paths and Parameters**

Adjust the following paths and parameters according to your data and model.

In [None]:
path = (
    inputs_path + "/example_02a/learned_strategies/base/buffer_obs"
)  # Replace with your data path

In [None]:
# Define feature names (replace with actual feature names)
# make columns names
names_1 = ["price forecast t+" + str(x) for x in range(1, 25)]
names_2 = ["residual load forecast t+" + str(x) for x in range(1, 25)]
feature_names = names_1 + names_2 + ["total capacity t-1"] + ["marginal costs t-1"]

**Load Observations and Input Data**

Load the observations and input data using the utility function.

In [None]:
df_obs, input_data = load_observations(path, ACTOR_NUM, feature_names)

df_obs

**Load the Trained Actor Model**

We initialize and load the trained actor neural network.

In [None]:
# Initialize the model
obs_dim = len(feature_names)
act_dim = 2  # Adjust if your model outputs a different number of actions
model = MLPActor(obs_dim=obs_dim, act_dim=act_dim, float_type=th.float)

In [None]:
ACTOR_NUM = 1  # Replace with the actor number or identifier
actor_path = (
    inputs_path
    + "/example_02a/learned_strategies/base/last_policies/actors/actor_pp_6.pt"
)  # Path to the trained actor model

# Load the trained model parameters
model_state = th.load(actor_path, map_location=th.device("cpu"))
model.load_state_dict(model_state["actor"])

Get the actions base on observation tensor we just loaded.

In [None]:
predictions = []
for obs in input_data:
    obs_tensor = th.tensor(obs, dtype=th.float)
    prediction = model(obs_tensor)
    predictions.append(prediction)
predictions

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    input_data, predictions, test_size=0.15, random_state=42
)

In [None]:
# Convert data to tensors
y_train = th.stack(y_train)
y_test = th.stack(y_test)

X_train_tensor = th.tensor(X_train, dtype=th.float32)
y_train_tensor = th.tensor(y_train, dtype=th.float32)
X_test_tensor = th.tensor(X_test, dtype=th.float32)
y_test_tensor = th.tensor(y_test, dtype=th.float32)

## 3.2. Creating a SHAP Explainer <a name="creating-a-shap-explainer"></a>

We define a prediction function compatible with SHAP and create a Kernel SHAP explainer.

In [None]:
# Define a prediction function for SHAP
def model_predict(X):
    X_tensor = th.tensor(X, dtype=th.float32)
    model.eval()
    with th.no_grad():
        return model(X_tensor).numpy()

In [None]:
# Use a subset of training data for the background dataset
background_size = 100  # Adjust the size as needed
background = X_train[:background_size]

In [None]:
# Create the SHAP Kernel Explainer
explainer = shap.KernelExplainer(model_predict, background)

In [None]:
# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

In [None]:
shap_values

## 4. Visualizing SHAP Values <a name="visualizing-shap-values"></a>

We generate summary plots to visualize feature importance for each output dimension.

In [None]:
print(shap_values[0].shape)
print(X_test.shape)

In [None]:
# Summary plot for the first output dimension
shap.summary_plot(shap_values[0], X_test, feature_names=feature_names, show=False)
plt.title("Summary Plot for Output Dimension 0, p_inflex")
plt.show()

# Summary plot for the second output dimension
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names, show=False)
plt.title("Summary Plot for Output Dimension 1, p_flex")
plt.show()

shap.summary_plot(
    shap_values[0],
    X_test,
    feature_names=feature_names,
    plot_type="bar",
    title="Summary Bar Plot for Output Dimension 0",
)

shap.summary_plot(
    shap_values[1],
    X_test,
    feature_names=feature_names,
    plot_type="bar",
    title="Summary Bar Plot for Output Dimension 1",
)

The SHAP summary plots show the impact of each feature on the model's predictions for each output dimension (action). Features with larger absolute SHAP values have a more significant influence on the decision-making process of the RL agent.

- **Positive SHAP Value**: Indicates that the feature contributes positively to the predicted action value.
- **Negative SHAP Value**: Indicates that the feature contributes negatively to the predicted action value.

By analyzing these plots, we can identify which features are most influential and understand how changes in feature values affect the agent's actions.

## 5. Conclusion <a name="conclusion"></a>

In this tutorial, we've demonstrated how to apply SHAP to a reinforcement learning agent to explain its decision-making process. By interpreting the SHAP values, we gain valuable insights into which features influence the agent's actions, enhancing transparency and trust in the model.

Explainability is crucial, especially when deploying RL agents in real-world applications where understanding the rationale behind decisions is essential for safety, fairness, and compliance.

## 6. Additional Resources <a name="additional-resources"></a>

- **SHAP Documentation**: [https://shap.readthedocs.io/en/latest/](https://shap.readthedocs.io/en/latest/)
- **PyTorch Documentation**: [https://pytorch.org/docs/stable/index.html](https://pytorch.org/docs/stable/index.html)
- **Reinforcement Learning Introduction**: [Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction"](http://incompleteideas.net/book/the-book-2nd.html)
- **Interpretable Machine Learning Book**: [https://christophm.github.io/interpretable-ml-book/](https://christophm.github.io/interpretable-ml-book/)

**Feel free to experiment with the code and explore different explainability techniques. Happy learning!**