# 4. Reinforcement learning tutorial

This tutorial will introduce users into the MATD3 implementation in ASSUME and hence how we use reinforcement leanring (RL). The main objective of this tutorial is to ensure participants grasp the steps required to equip ASSUME with a RL alogorithm. It ,therefore, start one level deeper, than the RL_application example and the knowledge from this tutorial i not required, if the already perconfigured algorithm in Assume should be used. The algorithm explained here is usable as a plug and play solution in the framework. The following coding tasks will highlight the key in the algorithm class and will explain the interactions with the learning role and other classes along the way. 

The outline of this tutorial is as follows. We will start with an introduction to the changed simualtion flow when we use reinforcement learning (1. From one simulation year to learning episodes). If you need a refresher on RL in general, please visit our readthedocs (https://assume.readthedocs.io/en/latest/). Afterwards, we dive into the tasks and reason behind a learning role (2. What role has a learning role) and then dive into the characteristics of the algorithm (3. The MATD3).

**As a whole, this tutorial covers the following coding tasks:**

1. xxx

## 0. Install Assume

Frist we need to install Assume in this Colab. Here we just install the ASSUME core package via pip. In general the instructions for an installation can be found here: https://assume.readthedocs.io/en/latest/installation.html. All the required steps are executed here and since we are working in colab the generation of a venv is not necessary.  

In [None]:
!pip install assume-framework

And easy like this we have ASSUME installed. Now we can let it run. Please note though that we cannot use the functionalities tied to docker and, hence, cannot access the predefined dashboards in colab. For this please install docker and ASSUME on your personal machine.

Further we would like to access the predefined scenarios in ASSUME which are stored on the git repository. Hence, we clone the repository.

In [None]:
!git clone https://github.com/assume-framework/assume.git

**Let the magic happen.** Now you can run your first ever simulation in ASSUME. The following code navigates to the respective assume folder and starts the simulation example example_01b using the local database here in colab.

When running locally, you can also just run `assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db"` in a shell

In [None]:
!cd assume && assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db"

## 1. From one simulation year to learning episodes

In a normal simulation wihtout reinforcement learning, we only run the time horizon of the simulation once. For RL the agents need to learn their strategy based on interactions. For that to work an RL agent has to see a situation, aka a simulation hour, multiple times, and hence we need to run the entire silumation hoirzon multiple times as well.   

To enable this we define a run learning function that will be called if the simulation is started and we defined in our config that we want to activate learning.  

**But first some imports:**

In [None]:
# install jdc for some in line magic,
# that allows us defining functions of classes across different cells

!pip install jdc

In [1]:
import logging
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional

import dateutil.rrule as rr
import numpy as np
import pandas as pd
import yaml
from tqdm import tqdm

from assume.common.base import LearningConfig
from assume.common.exceptions import AssumeException
from assume.common.forecasts import CsvForecaster, Forecaster
from assume.common.market_objects import MarketConfig, MarketProduct
from assume.world import World
from assume.scenario.loader_csv import load_scenario_folder
from assume.reinforcement_learning.buffer import ReplayBuffer

In [2]:
def run_learning(
    world: World, inputs_path: str, scenario: str, study_case: str
) -> None:
    """
    Train Deep Reinforcement Learning (DRL) agents to act in a simulated market environment.

    This function runs multiple episodes of simulation to train DRL agents, performs evaluation, and saves the best runs. It maintains the buffer and learned agents in memory to avoid resetting them with each new run.

    Args:
        world (World): An instance of the World class representing the simulation environment.
        inputs_path (str): The path to the folder containing input files necessary for the simulation.
        scenario (str): The name of the scenario for the simulation.
        study_case (str): The specific study case for the simulation.

    Note:
        - The function uses a ReplayBuffer to store experiences for training the DRL agents.
        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
    """
        # remove csv path so that nothing is written while learning
    temp_csv_path = world.export_csv_path
    world.export_csv_path = ""
    
    initialize_buffer()

    handle_storage_paths()


    run_learning_loop()


        # container shutdown implicitly with new initialisation
    logger.info("################")
    logger.info("Training finished, Start evaluation run")
    world.export_csv_path = temp_csv_path

    # load scenario for evaluation
    load_scenario_folder(
        world,
        inputs_path,
        scenario,
        study_case,
        perform_learning=False,
    )

In [None]:
def initialize_buffer():

    buffer = ReplayBuffer(
        buffer_size=int(world.learning_config.get("replay_buffer_size", 5e5)),
        obs_dim=world.learning_role.obs_dim,
        act_dim=world.learning_role.act_dim,
        n_rl_units=len(world.learning_role.rl_strats),
        device=world.learning_role.device,
        float_type=world.learning_role.float_type,
    )
    
    actors_and_critics = None

In [None]:
def handle_storage_paths():


    world.output_role.del_similar_runs()


    save_path = world.learning_config["trained_policies_save_path"]

    if Path(save_path).is_dir():
        # we are in learning mode and about to train new policies, which might overwrite existing ones
        accept = input(
            f"{save_path=} exists - should we overwrite current learnings? (y/N)"
        )
        if not accept.lower().startswith("y"):
            # stop here - do not start learning or save anything
            raise AssumeException("don't overwrite existing strategies")

In [None]:
def run_learning_loop():
    
    validation_interval = min(
        world.learning_role.training_episodes,
        world.learning_config.get("validation_episodes_interval", 5),
    )

    eval_episode = 1

    for episode in tqdm(
        range(1, world.learning_role.training_episodes + 1),
        desc="Training Episodes",
    ):
        # TODO normally, loading twice should not create issues, somehow a scheduling issue is raised currently
        if episode != 1:
            load_scenario_folder(
                world,
                inputs_path,
                scenario,
                study_case,
                perform_learning=True,
                episode=episode,
            )

        # give the newly created rl_agent the buffer that we stored from the beginning
        world.learning_role.initialize_policy(actors_and_critics=actors_and_critics)

        world.learning_role.buffer = buffer
        world.learning_role.episodes_done = episode

        if episode > world.learning_role.episodes_collecting_initial_experience:
            world.learning_role.turn_off_initial_exploration()

        world.run()

        actors_and_critics = world.learning_role.rl_algorithm.extract_policy()

        if (
            episode % validation_interval == 0
            and episode > world.learning_role.episodes_collecting_initial_experience
        ):
            # save current params in training path
            world.learning_role.rl_algorithm.save_params(directory=save_path)
            world.reset()

            # load validation run
            load_scenario_folder(
                world,
                inputs_path,
                scenario,
                study_case,
                perform_learning=False,
                perform_evaluation=True,
                eval_episode=eval_episode,
            )

            world.run()

            total_rewards = world.output_role.get_sum_reward()
            avg_reward = np.mean(total_rewards)
            # check reward improvement in validation run
            # and store best run in eval folder
            world.learning_role.compare_and_save_policies({"avg_reward": avg_reward})

            eval_episode += 1
        world.reset()

        # in load_scenario_folder_async, we initiate new container and kill old if present
        # as long as we do not skip setup container should be handled correctly
        # if enough initial experience was collected according to specifications in learning config
        # turn off initial exploration and go into full learning mode
        if episode >= world.learning_role.episodes_collecting_initial_experience:
            world.learning_role.turn_off_initial_exploration()

## 2. What role has a learning role



In [None]:
class Learning(Role):
    """
    This class manages the learning process of reinforcement learning agents, including initializing key components such as
    neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on
    the provided learning configuration.

    Args:
        simulation_start (datetime.datetime): The start of the simulation.
        simulation_end (datetime.datetime): The end of the simulation.
        learning_config (LearningConfig): The configuration for the learning process.

    """

    def __init__(
        self,
        learning_config: LearningConfig,
        start: datetime,
        end: datetime,
    ):
        self.simulation_start = start
        self.simulation_end = end

        # how many learning roles do exist and how are they named
        self.buffer: ReplayBuffer = None
        self.obs_dim = learning_config["observation_dimension"]
        self.act_dim = learning_config["action_dimension"]
        self.episodes_done = 0
        self.rl_strats: dict[int, LearningStrategy] = {}
        self.rl_algorithm = learning_config["algorithm"]
        self.critics = {}
        self.target_critics = {}

        # define whether we train model or evaluate it
        self.training_episodes = learning_config["training_episodes"]
        self.learning_mode = learning_config["learning_mode"]
        self.continue_learning = learning_config["continue_learning"]
        self.trained_policies_save_path = learning_config["trained_policies_save_path"]
        self.trained_policies_load_path = learning_config.get(
            "trained_policies_load_path", self.trained_policies_save_path
        )

        cuda_device = (
            learning_config["device"]
            if "cuda" in learning_config.get("device", "cpu")
            else "cpu"
        )
        self.device = th.device(cuda_device if th.cuda.is_available() else "cpu")

        # future: add option to choose between float16 and float32
        # float_type = learning_config.get("float_type", "float32")
        self.float_type = th.float

        th.backends.cuda.matmul.allow_tf32 = True
        th.backends.cudnn.allow_tf32 = True

        self.learning_rate = learning_config.get("learning_rate", 1e-4)

        # if we do not have initital experience collected we will get an error as no samples are avaiable on the
        # buffer from which we can draw exprience to adapt the strategy, hence we set it to minium one episode

        self.episodes_collecting_initial_experience = max(
            learning_config.get("episodes_collecting_initial_experience", 5), 1
        )

        self.train_freq = learning_config.get("train_freq", 1)
        self.gradient_steps = (
            self.train_freq
            if learning_config.get("gradient_steps", -1) == -1
            else learning_config["gradient_steps"]
        )
        self.batch_size = learning_config.get("batch_size", 128)
        self.gamma = learning_config.get("gamma", 0.99)

        self.eval_episodes_done = 0

        # function that initializes learning, needs to be an extra function so that it can be called after buffer is given to Role
        self.create_learning_algorithm(self.rl_algorithm)

        # store evaluation values
        self.max_eval = defaultdict(lambda: -1e9)
        self.rl_eval = defaultdict(list)

In [None]:
#magic to enable class definitions across colab cells
%%add_to Learning

def setup(self) -> None:
        """
        Set up the learning role for reinforcement learning training.

        Notes:
            This method prepares the learning role for the reinforcement learning training process. It subscribes to relevant messages
            for handling the training process and schedules recurrent tasks for policy updates based on the specified training frequency.
        """
        # subscribe to messages for handling the training process
        self.context.subscribe_message(
            self,
            self.handle_message,
            lambda content, meta: content.get("context") == "rl_training",
        )

        recurrency_task = rr.rrule(
            freq=rr.HOURLY,
            interval=self.train_freq,
            dtstart=self.simulation_start,
            until=self.simulation_end,
            cache=True,
        )

        self.context.schedule_recurrent_task(self.update_policy, recurrency_task)


In [None]:
#magic to enable class definitions across colab cells
%%add_to Learning


def create_learning_algorithm(self, algorithm: RLAlgorithm):
    """
    Create and initialize the reinforcement learning algorithm.

    This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm
    is associated with the learning role and configured with relevant hyperparameters.

    Args:
        algorithm (RLAlgorithm): The name of the reinforcement learning algorithm.
    """
    if algorithm == "matd3":
        self.rl_algorithm = TD3(
            learning_role=self,
            learning_rate=self.learning_rate,
            episodes_collecting_initial_experience=self.episodes_collecting_initial_experience,
            gradient_steps=self.gradient_steps,
            batch_size=self.batch_size,
            gamma=self.gamma,
        )
    else:
        logger.error(f"Learning algorithm {algorithm} not implemented!")


In [None]:
%%add_to Learning

def initialize_policy(self, actors_and_critics: dict = None) -> None:
    """
    Initialize the policy of the reinforcement learning agent considering the respective algorithm.

    This method initializes the policy (actor) of the reinforcement learning agent. It tests if we want to continue the learning process with
    stored policies from a former training process. If so, it loads the policies from the specified directory. Otherwise, it initializes the
    respective new policies.
    """

    self.rl_algorithm.initialize_policy(actors_and_critics)

    if self.continue_learning is True and actors_and_critics is None:
        directory = self.trained_policies_load_path
        if Path(directory).is_dir():
            logger.info(f"Loading pretrained policies from {directory}!")
            self.rl_algorithm.load_params(directory)
        else:
            logger.warning(
                f"Folder with pretrained policies {directory} does not exist"
            )

async def update_policy(self) -> None:
    """
    Update the policy of the reinforcement learning agent.

    This method is responsible for updating the policy (actor) of the reinforcement learning agent asynchronously. It checks if
    the number of episodes completed is greater than the number of episodes required for initial experience collection. If so,
    it triggers the policy update process by calling the `update_policy` method of the associated reinforcement learning algorithm.

    Notes:
        This method is typically scheduled to run periodically during training to continuously improve the agent's policy.
    """
    if self.episodes_done > self.episodes_collecting_initial_experience:
        self.rl_algorithm.update_policy()

def compare_and_save_policies(self, metrics: dict) -> None:
    """
    Compare evaluation metrics and save policies based on the best achieved performance according to the metrics calculated.

    This method compares the evaluation metrics, such as reward, profit, and regret, and saves the policies if they achieve the
    best performance in their respective categories. It iterates through the specified modes, compares the current evaluation
    value with the previous best, and updates the best value if necessary. If an improvement is detected, it saves the policy
    and associated parameters.

    metrics contain a metric key like "reward" and the current value.
    This function stores the policies with the highest metric.
    So if minimize is required one should add for example "minus_regret" which is then maximized.

    Notes:
        This method is typically used during the evaluation phase to save policies that achieve superior performance.
        Currently the best evaluation metric is still assessed by the development team and preliminary we use the average rewards.
    """
    if not metrics:
        logger.error("tried to save policies but did not get any metrics")
        return
    # if the current values are a new max in one of the metrics - we store them in the default folder
    first_has_new_max = False

    # add current reward to list of all rewards
    for metric, value in metrics.items():
        self.rl_eval[metric].append(value)
        if self.rl_eval[metric][-1] > self.max_eval[metric]:
            self.max_eval[metric] = self.rl_eval[metric][-1]
            if metric == list(metrics.keys())[0]:
                first_has_new_max = True
            # store the best for our current metric in its folder
            self.rl_algorithm.save_params(
                directory=f"{self.trained_policies_save_path}/{metric}"
            )

    # use last metric as default
    if first_has_new_max:
        self.rl_algorithm.save_params(directory=self.trained_policies_save_path)
        logger.info(
            f"Policies saved, episode: {self.eval_episodes_done + 1}, {metric=}, value={value:.2f}"
        )

#### **Solution 1**

First why do we scale?

Scaling observations is a crucial preprocessing step in machine learning, including reinforcement learning. It involves transforming the features so that they all fall within a similar numerical range. This is important for several reasons. Firstly, it aids in numerical stability during training. Large input values can lead to numerical precision issues, potentially causing the algorithm to perform poorly or even fail to converge. By scaling the features, we mitigate this risk, ensuring a more stable and reliable learning process.

Additionally, scaling promotes uniformity in the learning process. Many optimization algorithms, like gradient descent, adjust model parameters based on the magnitude of gradients. When features have vastly different scales, some may dominate the learning process, while others receive less attention. This imbalance can hinder convergence and result in a suboptimal model. Scaling addresses this issue, allowing the algorithm to treat all features equally and progress more efficiently towards an optimal solution. This not only expedites the learning process but also enhances the model's ability to generalize to new, unseen data. In essence, scaling observations is a fundamental practice that enhances the performance and robustness of machine learning models across a wide array of applications.

According to this the scaling should ensure a similar range for all input parameteres. You can achieve that by chosing the following scaling factors. If you add new observations, choose your scaling factors wisely.

In [None]:
"""
#scaling factors for all observations
#residual load forecast
scaling_factor_res_load = self.max_demand

# price forecast
scaling_factor_price = self.max_bid_price

# total capacity
scaling_factor_total_capacity = unit.max_power

# marginal cost
scaling_factor_marginal_cost = self.max_bid_price
"""

### 3.3 Choose an action

To differentiate between the inflexible and flexible parts of a plant's generation capacity, we split the bids into two parts. The first bid part allows agents to bid a very low or even negative price for the inflexible capacity; this reflects the agent's motivation to stay infra-marginal during periods of very low net load (e.g., in periods of high solar and wind power generation) to avoid the cost of a shut-down and subsequent start-up of the plant. The flexible part of the capacity can be offered at a higher price to provide chances for higher profits. The actions of agent $i$ at time-step $t$ are defined as $a_{i,t} = [ep^\mathrm{inflex}_{i,t}, ep^\mathrm{flex}_{i,t}] \in [ep^{min},ep^{max}]$, where $ep^\mathrm{inflex}_{i,t}$ and $ep^\mathrm{flex}_{i,t}$ are bid prices for the inflexible and flexible capacities, and $ep^{min},ep^{max}$ are minimal and maximal bid prices, respectively.

How do we learn, how to make good decisions? Basically by try and error, also know as **exploration**. Exploration is a fundamental concept in reinforcement learning, representing the strategy by which an agent interacts with its environment to gather information about the consequences of its actions. This is crucial because without exploration, the agent might settle for suboptimal policies based on its initial knowledge, limiting its ability to discover more rewarding states or actions.

In the initial stages of training, also often called initial exploration, it's imperative to employ almost random actions. This means having the agent take actions purely by chance. This seemingly counterintuitive approach serves a critical purpose. Initially, the agent lacks any meaningful information about the environment, making it impossible to make informed decisions. By taking random actions, it can quickly gather a broad range of experiences, allowing it to grasp the fundamental structure of the environment. These random actions serve as a kind of "baseline exploration," providing a starting point from which the agent can refine its policy through learning. With our domain knowledge we can even guide the initial exploration process, to enhance learning capabilities.


Following up on these concepts the following tasks will:
1. obtain the action values from the neurnal net in the bidding staretgy and
2. then transform theses values into the actual bids of an order. 

#### **Task 2.1**
**Goal**: With the observations and noise we generate actions

In the following task we define the actions for the initial exploration mode. As described before we can guide it by not letting it choose random actions but defining a base-bid on which we add a good amount of noise. In this way the initial strategy starts from a solution that we know works somewhat well. Define the respective base bid in the followin code. Remeber we are defining bids for a conventional power plant bidding in an Energy-Only-Market with a uniform pricing auction.  

In [None]:
#magic to enable class definitions across colab cells
%%add_to RLStrategy
def get_actions(self, next_observation):
        """
        Get actions
        """

        # distinction whetere we are in learning mode or not to handle exploration realised with noise
        if self.learning_mode:
            # if we are in learning mode the first x episodes we want to explore the entire action space
            # to get a good initial experience, in the area around the costs of the agent
            if self.collect_initial_experience_mode:
                # define current action as soley noise
                noise = (
                    th.normal(
                        mean=0.0, std=0.2, size=(1, self.act_dim), dtype=self.float_type
                    )
                    .to(self.device)
                    .squeeze()
                )

                # =============================================================================
                # 2.1 Get Actions and handle exploration
                # =============================================================================
                #==> YOUR CODE HERE
                base_bid = #TODO

                # add niose to the last dimension of the observation
                # needs to be adjusted if observation space is changed, because only makes sense
                # if the last dimension of the observation space are the marginal cost
                curr_action = noise + base_bid.clone().detach()

            else:
                # if we are not in the initial exploration phase we chose the action with the actor neuronal net
                # and add noise to the action
                curr_action = self.actor(next_observation).detach()
                noise = th.tensor(
                    self.action_noise.noise(), device=self.device, dtype=self.float_type
                )
                curr_action += noise
        else:
            # if we are not in learning mode we just use the actor neuronal net to get the action without adding noise

            curr_action = self.actor(next_observation).detach()
            noise = tuple(0 for _ in range(self.act_dim))

        curr_action = curr_action.clamp(-1, 1)

        return curr_action, noise


#### **Solution 2.1**

So how do we define the base bid?

Assuming the described auction is a efficient market with full information and competition, we know that bidding the marginal costs of the power plant is the economically best bid. With the RL strategy we can recreate the abuse of market power and incomplete information, which enables us to model different market settings. Yet, starting of with the theoretically styleized optimal solution guides our RL agents porperly. As the marginal costs of the power plant are part of the oberservations we can define the base bid in the following way.  

In [None]:
"""
#base_bid = marginal costs
base_bid = next_observation[-1] # = marginal_costs
"""

#### **Task 2.2**
**Goal: Define the actual bids with the outputs of the actors**

Similarly to every other output of a neuronal network, the actions are in the range of 0-1. These values need to be translated into the actual bids $a_{i,t} = [ep^\mathrm{inflex}_{i,t}, ep^\mathrm{flex}_{i,t}] \in [ep^{min},ep^{max}]$. This can be done in a way that further helps the RL agent to learn, if we put some thought into.

For this we go back into the calculate_bids() function and instead of just defining bids=actions, which was just a place holder, we actually make them into bids. Think about a smart way to transform them and fill the gaps in the following code. Remember:

  - *bid_quantity_inflex* represent the inflexible part of the bid. This represents the minimum run capacity of the unit.
  - *bid_quantity_flex* represent the flexible part of the bid. This represents the flexible capacity of the unit.

In [None]:
#magic to enable class definitions across colab cells
%%add_to RLStrategy
def calculate_bids(
    self,
    unit: SupportsMinMax,
    market_config: MarketConfig,
    product_tuples: list[Product],
    **kwargs,
) -> Orderbook:
    """
    Calculate bids for a unit
    """

    bid_quantity_inflex, bid_price_inflex = 0, 0
    bid_quantity_flex, bid_price_flex = 0, 0

    start = product_tuples[0][0]
    end = product_tuples[0][1]
    # get technical bounds for the unit output from the unit
    min_power, max_power = unit.calculate_min_max_power(start, end)
    min_power = min_power[start]
    max_power = max_power[start]

    # =============================================================================
    # 1. Get the Observations, which are the basis of the action decision
    # =============================================================================
    next_observation = self.create_observation(
        unit=unit,
        market_id=market_config.market_id,
        start=start,
        end=end,
    )

    # =============================================================================
    # 2. Get the Actions, based on the observations
    # =============================================================================
    actions, noise = self.get_actions(next_observation)

    bids = actions

    # =============================================================================
    # 3.2 Transform Actions into bids
    # =============================================================================
    #==> YOUR CODE HERE
    # actions are in the range [0,1], we need to transform them into actual bids
    # we can use our domain knowledge to guide the bid formulation
    bid_prices = actions * self.max_bid_price

    # 3.1 formulate the bids for Pmin
    # Pmin, the minium run capacity is the inflexible part of the bid, which should always be accepted
    bid_quantity_inflex = min_power
    bid_price_inflex = #TODO

    # 3.1 formulate the bids for Pmax - Pmin
    # Pmin, the minium run capacity is the inflexible part of the bid, which should always be accepted
    bid_quantity_flex = max_power - bid_quantity_inflex
    bid_price_flex = #TODO

    # actually formulate bids in orderbook format
    bids = [
        {
            "start_time": start,
            "end_time": end,
            "only_hours": None,
            "price": bid_price_inflex,
            "volume": bid_quantity_inflex,
        },
        {
            "start_time": start,
            "end_time": end,
            "only_hours": None,
            "price": bid_price_flex,
            "volume": bid_quantity_flex,
        },
    ]

    # store results in unit outputs which are written to database by unit operator
    unit.outputs["rl_observations"][start] = next_observation
    unit.outputs["rl_actions"][start] = actions
    unit.outputs["rl_exploration_noise"][start] = noise
    
    bids = self.remove_empty_bids(bids)

    return bids

#### **Solution 2.2**

So how do we define the actual bid from the action?

We have the bid price for the minimum power (inflex) and the rest of the power. As the power plant needs to run at minimal the minum power in order to offer generation in general, it makes sense to offer this generation at a lower price than the rest of the power. Hence, we can allocate the actions to the bid prices in the following way. In addition, the actions need to be rescaled of course.


In [None]:
"""
#calculate actual bids
#rescale actions to actual prices
bid_prices = actions * self.max_bid_price

#calculate inflexible part of the bid
bid_quantity_inflex = min_power
bid_price_inflex = min(bid_prices)

#calculate flexible part of the bid
bid_quantity_flex = max_power - bid_quantity_inflex
bid_price_flex = max(bid_prices)
"""

### 3.4 Get a reward
This step is done in the *calculate_reward*()-function, which is called after the market is cleared and we get the market feedback, so we can calculate the profit. In RL, the design of a reward function is as important as the choice of the correct algorithm. During the initial phase of the work, pure economic reward in the form of the agent's profit was used. Typically, electricity market models consider only a single restart cost. Still, in the case of using RL, the split into shut-down and start-up costs allow the agents to better differentiate between these two events and learn a better policy.


\begin{equation}
\pi_{i,t} =
\begin{cases}
P^\text{conf}_{i,t} (M_t - mc_{i,t}) dt - c^{su}_i & \text{if $P^\text{conf}_{i,t}$ $\geq  P^{min}_i$} \\
& \text{and $P_{i,t-1}$ $= 0$} \\
P^\text{conf}_{i,t} (M_t - mc_{i,t}) dt & \text{if $P^\text{conf}_{i,t}$ $\geq  P^{min}_i$} \\
& \text{and $P_{i,t-1}$ $\neq 0$} \\
- c^{sd}_i & \text{if $P^\text{conf}_{i,t}$ $\leq  P^{min}_i$} \\
& \text{and $P_{i,t-1}$ $\neq 0$} \\
0 & \text{otherwise} \\
\end{cases}
\end{equation}


In this equation, the variables are:
* $P^\text{conf}$ the confirmed capacity on the market
* $P^{min}$ the minimal stable capacity
* $M$ the market clearing price
* $mc$ the marginal generation cost
* $dt$ the market time resolution
* $c^{su}, c^{sd}$ the start-up and shut-down costs, respectively

The profit-driven reward function was sufficient for a few agents, but the learning performance decreased significantly with more agents. Therefore, we add an additional regret term $cm$.

#### **Task 3**
**Goal**: Define the reward guiding the learning process of the agent.

As the reward plays such a crucial role in the learning think of ways how to integrate further signals exceeding the monetary profit. One example could be integrating a regret term, namely the opportunity costs. Your task is to define the rewrad using the opportunity costs and to scale it.

In [None]:
#magic to enable class definitions across colab cells
%%add_to RLStrategy
def calculate_reward(
        self,
        unit,
        marketconfig: MarketConfig,
        orderbook: Orderbook,
    ):
    """
    Calculate reward
    """

    # =============================================================================
    # 3. Calculate Reward
    # =============================================================================
    # function is called after the market is cleared and we get the market feedback,
    # so we can calculate the profit

    product_type = marketconfig.product_type

    profit = 0
    reward = 0
    opportunity_cost = 0

    # iterate over all orders in the orderbook, to calculate order specific profit
    for order in orderbook:
        start = order["start_time"]
        end = order["end_time"]
        end_excl = end - unit.index.freq

        # depending on way the unit calaculates marginal costs we take costs
        if unit.marginal_cost is not None:
            marginal_cost = (
                unit.marginal_cost[start]
                if len(unit.marginal_cost) > 1
                else unit.marginal_cost
            )
        else:
            marginal_cost = unit.calc_marginal_cost_with_partial_eff(
                power_output=unit.outputs[product_type].loc[start:end_excl],
                timestep=start,
            )

        duration = (end - start) / timedelta(hours=1)

        # calculate profit as income - running_cost from this event
        price_difference = order["accepted_price"] - marginal_cost
        order_profit = price_difference * order["accepted_volume"] * duration

        # calculate opportunity cost
        # as the loss of income we have because we are not running at full power
        order_opportunity_cost = (
            price_difference
            * (
                unit.max_power - unit.outputs[product_type].loc[start:end_excl]
            ).sum()
            * duration
        )

        # if our opportunity costs are negative, we did not miss an opportunity to earn money and we set them to 0
        order_opportunity_cost = max(order_opportunity_cost, 0)

        # collect profit and opportunity cost for all orders
        opportunity_cost += order_opportunity_cost
        profit += order_profit

    # consideration of start-up costs, which are evenly divided between the
    # upward and downward regulation events
    if (
        unit.outputs[product_type].loc[start] != 0
        and unit.outputs[product_type].loc[start - unit.index.freq] == 0
    ):
        profit = profit - unit.hot_start_cost / 2
    elif (
        unit.outputs[product_type].loc[start] == 0
        and unit.outputs[product_type].loc[start - unit.index.freq] != 0
    ):
        profit = profit - unit.hot_start_cost / 2

    # =============================================================================
    # =============================================================================
    # ==> YOUR CODE HERE
    # The straight forward implemntation would be reward = profit, yet we would like to give the agent more guidance
    # in the learning process, so we add a regret term to the reward, which is the opportunity cost
    # define the reward and scale it

    scaling = #TODO
    regret_scale = #TODO
    reward = #TODO

    # store results in unit outputs which are written to database by unit operator
    unit.outputs["profit"].loc[start:end_excl] += profit
    unit.outputs["reward"].loc[start:end_excl] = reward
    unit.outputs["regret"].loc[start:end_excl] = opportunity_cost

#### **Solution 3**

So how do we define the actual reward?

We use the opportunity costs for further guidance, which quantify the expected contribution margin, as defined by the following equation, with $P^{max}$ as the maximal available capacity.

\begin{equation}
cm_{i,t} = \max[(P^{max}_i - P^\text{conf}_{i,t}) (M_t - mc_{i,t}) dt, 0]
\end{equation}

The regret term gives a negative signal to the agent when there is opportunity cost due to the unsold capacity, thus correcting the agent's actions. This term also introduces an increased influence of the competition between agents in learning. By minimizing the regret, the agents drive the bid prices closer to the marginal generation cost, which drives the market price down.

The reward of agent $i$ at time-step $t$ is defined by the equation below.

\begin{equation}
R_{i,t}  = \pi_{i,t} + \beta cm_{i,t}
\end{equation}

Here, $\beta$ is the regret scaling factor to adjust the ratio between profit-maximizing and regret-minimizing learning.

The described reward function has proven to perform well even with many agents and to accelerate learning convergence. This is because minimizing the regret term drives the overall system to equilibrium. At a point close to the equilibrium point, the average reward of all agents would converge to a constant value since further policy changes would not lead to an additional reduction in regrets or an increase in profits. Therefore, the average reward value can also be a good indicator of learning performance and convergence.

In [None]:
"""
scaling = 0.1 / unit.max_power
regret_scale = 0.2
reward = float(profit - regret_scale * opportunity_cost) * scaling
"""

### 3.5 Start the simulation

We are almost done with all the changes to actually be able to make ASSUME learn here in google colab. If you would rather like to load our pretrained strategies, we need a function for loading parameters, which can be found below.   



In [None]:
# magic to enable class definitions across colab cells
%%add_to RLStrategy


def load_actor_params(self, load_path):
    """
    Load actor parameters
    """
    directory = f"{load_path}/actors/actor_{self.unit_id}.pt"

    params = th.load(directory, map_location=self.device)

    self.actor = Actor(self.obs_dim, self.act_dim, self.float_type)
    self.actor.load_state_dict(params["actor"])

    if self.learning_mode:
        self.actor_target = Actor(self.obs_dim, self.act_dim, self.float_type)
        self.actor_target.load_state_dict(params["actor_target"])
        self.actor_target.eval()
        self.actor.optimizer.load_state_dict(params["actor_optimizer"])

To control the learning process, the config file determines the parameters of the learning algorithm. As we want to temper with these values in the notebook we will overwrite the learning config in the next cell and then load it into our world.  

In [None]:
learning_config = {
    "observation_dimension": 50,
    "action_dimension": 2,
    "continue_learning": False,
    "trained_policies_save_path": "None",
    "max_bid_price": 100,
    "algorithm": "matd3",
    "learning_rate": 0.001,
    "training_episodes": 100,
    "episodes_collecting_initial_experience": 5,
    "train_freq": 24,
    "gradient_steps": -1,
    "batch_size": 256,
    "gamma": 0.99,
    "device": "cpu",
    "noise_sigma": 0.1,
    "noise_scale": 1,
    "noise_dt": 1,
    "validation_episodes_interval": 5,
}

In [None]:
# Read the YAML file
with open("assume/examples/inputs/example_02a/config.yaml", "r") as file:
    data = yaml.safe_load(file)

# store our modifications to the config file
data["base"]["learning_mode"] = True
data["base"]["learning_config"] = learning_config

# Write the modified data back to the file
with open("assume/examples/inputs/example_02a/config.yaml", "w") as file:
    yaml.safe_dump(data, file)

In order to let the simulation run with the integrated learning we need to touch up the main file that runs it in the following way.

In [None]:
log = logging.getLogger(__name__)

csv_path = "./outputs"
os.makedirs("./local_db", exist_ok=True)

if __name__ == "__main__":
    """
    Available examples:
    - local_db: without database and grafana
    - timescale: with database and grafana (note: you need docker installed)
    """
    data_format = "local_db"  # "local_db" or "timescale"

    if data_format == "local_db":
        db_uri = "sqlite:///./local_db/assume_db.db"
    elif data_format == "timescale":
        db_uri = "postgresql://assume:assume@localhost:5432/assume"

    input_path = "assume/examples/inputs"
    scenario = "example_02a"
    study_case = "base"

    # create world
    world = World(database_uri=db_uri, export_csv_path=csv_path)

    # we import our defined bidding strategey class including the learning into the world bidding strategies
    # in the example files we provided the name of the learning bidding strategeis in the input csv is  "pp_learning"
    # hence we define this strategey to be one of the learning class
    world.bidding_strategies["pp_learning"] = RLStrategy

    # then we load the scenario specified above from the respective input files
    load_scenario_folder(
        world,
        inputs_path=input_path,
        scenario=scenario,
        study_case=study_case,
    )

    # run learning if learning mode is enabled
    # needed as we simulate the modelling horizon multiple times to train reinforcement learning run_learning( world, inputs_path=input_path, scenario=scenario, study_case=study_case, )

    if world.learning_config.get("learning_mode", False):
        run_learning(
            world,
            inputs_path=input_path,
            scenario=scenario,
            study_case=study_case,
        )

    # after the learning is done we make a normal run of the simulation, which equasl a test run
    world.run()