# Sarsa/Q-Learning/Actor-Critic

_By_

_Henri Lemoine; 261056402; henri.lemoine@mail.mcgill.ca_

_Frederic Baroz; 261118133; frederic.baroz@mail.mcgill.ca_

We provide our results in this notebook. We describe the process alongside with the code and provide a discussion (report) at the end of each question.

Because running the experiments sometimes takes several minutes, we provide screen-shots of the results within the discussion sections.

## 2.&nbsp;Imports


In [2]:
import random
import logging
import time
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import IPython
from typing import List, Tuple

## 3.&nbsp;Common constants & config

**Note on standard error**

We understood there was some freedom as to whether to show standard error or standard deviation. We thus again provide a `USE_STD` boolean to allow for both. Standard error is given by: $\ se = \frac{std}{\sqrt{N}}$.

**Note on seeding:**

- The seeding behavior has changed several times throughout versions of the gym environments. According to the `env.reset()` section in the [documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.reset), the seed should be set just after creating the environment through a call to `env.reset(seed = SEED)` and never again (subsequent calls to `env.reset()` at the start of each episode should be done without passing a seed).
- Calling `env.reset(seed = SEED)` only seeds the environmnent and since we use random number generators in our agent, python and/or numpy should be seeded as well. We provide a `seed_everything()` function that seeds the environment, python's and numpy's random number generators.


In [3]:
# Use standard deviation (True) or standard error (false)
USE_STD: bool = False
# Seed used for python, numpy and gymnasium random number generators
SEED: int = 42


# This utility function allows to seed all random at the same time
def seed_everything(seed, *gym_environments):
    """
    :gymnasium_environment: an instance of gym environment returned by the gym.make() function
    :seed: Seed to be set to all RNG. If None, uses the default SEED.
    """
    if seed is None:
        seed = SEED

    random.seed(seed)
    np.random.seed(seed)
    for env in gym_environments:
        env.reset(seed=seed)


logging.basicConfig(level=logging.INFO, force=True)
logger = logging.getLogger(__name__)

## 4.&nbsp;Question 1: Tabular RL

In this problem, you will compare the performance of SARSA and expected SARSA on the Frozen
Lake domain from the Gym environment suite:

https://gymnasium.farama.org/environments/toy_text/frozen_lake/

Use a tabular representation of the state space. Exploration should be softmax (Boltzmann). You will do 10 independent runs. Each run consists of 500 segments, in each segment there are 10 episodes of training, followed by 1 episode in which you simply run the optimal policy so far (i.e. you pick actions greedily based on the current value estimates). Pick 3 settings of the temperature parameter used in the exploration and 3 settings of the learning rate.

- One u-shaped graph that shows the effect of the parameters on the final training performance, expressed as the return of the agent (averaged over the last 10 training episodes and the 10 runs); note that this will typically end up as an upside-down u.
- One u-shaped graph that shows the effect of the parameters on the final testing performance, expressed as the return of the agent (during the final testing episode, averaged over the 10 runs)
- Learning curves (mean and standard deviation computed based on the 10 runs) for what you pick as the best parameter setting for each algorithm

Write a small report that describes your experiment, your choices of parameters, and the conclu- sions you draw from the graphs.


### 4.1.&nbsp;Q1 constants

Class `Q1C` encapsulating constants used in question 1.

- `USE_SEED` allows to set whether to use the seed or not in Q1. It is not used in Q2 as it is a requirement.
- `MODE_TEST`and `MODE_TRAIN` is used to run a series of episodes in each mode. `MODE_TRAIN` makes the agent use the Boltzman softmax policy whereas in `MODE_TEST` makes the agent use the optimal policy. This constant is also use to decide what to plot in graphs.
- We set a series of constants for calculating the return metrics and set a `DEFAULT_RETURN_METRICS`. These constants are only used for plotting the results.
  - `RETURN_METRICS_CUMULATIVE` returns the _undiscounted sum of rewards received throughout episodes_;
  - `RETURN_METRICS_AVERAGE`returns the _average of the rewards received throughout episodes_;
  - `RETURN_METRICS_DISCOUNTED` returns the _discounted sum of rewards received tourhout episodes_ (discounted using gamma).
- `DEFAULT_GAMMA` is the discount factor used in Sarsa and Expected Sarsa updates. Both questions 1 and 2 are episodic but provide rewards differently (+1 at the very end VS +1 throughout the episode) and we decided to use disctinct constants in each question in order to have the possibility to use distinct values.


In [4]:
class Q1C:
    # Use or do not use seeding
    USE_SEED = True

    # Episode modes
    MODE_TRAIN = 0
    MODE_TEST = 1

    # Types of returns
    RETURN_METRICS_CUMULATIVE = 0
    RETURN_METRICS_AVERAGE = 1
    RETURN_METRICS_DISCOUNTED = 2
    DEFAULT_RETURN_METRICS = RETURN_METRICS_CUMULATIVE

    # Gamma
    DEFAULT_GAMMA = 0.99

### 4.2.&nbsp;Sarsa & Expected Sarsa agents

We implemented one class for both Sarsa and Expected Sarsa as these algorithms are very similar.

The argument `is_expected_sarsa` defines if the agent uses the standard TD(0) Sarsa or the Expected Sarsa algorithm.

- `alpha` is the learning rate and `beta`the temperature for Boltzman softmax action selection.
- `select_action()` implements the Boltzman softmax action selection policy and calls `get_action_softmax_probabilities()` to get the probability of picking each action.
- `run_episodes()` is called to run a series of episodes. The parameter `mode` defines if those episodes are training or testing.
- `update_value_using_sarsa()` is called by `run_episodes()` if `is_expected_sarsa` is `False`.
- `update_value_using_expected_sarsa()` is called by `run_episodes()` if `is_expected_sarsa` is `True`.
- `update_return()` calculate the return for each of the series of episodes that the agent is currently running. It uses `DEFAULT_RETURN_METRICS` to select whether it calculates undiscounted or discounted cumulative rewards, or averaged rewards.


In [5]:
class SarsaAgent:
    def __init__(
        self, environment, alpha, beta, gamma=None, is_expected_sarsa=None, return_metrics=None
    ):
        """
        :environmnent: the frozen lake environment.
        :alpha: the learning rate used in policy evaluation.
        :beta: the temperature parameter used for Boltzman soft-max action selection.
        :use_expected_sarsa:(optional, defaults to False) indicates whether this agent is Sarsa or Expected Sarsa.
        :gamma:(optional, defaults to 0.98) discount factor for cumulative discounted return.
        :return_metrics:(optional, defaults to...) specify how return should be calculated (cumulative, averaged or discounted cumulative).
        :return: None.
        """

        # setting class parameters
        self.environment = environment
        self.alpha = alpha
        self.beta = beta
        # optional parameters
        self.is_expected_sarsa = False if is_expected_sarsa is None else is_expected_sarsa
        self.gamma = Q1C.DEFAULT_GAMMA if gamma is None else gamma
        self.return_metrics = (
            Q1C.DEFAULT_RETURN_METRICS if return_metrics is None else return_metrics
        )

        # the state-action values, matrix of shape: number of states x number of actions
        # note that if beta is very small, the softmax exponent can result in nan (exp(>=710) = inf)
        if self.beta < 0.01:
            self.q_values = np.zeros(
                (environment.observation_space.n, environment.action_space.n), dtype=np.longdouble
            )
        else:
            self.q_values = np.zeros(
                (environment.observation_space.n, environment.action_space.n), dtype=np.float64
            )

    # --------------------------------------
    # ACTION SELECTION
    # --------------------------------------
    def select_action(self, state, mode):
        """
        This method selects an action from a given state.
        Uses Boltzman softmax when in training mode and greedy action selection in testing mode.

        :state: the state from which to select an action.
        :mode: the mode this episode currently lies in.
        :return: the index of the selected action [0,3].
        """

        if mode == Q1C.MODE_TRAIN:
            action_probabilities = self.get_action_softmax_probabilities(state)
            return np.random.choice([*range(len(action_probabilities))], 1, p=action_probabilities)[
                0
            ]
        else:
            return np.random.choice(
                np.where(self.q_values[state, :] == np.max(self.q_values[state, :]))[0]
            )

    def get_action_softmax_probabilities(self, state):
        # getting all q values of actions for the particular state
        action_q_values = self.q_values[state, :]
        # calculating the Boltzman softmax denominator
        denominator = sum([np.exp(qv / self.beta) for qv in action_q_values])
        # for each action calculate the Boltzman softmax probability distribution and return the array
        return [np.exp(qv / self.beta) / denominator for qv in action_q_values]

    # --------------------------------------
    # RUNNING EPISODES
    # --------------------------------------
    def run_episodes(self, nb_episodes, mode):
        """
        :mode:Int: if the episodes should be run in training or testing mode.
        :nb_episodes:Int the number of successive episodes to run.

        :return:List(Float): numpy 1d array containing the return of each of the episodes we ran.
        """

        # initialize 0 return for each episode
        episode_returns = np.zeros(nb_episodes)

        # looping through all episodes
        for episode_index in range(nb_episodes):
            # at the start of each episode: reset the environment and set current state
            # passing the seed if we use a seed
            current_state, _ = self.environment.reset()

            # select the first action given initial state
            current_action = self.select_action(current_state, mode)

            # set a flag to check whether we have reached a terminal state
            # every episode is very likely to terminate in slippery mode, but infinite
            # loop could happen more easily with deterministic environmnent => we thus use a is_truncated flag as well
            is_terminal_state = False
            is_truncated = False

            # define step index to be incremented (only for average return)
            step_index = 0

            # loop until we have reached a terminal state
            while not is_terminal_state and not is_truncated:
                # first take the current action and obtain the reward, next_state and
                # whether we end up in a terminal state
                next_state, reward, is_terminal_state, is_truncated, _ = self.environment.step(
                    current_action
                )
                # we now get the next action with the same action selection method
                next_action = self.select_action(next_state, mode)

                # update the state-action value only if in training mode, otherwise do not update
                # perform Expected Sarsa update or standard Sarsa update
                if mode == Q1C.MODE_TRAIN:
                    if self.is_expected_sarsa:
                        self.update_value_using_expected_sarsa(
                            current_state, current_action, reward, next_state
                        )
                    else:
                        self.update_value_using_sarsa(
                            current_state, current_action, reward, next_state, next_action
                        )

                # update the current state and the current_action to be the next state and action
                current_action = next_action
                current_state = next_state

                # saving the return for graphs
                episode_returns[episode_index] = self.update_return(
                    episode_returns[episode_index], reward, step_index
                )

                step_index += 1

        return episode_returns

    # --------------------------------------
    # UPDATE ACTION-STATE VALUES
    # --------------------------------------
    def update_value_using_sarsa(self, s, a, r, s_prime, a_prime):
        """
        This method updates the q-table using TD(0) Sarsa update rule.

        :s: the current state.
        :a: the current action taken.
        :r: the reward received from taking action a from state s.
        :s_prime: the destination state.
        :a_prime: the next action taken.
        :is_terminal_state: whether the s_prime is a terminal state.
        """

        # Calculate the TD-error
        error = r + self.gamma * self.q_values[s_prime, a_prime] - self.q_values[s, a]
        self.q_values[s, a] = self.q_values[s, a] + self.alpha * error

    def update_value_using_expected_sarsa(self, s, a, r, s_prime):
        """
        Updating the q-table using Expected Sarsa update rule.

        :s: current state.
        :a: current action.
        :r: reward received for taking current action from current state.
        :s_prime: next state.
        """
        action_values = self.q_values[s_prime, :]
        action_probabilities = self.get_action_softmax_probabilities(s_prime)
        expected_value = 0
        for action_index, action_value in enumerate(action_values):
            expected_value += action_probabilities[action_index] * action_value

        error = r + self.gamma * expected_value - self.q_values[s, a]
        self.q_values[s, a] = self.q_values[s, a] + self.alpha * error

    # --------------------------------------
    # EPISODE RETURN (for graph)
    # --------------------------------------
    def update_return(self, current_return, reward, step_index):
        """
        This method update the current episode's return with the reward earned at each step.
        We implement three different return metrics: cumulative undiscounted return, cumulative discounted return, averaged return
        """
        if self.return_metrics == Q1C.RETURN_METRICS_CUMULATIVE:
            return current_return + reward
        elif self.return_metrics == Q1C.RETURN_METRICS_DISCOUNTED:
            return current_return + reward * (self.gamma**step_index)
        else:
            return current_return + (1 / (step_index + 1) * (reward - current_return))

### 4.3. Testing the implementation

We wanted to have a visual representation of q values and the learned policy. It helped assess whether the agent was performing adequately before running the actual experiment. We decided to include it in this document.


#### 4.3.1.&nbsp;Viewer class

We construct viewer for the FrozenLake 4x4 environment that builds a representation of the grid and displays q-values for each state and action.


In [6]:
class FrozenGridViewer:
    def __init__(self, agent):
        self.agent = agent
        self.html = ""

    @staticmethod
    def indexToCoord(index, nb_columns):
        x = index % nb_columns
        y = index // nb_columns
        return x, y

    @staticmethod
    def coordToIndex(y, x, nb_columns):
        return y * nb_columns + x

    @staticmethod
    def get_cell_colors(y, x):
        cell_index = FrozenGridViewer.coordToIndex(y, x, 4)
        if cell_index == 5 or cell_index == 7 or cell_index == 11 or cell_index == 12:
            return "1887b8"
        elif cell_index == 0:
            return "86ff40"
        elif cell_index == 15:
            return "fcff40"
        else:
            return "c7eeff"

    def build_grid(self):
        html = """<table style="border:1px solid black; border-collapse:collapse;">"""

        for y in range(4):
            html += "<tr>"
            for x in range(4):
                bg_col = FrozenGridViewer.get_cell_colors(y, x)
                html += f'<td style="border:1px solid black;background-color:#{bg_col};">{self.build_cell(y, x)}</td>'
            html += "</tr>"

        html += "</table>"
        return html

    def build_cell(self, y, x):
        cell_index = FrozenGridViewer.coordToIndex(y, x, 4)
        action_values = self.agent.q_values[cell_index, :]
        max_actions = np.where(action_values == np.max(action_values))[0]

        bold_css_0 = (
            "font-weight:bold;font-size:12px;color:red;"
            if 0 in max_actions and action_values[0] != 0
            else ""
        )
        bold_css_1 = (
            "font-weight:bold;font-size:12px;color:red;"
            if 1 in max_actions and action_values[1] != 0
            else ""
        )
        bold_css_2 = (
            "font-weight:bold;font-size:12px;color:red;"
            if 2 in max_actions and action_values[2] != 0
            else ""
        )
        bold_css_3 = (
            "font-weight:bold;font-size:12px;color:red;"
            if 3 in max_actions and action_values[3] != 0
            else ""
        )

        html = '<div style="font-size:9px;">'
        html += f'<div style="display:block;width:100px;height:25px;text-align:center;line-height:20px;{bold_css_3}">{round(action_values[3], 4)}</div>'
        html += f'<div style="display:block;width:50px;height:25px;float:left;text-align:center;line-height:20px;{bold_css_0}">{round(action_values[0], 4)}</div>'
        html += f'<div style="display:block;width:50px;height:25px;float:right;text-align:center;line-height:20px;{bold_css_2}">{round(action_values[2], 4)}</div>'
        html += '<div style="display:block;height:0px;width:0px;clear:both;"></div>'
        html += f'<div style="display:block;width:100px;height:25px;text-align:center;line-height:20px;{bold_css_1}">{round(action_values[1], 4)}</div>'
        html += "</div>"
        return html

    def display(self):
        return self.build_grid()

#### 4.3.2.&nbsp;Training agents

We train 4 agents over 1000 episodes:

- Sarsa on deterministic environment.
- Sarsa on stochastic environment.
- Expected Sarsa on deterministic environment.
- Expected Sarsa on stochastic environment.

All agents'parameters are arbitrarily set to $\alpha = 0.1$ and $\beta = 0.01$.

**_Approx. 4 seconds to run in Colab._**


In [None]:
env_deterministic = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=False)
env_stochastic = gym.make("FrozenLake-v1", desc=None, map_name="4x4", is_slippery=True)

if Q1C.USE_SEED:
    seed_everything(None, env_deterministic, env_stochastic)

sarsa_deterministic = SarsaAgent(env_deterministic, alpha=0.1, beta=0.01, is_expected_sarsa=False)
sarsa_stochastic = SarsaAgent(env_stochastic, alpha=0.1, beta=0.01, is_expected_sarsa=False)
exp_sarsa_deterministic = SarsaAgent(
    env_deterministic, alpha=0.1, beta=0.01, is_expected_sarsa=True
)
exp_sarsa_stochastic = SarsaAgent(env_stochastic, alpha=0.1, beta=0.01, is_expected_sarsa=True)

starttime = time.time()

_ = sarsa_deterministic.run_episodes(1000, Q1C.MODE_TRAIN)
_ = sarsa_stochastic.run_episodes(1000, Q1C.MODE_TRAIN)
_ = exp_sarsa_deterministic.run_episodes(1000, Q1C.MODE_TRAIN)
_ = exp_sarsa_stochastic.run_episodes(1000, Q1C.MODE_TRAIN)

exec_time = time.time() - starttime
logger.info(f"Training 4 agents over 1000 episodes took {round(exec_time, 2)} seconds.")

#### 4.3.3.&nbsp;Displaying learned policy

Here we construct an HTML table to show the learned policy for the 4 agent we just trained.


In [None]:
fg1 = FrozenGridViewer(sarsa_deterministic)
fg2 = FrozenGridViewer(sarsa_stochastic)
fg3 = FrozenGridViewer(exp_sarsa_deterministic)
fg4 = FrozenGridViewer(exp_sarsa_stochastic)

html = '<table style="border:1px solid black;">'

html += "<tr>"
html += '<td colspan=2 style="background-color:#cccccc;color:#000000;font-size:14px;text-align:center;padding:3px;font-weight:bold;">Sarsa</td>'
html += "</tr>"

html += "<tr>"
html += '<td style="background-color:#efefef;color:#000000;font-size:14px;text-align:center;padding:3px;">Non-slippery</td>'
html += '<td style="background-color:#efefef;color:#000000;font-size:14px;text-align:center;padding:3px;">Slippery</td>'
html += "</tr>"

html += "<tr>"
html += '<td style="padding:5px">' + fg1.display() + "</td>"
html += '<td style="padding:5px">' + fg2.display() + "</td>"
html += "</tr>"


html += "<tr>"
html += '<td colspan=2 style="background-color:#cccccc;color:#000000;font-size:14px;text-align:center;padding:3px;font-weight:bold;">Expected Sarsa</td>'
html += "</tr>"

html += "<tr>"
html += '<td style="background-color:#efefef;color:#000000;font-size:14px;text-align:center;padding:3px;">Non-slippery</td>'
html += '<td style="background-color:#efefef;color:#000000;font-size:14px;text-align:center;padding:3px;">Slippery</td>'
html += "</tr>"

html += "<tr>"
html += '<td style="padding:5px">' + fg3.display() + "</td>"
html += '<td style="padding:5px">' + fg4.display() + "</td>"
html += "</tr>"

html += "<tr>"
html += '<td colspan=2 style="font-size:12px;">'
html += '<div style="display:block; max-width:800px; padding:5px; margin:auto"><strong>FrozenLake v1 4x4 grid with action-state values and policies after training.</strong> '
html += 'The <span style="background-color:#86ff40;padding:1px;">green cell</span> is the starting position. The <span style="background-color:#fcff40;padding:1px;">yellow cell</span> represents the goal. <span style="background-color:#c7eeff;padding:1px;">Light-blue cells</span> are normal cells and <span style="background-color:#1887b8;padding:1px;">dark-blue cells</span> are holes. '
html += "In every cell, state-action values are represented at the position for each action (north, east, south, west). Bold red values indicate maximum values and thus the learned policy.</div>"
html += "</td>"
html += "</tr>"

html += "</table>"

display(IPython.display.HTML(html))

### 4.4.&nbsp;Experiment


#### 4.4.1.&nbsp;Experiment class

We create a class to run the experiment on a number of different agents. An agent is defined by the type of algorithm (Sarsa or Expected Sarsa), the value for $\alpha$ and the value for $\beta$.

**Utility methods:**

- `init_results()` creates for each agent a 2d numpy array of shape (number of runs, total number of episodes).
- `make_agents()` creates and returns a Sarsa and Expected Sarsa agent for each combination of $\alpha$ and $\beta$.
- `get_segment_boundaries()` takes the segment index as an argument and return the index of its first episode, of its last training episode and of it last testing episode.

**Running experiment:**

- `run_experiment()` runs the experiment for all the agents and stores the episode returns (according to the default return metrics) in the `results` member. At the start of each run, agents are instantiated, the environment is re-created, and everything is seeded with a new seed (if `Q1C.USE_SEED` is `True`).

**Accessing data:**

- `get_parameter_evaluation_data()` returns the data arranged in a suitable format for the first and second part of question 1: assessing the effect of hyperparameters. The argument `mode` controls whether we return the returns of the training or testing episodes of the last segment (averaged over all the episode of the last segment and the runs).
- `get_learning_curve_data()` returns the data arranged for the third part of question 1: learning curves. `best_sarsa` and `best_expected_sarsa` are 2-element arrays where the first element is the index of $\alpha$ and the second element is the index of $\beta$ that we considered the _best_ parameters. `only_testing` determines if it returns the return of all the episodes (testing & training) or only the testing episodes.

**Graphs:**

- `make_param_plot()` makes and display the plot for the first and second part of question 1, as determined by the value of `mode` (training or testing).
- `make_learning_curve_plot()` makes and display the plot for the third part of question 1. Again, `best_sarsa` and `best_expected_sarsa` are 2-element arrays containing the index for the best $\alpha$ and $\beta$.


In [9]:
class SarsaExperiment:
    ALGO_TYPE_SARSA = 0
    ALGO_TYPE_EXPECTED_SARSA = 1

    def __init__(
        self, alphas, betas, nb_runs, nb_segments, nb_training_episodes, nb_testing_episodes
    ):
        # arrays for different values of alpha and beta
        self.alphas = alphas
        self.betas = betas
        # run settings
        self.nb_runs = nb_runs
        self.nb_segments = nb_segments
        self.nb_training_episodes = nb_training_episodes
        self.nb_testing_episodes = nb_testing_episodes

        # standard dict for storing the results
        # for each algorithm, stores a 2d np.array of shape (run, episode_return)
        # the dictionnary keys are of form ('algo_type', alpha_index, beta_index)
        self.results = {}
        self.init_results()

        # Instantiate the FrozenLake 4x4 environment
        self.environment = None

    # --------------------------------------
    # UTILITY METHODS
    # --------------------------------------
    def init_results(self):
        """
        Creates and sets a dictionary to store the returns of the experiment, where each key is an agent algorithm and each value is a 2d numpy array of shape (run_inde, episode_index).
        """
        total_nb_episodes_per_run = (self.nb_segments * self.nb_training_episodes) + (
            self.nb_segments * self.nb_testing_episodes
        )
        for alpha_index, _ in enumerate(self.alphas):
            for beta_index, _ in enumerate(self.betas):
                self.results[(SarsaExperiment.ALGO_TYPE_SARSA, alpha_index, beta_index)] = np.zeros(
                    (self.nb_runs, total_nb_episodes_per_run)
                )
                self.results[
                    (SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA, alpha_index, beta_index)
                ] = np.zeros((self.nb_runs, total_nb_episodes_per_run))

    def make_agents(self):
        """
        Instantiate all the agent algorithms. One algorithm is defined by a combination of the agent type (Sarsa or Expected Sarsa), and its value of alpha and beta.
        """
        agents = {}
        for alpha_index, alpha in enumerate(self.alphas):
            for beta_index, beta in enumerate(self.betas):
                agents[(SarsaExperiment.ALGO_TYPE_SARSA, alpha_index, beta_index)] = SarsaAgent(
                    self.environment, alpha, beta, is_expected_sarsa=False
                )
                agents[(SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA, alpha_index, beta_index)] = (
                    SarsaAgent(self.environment, alpha, beta, is_expected_sarsa=True)
                )
        return agents

    def get_segment_boundaries(self, segment_index):
        """
        Return the first episode, the last training episode and the last testing episode of the segment.

        segment_index:Int: the segment we want to find the boundary of.
        """
        start = (segment_index * self.nb_training_episodes) + (
            segment_index * self.nb_testing_episodes
        )
        training_end = start + self.nb_training_episodes
        testing_end = training_end + self.nb_testing_episodes
        return start, training_end, testing_end

    # --------------------------------------
    # RUNNING EXPERIMENT
    # --------------------------------------
    def run_experiment(self):
        """
        Runs the experiment for all algorithms. At each new run, makes a new environment, new agents, and uses new seed (if USE_SEED = True)
        """
        logger.info("Starting experiment.")
        exp_starttime = time.time()

        # Loop through all runs
        for run_index in range(self.nb_runs):
            run_starttime = time.time()
            logger.info(f"> Starting run {run_index}.\n")

            # At start of run, instantiate all 18 agents (1 for each combination of sarsa type, alpha and beta)
            self.environment = gym.make(
                "FrozenLake-v1", desc=None, map_name="4x4", is_slippery=True
            )
            agents = self.make_agents()

            # Setting the seed
            if Q1C.USE_SEED:
                seed_everything(SEED + run_index, self.environment)

            # keeping a count of segment to log messages every 50 segments (avoid modulo because slow)
            segment_count = 0
            # Loop through all segments
            for segment_index in range(self.nb_segments):
                if segment_count == 0:
                    segm_starttime = time.time()
                    logger.info(f"    > Starting segment {segment_index}.")

                # Find start and stop indices for total episodes
                start, training_end, testing_end = self.get_segment_boundaries(segment_index)
                # for each agent, stores the returns of each episode in results
                for agent_key, agent in agents.items():
                    # run segment's training episodes
                    training_returns = agent.run_episodes(
                        self.nb_training_episodes, mode=Q1C.MODE_TRAIN
                    )
                    # run segment's testing episodes
                    testing_returns = agent.run_episodes(
                        self.nb_testing_episodes, mode=Q1C.MODE_TEST
                    )
                    # store episodes' returns for training and testing
                    self.results[agent_key][run_index, start:training_end] = training_returns
                    self.results[agent_key][run_index, training_end:testing_end] = testing_returns

                if segment_count == 49:
                    exec_time = time.time() - segm_starttime
                    logger.info(
                        f"    > Finished 50 segments at segment {segment_index} in {round(exec_time, 2)} seconds.\n"
                    )
                    segment_count = 0
                elif segment_index == self.nb_segments - 1:
                    exec_time = time.time() - segm_starttime
                    logger.info(
                        f"    > Finished last segments of run at segment {segment_index} in {round(exec_time, 2)} seconds.\n"
                    )
                    segment_count = 0
                else:
                    segment_count += 1

            run_exec_time = time.time() - run_starttime
            logger.info(f"> Finished run {run_index} in {round(run_exec_time, 2)} seconds.\n\n")

        exp_exec_time = time.time() - exp_starttime
        logger.info(f"Finished experiment in {round(exp_exec_time, 2)} seconds.")

    # --------------------------------------
    # ACCESSING DATA
    # --------------------------------------
    def get_parameter_evaluation_data(self, mode):
        """
        This method returns the return data for parameter curves. It returns the mean of the episodes' return over the episodes of the last segment and the runs.

        :mode: Int: If MODE_TRAIN, plots the return of the training episodes of the last segment (averaged over the runs). If MODE_TEST, plots the return of the testing episodes of the last segment (averaged over runs).
        """
        # dictionary containing the data for plots
        #   with keys representing algorithms (Sarsa or Expected Sarsa with each value for alpha) for a total of 6 algorithms (= 6 curves)
        #   with values being an array of 3 values containing the averaged return for each value of beta
        data = {}
        for beta_index, _ in enumerate(self.betas):
            data[(SarsaExperiment.ALGO_TYPE_SARSA, beta_index)] = np.zeros(len(self.alphas))
            data[(SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA, beta_index)] = np.zeros(
                len(self.alphas)
            )

        for agent_key, agent_results in self.results.items():
            # get the boundaries of the last segment's training/testing episodes
            last_segment_start, last_segment_training_end, last_segment_testing_end = (
                self.get_segment_boundaries(self.nb_segments - 1)
            )
            # get a 2d array of all the runs x all the returns of the episodes in the boundaries
            # calculate the mean on the flattened array and storing it in the data dictionnary
            if mode == Q1C.MODE_TRAIN:
                mean = np.mean(agent_results[:, last_segment_start:last_segment_training_end])
            else:
                mean = np.mean(agent_results[:, last_segment_training_end:last_segment_testing_end])

            data[(agent_key[0], agent_key[2])][agent_key[1]] = mean
        return data

    def get_learning_curve_data(self, best_sarsa, best_expected_sarsa, only_testing):
        """
        This method returns the data for agents' learning curves as measured by their return averaged over the runs.

        :best_sarsa: List(Float): a list of 2 elements. The first element is the index of the chosen alpha from the experiment's alphas. The second element is the index of the chosen beta from the experiment's betas.
        :best_expected_sarsa: List(Float): similar list as best_sarsa, but for expected_sarsa's parameters.
        :only_testing: Bool: specifies if only testing episodes or all episodes are plotted. Defaults to False.
        """
        if only_testing:
            best_sarsa_results = np.zeros(
                (self.nb_runs, self.nb_segments * self.nb_testing_episodes)
            )
            best_expected_sarsa_results = np.zeros(
                (self.nb_runs, self.nb_segments * self.nb_testing_episodes)
            )

            for segment_index in range(self.nb_segments):
                # start & end are the start/end indices of the testing episodes of the current segment within the total results
                # this_start & this_end are the start/end indices of the data for only test episodes used in this graph
                _, start, end = self.get_segment_boundaries(segment_index)
                this_start = segment_index * self.nb_testing_episodes
                this_end = segment_index * self.nb_testing_episodes + self.nb_testing_episodes

                best_sarsa_results[:, this_start:this_end] = self.results[
                    (SarsaExperiment.ALGO_TYPE_SARSA, best_sarsa[0], best_sarsa[1])
                ][:, start:end]
                best_expected_sarsa_results[:, this_start:this_end] = self.results[
                    (
                        SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA,
                        best_expected_sarsa[0],
                        best_expected_sarsa[1],
                    )
                ][:, start:end]
        else:
            best_sarsa_results = self.results[
                (SarsaExperiment.ALGO_TYPE_SARSA, best_sarsa[0], best_sarsa[1])
            ]
            best_expected_sarsa_results = self.results[
                (
                    SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA,
                    best_expected_sarsa[0],
                    best_expected_sarsa[1],
                )
            ]

        best_sarsa_mean = np.mean(best_sarsa_results, axis=0)
        best_expected_sarsa_mean = np.mean(best_expected_sarsa_results, axis=0)

        best_sarsa_std = np.std(best_sarsa_results, axis=0)
        best_expected_sarsa_std = np.std(best_expected_sarsa_results, axis=0)

        if USE_STD:
            best_sarsa_std = best_sarsa_std / np.sqrt(self.nb_runs)
            best_expected_sarsa_std = best_expected_sarsa_std / np.sqrt(self.nb_runs)

        return best_sarsa_mean, best_sarsa_std, best_expected_sarsa_mean, best_expected_sarsa_std

    # --------------------------------------
    # GRAPHS
    # --------------------------------------
    def make_params_plot(self, mode, title):
        """
        This methods draws a graph showing agents' performances depending on hyperparameters values alpha (learning rate) and beta (Boltzman soft-max temperature).

        :mode: Int: If MODE_TRAIN, plots the return of the training episodes of the last segment (averaged over the runs). If MODE_TEST, plots the return of the testing episodes of the last segment (averaged over runs).
        :title: Str: The title of the figure.
        """

        colors = list(mcolors.TABLEAU_COLORS.keys())
        plt.style.use("seaborn")

        fig, ax = plt.subplots(figsize=[13, 6])
        i = 0
        for agent_key, agent_data in self.get_parameter_evaluation_data(mode).items():
            agent_name = (
                "Expected Sarsa"
                if agent_key[0] == SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA
                else "Sarsa(0)"
            )
            agent_style = "--" if agent_key[0] == SarsaExperiment.ALGO_TYPE_EXPECTED_SARSA else "-"

            ax.plot(
                self.alphas,
                agent_data,
                linestyle=agent_style,
                color=colors[i],
                label=f"{agent_name}, β = {self.betas[agent_key[1]]}",
            )

            i += 1

        ax.legend()
        ax.set_xlabel("Learning rate α")
        if Q1C.DEFAULT_RETURN_METRICS == Q1C.RETURN_METRICS_AVERAGE:
            ax.set_ylabel("Return \n(average reward per episode)")
        elif Q1C.DEFAULT_RETURN_METRICS == Q1C.RETURN_METRICS_DISCOUNTED:
            ax.set_ylabel("Return \n(discounted cumulative return)")
        else:
            ax.set_ylabel("Return \n(undiscounted cumulative return)")

        ax.set_title(title, fontsize=12, fontweight="bold")

        plt.show()

    def make_learning_curve_plot(
        self, best_sarsa, best_expected_sarsa, title, only_testing=False, draw_individual=True
    ):
        """
        This method draws a graph for agents' learning curves as measured by their return averaged over the runs.

        :best_sarsa: List(Float): a list of 2 elements. The first element is the index of the chosen alpha from the experiment's alphas. The second element is the index of the chosen beta from the experiment's betas.
        :best_expected_sarsa: List(Float): similar list as best_sarsa, but for expected_sarsa's parameters.
        :title: String: the title of the figure.
        :only_testing: Bool: specifies if only testing episodes or all episodes are plotted. Defaults to False.
        :draw_individual: Bool: specifies if should provide 3 plots instead of one: for both algorithms and each of the algorithms individually (readability purpose).
        """
        colors = list(mcolors.TABLEAU_COLORS.keys())
        plt.style.use("seaborn")

        if draw_individual:
            fig, (ax1, ax2, ax3) = plt.subplots(ncols=1, nrows=3, figsize=[16, 18])
            fig.suptitle(title, fontsize=16, fontweight="bold", y=0.94)
        else:
            fig, ax1 = plt.subplots(figsize=[16, 8])
            fig.suptitle(title, fontsize=16, fontweight="bold")

        best_sarsa_mean, best_sarsa_std, best_expected_sarsa_mean, best_expected_sarsa_std = (
            self.get_learning_curve_data(best_sarsa, best_expected_sarsa, only_testing)
        )

        ax1.plot(
            best_sarsa_mean,
            label=f"Sarsa, α = {self.alphas[best_sarsa[0]]}, β = {self.betas[best_sarsa[1]]}",
            linestyle="-",
            color="tab:blue",
        )
        ax1.fill_between(
            np.arange(best_sarsa_mean.shape[0]),
            best_sarsa_mean + best_sarsa_std,
            best_sarsa_mean - best_sarsa_std,
            alpha=0.2,
            color="tab:blue",
        )
        ax1.plot(
            best_expected_sarsa_mean,
            label=f"Expected Sarsa, α = {self.alphas[best_expected_sarsa[0]]}, β = {self.betas[best_expected_sarsa[1]]}",
            linestyle="-",
            color="tab:orange",
        )
        ax1.fill_between(
            np.arange(best_expected_sarsa_mean.shape[0]),
            best_expected_sarsa_mean + best_expected_sarsa_std,
            best_expected_sarsa_mean - best_expected_sarsa_std,
            alpha=0.2,
            color="tab:orange",
        )

        ax1.legend()

        if draw_individual:
            ax2.plot(
                best_sarsa_mean,
                label=f"Sarsa, α = {self.alphas[best_sarsa[0]]}, β = {self.betas[best_sarsa[1]]}",
                linestyle="-",
                color="tab:blue",
            )
            ax2.fill_between(
                np.arange(best_sarsa_mean.shape[0]),
                best_sarsa_mean + best_sarsa_std,
                best_sarsa_mean - best_sarsa_std,
                alpha=0.2,
                color="tab:blue",
            )
            ax2.legend()

            ax3.plot(
                best_expected_sarsa_mean,
                label=f"Expected Sarsa, α = {self.alphas[best_expected_sarsa[0]]}, β = {self.betas[best_expected_sarsa[1]]}",
                linestyle="-",
                color="tab:orange",
            )
            ax3.fill_between(
                np.arange(best_expected_sarsa_mean.shape[0]),
                best_expected_sarsa_mean + best_expected_sarsa_std,
                best_expected_sarsa_mean - best_expected_sarsa_std,
                alpha=0.2,
                color="tab:orange",
            )
            ax3.legend()

        if only_testing:
            ax1.set_xlabel("Episodes (testing episodes only)")
            if draw_individual:
                ax1.set_xlabel("")
                ax2.set_xlabel("")
                ax3.set_xlabel("Episodes (testing episodes only)")
        else:
            ax1.set_xlabel("Episodes (training and testing episodes)")
            if draw_individual:
                ax1.set_xlabel("")
                ax2.set_xlabel("")
                ax3.set_xlabel("Episodes (training and testing episodes)")

        if Q1C.DEFAULT_RETURN_METRICS == Q1C.RETURN_METRICS_AVERAGE:
            ax1.set_ylabel("Return \n(average reward per episode)")
            if draw_individual:
                ax2.set_ylabel("Return \n(average reward per episode)")
                ax3.set_ylabel("Return \n(average reward per episode)")
        elif Q1C.DEFAULT_RETURN_METRICS == Q1C.RETURN_METRICS_DISCOUNTED:
            ax1.set_ylabel("Return \n(discounted cumulative return)")
            if draw_individual:
                ax2.set_ylabel("Return \n(discounted cumulative return)")
                ax3.set_ylabel("Return \n(discounted cumulative return)")
        else:
            ax1.set_ylabel("Return \n(undiscounted cumulative return)")
            if draw_individual:
                ax2.set_ylabel("Return \n(undiscounted cumulative return)")
                ax3.set_ylabel("Return \n(undiscounted cumulative return)")

        if draw_individual:
            ax1.set_title("A. Sarsa and Expected Sarsa", fontsize=12, fontweight="bold")
            ax2.set_title("B. Sarsa only", fontsize=12, fontweight="bold")
            ax3.set_title("C. Expected Sarsa only", fontsize=12, fontweight="bold")

        plt.show()

####4.4.2.&nbsp;Running the experiment

**_Between 20 and 40 min. to run in Colab depending on the parameter selection (very small value of $\beta$ result in the use of long floats) and the time when the experiment was performed (shorter during the night)._** The report at the end provides screen shots of our results in order to avoid running the experiment at the time of corrections.

We logged the progress of the environment. To hide it, use `logging.basicConfig(level = logging.WARNING, force=True)`.


In [None]:
# Standard settings
alphas = [0.1, 0.5, 0.9]
betas = [0.001, 0.005, 0.01]
nb_runs = 10

# Mega run settings
# alphas = [0.01, 0.1, 0.3, 0.5, 0.7, 0.9]
# betas = [0.001, 0.005, 0.01]
# nb_runs = 50

exp = SarsaExperiment(alphas, betas, nb_runs, 500, 10, 1)
exp.run_experiment()

####4.4.3.&nbsp;Graphs


##### A.&nbsp;Effect of parameters on training


In [None]:
exp.make_params_plot(
    Q1C.MODE_TRAIN,
    title=f"Effects of hyperparameters on Sarsa and Expected Sarsa performances\n(averaged over last {exp.nb_training_episodes} training episodes and {exp.nb_runs} runs)",
)

##### B.&nbsp;Effect of parameters on testing


In [None]:
exp.make_params_plot(
    Q1C.MODE_TEST,
    title=f"Effects of hyperparameters on Sarsa and Expected Sarsa performances\n(last testing episode averaged over {exp.nb_runs} runs)",
)

##### C.&nbsp;Learning curves


Let us plot the return for each episode, averaged over the 10 runs. We first consider all episodes (training episodes).


In [None]:
exp.make_learning_curve_plot(
    [1, 1],
    [1, 2],
    title=f"Learning rate for Sarsa and Expected Sarsa for best α and β\n(averaged over {exp.nb_runs} runs)",
    draw_individual=True,
)

Training episodes do not allow for easy visualization of the learning curve as they are exploratory w.r.t. to the temperature β. Let us only plot the return of testing episodes.


In [None]:
exp.make_learning_curve_plot(
    [1, 1],
    [1, 2],
    title=f"Learning rate for Sarsa and Expected Sarsa for best α and β\n(averaged over {exp.nb_runs} runs)",
    only_testing=True,
    draw_individual=False,
)

### 4.5.&nbsp;Report

We implemented Sarsa(0) and Expected Sarsa in the same class and tested different values for the learning rate $\alpha$ and the Boltzman softmax temperature $\beta$. We run the experiment with 18 agents, 1 for each combination of (a) the algorithm (Sarsa or Expected Sarsa), (b) the value for $\alpha$ and (c) the value for $\beta$, over 10 runs, each consisting of 500 segment of 10 training episode and 1 testing episode.

#### **Visualizing the policy**

This part was not required. Since the Frozen Lake environement is very stochastic, we first desired to get a visual representation of the learnt policy. In **section 4.3.**, we implemented a "viewer" to represent the grid-world and display values for each action in each state. We ran 1000 training episodes for 2 different algorithms, Sarsa(0) and Expected Sarsa, in 2 different environments, deterministic and non deterministic. Each algorithm was parametrized with $\alpha = 0.1$ and $\beta = 0.01$

<img src="https://i.imgur.com/TzKrNXj.png" alt="Grid world to show Sarsa's and Expected Sarsa's optimal policy in stochastic and non-stochastic environments." width=600/>

**_Figure 1:_** _Grid world showing Sarsa's and Expected Sarsa's optimal policy in stochastic and non-stochastic environments. The optimal policy is represented by the red, bold values._

In the **_non-stochastic environment_**, both Sarsa and Expected Sarsa find their way to the goal in a very straightforward manner. Zero values in cells other than holes and the goal are because the optimal action is so dominant that other action get very small values and round up to zero.

The agents are much more hesitant in the **_stochastic environment_**, which results from its very high stochasticity (1/3 chance of actually doing what the agent intended to do). In some cases, it is obvious what the agent intended to do. For instance, in the first-row second-column cell (0,1), both agents decide to go up and thus avoid falling in the hole of the cell below, while maintaining a probability of ending up in the cell to the right (because of the stochastic environment), and thus of progressing toward the goal. In the second-row third-column cell (1,2), both Sarsa and Expected Sarsa decide to go left, directly into a hole, but there is actually less chance of falling into a hole when picking this action (1/3), compared to the action of going either up or down (2/3), as this cell is surrounded by two holes.

In other cases, it is much more difficult to understand the agents' choices. For instance in the first-row last-column cell (0,3), Expected Sarsa decides to go down directly into a hole, which is definitely not very beneficial, compared to Sarsa's more "safe" choice. We conclude that in such a highly stochastic environment, the agent just do not perform well here given the 1000 training episodes of our experiment.

#### **Selection of hyperparameters**

We used values of $\alpha = [0.1, 0.5, 0.9]$ and $\beta = [0.001, 0.005, 0.01]$ and inspected the impact of hyperparameters by plotting $\alpha$ values against the cumulative (undiscounted) return, averaged over the 10 runs for the last 10 training episodes (**Figure 2**) and for the last testing episode (**Figure 3**). Each graph thus shows 6 curves, each of them representing a combination of the algorithm and $\beta$. We chose this range for $\alpha$ and $\beta$ after several tries in order to show the typical upside-down v-shape curves.

The discount factor $\gamma$ was not specified. While $\gamma=1$ would not be "inappropriate" in such episodic tasks it can lead to less stability in the learning process. On the other hand, if $\gamma$ is too low, it may slow the learning process, particularly in the frozen lake environment, where only +1 rewards are generated when reaching the goal (which can be relatively far from the start state). We chose to use a value very close to 1, that is $\gamma=0.99$. This parameter can be modified in the Q1C constant class (**section 4.1.**).

For the return to be plotted, we allowed to show _cumulative undiscounted return_, _cumulative discounted return_ (using the prespecified `Q1C.DEFAULT_GAMMA`) or _average return_ over the episode. We chose to show _cumulative undiscounted return_ because we thought it would allow for a more meaningful interpretation in the frame of the Frozen Lake environment: since there is only +1 rewards when reaching the goal and zero otherwise, cumulative undiscounted return is analogous to the probability of successful episodes, when averaged over the runs.

<img src="https://i.imgur.com/49cCUJr.png" alt="Effects of hyperparameters on training performance." width=600/>

**_Figure 2:_** _Effects of hyperparameters on training performance._

<img src="https://i.imgur.com/5Soo6Pu.png" alt="Effects of hyperparameters on training performance." width=600/>

**_Figure 3:_** _Effects of hyperparameters on training performance._

We observe the upside-down V-shaped all curves except for when $\beta$ is 0.001. We may not have enough $\alpha$ values to have identified the top of the curve or it may result from the highly stochastic environment and the limited sample size (10 runs). Curiously, with those parameters, Sarsa seem to outperform Expected Sarsa, which is contrary to the theory. Again, this could be that we have not found a particular $\alpha$ value that really maximizes the performance of Expected Sarsa or because of randomness in our limite sample.

When considering training episodes, the best agents showed to find the goal between 55% and 65% of the time (undiscounted reward, analogous to the percent of time the agent gets a +1 reward for having found the goal). When considering testing alone, results were slightly better with Sarsa reaching the goal 70% of times.

Sarsa performed best with $\beta = 0.01$, which is more exploratory, when considering training and with $\beta = 0.005$, which is more greedy w.r.t. the optimal policy, when considering testing. Interestingly, Expected Sarsa showed opposite results with $\beta = 0.005$ when considering training and with $\beta = 0.01$ when considering testing.

The best $\alpha$ value 0.5. $\alpha$ values that are too low or too high can result in a slower learning process by either expanding the time needed to reach convergence or "overshooting" the optimal state-action value at each update. Again, we limited the experiment to 3 values for $\alpha$ and we might have missed a better value, for instance around 0.1.

#### **Learning curves**

We chose the parameters that maximized return when considering only the testing episodes, that is Sarsa with $\alpha = 0.5$ and $\beta = 0.005$ and Expected Sarsa with $\alpha = 0.5$ and $\beta = 0.01$.

<img src="https://i.imgur.com/A07LL1K.png" alt="Learning curves of Sarsa and Expected Sarsa for all episodes of the experiment. A. Sarsa and Expected Sarsa. B. Sarsa only. C. Expected Sarsa only." width=600/>

**Figure 4:** _Learning curves as measured by cumulative (undiscounted) episode return for Sarsa and Expected Sarsa for all episodes (training & testing) in the experiment, averaged over the 10 runs. A. Sarsa and Expected Sarsa. B. Sarsa only. C. Expected Sarsa only._

Because readability is hampered by the number of episodes, we show learning curves considering only testing episodes.

<img src="https://i.imgur.com/wQS0VDQ.png" alt="Learning curves of Sarsa and Expected Sarsa for only testing episodes." width=600/>

**Figure 5:** _Learning curves as measured by cumulative (undiscounted) episode return for Sarsa and Expected Sarsa for only testing episodes, averaged over the 10 runs._

We show return as measured by cumulative undiscounted reward for each episode, over the entire learning process (Figure 4) and only over the testing episodes (Figure 5). We observe the agents learn quickly reaching maximal return of approximately 0.6, after the 1250th episode, approximately. Again, considering we plot cumulative reward, averaged over the runs, and considering our environment only returns +1 rewards when reaching the goal and 0 otherwise, this is analogous to the probability of reaching the goal. Expected Sarsa performs slightly better than Sarsa as expected (Expected Sarsa trades computation time for slightly better results compared with Sarsa). Variance is however relatively high and does not seem to reduce much as the experiment continues. We concluded this being the result of our highly stochastic environment, and of our sample size, limited to 10 runs.


## 5.&nbsp;Question 2: Function approximation in RL

Implement and compare empirically Q-learning and actor-critic with linear function approximation on the cart-pole domain from the Gym environment suite:

https://gymnasium.farama.org/environments/classic_control/cart_pole/

For this experiment, you should use a function approximator in which you discretize the state
variables into 10 bins each; weights start initialized randomly between $−0.001$ and $0.001$. You will need to use the same seed for this initialization for all parameters settings, but will have 10 different seeds (for the different runs). Use 3 settings of the learning rate parameter $\alpha$ = 1/4,1/8,1/16. Perform 10 independent runs, each of 1000 episodes. Each episode should start at a random initial state. The exploration policy should be $\epsilon$-greedy and you should use 3 values of $\epsilon$ of your choice.
Plot for each algorithm 3 graphs, one for each $\epsilon$, containing the average and standard error of the learning curves for each value of $\alpha$ (each graph will have 3 curves). Make sure all graphs are on the same scale. Based on these graphs, pick what you consider the best parameter choice for both $\epsilon$ and $\alpha$ and show the best learning curve for Q-learning and actor-critic on the same graph. Write a small report that describes your experiment, your choices of parameters, and the conclusions you draw from this experimentation.


### 5.1.&nbsp;Q2 constants

Class `Q2C` encapsulating constants used in question 2.

- `DEFAULT_GAMMA` is the discount factor used in Q-learning and Actor-Critic updates. As mentioned in Q1, both Q1 and Q2 are episodic but provide rewards differently (+1 at the very end VS +1 throughout the episode) and we decided to use disctinct constants in each question in order to have the possibility to use distinct values.


In [15]:
class Q2C:
    SAVE_FILES = True
    SHOW_GRAPHS = True

    # Gamma
    DEFAULT_GAMMA = 1.0

### 5.2. Q2 Plot Helper

We defined a function $\text{plot}$ to generalize plotting tasks. The first agument _data_ contains the data and is of shape: (number of figures, number of curves, number of steps, number of runs).


In [16]:
def plot(
    data: np.array,
    title: str,
    main_labels: List[str],
    ax_titles: List[str],
    algs_info: List[Tuple[str, str, int]],
    range_y: List[Tuple[float, float]] = None,
    size: List[int] = (10, 10),
    fill_std: List[int] = None,
    legend_loc: List[str] = None,
    filename: str = None,
    show: bool = False,
):
    """
    Args:
        data (np.array): All that needs to be plotted. Shape: (n_figs, n_curves, n_steps, n_runs).
        title (str): The title of the plot.
        main_labels (List[str]): A list of strings, the first element being the label of the x axis, and the rest being the labels of the y axes.
        ax_titles (List[str]): A list of titles for each subplot.
        algs_info (List[Tuple[str, str, int]]): A list of tuples containing the information of each curve. Each tuple contains the following: (name, color, marker type). The marker type is an integer denoting the type of marker to be used. 0: normal full line, 1: dashed line, 2: scatter plot.
        range_y (List[Tuple[float, float]], optional): A list of tuples of length n_figs, each denoting the range of the y axis of the corresponding subplot. Defaults to None.
        size (List[int], optional): The size of the plot. Defaults to (10, 10).
        fill_std (List[int], optional): A list of integers of length n_figs, each denoting whether and how to fill the standard deviation of the corresponding subplot. If None, no standard deviation will be filled. If 0, the standard deviation will be filled with a transparent color. If 1, the standard deviation will be a dashed line above and bellow the mean. If 2, just plot the standard deviation with a dashed line. Defaults to None.
        legend_loc (List[str], optional): A list of keys indicating for each figure where to place the legend. If None or if list element not existing, use best location. Defaults to None.
        filename (str, optional): The name of the file to save the plot to. If none, the plot will not be saved. Defaults to None.
        show (bool, optional): Whether to show the plot or not. Defaults to False.

    Returns:
        None
    """
    nb_figs, nb_curves, nb_steps, nb_runs = data.shape
    range_x = (0, nb_steps - 1)

    fig, axes = plt.subplots(nrows=nb_figs, ncols=1, figsize=size)
    fig.suptitle(title, fontsize=18, fontweight="bold", y=0.94)

    for i in range(nb_figs):
        if nb_figs == 1:
            ax = axes
        else:
            ax = axes[i]

        ax.set_title(ax_titles[i], fontsize=12, fontweight="bold", loc="left")

        for j in range(nb_curves):
            if algs_info[j][2] == 0:
                mean = np.mean(data[i, j, :, :], axis=1)

                std = np.std(data[i, j, :, :], axis=1)
                if USE_STD is False:
                    std = np.std(data[i, j, :, :], axis=1) / np.sqrt(nb_runs)

                ax.plot(mean, label=algs_info[j][0], color=algs_info[j][1])

                if fill_std is not None:
                    if fill_std[i] == None:
                        pass
                    elif fill_std[i] == 0:
                        ax.fill_between(
                            np.arange(nb_steps),
                            mean - std,
                            mean + std,
                            alpha=0.2,
                            color=algs_info[j][1],
                        )
                    elif fill_std[i] == 1:
                        ax.plot(mean - std, color=algs_info[j][1], linestyle="--")
                        ax.plot(mean + std, color=algs_info[j][1], linestyle="--")
                    elif fill_std[i] == 2:
                        ax.plot(std, color=algs_info[j][1], linestyle="--")
                    else:
                        raise ValueError("Invalid fill_std value.")

            elif algs_info[j][2] == 1:
                ax.axhline(
                    y=np.mean(data[i, j, -1, :]),
                    label=algs_info[j][0],
                    color=algs_info[j][1],
                    linestyle="--",
                )

            elif algs_info[j][2] == 2:
                ax.scatter(
                    np.arange(nb_steps),
                    np.mean(data[i, j, :, :], axis=1),
                    s=10,
                    label=algs_info[j][0],
                    color=algs_info[j][1],
                )

            else:
                raise ValueError("Invalid marker type.")

        if len(algs_info[j][0]) > 0:
            ax.legend(prop={"size": 10})  # loc='upper right',

            if legend_loc is not None and i < len(legend_loc):
                ax.legend(loc=legend_loc[i])

        if i == nb_figs - 1:
            ax.set_xlabel(main_labels[0], fontsize=14, labelpad=10)
        ax.set_ylabel(main_labels[i], fontsize=14, labelpad=10)
        # ax.set_ylabel(main_labels[i+1], fontsize=14, labelpad=10)

        ax.set_xlim(range_x)
        if range_y is not None and range_y[i] is not None:
            ax.set_ylim(range_y[i])

        ax.locator_params(nbins=10, axis="x")
        ax.locator_params(nbins=5, axis="y")
        ax.grid()

    if filename is not None:
        plt.savefig(filename)
    if show:
        plt.show()

### 5.3. Q-Learning And Actor-Critic Agents


#### 5.3.1. Q-Learning Agent


In [17]:
class QLearningAgent:
    def __init__(self, env, alpha=0, epsilon=0, gamma=None, nb_bins=10):
        self.alpha = alpha
        self.epsilon = epsilon
        self.env = env
        self.gamma = Q2C.DEFAULT_GAMMA if gamma is None else gamma
        self.nb_bins = nb_bins  # 10

        self.nb_states = self.env.observation_space.shape[0]  # 4
        self.nb_actions = env.action_space.n  # 2

        self.w = np.random.uniform(
            -0.001, 0.001, size=(self.nb_states * self.nb_bins, self.nb_actions)
        )

    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.nb_actions)
        else:
            # return np.argmax(np.dot(self.w.T, state)) # This is wrong because it is deterministic when there are multiple maxima
            return np.random.choice(
                np.flatnonzero(np.isclose(np.dot(self.w.T, state), np.dot(self.w.T, state).max()))
            )

    def discretize(self, state):
        cart_pos, cart_vel, pole_angle, pole_vel = state
        cart_pos = np.digitize(cart_pos, np.linspace(-2.4, 2.4, self.nb_bins - 1))
        cart_vel = np.digitize(cart_vel, np.linspace(-3.5, 3.5, self.nb_bins - 1))
        pole_angle = np.digitize(pole_angle, np.linspace(-0.4, 0.4, self.nb_bins - 1))
        pole_vel = np.digitize(pole_vel, np.linspace(-3.5, 3.5, self.nb_bins - 1))
        arr = np.zeros(self.nb_bins * self.nb_states)
        arr[cart_pos + self.nb_bins * 0] = 1
        arr[cart_vel + self.nb_bins * 1] = 1
        arr[pole_angle + self.nb_bins * 2] = 1
        arr[pole_vel + self.nb_bins * 3] = 1
        return arr  # (40,)

    def run_episode(self, state=None):
        done = False
        total_reward = 0
        I = 1

        # Initialize S (first state of episode)
        if state is None:
            state = self.env.reset()[0]
            state = self.discretize(state)

        # Loop while S is not terminal (for each time step)
        while not done and total_reward <= 200:
            action = self.select_action(state)

            # first take the current action and obtain the reward, next_state and
            # whether we end up in a terminal state
            next_state, reward, terminated, truncated, _ = self.env.step(action)
            next_state = self.discretize(next_state)
            done = terminated or truncated
            # we now get the next action with the same action selection method
            # next_action = self.select_action(next_state)

            q_value = np.dot(state, self.w[:, action])
            next_q_values = np.dot(next_state, self.w)

            error = reward + I * np.max(next_q_values) - q_value
            self.w[:, action] += self.alpha * error * state

            I *= self.gamma
            state = next_state
            total_reward += reward

        return total_reward

    def run(self, nb_episodes=1000, do_print=True):
        rewards = []
        for i in range(nb_episodes):
            reward = self.run_episode()
            rewards.append(reward)
            if do_print and i % 100 == 0:
                print(
                    f"Episode {i+1}/{nb_episodes} - Avg Reward: {np.array(rewards[-100:]).mean():.2f}"
                )
        return rewards

#### 5.3.2. Actor-Critic Agent


In [18]:
class ActorCriticAgent:
    def __init__(self, env, alpha_theta=0, alpha_w=0, gamma=1, nb_bins=10):
        self.env = env
        self.alpha_theta = alpha_theta
        self.alpha_w = alpha_w
        self.gamma = gamma
        self.nb_bins = nb_bins

        self.nb_states = self.env.observation_space.shape[0]  # 4
        self.nb_actions = self.env.action_space.n  # 2

        self.theta = np.random.uniform(
            -0.001, 0.001, (self.nb_states * self.nb_bins, self.nb_actions)
        )  # (40, 2)
        self.w = np.random.uniform(-0.001, 0.001, (self.nb_states * self.nb_bins,))  # (40,)

    def discretize(self, state):
        cart_pos, cart_vel, pole_angle, pole_vel = state
        cart_pos = np.digitize(cart_pos, np.linspace(-2.4, 2.4, self.nb_bins - 1))
        cart_vel = np.digitize(cart_vel, np.linspace(-3.5, 3.5, self.nb_bins - 1))
        pole_angle = np.digitize(pole_angle, np.linspace(-0.4, 0.4, self.nb_bins - 1))
        pole_vel = np.digitize(pole_vel, np.linspace(-3.5, 3.5, self.nb_bins - 1))
        arr = np.zeros(self.nb_bins * self.nb_states)
        arr[cart_pos + self.nb_bins * 0] = 1
        arr[cart_vel + self.nb_bins * 1] = 1
        arr[pole_angle + self.nb_bins * 2] = 1
        arr[pole_vel + self.nb_bins * 3] = 1
        return arr  # (40,)

    def get_policy(self, state):
        def softmax(x):
            e_x = np.exp(x - np.max(x))
            return e_x / e_x.sum(axis=0)

        return softmax(np.dot(self.theta.T, state).reshape(-1))

    def run_episode(self, state=None):
        done = False
        total_reward = 0
        I = 1

        # Initialize S (first state of episode)
        if state is None:
            state = self.env.reset()[0]
            state = self.discretize(state)

        # Loop while S is not terminal (for each time step)
        while not done and total_reward <= 200:
            # Choose action: A ~ pi(.|s, theta)
            policy = self.get_policy(state)
            action = np.random.choice(self.nb_actions, p=policy)

            # Take action A, observe S', R
            next_state, reward, terminated, truncated, _ = self.env.step(action)
            next_state = self.discretize(next_state)
            done = terminated or truncated

            # Update delta: delta <- R + gamma * v_hat(S', w) - v_hat(S, w)    (if S' is terminal, then v_hat(S', w) = 0)
            delta = (
                reward
                + self.gamma * float(np.dot(next_state, self.w.T))
                - float(np.dot(state, self.w.T))
            )

            # Update w: w <- w + alpha_w * delta * grad v_hat(S, w)
            self.w += self.alpha_w * delta * state

            # Update theta: theta <- theta + alpha_theta * I * delta * grad ln pi(A|S, theta)
            self.theta += (
                self.alpha_theta * I * delta * np.dot(state.reshape(-1, 1), policy.reshape(1, -1))
            )

            I *= self.gamma
            state = next_state
            total_reward += reward
        return total_reward

    def run(self, nb_episodes=1000, do_print=True):
        rewards = []
        for i in range(nb_episodes):
            reward = self.run_episode()
            rewards.append(reward)
            if do_print and i % 100 == 0:
                print(f"Episode {i+1}/{nb_episodes} - Reward: {reward}")
        return rewards

### 5.4. Experiment


#### 5.4.1. Running the experiment


In [None]:
nb_episodes = 1000
nb_runs = 10
alphas = [1 / 16, 1 / 8, 1 / 4]
epsilons = [0.01, 0.04, 0.1]

# Environment
env = gym.make("CartPole-v1")

# Data
data_Q2 = np.zeros((4, len(epsilons), nb_episodes, nb_runs))

for k in range(nb_runs):
    print(f"Run {k+1}/{nb_runs}")
    for i, alpha in enumerate(alphas):
        print(f"\tAlpha: {alpha}")
        for j, epsilon in enumerate(epsilons):
            seed_everything(k, env)

            q_learning = QLearningAgent(env, alpha=alpha, epsilon=epsilon)
            data_Q2[j][i][:, k] = q_learning.run(nb_episodes=nb_episodes, do_print=False)

            print(
                f"\t\tQ-Learning - Epsilon: {epsilon} - Avg Reward Last 100 Episodes: {data_Q2[j][i][-100:, k].mean():.2f}"
            )

        seed_everything(k, env)

        actor_critic = ActorCriticAgent(env, alpha_theta=alpha, alpha_w=alpha)
        data_Q2[3][i][:, k] = actor_critic.run(nb_episodes=nb_episodes, do_print=False)

        print(
            f"\t\tActor-Critic - Avg Reward Last 100 Episodes: {data_Q2[3][i][-100:, k].mean():.2f}"
        )

#### 5.4.2. Graphs


##### 5.4.2.1. Effect of hyperparameters on performance


In [None]:
plot(
    data=data_Q2,
    title="Question 2\nQ-learning & Actor-Critic",
    main_labels=["Returns", "Returns", "Returns", "Returns", "Returns", "Returns"],
    ax_titles=[
        f"A. Q-learning with $\epsilon = {epsilons[0]}$ averaged over ${nb_runs}$ runs",
        f"B. Q-learning with $\epsilon = {epsilons[1]}$ averaged over ${nb_runs}$ runs",
        f"C. Q-learning with $\epsilon = {epsilons[2]}$ averaged over ${nb_runs}$ runs",
        f"F. Actor-Critic with Softmax averaged over {nb_runs} runs",
    ],
    algs_info=[
        (rf"$\alpha = {alphas[0]}$", "tab:blue", 0),
        (rf"$\alpha = {alphas[1]}$", "tab:green", 0),
        (rf"$\alpha = {alphas[2]}$", "tab:orange", 0),
    ],
    size=(10, 30),
    range_y=[(0, 200), (0, 200), (0, 200), (0, 200), (0, 200), (0, 200)],
    fill_std=[0, 0, 0, 0, 0, 0],
    filename="q2.png" if Q2C.SAVE_FILES else None,
    show=Q2C.SHOW_GRAPHS,
)

#### 5.4.2.2. Best Parameters

We can see what the best hyperparameters are to maximize returns. For Q-Learning, $\alpha = 1/8$ and $\epsilon = 0.01$ are optimal; for Actor-Critic, $\alpha = 1/4$ gives the highest average results.


In [None]:
data_Q2_best = np.zeros((1, 2, nb_episodes, nb_runs))

best_alpha_q = 1
best_eps_q = 0
best_alpha_ac = 2

data_Q2_best[0][0] = data_Q2[best_eps_q][best_alpha_q]
data_Q2_best[0][1] = data_Q2[3][best_alpha_ac]

# Plot #2: Q-learning and Actor-Critic best results
SAVE_FILES = True
SHOW_GRAPHS = True

plot(
    data=data_Q2_best,
    title="Question 2",
    main_labels=["Reward"],
    ax_titles=[f"Best Q-learning and best Actor-Critic averaged over {nb_runs} runs"],
    algs_info=[
        (
            rf"Q-learning with $\alpha = {alphas[best_alpha_q]}$ and $\epsilon = {epsilons[best_eps_q]}$",
            "tab:orange",
            0,
        ),
        (rf"Actor-Critic with $\alpha = {alphas[best_alpha_ac]}$ and Softmax", "tab:red", 0),
    ],
    size=(10, 12),
    range_y=[(0, 200)],
    fill_std=[0],
    filename="q2_best.png" if SAVE_FILES else None,
    show=SHOW_GRAPHS,
)

### 5.5. Report

We implemented Q-learning and Actor-Critic with linear function approximation on the cart-pole domain from the Gym environment suite. We used a function approximator in which we discretize the state variables into 10 bins each, with weights initialized randomly using a uniform distribution between $−0.001$ and $0.001$.

#### **Selection of hyperparameters**

We used 3 settings of the learning rate parameter $\alpha$ = 1/4, 1/8, 1/16. For the Q-Learning agents, we used $\epsilon$-greedy exploration with $\epsilon$ = 0.01, 0.04, 0.1; for the Actor-Critic Agents, we used Softmax exploration instead.

We performed 10 independent runs, each of 1000 episodes.

We plot 3 graphs for $\epsilon$-greedy, and one for Actor-Critic, each containing the average and standard error of the learning curves for each value of $\alpha$.

**_Figure 1:_** _Effects of hyperparameters on training performance._
<img src="https://i.imgur.com/TQRO6Vq.png" width=600/>

As we can see, for Q-Learning, lower values for $\epsilon$ did better. Actor-Critic was surprisingly performant, though its performance deteriorated over time after an initial jump in performance. This suggests that decreasing slightly the gamma value might have increased performance ($\gamma$ = 1 for our experiments).

For Q-Learning, $\alpha = 1/8$ and $\epsilon = 0.01$ gave the highest mean returns over the 10 runs and 1000 episodes; for Actor-Critic, $\alpha = 1/4$ gives the highest mean returns. However, for Q-Learnnig with $\epsilon = 0.01$, we can see that though $\alpha = 1/8$ gave better average results, it gave much less stable results, and results that were ultimately roughly equivalent to $\alpha = 1/16$, which could be used to argue that it was a better hyperparameter for our problem. Still, we decided to choose $\alpha = 1/8$.

**_Figure 2:_** _Best hyperparameters for Q-Learning and Actor-Critic on training performance._
<img src="https://i.imgur.com/193ONJS.png" width=600/>

Here, we notice more easily the striking difference of learning speeds between Q-Learning and Actor-Critic. Actor-Critic initially learns quickly to get good returns, but deteriorates after the 50th episode; comparatively, the optimal Q-Learning agent learns slowly, but is caught up to the optimal Actor-Critic agent after 1000 episodes.

Additionally, we observe that the Q-Learning curve is less stable than the Actor-Critic curve. This is expected, and a known strength of Actor-Critic agents over Q-Learning agents.

#### **Conclusion**

In summary, we applied Q-Learning and Actor-Critic algorithms with linear function approximation to the cart-pole domain. We then tested various hyperparameters to assess their performance. The Q-Learning agents explored using $\epsilon$-greedy, while the Actor-Critic agents used Softmax exploration.

Our findings show that the Q-Learning algorithm with $\alpha = 1/8$ and $\epsilon = 0.01$ provided the best overall performance in terms of mean returns, although it was less stable than the Actor-Critic algorithm. In contrast, the Actor-Critic algorithm with $\alpha = 1/4$ achieved the highest mean returns and displayed faster initial learning, but its performance deteriorated over time.

Comparing the two algorithms, we note that Actor-Critic agents learn more quickly at the beginning and have more consistent performance than Q-Learning agents. This is because Actor-Critic agents use a value function and a policy function to learn a better representation of the environment and exploit it more efficiently. In contrast, Q-Learning agents rely solely on a value function, which can lead to instability issues, particularly when using function approximation.
