# RL test run: Tabular methods

## Introduction

This notebook visualizes the results of testing selected Reinforcement Learning (RL) methods. 

Here, we present a comparison of different _tabular learning methods_. These methods store the approximate action-value functions to a table data structure. We learn for 100.000 episodes, which we would hope to be enough for reaching a reasonble  approximation of the value value function.

For _epsilon_, we use the same scaled inverse visit count strategy, with _n0_ of 50. However, for Q-learning, a off-policy method, we apply a fully random behavior policy. For _alpha_, an exponential decay schedule is used, starting from 0.1 and reaching 0.05 at 90.000 learning iterations.

Skip the intro and check the details of [the methods](#Evaluated-RL-methods), or get straight to the point [here](#Evaluation-of-method-performance)

### Context recap

The notebook structure is repeated for different test runs with different method configurations in notebooks stored in this folder. 

Implementation of the evaluated RL methods can be found in repository https://github.com/mmakipaa/rl.

Results have been created by running a test run based on yaml configuration file `configs\tabular_agents.yaml` for a given number of iterations. In this case:

```sh
python run.py --environment blackjack --iterations 100000 --configfile tabular_agents
```

the results have been saved to a report file `blackjack_tabular_agents_100000.pik`. 

In this notebook, this report file is loaded, reported results are pre-processed and several visualizations are created in hope to unravel the hidden behavior of the methods.

### Imports

The results of the testrun are stored as [pickle](https://docs.python.org/3/library/pickle.html)-serialized [Pandas](https://pandas.pydata.org/) `DataFrames`, so we import pickle, pandas and numpy. [Matplotlib](https://matplotlib.org/stable/index.html) and [Seaborn](https://seaborn.pydata.org/) are used for visualization. 

In addition, preprocessing and plotting utility functions are imported from `./utils` folder.


In [None]:
from pathlib import Path
import pickle

import pandas as pd
import numpy as np
import itertools

import matplotlib.pyplot as plt
import seaborn as sns

import utils.plotting as plotting
from utils.process_report import process_report

### Figure settings

The aim is to create relatively large png images in 4:3 aspect ratio. Alternative would be to create inline images in svg format.

To adjust image size for side-by-side gridplots, see `get_sidebyside_imgsize` in `utils\plotting.py`

In [None]:
FIG_SIZE=(12, 9)
FIG_DPI=300

%matplotlib inline
%config InlineBackend.figure_format = 'png'
# alternatively - %config InlineBackend.figure_formats = ['svg']

### Report data

Test run results are loaded from `report_filename` and reference result from `reference_filename` defined in the cell below. The naming convention for pickled report filenames is: `<environment>_<configfile>_<iterations>.pik`.

In [None]:
FOLDER = Path("../testruns/")

report_filename = ( "blackjack_tabular_agents_100000.pik" )
reference_filename = 'blackjack_ref_agent_100000000.pik'

In [None]:
report_file = FOLDER / report_filename

with open(report_file,'rb') as f:
    report_dict = pickle.load(f)
    agents = report_dict['agents']
    df_report = report_dict['report']
    df_ev_revards = report_dict['ev_rewards']

reference_file = FOLDER / reference_filename
    
with open(reference_file,'rb') as f:
    ref_dict = pickle.load(f)
    ref_agents = ref_dict['agents']
    df_reference = ref_dict['report']

### Evaluated RL methods

The report file loaded above contains results for the following RL methods:

In [None]:
df_agents = pd.DataFrame(agents)
df_agents[['name','method']]

Method descriptions are as follows (from [rl/README.md](https://github.com/mmakipaa/rl/blob/main/README.md)):

| Method key| Description |
| --- | --- |
| MonteCarloOn | On-policy Monte Carlo, full episodes and tabular value representation |
| MonteCarloOff | Off-policy Monte Carlo, full episodes and tabular value representation |
| Sarsa | Sarsa TD(0) using tabular value representation |
| SarsaExpected | Expected Sarsa TD(0) using tabular value representation |
| Qlearning | Q-learning TD(0) using tabular value representation |

Learning parameters `epsilon` and `alpha` and corresponding learning schedules are detailed in the tables below:

In [None]:
display(df_agents.filter(regex='name|epsilon'))

In [None]:
display(df_agents.filter(regex='name|alpha'))

### Reporting points

Action-values and visit counts are reported for each state-action pair at log-intervals as indicated in column `iterations`. Reports have been logged at the following points during learning:

In [None]:
iterations = list(df_report['iterations'].unique())
MAX_ITERATIONS = iterations[-1]
display(iterations)

### Reference result

As we do not have access to the true value function of Blackjack, we approximate the value function with a reference result, that we believe to be close to the true value function.

The reference result has been obtained by running off-policy MonteCarlo method with random behavior policy for 100 000 000 episodes, with configuration as shown below.

When we in the following compare the value function learned by each method to this reference result, it is good to understand that this comparison is not without caveats: if the reference result is wrong, so will be the comparisons presented below, such as MSE error or wrong policy decisions.


In [None]:
display(ref_agents)
display(df_reference['iterations'].max())

### Pre-processing report data

The report DataFrame,`df_report`, records the estimated action-value function for each state-action pair. State is represented as columns `(dealer, player, soft)`, where `dealer` corresponds to sum of dealer's cards, `player` to sum of player's cards, and `soft` indicates whether player has a soft (usable) ace. Column `action` indicates the action; True corresponds to `hit` and False corresponds to `stand`.

The report contains the state-action pairs for each agent included in the test run at each reporting iteration. `df_reference` contains a similar report for the reference method.

In addition to state-action values, state-action visit counts are recorded for tabular methods. Visit counts are not available for approximate semi-gradient or batch methods.

Next, we run the `process_report` utility to pre-process the report files and to create the following DataFrames:

* `df_ref` - augments the reference results to indicate whether actions is the optimal action in that state

* `df_ref_change_points` - indicates the points for different combinations of dealers cards and soft ace, where optimal action changes from `hit` to `stand`.

* `df_sa` - augments the report DataFrame to indicate whether an action is the optimal one in that state. We also add the reference values and optimal action labels, and calculate the error between values of the tested method and our reference.

* `df_states` - rolls the state-action pairs of `df_sa`into states, calculating visit counts for each state, as well as state min and max values and optimal actions. Additionally labels the states for plotting policies on state grid later in this notebook.

In [None]:
df_ref, df_ref_change_points, df_sa, df_states = process_report(df_report, df_reference)

## Evaluation of method performance

The allow comparison of the methods, we run simple performance metrics for each tested method in the following.

### Evaluation rewards

Ten thousand additional evaluation episodes were run using the value function and corresponding greedy policy reached at the end of learning. The combined rewards received during evaluation are shown below.

Note that also training time rewards would be available as `report_dict['tr_rewards']`. We do not, however, consider training time rewards as a metric at this point.

In [None]:
df_ev_revards['per_episode'] = df_ev_revards['reward'] / df_ev_revards['episodes']
df_ev_revards['rank'] = df_ev_revards['per_episode'].rank(ascending=False)
df_ev_revards = df_ev_revards.sort_values(by='rank')

In [None]:
display(df_ev_revards)

### Mean-squared error 

We calculate the mean-squared error of the action-value function against the reference and display the sorted order of methods below.

As we are comparing to the reference case, mean squared error of zero would mean that the evaluated method would have learned exactly the same action-value function as the reference method.


In [None]:
dg = df_sa.groupby(['agent', 'iterations'])
df_mse = dg['sq_error'].mean().reset_index()

df_mse_last = df_mse.loc[df_states.groupby(['agent'])['iterations'].idxmax()]
df_mse_last['rank'] = df_mse_last['sq_error'].rank(ascending=True)
df_mse_last = df_mse_last.sort_values(by='rank')

In [None]:
display(df_mse_last)

### Wrong action count

We calculate the number of states where the agent would make a wrong decision, i.e. the policy learned would not select the same optimal action as the reference method at the end of learning. Thus, zero wrong actions selected would mean that the method has learned the same greedy policy as the refence method.

There are 280 states and 560 state-action pairs. Initially, when action-values (or weights for approximate methods) have been initialized to same initial value, typically zero, a unique best action is not available. In this case, the policy decisions in 280 states would be made randomly, resulting in hopefully getting half of the decisions right. This explains the intial level from which we hope to improve during learning.

In [None]:
def get_wrong_action_count(x):
    # Assume we would get half of states with ties right
    compensation = np.floor(x['s_best_action'].isna().sum() / 2)
    return sum(x['s_best_action'] != x['ref_best_action']) - compensation

df_wrong = df_states.groupby(['agent', 'iterations']).apply(get_wrong_action_count).reset_index(name ='wrong_count')

df_wrong_last = df_wrong.loc[df_states.groupby(['agent'])['iterations'].idxmax()]
df_wrong_last['rank'] = df_wrong_last['wrong_count'].rank(ascending=True)
df_wrong_last = df_wrong_last.sort_values('rank')

In [None]:
display(df_wrong_last)

### A simple performance ranking
We collect the above performance metrics by calculating the combined rank, column `sumrank` in the DataFrame below, to provide a simple performance ranking for the evaluated methods:

In [None]:
df_agent_ranking = pd.concat([df_wrong_last, df_mse_last, df_ev_revards]).groupby('agent') \
                            .sum().drop(['iterations','episodes'], axis=1) \
                            .rename(columns={"rank": "sumrank"}) \
                            .sort_values(by=['sumrank','per_episode', 'wrong_count','sq_error']) \
                            .reset_index()
display(df_agent_ranking)

## Visualizing convergence over learning iterations

In the following we show the evolution of mean squared error and number of states where a wrong decision is made over learning iterations.

Note that in the following plots we use logarithmic scale on the x-axis as the reports were collected at log-itervals during learning.

In [None]:
#SCALE = 'LINEAR'
SCALE = 'LOG'

In [None]:
fig = plt.figure(figsize=FIG_SIZE, dpi=FIG_DPI)
ax = sns.lineplot(data=df_mse, x="iterations", y="sq_error", hue='agent', linewidth=1, marker='o', markersize=4)

sns.set_style("white")

if SCALE == 'LOG':
    _min_xlim = 0.9 * df_report['iterations'].loc[df_report['iterations'] > 0].min()
else:
    _min_xlim = 0
_max_xlim = MAX_ITERATIONS * 1.05
_min_ylim = 0
_max_ylim = df_mse['sq_error'].max() * 1.05
    
plotting.format_ax(ax, _min_xlim, _max_xlim, _min_ylim, _max_ylim, SCALE)

ax.set_ylabel("MSE")

_ = ax.set_title("Mean Sqared Error (compared to reference)", fontdict={'fontsize': 12}, y=0.9)

In [None]:
fig = plt.figure(figsize=FIG_SIZE, dpi=FIG_DPI)
ax = sns.lineplot(data=df_wrong, x="iterations", y="wrong_count", hue='agent',linewidth=1, marker='o', markersize=4)

if SCALE == 'LOG':
    _min_xlim = 0.9 * df_report['iterations'].loc[df_report['iterations'] > 0].min()
else:
    _min_xlim = 0
_max_xlim = MAX_ITERATIONS * 1.05
_min_ylim = 0
_max_ylim = df_wrong['wrong_count'].max() * 1.05
    
plotting.format_ax(ax, _min_xlim, _max_xlim, _min_ylim, _max_ylim, SCALE)

ax.set_ylabel("Number of states with wrong actions selected")

_ = ax.set_title("Wrong actions (compared to reference)", fontdict={'fontsize': 12}, y=0.9)

## Plotting the action-value function

In the following, we visualize the action-value functions for each of the methods at the end of learning.

For clarity, we divide the state-action pairs into four groups, each corresponding to state-actions in pair `(soft, action)`. For instance, `soft=True` and `action=STAND` would mean that the player has a soft (usable) ace valued at 11 and chooses to stand, i.e. not to take any more cards.

For each group, x-axis shows the different sums of players cards (from 4 to 21) and y-axis gives the action-value. Each plotted line shows a different value for dealers cards (from 2 to 11, or ten different lines with different hues from the palette shown below). 

In [None]:
sns.palplot(plotting.get_colormap("dealer")[0]([i / 10 for i in range(0,10)]))


Reference value-function is shown in dotted lines and gray hues.

The plots are ordered according to our simple performance ranking calculated above.

In [None]:
#fig = plt.figure(figsize=FIG_SIZE, dpi=FIG_DPI)

ratio = 3 / 4
width = 12
height = ratio * 2 / 2 * width

max_iterations = df_sa.loc[df_sa.groupby(['agent'])['iterations'].idxmax(),['agent','iterations']]

for current_agent in df_agent_ranking['agent'].tolist():
    fig = plt.figure(figsize=(width, height), dpi=FIG_DPI)
    current_set = df_sa.loc[((df_sa['agent'] == current_agent) &
                      (df_sa['iterations'] == max_iterations.loc[max_iterations['agent'] == current_agent,'iterations'].item()))]
    
    for i, pair in enumerate(itertools.product((False, True), repeat=2)):

        _action = pair[0]
        _soft = pair[1]

        _df_vis = current_set.loc[(current_set['action'] == _action) &(current_set['soft'] == _soft)]
        
        title = f"{current_agent}: Action: {('Hit' if _action else 'Stand')}, Soft: {_soft}" 
        plotting.plot_value_subplot(fig, i+1, _df_vis, title)
    
    title_str = current_agent + ": "
    title_str += f"Wrong: {df_agent_ranking.loc[df_agent_ranking['agent'] == current_agent,'wrong_count'].item():.0f}"
    title_str += f", MSE: {df_agent_ranking.loc[df_agent_ranking['agent'] == current_agent,'sq_error'].item():.04f}"
    title_str += f", Reward: {df_agent_ranking.loc[df_agent_ranking['agent'] == current_agent,'reward'].item():.0f}"

    fig.suptitle(title_str)

## Option to conserve energy...

To limit the number of subplots created in the side-by-side visualizations that follow, we can filter the iterations at which we plot intermediate values.

For example, the following discards every other iteration included in the report file, keeping the final result.

In [None]:
sbs_iterations = iterations[1::2] + [iterations[-1]]
display(sbs_iterations)

The image size for following visualizations set as (inches)

In [None]:
sbs_fig_w, sbs_fig_h = plotting.get_sidebyside_imgsize(len(agents), len(sbs_iterations))
display((sbs_fig_w, sbs_fig_h))

## Action-value function over iterations

To dig deeper, we visualize the development of action-value function over learning iterations. 

In the subplot grid, each column shows results for a RL method. Each row displays the value function at an iteration during learning, with last row showing the final result. Here one iteration of learning corresponds to an episode: Starting from 100 iterations, the agent has completed 100 episodes of play at that point.

We divide the plots in two, first plotting the states `(dealer, player, soft)` with no soft ace, i.e. `soft=False` and then repeating for states with soft ace.

Within a subplot, each square on the grid corresponds to a state and shows the estimate of _value of the state following greedy policy_. 

As with previous action-value plots, sum for dealer's cards on vertical grid axis varies from 2 to 11, and sum of player's cards on the horizontal axis from 4 to 21.

Dark green hues corresponds to action-values close to 1 (i.e. likely to win if choosing the optimal action) and dark purple hues to action-values close to -1 (likely to lose), utilizing this palette. 

In [None]:
sns.palplot(plotting.get_colormap("values")[0]([i / 50 for i in range(0,51)]))

#### No soft ace

First, we plot the values for `soft = False`, the player has no usable soft ace

In [None]:
soft = False

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_max_value', agents=agents, iterations=sbs_iterations, 
                         soft=soft, content_type='values')

#### With soft ace

We repeat the plot of optimal value functions for the states with soft ace. Note that the minimum sum for player is now 12: a usable ace counted as 11 and a non-usable ace counted as 1.

In [None]:
soft = True

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_max_value', agents=agents, iterations=sbs_iterations, 
                         soft=soft, content_type='values')

## Policy over iterations

Next, we visualize how the greedy policy changes over iterations. Orange squares correspond to states where the optimal action is to `hit`, blue to states where optimal action is to `stand`. Again, for each subplot, the dealers hand varies along horizontal and players hand along vertical axis.

For tabular methods, we can additionally illustrate states where only one of the actions has been visited in light orange and blue hues. States with tied values for actions are shown in gray and non-visited states in light gray. The colormap is shown below.

In [None]:
sns.palplot(plotting.get_colormap("actions")[0]([i / 50 for i in range(0,51)]))

#### No soft ace

In [None]:
soft = False

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_label', agents=agents, iterations=sbs_iterations, soft=soft,\
                         content_type='actions')

#### With soft ace

In [None]:
soft = True

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_label', agents=agents, iterations=sbs_iterations, soft=soft,\
                         content_type='actions')

## Difference in value between actions

Finally, we illustrate the difference in value between actions in a state and how the difference changes over iterations. 

A light hue indicates little difference in value between the actions. This can be the case in the early phases of learning, when the values of actions have not yet converged, or simply because the values are close - there is not much difference in expected returns regardless of what action is chosen.

A dark blue hue represents a state where the difference in the value of actions approaches maximum value of 2: Selecting one action would have the value of 1 and the other action the value of -1. This is the case for example for states where player's sum is 21 - taking an additional card is a sure path to losing, resulting in expected reward of -1. Standing at 21 on the other hand has a small probability of a tie with reward of 0, but winning is quite likely, giving expected reward of 1.

In [None]:
sns.palplot(plotting.get_colormap("value_diffs")[0]([i / 50 for i in range(0, 51)]))

#### No soft ace

In [None]:
soft = False

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_diff', agents=agents, iterations=sbs_iterations,
                         soft=soft, content_type='value_diffs')

#### With soft ace

In [None]:
soft = True

fig = plt.figure(figsize=(sbs_fig_w, sbs_fig_h ), dpi=FIG_DPI)

plotting.plot_heatmaps_sidebyside(fig, df_states, 's_diff', agents=agents, iterations=sbs_iterations,
                         soft=soft, content_type='value_diffs')