**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/gym_plannable.git
!{sys.executable} -m pip install git+https://github.com/michalgregor/rl_tabular.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
%matplotlib inline
from rl_tabular import ActionValueTable, StateValueTable
from rl_tabular.maze_env_plots import (
    Plotter, plot_action_values, plot_state_values)
from rl_tabular import EpsGreedyPolicy, ReplayBuffer, QLearning
from rl_tabular import ExponentialSchedule
from rl_tabular import qtable_control
from rl_tabular import Trainer, seed
from gym_plannable.env import MazeEnv
import matplotlib.pyplot as plt
import numpy as np

## Q-Learning

Next we are going to have a look at a temporal difference (TD) approach known as Q-learning. Q-learning is model-free and it learns the optimal **action-value**  function, so no model is required to control the agent either (unlike TD learning with a state-value function).

Finally, Q-learning is an off-policy method so it can learn from experience gather using any policy – this will allow us to use it together with experience replay.

### Q-learning with a Fixed Action Sequence

As the first step, we are going to focus on Q-learning itself and we are going to disregard exploration. To this end, the cell below defines a fixed action sequence, which gets the agent from the start to the goal state. The sole role of our reinforcement learning agent will be to learn the sequence after experiencing it several times. We are going to see how well that is going to work.

---
### Task 1: Implement the Q-learning Update Rule

**In the cell below, fill in the implementation of the Q-learning update rule.** 

$$
\begin{aligned}
\delta &= r_{t+1} + \gamma \underset{a}{\max} \ Q(s_{t+1}, a)-Q(s_{t},a_{t}) \\
Q(s_{t},a_{t}) &\leftarrow Q(s_{t},a_{t}) + \alpha \delta
\end{aligned}
$$
---


In [None]:
actions = [0, 0, 0, 0, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 3, 3, 3, 3]
alpha = 0.5
gamma = 0.9

# create the environment
env = MazeEnv() # we do not enable rendering here; we'll do it manually

# set up plotting
plotter = Plotter(env, ActionValueTable, StateValueTable, figsize=[8, 4])

# set up the value function
qtable = ActionValueTable(env.action_space.n)

# the training loop
step = 0
for episode in range(20):
    obs = env.reset()
        
    for a in actions:
        # apply the action and observe the effect
        obs_next, reward, done, _ = env.step(a)
        
        
        
        
        # compute td: a, reward, qtable, gamma, obs, obs_next, np.max
        td = # -----
        
        
        
        
        qtable[obs, a] += alpha * td
        
        # book keeping
        obs = obs_next
        step += 1

        # for the first 3 episodes, we also do step-wise plots
        if episode < 3: plotter.plot(qtable, qtable.to_state_values())

        # if the environment is done, conclude the episode
        if done: break

    print(f"Episode {episode} finished after {step} steps.")

In [None]:
plt.figure()
plot_action_values(
    qtable, plotter.states, env=env, render_agent=False
)

plt.figure()
plot_state_values(
    qtable.to_state_values(), plotter.states, env=env, render_agent=False
)

What we can observe from the visualization is that with vanilla Q-learning, the agent only updates the value of the second to last state. This makes sense because while computing the values of the previous states, we did not yet know that the agent was going to receive a reward.

On the other hand, this is clearly very impractical – it means that we need to repeat the sequence around 20 times (roughly its length) before the agent manages to learn it in its entirety – that is, before the value is propagated all the way from the goal state back to the initial state.

And even this is only true under the assumption that the agent is going to see the same sequence of actions every time – when in reality, the agent will actually need to explore different paths. It is this exploration aspect that we are going to have a look at next.

### Q-learning with $\varepsilon$-greedy Exploration

In the previous example we were using the same fixed action sequence in every episode. Now we are going to do something a bit more realistic: we are going to use the $\varepsilon$-greedy policy with $\varepsilon = 0.1$ to do some exploration.

For this experiment we are going to use off-the-shelf implementations of both the policy and of Q-learning. The visualization will show both the action-value function – with **coloured tiles**  visualizing **the path that the agent took**  in the last episode – and the state-value function.

*Note the part that says `seed(1)`. This fixes the seed of the random number generator so that we get the same results every time the cell runs. Comment that line out if you want to check out other possible ways this can play out.* 



In [None]:
seed(1)

env = MazeEnv(
    show_path=True,
    show_path_kw=dict(show_arrows=False, show_visited=True)
)

# set up plotting
plotter = Plotter(env, ActionValueTable, 
    (StateValueTable, {'render_kwargs': {'skip': {'player_logger'}}}),
     figsize=[8, 4], render_agent=False
)

qtable = ActionValueTable(env.action_space.n)
algo = QLearning(qtable, alpha=0.5, gamma=0.9)
policy = EpsGreedyPolicy(qtable, env.action_space.n, epsilon=0.1)
trainer = Trainer(
    algo, policy, verbose=5, on_end_episode=[
        lambda *args: plotter.plot(qtable, qtable.to_state_values())]
)

trainer.train(env, max_episodes=50, max_episode_steps=1000)

In [None]:
env = MazeEnv(show_path=True)
qtable_control(env, qtable, render=False, max_steps=100)
env.render()

As you can see, the agent is walking around randomly until it discovers the goal state at which point it starts to learn a path leading to it. Thanks to the $\varepsilon$-greedy policy it also does a bit of exploration: it sometimes deviates from what it has already learnt, which allows it to experience something else for a change.

However, with $\varepsilon = 0.1$ even after taking a number of episodes, it only really learns about a single path to the goal state and about the values of some additional states that lie very close to it.

---
### Task 2:

**As your next task, try re-running the experiment with different values of $\varepsilon$. Observe what changes as $\varepsilon$ gets closer to 1. How much of the state space has the agent explored? What is the total number of steps that the agent took within the alloted 50 episodes?** 

---


In [None]:
env = MazeEnv(show_path=True)
qtable_control(env, qtable, max_steps=100)
env.render()

### Q-learning with Experience Replay

The next problem that we need to address is the low sample-efficiency of our algorithm. Recall how we needed to traverse the same path a number of times, because each time the agent only learned to remember a single step of it. Surely there must be a way to learn about the entire path after just having seen it once.

As it turns out, there are several ways to do just that – we will be using a technique known as experience replay, where we record all that the agent has experienced and then replay bits of that experience multiple times. That way we can actually propagate the value of the goal state transition all the way back to the initial state even if we have only observed the action sequence once.

#### The Replay Buffer

In our case we do not need to implement a replay buffer from scratch – we have an implementation ready. However, we are going to have a look at its interface. As we construct the buffer, we specify its maximum size (`max_size`). As the capacity of the buffer is reached, new experience starts to replace the oldest experience recorded in the buffer. Usually, though, if there is enough memory, the maximum size is set to a very large value so that experience is often not discarded at all.

In addition to the maximum size of the buffer, we can also specify the default batch size (`batch_size`). When we call `.sample()`, we are going to get a batch contanining `batch_size` randomly sampled transitions.



In [None]:
replay_buffer = ReplayBuffer(max_size=10000, batch_size=100)

To record a transition in the replay buffer, you can call `replay_buffer.add(obs, a, reward, obs_next, done, info)`, where `a` is the action and `obs` is the pre-transition observation, while `obs_next` is the post-transition observation.

---
#### Task 3: Record Transitions

**Run the environment picking actions randomly. Use `replay_buffer.add` to record a 1000 transitions in the replay buffer.** 

Tips:

* Recall that `env.step(action)` returns the `(observation, reward, done, info)` tuple.
* Recall also that once `done` is true, the current episode has ended and you need to start a new episode using `obs = env.reset()`.
* Recall that you can sample a random action from the action space using `env.action_space.sample()`.
---


In [None]:
replay_buffer = ReplayBuffer(max_size=10000, batch_size=100)

env = MazeEnv()
obs = env.reset()
done = False

for step in range(1000):
    
    
    
    # ----
    
    
    
    obs = obs_next
        if terminated or truncated: obs = env.reset()

In [None]:
obs, a, reward, obs_next, done, info = replay_buffer.sample()

print(f"The replay buffer contains {len(replay_buffer)} samples.")
print(f"The batch contains {len(obs)} samples.\n")

print(f"obs: {obs[:3]} ...")
print(f"actions: {a[:3]} ...")
print(f"obs_next: {obs_next[:3]} ...")
print(f"done: {done[:3]} ...")
print(f"info: {info[:3]} ...")

#### Q-learning with a Replay Buffer

Next we are going to experiment with Q-learning using both an $\varepsilon$-greedy policy and a replay buffer. We are again going to be using an off-the-shelf implementation.

What you should see is that values propagate much faster because we are replaying the same experience again and again. We should also see that values are even being updated for states that have not been visited during the last episode. However, with $\varepsilon = 0.1$ the agent still does not always adequately explore all areas of the state space.



In [None]:
seed(1)

env = MazeEnv(
    show_path=True,
    show_path_kw=dict(show_arrows=False, show_visited=True)
)

plotter = Plotter(env, ActionValueTable, 
    (StateValueTable, {'render_kwargs': {'skip': {'player_logger'}}}),
     figsize=[8, 4], render_agent=False
)

qtable = ActionValueTable(env.action_space.n)
algo = QLearning(qtable, alpha=0.5, gamma=0.9)
policy = EpsGreedyPolicy(qtable, env.action_space.n, epsilon=0.1)
replay_buffer = ReplayBuffer(max_size=10000, batch_size=100)

trainer = Trainer(
    algo, policy, verbose=5, replay_buffer=replay_buffer,
    on_end_episode=[lambda *args: plotter.plot(qtable, qtable.to_state_values())],
)

trainer.train(env, max_episodes=20, max_episode_steps=1000)

In [None]:
env = MazeEnv(show_path=True)
qtable_control(env, qtable, max_steps=100)
env.render()

### Annealing the Exploration Rate

As we have seen before, keeping $\varepsilon$ too low results in too little exploration. On the other hand, if we keep $\varepsilon$ at a very high value, the agent will keep behaving erratically even after it has approached the true action-values. As a result, it may sacrifice a lot of reward, take unnecessarily many steps, etc.

A sensible thing to do, then, is to anneal the exploration rate: make the agent explore a lot at the beginning, but gradually decrease the exploration rate so that the agent keeps closer to the optimal policy after a while, exploiting and refining the acquired knowledge.

A common way to anneal the exploration rate is by prescribing an exponential schedule. Here we are going to use the `ExponentialSchedule` class, where we can specify:

* The initial value of the parameter;
* The final value of the parameter;
* The step to start the annealing at;
* The step to stop the annealing at.
In our case, the annealing will be based on the episode number (`method='episode'`) and not the step number. However, using the step number is also possible and probably even more common. Here is what an exponential schedule can look like:



In [None]:
eps_schedule = ExponentialSchedule(
    None,
    init_val=1.0, final_val=0.1,
    first_step=5, final_step=18,
    method='episode'
)

steps = range(eps_schedule.final_step+5)
eps = [eps_schedule.get_value(s) for s in steps]
plt.plot(steps, eps)
plt.xlabel("episode")
plt.ylabel("epsilon")
plt.grid(ls='--')

Having defined the schedule, we are now going to register it as a `on_begin_step` callback so that the $\varepsilon$ keeps getting updated according to the schedule. Let's see what effect that is going to have.

Given that we keep $\varepsilon$ at $1$ for the first few episodes, the agent is going to behave in a completely random way then. This should help it acquire a lot of experience over those episodes. Afterwards, $\varepsilon$ is going to decrease gradually.



In [None]:
seed(1)

env = MazeEnv(
    show_path=True,
    show_path_kw=dict(show_arrows=False, show_visited=True)
)

plotter = Plotter(env, ActionValueTable, 
    (StateValueTable, {'render_kwargs': {'skip': {'player_logger'}}}),
     figsize=[8, 4], render_agent=False
)

qtable = ActionValueTable(env.action_space.n)
algo = QLearning(qtable, alpha=0.5, gamma=0.9)
policy = EpsGreedyPolicy(qtable, env.action_space.n)

eps_schedule = ExponentialSchedule(
    policy.set_epsilon,
    init_val=1.0, final_val=0.1,
    first_step=5, final_step=18,
    method='episode'
)

replay_buffer = ReplayBuffer(max_size=10000, batch_size=100)

trainer = Trainer(
    algo, policy, replay_buffer, verbose=5,
    on_begin_step=[eps_schedule],
    on_end_episode=[lambda *args: plotter.plot(qtable, qtable.to_state_values())],
)

trainer.train(env, max_episodes=20, max_episode_steps=1000)

In [None]:
env = MazeEnv(show_path=True)
qtable_control(env, qtable, max_steps=100)
env.render()