**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/gym_plannable.git
!{sys.executable} -m pip install git+https://github.com/michalgregor/rl_tabular.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
%matplotlib inline
from gym_plannable.env import MazeEnv
from rl_tabular import StateValueTable, ActionValueTable, collect_states
from rl_tabular.maze_env_plots import plot_action_values, plot_state_values
from rl_tabular import vtable_control, qtable_control
import numpy as np

## Dynamic Programming: Value Iteration

As we have already learnt, reinforcement learning is all about discovering a policy that maximizes long-term rewards. We also know that one branch of approaches to that problem relies on value functions. 

A *state-value function*  $V(s)$ expresses how much reward we can expect in the long term after being in state $s$ and following policy $\pi$ thereafter [[sutton1998]](#sutton1998):

$$
V^{\pi}(s) = \mathbb{E}_{\pi}\{ R_{t}|s_{t}=s \}
$$
You will also recall that, provided we have a distribution model of the environment and the state-action space is small, we can compute the state-value function using a method known as **value iteration** .

We initialize the values of all states to zeros:



In [None]:
vtable = StateValueTable()
plot_state_values(vtable);

Then we apply the following rule, iterating over all states $s$ multiple times:

$$
V_{\color{red} k+1}(s) = \max_{a \in A} \sum_{s'} P_{ss'}^{a} \left[r_{ss'}^{a} + \gamma V_{\color{red} k}(s') \right].
$$where $P_{ss'}^{a}$ is the probability of going from state $s$ to state $s'$ after taking action $a$; and $r^a_{ss'}$ is the reward received for that same transition.

---
### Task 1: Value Iteration

**In the cell below fill in the value iteration rule.** 

$$
V_{\color{red} k+1}(s) = \max_{a \in A} \sum_{s'} P_{ss'}^{a} \left[r_{ss'}^{a} + \gamma V_{\color{red} k}(s') \right].
$$
---


In [None]:
gamma = 0.9
env = MazeEnv()
env.reset()

# we collect all reachable states
states = collect_states(env.single_plannable_state())

for it in range(15):
    print(f"Iteration {it} started.")
    
    for state in states:
        if state.is_done():
            continue
            
        legals = state.legal_actions()
        maxval = -np.inf

        for a in legals:
            val = 0

            for next_state, prob in state.all_next(a):
                r = next_state.reward()                
                
                
                
                # r, gamma, prob, vtable, next_state       
                val +=     # -----

                
                
                
            maxval = max(val, maxval)

        vtable[state] = maxval
        
    plot_state_values(vtable, states, env=env, update_display=True)

### Controlling the Agent using the State-Value Function

If the action space is discrete and relatively small and we know the optimal value function, it is easy to derive the optimal policy from it – we simply iterate over all the actions and pick the one which is most likely to lead to a high-value next state.



In [None]:
env = MazeEnv(render_mode='human', show_path=True)
vtable_control(env, vtable, max_steps=100)

### Converting Between State and Action-Value Functions

Given a model, a state-value function can be converted to an action-value function by applying essentially the same logic that we applied when controlling the agent using the state-value function.



In [None]:
qtable = vtable.to_action_values(states)
plot_action_values(qtable, states, action_spec=env.action_spec)

It is also possible to convert the action-value function into a state-value function:



In [None]:
vtable2 = qtable.to_state_values()
plot_state_values(vtable2, states)

### Control using the Action-Value Function

Agents are easier to control using the action-value function as opposed to the state-value function. Since the action-value function gives us the value of each action directly, we merely need to pick the action with the maximum value – we no longer need a model to query for possible next states etc.



In [None]:
env = MazeEnv(render_mode='human', show_path=True)
qtable_control(env, qtable, max_steps=100)

### References

<a id="sutton1998">[sutton1998]</a> SUTTON, R.S. - BARTO, A.G. Reinforcement Learning: An Introduction. [s.l.]: The MIT Press, 1998. ISBN 0262193981.

