<a href="https://colab.research.google.com/github/onlyabhilash/reinforcement_learning_course_materials/blob/main/exercises/templates/ex03/Ex3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 3: Dynamic Programming

## 1) Policy Evaluation

After your master thesis you decide to study on and do you beer-bachelor.
Therefore, you have to drink three beers in three different pubs.
There are six pubs available in town, you start at home and will (hopefully) end up at home. The problem is depicted in the following picture:

![](Beer-Bachelor.png)

In our first example we follow the 50/50 policy.
So after drinking in a pub - e.g. Auld Triangle, there is a $50 \, \%$ probability to go "up" to the Globetrotter and  $50\, \%$ probability to go "down" to the Black Sheep.
Evaluate the state values using policy evaluation ($v_\mathcal{X} = \mathcal{R}_\mathcal{X} + \gamma \mathcal{P}_{xx'} v_\mathcal{X}$):

\begin{align*}
\begin{bmatrix}
v^{50/50}_{1}\\
.\\
.\\
.\\
v^{50/50}_{n}\\
\end{bmatrix}
=
\begin{bmatrix}
\mathcal{R}^{50/50}_{1}\\
.\\
.\\
.\\
\mathcal{R}^{50/50}_{n}\\
\end{bmatrix}
+
\gamma
\begin{bmatrix}
{p}^{50/50}_{11}&...&{p}^{50/50}_{1n}\\
.& &.\\
.& &.\\
.& &.\\
{p}^{50/50}_{n1}&...&{p}^{50/50}_{nn}\\
\end{bmatrix}
\begin{bmatrix}
v^{50/50}_{1}\\
.\\
.\\
.\\
v^{50/50}_{n}\\
\end{bmatrix}
\end{align*}

The rewards are given as negative numbers next to the arrows and represent the distances between two bars as a penalty.
In this exercise we will set $\gamma = 0.9$.
In the shown problem we have $n = 8$ states (pubs, including start-home and end-home), ordered as given by the state space:

\begin{align*}
\mathcal{X} =
\left\lbrace \begin{matrix}
\text{Start: Home}\\
\text{Auld Triangle}\\
\text{Lötlampe}\\
\text{Globetrotter}\\
\text{Black Sheep}\\
\text{Limericks}\\
\text{Fat Louis}\\
\text{End: Home}\\
\end{matrix}
\right\rbrace
\end{align*}

Use a little python script to calculate the state values!

YOUR ANSWER HERE

In [None]:
import numpy as np

# define given parameters
gamma = 0.9 # discount factor

# YOUR CODE HERE
raise NotImplementedError()

print(v_X)


## 2) Exhaustive Policy Search

From now on use $\gamma = 1$.

As you have pre knowledge from your master degree, you try to minimize the distance of the way you have to take during your tour in order to have more time in the pubs. Therefore, you perform the following exhaustive search algorithm:

1. Write down all possible path-permutations and calculate the distances.
2. Which is the best path concerning most beer per distance?
3. Derive the formula to calculate the number of necessary path comparisons.



YOUR ANSWER HERE

## 3) Dynamic Programming - The Idea

Trying out all combinations might not be best for your liver, so you want to solve the problem above using dynamic programming.

Making use of value iteration, derive the values resulting from the optimal policy: $v_{i+1}^*(x_k) = \text{max}_u (r_{k+1} + v_{i}^*(x_{k+1}))$.

Hint: There is only one policy improvement step needed.

How many value comparisons have to be made?

YOUR ANSWER HERE



## 4) Value Iteration in Stochastic Environments

All of the pubs have different special offers on some days of the week.
Due to general confusion you have no clue, which day of the week we currently have.
You only know, for example, that Globetrotter has one happy-hour-day per week, but Black Sheep has four days per week.
So, the chance to get a positive reward in the Black Sheep is higher than in the Globetrotter.

To find the best path we can use the Bellman optimality equation we know from the lecture:

$v_\pi(x_k) = \text{max}_{u_k\in \mathcal{U}} \mathbb{E}\left[R_{k+1} + \gamma v_\pi(X_{k+1}|X_k = x_k, U_k = u_k)\right]$

## Comparison to lecture:
In the tree example from lecture we have deterministic rewards and a stochastic state transition.
So $v_\pi(x_k) = \text{max}_{u_k\in \mathcal{U}} r_x^u + \gamma \Sigma_{x_{k+1}\in \mathcal{X}}p_{xx'}^u v_\pi(x_{k+1})$

![](TreeExampleVL.PNG)


In our problem we have deterministic state transitions because we reach the bar we plan to visit for sure.
But the reward in our case has a stochastic offset.
If happy-hour-day (randomly, we do not know the weekday), we get an additional positive reward.
The probability to get that is dependent on the number of happy-hour-days per week.
For example:

![](Stochastic_Rewards.png)

As can be seen, in Globetrotter we have in 1 of 7 cases (days) an additional happy-hour-reward and in Black Sheep in 4 of 7 cases.

The states are defined in the following order:

$\mathcal{X} = \left\lbrace \begin{matrix}
\text{Start: Home}\\
\text{Auld Triangle}\\
\text{Lötlampe}\\
\text{Globetrotter}\\
\text{Black Sheep}\\
\text{Limericks}\\
\text{Fat Louis}\\
\text{End: Home}\\
\end{matrix}
\right\rbrace$

The probability to get the positive reward (happy hour) is defined by:

$p_{xr_+} = \left[ \begin{matrix}
\frac{3}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & 0 & 0\\
\frac{6}{7} & \frac{4}{7} & \frac{4}{7} & \frac{5}{7} & \frac{5}{7} & 0 & 0\\
\end{matrix}
\right]$

                    
$r_{+} = \left[ \begin{matrix}
12 & 16 & 16 & 16 & 16 & 0 & 0\\
6 & 10 & 10 & 8 & 8 & 0 & 0\\
\end{matrix}
\right]$  

The probability to get the no extra reward is $p_{xr_-} = 1 - p_{xr_+}$.
In that case you take the "no extra reward" $r_-$ is zero in all cases.

$r_{-} = \left[ \begin{matrix}
0 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0 & 0\\
\end{matrix}
\right]$  


The actions to choose are $\textit{up}$ and $\textit{down}$ (for states Li and F, $\textit{up}$ and $\textit{down}$ mean the same).

So if you are at home and you go $\textit{up}$ you get a fixed reward of $-3$ on the way.
With a probability of $p_{x=0,r_+} = \frac{3}{7}$ there is happy hour in the Auld Triangle and you get an additional reward of $+12$.



1. Examine the pseudocode for value iteration as presented in the lecture. Does this pseudocode already contain the concept of a stochastic reward and where do we find it?
2. Make use of value iteration to (use $\gamma = 1$):
    1. Find the state value for each state
    2. Find the optimal policy
3. Check your solution by dynamic programming by hand like in 3)

### Pseudocode:
***
- **input:** Full model of the MDP i.e. $\left\langle\mathcal{X}, \mathcal{U}, \mathcal{P}, \mathcal{R}, \gamma \right\rangle$
- **parameter:** $\delta>0$ as accuracy termination threshold
- **init:** $v_0(x_k)\, \forall \, x_k\in\mathcal{X}$ arbitrary except $v_0(x_k)=0$ if $x_k$ is terminal
- **repeat**
    - $\Delta \leftarrow 0 $
    - **for** $\forall \, x_k\in\mathcal{X}$
        - $\tilde{v}\leftarrow \hat{v}(x_k)$
		- $\hat{v}(x_k)\leftarrow  \max_{u_k\in\mathcal{U}}\left(\mathcal{R}^u_x + \gamma\sum_{x_{k+1}\in\mathcal{X}}p_{xx'}^u \hat{v}(x_{k+1})\right)$
		- $\Delta \leftarrow \max\left(\Delta, |\tilde{v}-\hat{v}(x_k) |\right)$
    - **end**
- **until** $\Delta < \delta$
- **output:** Deterministic policy $\pi\approx\pi^*$, such that

$\pi(x_k)\leftarrow  \text{arg} \, \text{max}_{u_k\in\mathcal{U}}\left(\mathcal{R}^u_x + \gamma\sum_{x_{k+1}\in\mathcal{X}}p_{xx'}^u \hat{v}(x_{k+1})\right)$
***
Value iteration (note: compared to policy iteration, value iteration doesn't require an initial policy but only a state-value guess)

## 4) Solution


In [None]:
import numpy as np

r_ways = np.array([[-3, -2, -3, -4, -5, -6, -7],
                   [-1, -4, -5, -5, -6, -6, -7]]) # fixed rewards for the upwards or downwards path

p_xr = np.array([[3/7, 1/7, 1/7, 1/7, 1/7, 0, 0],
                 [6/7, 4/7, 4/7, 5/7, 5/7, 0, 0]]) #probability of success for the upwards or downwards path

r_happy = np.array([[12, 16, 16, 16, 16, 0, 0],
                    [ 6, 10, 10,  8,  8, 0, 0]])

expected_rewards = r_ways + p_xr*r_happy + (1-p_xr)*0
values = np.zeros([8])


delta = 0.1 # lower tolerance boundary

# YOUR CODE HERE
raise NotImplementedError()
print(values)
print(iteration_idx)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE