# Value Iteration

Consider the world shown below. An agent can move from cell to cell using the displayed actions (arrows). The reward for an action is equal to the number on the arrows. 

Assume that it is a **deterministic** MDP. 

Conduct **value iteration** to iteratively compute the state values for all states $s_i, \forall i ∈\{0,\ldots,8\}$. The **discount factor** is $\gamma=0.9$, and the results should be rounded to integers.

Note: The states $s_3$ and $s_8$ are terminal. 

**Recap: Value Iteration**

Value iteration combines the policy evaluation and improvement step to iteratively update the state value function estimates according to

$$ V_{k+1}(s) = \max_{a \in \mathcal{A}} \left\{r(s,a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s,a) V_{k}(s')\right\} $$

<img src="grid.png"
     alt="Grid World"
     width="600" />

## Method 1:

### Initialization:


$ V_{0}(s_0) =0$  
$ V_{0}(s_1) =0$  
$ V_{0}(s_2) =0$  
$ V_{0}(s_3) =0$  
$ V_{0}(s_4) =0$  
$ V_{0}(s_5) =0$  
$ V_{0}(s_6) =0$  
$ V_{0}(s_7) =0$  
$ V_{0}(s_8) =0$  

$\pi_{greedy} = [?,?,?,?,?,?,?,?]$

## First iteration (k=1)

### State $s_0$

$$ V_{1}(s_0) = \max \{r(s_0,a_{up}) + \gamma  V_{0}(s_3),\quad r(s_0,a_{right}) + \gamma  V_{0}(s_1)\}$$

$$ V_{1}(s_0) = \max  \{-30, 0\} = 0 $$

$\pi_{greedy} = [a_{right},?,?,?,?,?,?,?]$

### State $s_1$

$$  V_{1}(s_1) = \max \{r(s_1,a_{left}) + \gamma  V_{0}(s_0), \quad r(s_1,a_{right}) + \gamma  V_{0}(s_2),\quad r(s_1,a_{up}) + \gamma  V_{0}(s_4)\} $$

$$ V_{1}(s_1) = \max  \{0, 0, -30\} = 0 $$ 

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},?,?,?,?,?,?]$

### State $s_2$

$$ V_{1}(s_2) =  r(s_2,a_{up}) + \gamma  V_{0}(s_5)$$

$$ V_{1}(s_2) = 0 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up},?,?,?,?,?]$

### State $s_3$

$$ V_{1}(s_3) = 0 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},?,?,?,?]$

### State $s_4$

$$ V_{1}(s_4) = \max \{r(s_4,a_{left}) + \gamma  V_{0}(s_3),\quad
  r(s_4,a_{right}) + \gamma  V_{0}(s_5),\quad
  r(s_4,a_{up}) +  \gamma  V_{0}(s_7)\} $$

$$ V_{1}(s_4) = \max  \{-40, 0, 10\} = 10 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},?,?,?]$

### State $s_5$

$$ V_{1}(s_5) = \max \{r(s_5,a_{left}) +  \gamma  V_0(s_4),\quad
  r(s_5,a_{down}) + \gamma  V_0(s_2)\,\quad
  r(s_5,a_{up}) + \gamma  V_0(s_8)\} $$

$$ V_{1}(s_5) = \max  \{0, 0, 80\} = 80 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},?,?]$

### State $s_6$

$$ V_{1}(s_6) = \max \{r(s_6,a_{down}) +\gamma  V_{0}(s_3),\quad r(s_6,a_{right}) + \gamma  V_{0}(s_7)\}$$

$$ V_{1}(s_6) = \max  \{-20, -10\} = -10 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},?,?]$

### State $s_7$

$$ V_{1}(s_7) =  r(s_7,a_{right}) + \gamma  V_{0}(s_8)$$

$$ V_{1}(s_7) = 100 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , ?]$

### State $s_8$

$$ V_{1}(s_8) = 0 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

## Second iteration (k=2)

$V_1 = \{0,0,0,0,\textcolor{red}{10},\textcolor{red}{80},\textcolor{red}{-10},\textcolor{red}{100},0\}$

### State $s_0$

$$ V_{2}(s_0) = \max \{r(s_0,a_{up}) + \gamma  V_{1}(s_3),\quad r(s_0,a_{right}) + \gamma  V_{1}(s_1)\}$$

$$ V_{2}(s_0) = \max  \{-30 + 0.9*0,\quad 0 + 0.9* 0\} = 0 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_1$

$$ V_{2}(s_1) = \max \{r(s_1,a_{left}) +  \gamma  V_{1}(s_0),\quad
  r(s_1,a_{right}) +  \gamma  V_{1}(s_2),\quad
  r(s_1,a_{up}) +  \gamma  V_{1}(s_4)\} $$

$$ V_{2}(s_1) = \max  \{0 + 0.9*0, \quad 0 + 0.9*0, \quad -30 + 0.9*10 \} = 0 $$ 

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_2$

$$ V_{2}(s_2) =  r(s_2,a_{up}) + \gamma  V_{1}(s_5)$$

$$ V_{2}(s_2) = 0 + 0.9*80 = 72 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_4$

$$ V_{2}(s_4) = \max \{r(s_4,a_{left}) +  \gamma  V_{1}(s_3),\quad
  r(s_4,a_{right}) +  \gamma  V_{1}(s_5),\quad
  r(s_4,a_{up}) +  \gamma  V_{1}(s_7)\} $$

$$ V_{2}(s_4) = \max  \{-40 + 0.9 * 0,\quad 0 + 0.9 * 80, \quad 10  + 0.9 * 100\} = 100 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_5$ 

$$ V_{2}(s_5) = \max \{r(s_5,a_{left}) + \gamma  V_{1}(s_4),\quad
  r(s_5,a_{down}) + \gamma  V_{1}(s_2),\quad
  r(s_5,a_{up}) + \gamma  V_{1}(s_8)\} $$

$$ V_{2}(s_5) = \max  \{0 + 0.9*10,\quad 0 + 0.9*0,\quad 80 + 0.9*0\} = 80 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_6$

$$ V_{2}(s_6) = \max \{r(s_6,a_{down}) + \gamma  V_{2}(s_3),\quad r(s_6,a_{right}) + \gamma  V_{2}(s_7)\}$$

$$ V_{2}(s_6) = \max  \{-20 + 0.9*0,\quad -10 + 0.9*100\} = 80 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

### State $s_7$

$$ V_{2}(s_7) =  r(s_7,a_{right}) + \gamma  V_{1}(s_8)$$

$$ V_{2}(s_7) = 100 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

## Third iteration (k=3)

$V_2 = \{0,0,\textcolor{red}{72},0,\textcolor{red}{100},80,\textcolor{red}{80},100,0\}$

### State $s_0$


$$ V_{3}(s_0) = \max \{r(s_0,a_{up}) + \gamma  V_{2}(s_3),\quad r(s_0,a_{right}) + \gamma  V_{2}(s_1)\}$$

$$ V_{3}(s_0) = \max  \{-30 + 0.9*0,\quad 0 + 0.9* 0\} = 0 $$

$\pi_{greedy} = [a_{right},\{a_{left},a_{right}\},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

We can calculate the value of $s_0$, but we know its value depends on the value of  $s_1$ which will change again. Therefore we don't need to figure it out now. 

### State $s_1$

$$ V_{3}(s_1) = \max \{r(s_1,a_{left}) +  \gamma  V_{2}(s_0),\quad
  r(s_1,a_{right}) +  \gamma  V_{2}(s_2),\quad
  r(s_1,a_{up}) +  \gamma  V_{2}(s_4)\} $$

$$ V_{3}(s_1) = \max  \{0 + 0.9*0, \quad 0 + 0.9*0, \quad -30 + 0.9*100 \} = 60 $$ 

$\pi_{greedy} = [a_{right},\textcolor{red}{a_{up}},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

We can calculate the value of $s_1$, but we know its value depends on the value of  $s_2$ which will change again. Therefore we don't need to figure it out now. 

### State $s_2$

$$ V_{3}(s_2) =  r(s_2,a_{up}) + \gamma  V_{2}(s_5)$$

$$ V_{3}(s_2) = 0 + 0.9*80 = 72 $$

$\pi_{greedy} = [a_{right},a_{up},a_{up}, \{\},a_{up},a_{up},a_{right},a_{right} , \{\}]$

We can calculate the value of $s_2$, but we know its value depends on the value of  $s_5$ which will change again. Therefore we don't need to figure it out now. 

### State $s_4$

Since we can only reach states $s_3$, $s_5$ and $s_7$ from state $s_4$, and $V(s_3)$, $V(s_5)$ and $V(s_7)$ have not been changed in the last iteration, we know that $V(s_4)$ will not change in this iteration.

### State $s_5$ 

$$ V_{3}(s_5) = \max \{r(s_5,a_{left}) + \gamma  V_{2}(s_4),\quad
  r(s_5,a_{down}) + \gamma  V_{2}(s_2),\quad
  r(s_5,a_{up}) + \gamma  V_{2}(s_8)\} $$

$$ V_{3}(s_5) = \max  \{0 + 0.9*100,\quad 0 + 0.9*72,\quad 80 + 0.9*0\} = 90 $$

$\pi_{greedy} = [a_{right},a_{up},a_{up}, \{\},a_{up},\textcolor{red}{a_{left}},a_{right},a_{right} , \{\}]$

### States $s_6$ and $s_7$

Since the V-values for the successor states for $s_6$ and $s_7$ have not been changed in the last iteration, we know that $V(s_6)$ and $V(s_7)$ will not change in this iteration.

## Fourth iteration (k=4)

$V_3 = \{0,\textcolor{red}{60},72,0,100,\textcolor{red}{90},80,100,0\}$

### State $s_0$

$$ V_{4}(s_0) = \max \{r(s_0,a_{up}) + \gamma  V_{3}(s_3),\quad r(s_0,a_{right}) + \gamma  V_{3}(s_1)\}$$

$$ V_{4}(s_0) = \max  \{-30 + 0.9*0,\quad 0 + 0.9* 60\} = 54 $$

$\pi_{greedy} = [a_{right},a_{up},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

### State $s_2$

$$ V_{4}(s_2) =  r(s_2,a_{up}) + \gamma  V_{3}(s_5)$$

$$ V_{4}(s_2) = 0 + 0.9*90 = 81 $$

$\pi_{greedy} = [a_{right},a_{up},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

### State $s_4$

$$ V_{4}(s_4) = \max \{r(s_4,a_{left}) +  \gamma  V_{3}(s_3),\quad
  r(s_4,a_{right}) +  \gamma  V_{3}(s_5),\quad
  r(s_4,a_{up}) +  \gamma  V_{3}(s_7)\} $$

$$ V_{4}(s_4) = \max  \{-40 + 0.9 * 0,\quad 0 + 0.9 * 90,\quad 10  + 0.9 * 100\} = 100 $$

$\pi_{greedy} = [a_{right},a_{up},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

## Fifth iteration (k=5)

$V_4 = \{\textcolor{red}{54},60,\textcolor{red}{81},0,100,90,80,100,0\}$

### State $s_1$

$$ V_{5}(s_1) = \max \{r(s_1,a_{left}) +  \gamma  V_{4}(s_0),\quad
  r(s_1,a_{right}) +  \gamma  V_{4}(s_2),\quad
  r(s_1,a_{up}) +  \gamma  V_{4}(s_4)\} $$

$$ V_{5}(s_1) = \max  \{0 + 0.9*0, \quad 0 + 0.9*81, \quad -30 + 0.9*100 \} = 73 $$ 


$\pi_{greedy} = [a_{right},\textcolor{red}{a_{right}},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

### State $s_5$ 

$$ V_{5}(s_5) = \max \{r(s_5,a_{left}) + \gamma  V_{4}(s_4),\quad
  r(s_5,a_{down}) + \gamma  V_{4}(s_2),\quad
  r(s_5,a_{up}) + \gamma  V_{4}(s_8)\} $$

$$ V_{5}(s_5) = \max  \{0 + 0.9*100,\quad 0 + 0.9*81,\quad 80 + 0.9*0\} = 90 $$

$\pi_{greedy} = [a_{right},a_{right},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

## Sixth iteration (k=6)

$V_5 = \{54,\textcolor{red}{73},81,0,100,90,80,100,0\}$

### State $s_0$

$$ V_{6}(s_0) = \max \{r(s_0,a_{up}) + \gamma  V_{5}(s_3),\quad r(s_0,a_{right}) + \gamma  V_{5}(s_1)\}$$

$$ V_{6}(s_0) = \max  \{-30 + 0.9*0,\quad 0 + 0.9* 73\} = 66 $$

$\pi_{greedy} = [a_{right},a_{right},a_{up}, \{\},a_{up},a_{left},a_{right},a_{right} , \{\}]$

## Optimal state value function 


$ V_{6}^*(s_0)=  66$    
$ V_{6}^*(s_1) = 73$     
$ V_{6}^*(s_2) = 81$    
$ V_{6}^*(s_3) = 0$   
$ V_{6}^*(s_4) = 100$   
$ V_{6}^*(s_5) = 90$   
$ V_{6}^*(s_6) = 80$  
$ V_{6}^*(s_7) = 100$      
$ V_{6}^*(s_8) = 0$  

What is the optimal policy given that we start form $s_0$?

$$s_0 \rightarrow s_1 \rightarrow s_2 \rightarrow s_5 \rightarrow s_4\rightarrow s_7 \rightarrow s_8 $$

## Method 2

<img src="grid.png"
     alt="Grid World"
     width="600" />

We start from the the terminal states $s_3, s_8$

### States $s_3, s_8$

$$V^*(s_3) = r(s_3,a_i) = 0   $$ 
$$V^*(s_8) = r(s_8,a_i) = 0   $$ 

No actions can be performed there. The values of these states will not change.

From which state can you reach the final state with an action?

### State $s_7$

$$V^*(s_7) =  r(s_7,a_{right}) + \gamma V^*(s_8)$$

$$V^*(s_7) = 100 $$

This value will not change.

### State $s_6$

It is clear that we will not use the action in the direction of $s_3$.

$$V^*(s_6) = r(s_6,a_{right}) +  \gamma V^*(s_7)$$

$$V^*(s_6) = -10 + 0.9*100 = 80 $$

This value will not change any more.

From $s_5$ and $s_4$, there are two paths to the destination $s_8$. Maybe we will need to do two iterations here to calculate the values for these states. 


### State $s_5$

We do not need to consider the action in the direction of $s_2$.

$$ V_{2}(s_5) = \max \{r(s_5,a_{left}) + \gamma  V_{0}(s_4), \quad r(s_5,a_{up}) + \gamma V^*(s_8)\} $$

$$ V_{2}(s_5) = \max  \{0 + 0.9*0, \quad 80 + 0.9*0\} = 80 $$


### State $s_4$

It is clear that we will not use the action in the direction of $s_3$.

$$ V_{2}(s_4) = \max \{r(s_4,a_{right}) + \gamma  V_{2}(s_5)\,
  \quad r(s_4,a_{up}) +  \gamma V^*(s_7))\} $$

$$ V_{2}(s_4) = \max  \{ 0 + 0.9 * 80,\quad 10  + 0.9 * 100\} = 100 $$

We will recalculate the value of $s_5$ because the state $s_4$ is better than we thought. If the new value of $s_5$ is bigger than $\frac{100}{0.9} \approx 111$, then we need another iteration to calculate $V(s_4)$.


### State $s_5$

$$ V_{3}(s_5) = \max \{r(s_5,a_{left}) +  \gamma  V_{2}(s_4), \quad r(s_5,a_{up}) +  \gamma V^*(s_8)\} $$

$$ V_{3}(s_5) = \max  \{0 + 0.9*100,\quad 80 + 0.9*0\} = 90 $$


The values $V^*(s_5)$ and $V^*(s_4)$  will not change any more.

### State $s_2$

The value of $s_2$ depends only on the value of $s_5$

$$V^*(s_2) =  r(s_2,a_{up}) +  \gamma V^*(s_5)$$

$$V^*(s_2) = 0 + 0.9*90 = 81 $$

### State $s_1$


$$V^*(s_1) = \max \{r(s_1,a_{right}) +  \gamma V^*(s_2), \quad r(s_1,a_{up}) +  \gamma V^*(s_4)\} $$

$$V^*(s_1) = \max  \{ 0 + 0.9*81, \quad -30 + 0.9*100 \} = 72.9 = 73 $$

### State $s_0$

$$V^*(s_0) = r(s_0,a_{right}) +\gamma V^*(s_1)$$

$$V^*(s_0) =  0 + 0.9* 73 = 65.7 = 66 $$