<br/>

$$ \huge{\underline{\textbf{ Iterative Policy Evaluation }}} $$

$$ \color{red}{ \large{\textbf{ WORK IN PROGRESS } }} $$

<br/>

Implementation of Iterative Policy Evaluation from Sutton and Barto 2018, chapter 4.1

<br/>

<img src="assets/0401_iter_policy_eval.png"/>

<br/>

In [None]:
# Naive implementation but matches the box exactly
def iter_policy_eval(env, policy, gamma, theta):
    V = np.zeros(env.nb_states)
    
    while True:
        delta = 0
        for s in range(env.nb_states):
            v = V[s]
            
            tmp = 0
            for a in range(env.nb_actions):
                for p, s_, r, _ in env.model[s][a]:              # see note #1
                    tmp += policy[s,a] * p * (r + gamma * V[s_])
            V[s] = tmp
            
            delta = max(delta, abs(v - V[s]))
    
        if delta < theta: break
    
    return V

### Note #1
__env.model__ parameter is taken directly from OpenAI API for FrozenLake-v1 (where it is called __env.P__, see below). It is a nested structure which describes transition probabilities and expected rewards, for example:
```
>>> env.model[6][0]
[(0.3333333333333333, 2, 0.0, False),
 (0.3333333333333333, 5, 0.0, True),
 (0.3333333333333333, 10, 0.0, False)]
```
Has following meaning:
* from state 6 and taking action 0, there is __0.33__ probability transitioning to state __2__, with reward __0.0__, transition is non-terminal
* from state 6 and taking action 0, there is 0.33 probability transitioning to state 5, with reward 0.0, transition is terminal, MDP ends
* from state 6 and taking action 0, there is 0.33 probability transitioning to state 10, with reward 0.0, transition is non-terminal

See diagram
<img src="assets/0401_model_diagram.png">

### More proper version

TODO!!

In [None]:
# Faster, but potentially uses more memory
def iter_policy_eval(env, policy, gamma, theta):
    V = np.zeros(env.nb_states)
    
    while True:
        delta = 0
        for s in range(env.nb_states):
            v = V[s]
            
            tmp = 0
            for a in range(env.nb_actions):
                for p_, s_, r_, _ in env.model[s][a]:
                    tmp += policy[s,a] * p_ * (r_ + gamma*V[s_])
            V[s] = tmp
            
            delta = max(delta, abs(v - V[s]))
    
        if delta < theta: break
    
    return V

# Experiment Setup

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import gym

In [4]:
env = gym.make('FrozenLake-v0')
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


Rename some members, don't break stuff!

In [14]:
assert not hasattr(env, 'nb_states')
assert not hasattr(env, 'nb_actions')
assert not hasattr(env, 'model')
env.nb_states = env.env.nS
env.nb_actions = env.env.nA
env.model = env.env.P

In [15]:
policy = np.ones([env.nb_states, env.nb_actions]) / env.nb_actions

In [34]:
V_pi = iter_policy_eval(env, policy, gamma=1.0, theta=0.00001)

In [35]:
print(V_pi.reshape([4, -1]).round(3))

[[0.014 0.012 0.021 0.01 ]
 [0.016 0.    0.041 0.   ]
 [0.035 0.088 0.142 0.   ]
 [0.    0.176 0.439 0.   ]]


In [38]:
env.model[0][0]

[(0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 4, 0.0, False)]