# Example 25: Iterative Policy Evaluation

## Contents
* [Acknowledgements](#ackw)
* [Overview](#overview) 
    * [Iterative policy evaluation](#ekf)
    * [Test case](#test_case)
* [Include files](#include_files)
* [The main function](#m_func)
* [Results](#results)
* [Source Code](#source_code)

## <a name="ackw"></a> Acknowledgements

Most of the content in this notebook is taken from the book ```Reinforcement Learning An introduction``` by Sutton and Barto.

## <a name="overview"></a> Overview

In this example we go over another classical reinforcement learning algorithm; namely iterative policy evaluation. This is a method for computing an optimal policy $\pi$. 

### <a name="ekf"></a> Iterative policy evaluation

Iterative policy evaluation is actually a dynamic programming algorithm. With these algorithms, one should supply the model of the environment. As such, one can argue that classical DP algorithms are probably of limited utility in the field of reinforcement learning.



Here, we will implement a table based approach to iterative policy evaluation. Therefore, we will have to assume that the state, $S$, action $A$ and reward, $R$, sets are finite. Furthermore, we will assume that the dynamics can be described by a set of probabilities 

$$p(s^{'}, r | s, \alpha), ~~ \forall s  \in S, \alpha \in A, r \in R$$

We can easily obtain optimal policies once we have found the optimal functions that is either $V^{*}$ or $Q^{*}$. We know that these will satisfy respectively the following Bellman equations:

$$V^{*}(s) = max_{\alpha} E[R_{t+1} + \gamma V^{*}(s_{t+1}) | S_t = s, A_t = \alpha]$$

$$Q^{*}(s,\alpha)= E[R_{t+1} + \gamma max_{\alpha^{'}}Q^{*}(s_{t+1}, \alpha^{'}) | S_t = s, A_t = \alpha]$$

DP algorithms can be obtained by turning Bellman equations, such as the ones above, into assignments. As we will see below, these are just update rules for improving approximations of the desired value functions. Let's see how to compute the state value function $V_{\pi}$ for an arbitrary policy $\pi$.  This is called policy evaluation. Concretely, since the method is iterative is called iterative policy evaluation. 

We can write the following for the value function $V_{\pi}$

$$V_{\pi}(s) = E_{\pi}[G_t| S_t = s] = \sum_{\alpha} \pi(\alpha |s) \sum_{s^{'}, r} p(s^{'}, r | s, \alpha)[r + \gamma V_{\pi}(s^{'})]$$

where $\pi(\alpha |s)$ is the probability of taking action $\alpha$ whilst in state $s$ under the policy $\pi$. If the dynamics of the environment, $p(s^{'}, r | s, \alpha)$, are known, then the equation above is a system of $|S|$ simulataneous equations. The unknowns are the $V_{\pi}(s), s \in S$.

Since the system is linear, its solution is  straightforward. All we need to do is to inverte the system matrix. The latter however is not always easy particular if the state space is large. We can avoid matrix inversion altogether by using iterative solution methods. 

Consider a sequence of approximate value functions  $V_0, V_1,V_2, \cdots$ 
The initial approximation, $V_0$ , is chosen arbitrarily (except for the terminal state whic is given the value 0). Each successive approximation is obtained by using the Bellman equation for $V$ as an update rule:

$$V_{k+1}(s) = \sum_{\alpha} \pi(\alpha |s) \sum_{s^{'}, r} p(s^{'}, r | s, \alpha)[r + \gamma V_{k}(s^{'})]$$

$V_k = V_{\pi}$ is a fixed point for this update rule. The sequence ${V_k}$ can be shown to converge to $V_{\pi}$ as $k\rightarrow \infty$. This algorithm is called iterative policy evaluation.

To produce each successive approximation, $V_{k+1}$ from $V_k$, iterative policy evaluation
applies the same operation to each state $s$: it replaces the old value of $s$ with a new value
obtained from the old values of the successor states of $s$, and the expected immediate
rewards, along all the one-step transitions possible under the policy being evaluated. We
call this kind of operation an expected update. Each iteration of iterative policy evaluation updates the value of every state once to produce the new approximate value function $V_{k+1}$.

A  simple implementation of iterative policy evaluation, will use two arrays one for  $V_{k}(s)$, and one for $V_{k+1}(s)$. With two arrays, the new values can be computed one by one from the old values without the old values being changed. In pseudocode the algorithm looks like the following:

<img src="iterative_policy_evaluation.png"
     alt="Iterative Policy Evaluation Algorithm"
     style="float: left; margin-right: 10px; width: 500px;" />


**Note that the image is from the book ```Reinforcement Learning An introduction``` by Sutton and Barto.**

Another approach is to use one array and update the values in place, that is, with each new value immediately
overwriting the old one. Then, depending on the order in which the states are updated,
sometimes new values are used instead of old ones on the right-hand side of the equation above. This
in-place algorithm also converges to $V_{\pi}$; in fact, it usually converges faster than the
two-array version, as you might expect, because it uses new data as soon as they are
available. 

However, for  the in-place algorithm, the order in which states have their values updated during the
sweep has a significant influence on the rate of convergence. For this reason the class ```SyncValueFuncItr``` uses a two arrays implementation.

Regardless of the implementation approach used, iterative policy evaluation improves a given initial policy. This is shown schematically in the figure below.  

<img src="grid_world_policy.png"
     alt="Grid World Policy"
     style="float: left; margin-right: 10px; width: 500px;" />

### <a name="test_case"></a> Test case

In this example, we will simulate an episodic MDP. The world is a square $4 \times 4$ grid. The goal states are the top left and bottom right cells. The initial $V_{\pi}$ function is shown in the image below. 

<img src="initial_grid_world.png"
     alt="Initial Grid World"
     style="float: left; margin-right: 10px; width: 500px;" />

Each state can have four possible actions; ```UP```, ```DOWN```, ```LEFT``` and ```RIGHT```. The code below uses the ```cengine::rl::worlds::GridWorldAction``` enumeration to describe that. If the agent goes off the world it is assumed to come back  on the  state that led it outside. For every transition we will assume a reward of $R=-1$. We also use $\gamma=1$. 

The dynamics function $p(s^{'}, r | s, \alpha)$ is modeled via the following lambda expression

```
 auto dynamics = [](const state_t& s1, real_t,
                const state_t& s2, const action_t& action){
          0.25
        };
```

Why do we return this value and not 1? Afterall, the dynamics is deterministic. This is because the code loops over the transition states for state $s$ regardless of the action chosen. However, the agent cannot transition from every state to another state regardless of the action. In other words, the function $p(s^{'}, r | s, \alpha)$ should return zero as well. In order not to clutter the code with if/else statements we use this workaround.



Similarly the function $\pi(\alpha, s)$ is modeled by

```
 auto policy = [](const action_t&, const state_t&){
          return 0.25;
        };
```

That is under the given policy, the agent has an equal probability to select any of the allowed four actions. The value function on some iterations is shown in the figure below: 

<img src="gird_world_value_function.png"
     alt="Grid World Value Function"
     style="float: left; margin-right: 10px; width: 500px;" />

## <a name="include_files"></a> Include files

```
#include "cubic_engine/base/cubic_engine_types.h"
#include "kernel/base/kernel_consts.h"
#include "cubic_engine/rl/worlds/grid_world.h"
#include "cubic_engine/rl/worlds/grid_world_action_space.h"
#include "cubic_engine/rl/synchronous_value_function_learning.h"
#include "cubic_engine/rl/reward_table.h"

#include <iostream>
```

## <a name="m_func"></a> The main function

```
namespace example
{

using cengine::uint_t;
using cengine::real_t;
using cengine::rl::worlds::GridWorld;
using cengine::rl::worlds::GridWorldAction;
using cengine::rl::SyncValueFuncItr;
using cengine::rl::SyncValueFuncItrInput;
using cengine::rl::RewardTable;
using kernel::CSVWriter;

class RewardProducer
{
public:

    typedef real_t value_t;

    /// construcotr
    RewardProducer();

    /// returns the reward for the goal
    real_t goal_reward()const{return 0.0;}

    /// returns the reward for the action
    /// at  state s when going to state sprime
    template<typename ActionTp, typename StateTp>
    real_t get_reward(const ActionTp& action,
                      const StateTp& s,
                      const StateTp& sprime)const{
        return rewards_.get_reward(s.get_id(), action);
    }

    /// returns the reward for the action
    /// at  state s when going to state sprime
     template<typename ActionTp, typename StateTp>
     real_t get_reward(const ActionTp& action,
                          const StateTp& s)const{
            return rewards_.get_reward(s.get_id(), action);
     }

private:

    /// table that holds the rewards
    RewardTable<GridWorldAction, real_t> rewards_;

    /// setup the rewards
    void setup_rewards();
};

RewardProducer::RewardProducer()
    :
   rewards_()
{
    setup_rewards();
}

void
RewardProducer::setup_rewards(){

    rewards_.set_reward(0, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(0, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(0, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(0, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(1, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(1, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(1, GridWorldAction::WEST, -1.0);
    rewards_.set_reward(1, GridWorldAction::SOUTH, -1.0);

    rewards_.set_reward(2, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(2, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(2, GridWorldAction::WEST, -1.0);
    rewards_.set_reward(2, GridWorldAction::SOUTH, -1.0);

    rewards_.set_reward(4, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(4, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(4, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(4, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(5, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(5, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(5, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(5, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(6, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(6, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(6, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(6, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(7, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(7, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(7, GridWorldAction::WEST, -1.0);
    rewards_.set_reward(7, GridWorldAction::EAST, -1.0);

    rewards_.set_reward(8, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(8, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(8, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(8, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(9, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(9, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(9, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(9, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(10, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(10, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(10, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(10, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(11, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(11, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(11, GridWorldAction::WEST, -1.0);
    rewards_.set_reward(11, GridWorldAction::EAST, -1.0);

    rewards_.set_reward(13, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(13, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(13, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(13, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(14, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(14, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(14, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(14, GridWorldAction::WEST, -1.0);

    rewards_.set_reward(15, GridWorldAction::NORTH, -1.0);
    rewards_.set_reward(15, GridWorldAction::EAST, -1.0);
    rewards_.set_reward(15, GridWorldAction::SOUTH, -1.0);
    rewards_.set_reward(15, GridWorldAction::WEST, -1.0);
}

typedef GridWorld<RewardProducer> world_t;
typedef world_t::state_t state_t;

const uint_t N_CELLS = 4;

void
create_wolrd(world_t& w){

   std::vector<state_t> world_states;
   world_states.reserve(N_CELLS*N_CELLS);

   uint_t counter=0;
   for(uint_t i=0; i<N_CELLS; ++i){
       for(uint_t j=0; j<N_CELLS; ++j){
           world_states.push_back(state_t(counter++));
       }
   }

   w.set_states(std::move(world_states));

   counter=0;
   for(uint_t i=0; i<N_CELLS*N_CELLS; ++i){

       auto& state = w.get_state(i);

       /// bottom row
       if(i <4){

           state.set_transition(static_cast<GridWorldAction>(GridWorldAction::SOUTH), &state);

           if(i != 3){
             state.set_transition(GridWorldAction::EAST, &w.get_state(i+1));
           }
           else{
               state.set_transition(GridWorldAction::EAST, &state);
           }

           state.set_transition(GridWorldAction::NORTH, &w.get_state(N_CELLS + i));

           if(i == 0){
                state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &state);
           }
           else{
               state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &w.get_state(i-1));
           }
       }
       else if(i >= 12 ){
           /// top row

           state.set_transition(static_cast<GridWorldAction>(GridWorldAction::SOUTH), &w.get_state(i - N_CELLS));

           if(i != 15){
             state.set_transition(GridWorldAction::EAST, &w.get_state(i+1));
           }
           else{
               state.set_transition(GridWorldAction::EAST, &state);
           }

           state.set_transition(GridWorldAction::NORTH, &state);

           if(i == 12){
               state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &state);
           }
           else{
              state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &w.get_state(i-1));
           }
       }
       else{

           /// all rows in between
           state.set_transition(static_cast<GridWorldAction>(GridWorldAction::SOUTH), &w.get_state(i - N_CELLS));

           if(i != 11 && i != 7){
               state.set_transition(static_cast<GridWorldAction>(GridWorldAction::EAST), &w.get_state(i +1));
           }
           else{
               state.set_transition(static_cast<GridWorldAction>(GridWorldAction::EAST), &state);
           }

           state.set_transition(static_cast<GridWorldAction>(GridWorldAction::NORTH), &w.get_state(i + N_CELLS));

           if(i != 4 && i != 8 ){
              state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &w.get_state(i-1));
           }
           else {
              state.set_transition(static_cast<GridWorldAction>(GridWorldAction::WEST), &state);
           }
       }
   }
}

}

int main(){

    using namespace example;

    try{

        typedef GridWorld<RewardProducer> world_t;
        typedef world_t::state_t state_t;
        typedef world_t::action_t action_t;

        auto policy = [](const action_t&, const state_t&){
          return 0.25;
        };

        auto dynamics = [](const state_t& s1, real_t,
                const state_t& s2, const action_t& action){
          return 0.25;
        };

        std::vector<real_t> rewards(1, -1.0);

        /// the world of the agent
        world_t world;
        create_wolrd(world);


        std::cout<<"Number of states: "<<world.n_states()<<std::endl;

        state_t start(15);
        state_t goal1(3);
        state_t goal2(12);

        world.append_goal(goal1);
        world.append_goal(goal2);

        /// simulation parameters
        /// number of episodes for the agent to learn.
        const uint_t N_ITERATIONS = 160;
        const real_t TOL = 0.001;
        const real_t GAMMA = 1.0;

        SyncValueFuncItrInput input={TOL, GAMMA, N_ITERATIONS, true};
        SyncValueFuncItr<world_t> learner(std::move(input));

        std::vector<real_t> row(2);
        learner.initialize(world, 0.0);

        world.restart(start);

        while(learner.continue_iterations()){

            std::cout<<"At iteration: "<<learner.get_current_iteration()<<std::endl;

            learner.step(policy, dynamics);
            auto values = learner.get_values();

            for(auto c=0; c<values.size(); ++c){
                std::cout<<"Cell: "<<c<<" value: "<<values[c]<<std::endl;
            }   
        }    
    }
    catch(std::exception& e){

        std::cerr<<e.what()<<std::endl;
    }
    catch(...){

        std::cerr<<"Unknown exception occured"<<std::endl;
    }

    return 0;
}


```

## <a name="results"></a> Results



```
Number of states: 16
At iteration: 1
Cell: 0 value: -1
Cell: 1 value: -1
Cell: 2 value: -1
Cell: 3 value: 0
Cell: 4 value: -1
Cell: 5 value: -1
Cell: 6 value: -1
Cell: 7 value: -1
Cell: 8 value: -1
Cell: 9 value: -1
Cell: 10 value: -1
Cell: 11 value: -1
Cell: 12 value: 0
Cell: 13 value: -1
Cell: 14 value: -1
Cell: 15 value: -1
At iteration: 2
Cell: 0 value: -2
Cell: 1 value: -2
Cell: 2 value: -1.75
Cell: 3 value: 0
Cell: 4 value: -2
Cell: 5 value: -2
Cell: 6 value: -2
Cell: 7 value: -1.75
Cell: 8 value: -1.75
Cell: 9 value: -2
Cell: 10 value: -2
Cell: 11 value: -2
Cell: 12 value: 0
Cell: 13 value: -1.75
Cell: 14 value: -2
Cell: 15 value: -2
At iteration: 3
Cell: 0 value: -3
Cell: 1 value: -2.9375
Cell: 2 value: -2.4375
Cell: 3 value: 0
Cell: 4 value: -2.9375
Cell: 5 value: -3
Cell: 6 value: -2.875
Cell: 7 value: -2.4375
Cell: 8 value: -2.4375
Cell: 9 value: -2.875
Cell: 10 value: -3
Cell: 11 value: -2.9375
Cell: 12 value: 0
Cell: 13 value: -2.4375
Cell: 14 value: -2.9375
Cell: 15 value: -3

...

At iteration: 130
Cell: 0 value: -21.9815
Cell: 1 value: -19.9835
Cell: 2 value: -13.9889
Cell: 3 value: 0
Cell: 4 value: -19.9835
Cell: 5 value: -19.9836
Cell: 6 value: -17.9855
Cell: 7 value: -13.9889
Cell: 8 value: -13.9889
Cell: 9 value: -17.9855
Cell: 10 value: -19.9836
Cell: 11 value: -19.9835
Cell: 12 value: 0
Cell: 13 value: -13.9889
Cell: 14 value: -19.9835
Cell: 15 value: -21.9815
At iteration: 131
Cell: 0 value: -21.9825
Cell: 1 value: -19.9844
Cell: 2 value: -13.9895
Cell: 3 value: 0
Cell: 4 value: -19.9844
Cell: 5 value: -19.9845
Cell: 6 value: -17.9862
Cell: 7 value: -13.9895
Cell: 8 value: -13.9895
Cell: 9 value: -17.9862
Cell: 10 value: -19.9845
Cell: 11 value: -19.9844
Cell: 12 value: 0
Cell: 13 value: -13.9895
Cell: 14 value: -19.9844
Cell: 15 value: -21.9825


```

## <a name="source_code"></a> Source Code



<a href="../exe.cpp">exe.cpp</a>