# Example 26: Policy Iteration

## Contents
* [Acknowledgements](#ackw)
* [Overview](#overview) 
    * [Policy iteration algorithm](#ekf)
    * [Test case](#motion_model)
* [Include files](#include_files)
* [The main function](#m_func)
* [Results](#results)
* [Source Code](#source_code)

## <a name="ackw"></a> Acknowledgements

Most of the content in this notebook is taken from the book ```Reinforcement Learning An introduction``` by Sutton and Barton.

## <a name="overview"></a> Overview

In this notebook we will discuss the so-called Policy Iteration algorithm for a finite MDP. 

### <a name="ekf"></a> Policy iteration algorithm

The idea behind policy iteration is rather simple; once a policy $\pi$ has been improved using $V_{\pi}$ to yield a better policy say $\pi_1$ we can then compute $V_{\pi_1}$ and improve it again to yield an even better policy say $\pi_2$. This is of course true provided that the used policy is not optimal already.

This process will give us a sequence of monotonically improving policies and value fuctions:


$$\pi_0 \rightarrow V_{\pi_0} \rightarrow \pi_1 \rightarrow V_{\pi_1}\rightarrow \pi_2 \cdots \pi_{*} \rightarrow V_{\pi_{*}}$$

The step $\pi_i \rightarrow V_{\pi_i}$ is an evaluation step. On the other hand the step $V_{\pi_i} \rightarrow \pi_{i+1} $ is an improvement step. We know that we can use iterative policy evaluation at this step. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal).

Since we are assuming a finite MDP, this means that  only a finite number of policies exists. Thus, this process must converge to an optimal policy and optimal value function in a finite number of iterations.

This way of finding an optimal policy is called policy iteration. The algorithm is shown in the image below 

<img src="policy_iteration.png"
     alt="Policy Iteration"
     style="float: left; margin-right: 10px; width: 500px;" />

**Note that the image is taken from the book ```Reinforcement Learning An introduction``` by Sutton and Barto**

Note that each policy evaluation, itself an iterative computation,
is started with the value function for the previous policy. This typically results in a great
increase in the speed of convergence of policy evaluation (presumably because the value
function changes little from one policy to the next).

### <a name="motion_model"></a> Test case

A common test case for SARSA is the so-called Cliff world. The used world is shown in the image below

<img src="cliff_world.png"
     alt="Cliff World"
     style="float: left; margin-right: 10px; width: 500px;" />

This is a standard un-discounted i.e. $\gamma = 1$, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is -1 on all transitions except those into the region marked Cliff. Stepping into this region incurs a reward of optimal path -100 and sends the agent instantly back to the start.

## <a name="include_files"></a> Include files

```
#include "cubic_engine/base/cubic_engine_types.h"
#include "kernel/utilities/csv_file_writer.h"
#include "kernel/base/kernel_consts.h"
#include "kernel/utilities/csv_file_writer.h"
#include "cubic_engine/rl/worlds/cliff_world.h"
#include "cubic_engine/rl/worlds/grid_world_action_space.h"
#include "cubic_engine/rl/tabular_sarsa_learning.h"
#include "cubic_engine/rl/reward_table.h"

#include <cmath>
#include <utility>
#include <tuple>
#include <iostream>
#include <random>
#include <algorithm>
```

## <a name="m_func"></a> The main function

```
int main(){

    using cengine::uint_t;
    using cengine::real_t;
    using cengine::rl::worlds::CliffWorld;
    using cengine::rl::worlds::GridWorldAction;
    using cengine::rl::SarsaTableLearning;
    using cengine::rl::SarsaLearningInput;
    using cengine::rl::RewardTable;
    using kernel::CSVWriter;

    try{

        typedef CliffWorld world_t;
        typedef world_t::state_t state_t;

        /// the world of the agent
        CliffWorld world;
        world.create_world();

        std::cout<<"Number of states: "<<world.n_states()<<std::endl;

        state_t start(0);
        state_t goal(11);

        /// simulation parameters
        /// number of episodes for the agent to learn.
        const uint_t N_ITERATIONS = 500;
        const real_t ETA = 0.1;
        const real_t EPSILON = 0.1;
        const real_t GAMMA = 1.0;
        const real_t PENALTY = -100.0;

        SarsaLearningInput qinput={ETA, EPSILON, GAMMA, true, true};
        SarsaTableLearning<world_t> sarsalearner(std::move(qinput));

        CSVWriter writer("agent_rewards.csv", ',', true);
        writer.write_column_names({"Episode", "Reward"}, true);

        std::vector<real_t> row(2);
        sarsalearner.initialize(world, PENALTY);

        auto& table = sarsalearner.get_table();
        table.save_to_csv("table_rewards" + std::to_string(0) + ".csv");

        for(uint_t episode=0; episode < N_ITERATIONS; ++episode){

            std::cout<<"At episode: "<<episode<<std::endl;
            world.restart(start, goal);
            auto result = sarsalearner.train(goal);

            /// the total reward the agent obtained
            /// in this episode
            auto reward = result.total_reward;
            writer.write_row(std::make_tuple(episode, reward));
            std::cout<<"At episode: "<<episode<<" total reward: "<<reward<<std::endl;

            if(episode == N_ITERATIONS - 1){
                auto& table = sarsalearner.get_table();
                table.save_to_csv("table_rewards" + std::to_string(episode) + ".csv");
            }
        }
    }
    catch(std::exception& e){

        std::cerr<<e.what()<<std::endl;
    }
    catch(...){

        std::cerr<<"Unknown exception occured"<<std::endl;
    }

    return 0;
}

```

## <a name="results"></a> Results



```
...

Taking action: SOUTH
	Reward received: -1
	Current value for state: 27 and action: SOUTH -1
	Next action: NORTH
	Next state: 15
	Setting for state: 27 and action: SOUTH to value: -1
	At iteration: 215
	Current state: 15
	Taking action: NORTH
	Reward received: -1
	Current value for state: 15 and action: NORTH -1
	Next action: SOUTH
	Next state: 27
	Setting for state: 15 and action: NORTH to value: -1
	At iteration: 216
	Current state: 27
	Taking action: SOUTH
	Reward received: -1
	Current value for state: 27 and action: SOUTH -1
	Next action: NORTH
	Next state: 15
	Setting for state: 27 and action: SOUTH to value: -1
	At iteration: 217
	Current state: 15
	Taking action: NORTH
	Reward received: -1
	Current value for state: 15 and action: NORTH -1
	Next action: SOUTH
	Next state: 27
	Setting for state: 15 and action: NORTH to value: -1
	At iteration: 218
	Current state: 27
	Taking action: SOUTH
	Reward received: -1
	Current value for state: 27 and action: SOUTH -1
	Next action: NORTH
	Next state: 15
	Setting for state: 27 and action: SOUTH to value: -1
	At iteration: 219
	Current state: 15
	Taking action: NORTH
	Reward received: -1
	Current value for state: 15 and action: NORTH -1
	Next action: SOUTH
	Next state: 27
	Setting for state: 15 and action: NORTH to value: -1
	At iteration: 220
	Current state: 27
	Taking action: SOUTH
	Reward received: -1
	Current value for state: 27 and action: SOUTH -1
	Next action: NORTH
	Recalculated next action: SOUTH
	Next state: 15
	Setting for state: 27 and action: SOUTH to value: -1
	At iteration: 221
	Current state: 15
	Taking action: SOUTH
WORLD FINISHED AT STATE: 15 AND ACTION: SOUTH
At episode: 498 total reward: -220
At episode: 499
	At iteration: 1
	Current state: 0
	Taking action: SOUTH
WORLD FINISHED AT STATE: 0 AND ACTION: SOUTH
At episode: 499 total reward: 0

```

The following images shown the sum of rewards achieved by the algorithm for various values of $\epsilon$. We see that as $\epsilon$ is increased more exploration occurs.

<img src="e_0_1.png"
     alt="Total Rewards"
     style="float: left; margin-right: 10px;" />

<img src="e_0_2.png"
     alt="Total Rewards"
     style="float: left; margin-right: 10px;" />

<img src="e_0_3.png"
     alt="Total Rewards"
     style="float: left; margin-right: 10px;" />

## <a name="source_code"></a> Source Code



<a href="../exe.cpp">exe.cpp</a>