# Reinforcement Learning
Jin Yeom (jinyeom@utexas.edu)

In this notebook, we'll study and experiment with basics of reinforcement learning (RL). We're going to first talk about RL techniques that are based on known MDPs, i.e., every state, transition from one to another, and its reward is exposed to the agent; then we'll talk about methods with unknown MDPs, where the agent has to explore the environment to approximate the MDP.

Note that this notebook is based on Sutton and Barto's book __Reinforcement Learning: An introduction__.

## Policy Iteration and Value Iteration

If you look at traditional AI methods (what we now call Good Old-Fashion Artificial Intelligence, or GOFAI) based on search algorithms, they are often designed with an assumption that the model of the environment is completely exposed to the agent. In other words, the agent has a direct access to all states. So, the agent is often able to search through different states and plan an optimal path to maximize the utility.

In this section, we're going to start from there. The model is given to the agent; but rather than searching and planning to build its policy, it will explore and evaluate each state, so that the agent can learn to decide whether it should be in a certain state.

### The Bellman Equation

One key idea for the two algorithms in this section is the Bellman Equation (shown below).

$$
v_*(s) = \max_{a}\mathbb{E}[R_{t + 1} + \gamma v_*(S_{t + 1}) \space | \space S_t = s, A_t = a]
$$

An important thing to notice in this equation is that this equation is recursive, i.e., the value of a state is determine by that of its next state, and the value of the next state by the next, and so on. Intuitively, what this equation is telling is, for each state, the optimal value is given by an action that ends up with the highest expected value (if you're familiar with search-based AI methods, it's exactly like the Expectimax method).

### Policy Iteration

Another observation that the Bellman Eqaution suggests is that, in order to evaluate each state, the agent must try some sequence of further actions and find out what happens. But rather than searching for the best sequence of actions by trying every possible scenario, we would like the agent to learn the best sequence of actions, by learning where to be and where not to be. In other words, the agent will try some sequence of actions, evaluate it, fix the sequence to improve the score, and repeat. This process is called **Policy Iteration**.

Now, let's try implementing Policy Iteration for a small problem called GridWorld. In this problem, the agent moves from a grid to another to reach the goal, while avoiding fire. The code for GridWorld is included separately in this repository.

In [1]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

from scripts import gridworld as gw

Let's start with a common example grid.

In [2]:
# Aliases for easier interpretation of each grid type.
D, W, G, F = gw.GRID, gw.WALL, gw.GOAL, gw.FIRE

grids = gw.GridWorld(np.array([[D, D, D, G],
                               [D, W, D, F],
                               [D, D, D, D]]))

# Visualize the grids.
# Empty grids are ones that are reachable, `#` shows a wall,
# `o` is the good terminal state (goal), and `x` is the bad
# terminal state (fire).
grids.show()

+-+-+-+-+
| | | |o|
+-+-+-+-+
| |#| |x|
+-+-+-+-+
| | | | |
+-+-+-+-+


As mentioned earlier, the steps in Policy Iteration algorithm are
1. Initialize the agent's policy ($\forall s \in S, \pi(s) \in A$)
2. Evaluate $\pi$ for each state $s$.
3. Improve $\pi$ for each state $s$; go to 2.

In [7]:
def init_policy(env):
    """ Initialize the policy with 4 possible actions for each state with the
    same probability (1.0 / 4 actions = 0.25). """
    r, c = env.shape
    return np.full((r, c, 4), 0.25)

pi = init_policy(grids)
grids.show(policy=pi)

+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+


In [8]:
def eval_policy(policy, env, gamma, theta=1e-8):
    """ Evaluate the argument policy. """
    V = np.zeros(env.shape)
    while True:
        delta = 0.0
        for s in env.S:
            v = V[s]
            V[s] = sum([p * (env.R(s) + gamma * V[env.T(s, a)]) 
                        for a, p in enumerate(policy[s])])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            return V
        
print(eval_policy(pi, grids, 0.5))

[[ 0.10215867  0.02686511  0.10215867  0.98143746]
 [-0.1422485  -0.03423668 -0.1422485  -1.04911344]
 [-0.04878473 -0.01626157 -0.04878473 -0.18298303]]


In [9]:
def impr_policy(policy, env, gamma):
    """ Improve the argument policy. """
    policy_stable = True
    V = eval_policy(policy, env, gamma)
    for s in env.S:
        old_action = np.argmax(policy[s])
        new_action = np.argmax([p * (env.R(s) + gamma * V[env.T(s, a)])
                               for a, p in enumerate(policy[s])])
        if new_action != old_action:
            policy_stable = False
        policy[s] = np.eye(4)[new_action]
    return policy_stable, V

Now, we can put these functions together to complete Policy Iteration.

In [13]:
def policy_iteration(env, gamma):
    pi = init_policy(env)
    done = False
    while not done:
        done, V = impr_policy(pi, env, gamma)
    return pi, V

pi, V = policy_iteration(grids, 0.7)
grids.show(policy=pi)
print(V)

+-+-+-+-+
|<|<|>|>|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
[[ 2.33333331  1.63333332  2.33333331  3.33333331]
 [ 1.63333332  1.14333332  1.63333332  1.33333332]
 [ 1.14333332  0.          1.14333332  0.93333332]]


## Value Iteration