# Reinforcement Learning
Jin Yeom (jinyeom@utexas.edu)

In this notebook, we'll study and experiment with basics of reinforcement learning (RL). We're going to first talk about RL techniques that are based on known MDPs, in which every state, every transition from one to another, and its reward is exposed to the agent; then we'll talk about methods with unknown MDPs, where the agent has to explore the environment, from one state to another, to approximate the MDP.

Note that this notebook is loosely based on Sutton and Barto's book __Reinforcement Learning: An introduction__.

## Value Iteration and Policy Iteration

In this section, we're going to start with RL techniques based on known MDPs. The model is provided to the agent, so the agent has a direct access to each state. It will explore and evaluate each state, so that the agent can learn to decide whether it should be in a certain state.

### The Bellman Equation

One key idea for the two algorithms in this section is the Bellman Equation (shown below).

$$
V^*(s) = \max_{a}\sum_{s'} T(s, a, s')[R(s, a, s') + \gamma V^*(s')]
$$

One important insight from this equation is that the optimal value of each state is determined by that of the next state. Dynamic programming can be used to efficiently solve the recurrence.

### GridWorld

For this section, we'll experiment a small problem called GridWorld. In this problem, the agent moves from a grid to another to reach the goal, while avoiding fire. The code for GridWorld is included separately in this repository.

In [2]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

from scripts import gridworld as gw

Let's start with a common example grid.

In [3]:
# Aliases for easier interpretation of each grid type.
D, W, G, F = gw.GRID, gw.WALL, gw.GOAL, gw.FIRE

grids = gw.GridWorld(np.array([[D, D, D, G],
                               [D, W, D, F],
                               [D, D, D, D]]))

# Visualize the grids.
# Empty grids are ones that are reachable, `#` shows a wall,
# `o` is the good terminal state (goal), and `x` is the bad
# terminal state (fire).
grids.show()

+-+-+-+-+
| | | |o|
+-+-+-+-+
| |#| |x|
+-+-+-+-+
| | | | |
+-+-+-+-+


### Value Iteration
Value Iteration algorithm simply computes the value of each state iteratively, using the recurrence of the Bellman Equation; hence the name.

In [None]:
def value_iteration(env, gamma):
    V = np.zeros(env.shape)
    while True:
        V_ = V
        for s in env.S:
            V_[s] = sum([p * (env.R(s, a) + gamma * V[env.T(s, a)]) 
                         for a, p in enumerate(policy[s])])
            

### Policy Iteration
Policy Iteration algorithm follows the steps below:
1. Initialize the agent's policy ($\forall s \in S, \pi(s) \in A$)
2. Evaluate $\pi$ for each state $s$.
3. Improve $\pi$ for each state $s$; go to 2.

In [4]:
def init_policy(env):
    """ Initialize the policy with 4 possible actions for each state with the
    same probability (1.0 / 4 actions = 0.25). """
    r, c = env.shape
    return np.full((r, c, 4), 0.25)

pi = init_policy(grids)
grids.show(policy=pi)

+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+


In [7]:
def eval_policy(policy, env, gamma, theta=1e-8):
    """ Evaluate the argument policy. """
    V = np.zeros(env.shape)
    while True:
        delta = 0.0
        for s in env.S:
            v = V[s]
            V[s] = sum([p * (env.R(s, a) + gamma * V[env.T(s, a)])
                        for a, p in enumerate(policy[s])])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            return V
        
print(eval_policy(pi, grids, 0.5))

[[ 0.20431734  0.05373022  0.20431734 -0.03712508]
 [-0.284497   -0.06847337 -0.284497   -0.09822688]
 [-0.09756946 -0.03252315 -0.09756946 -0.36596606]]


In [29]:
def impr_policy(policy, env, gamma):
    """ Improve the argument policy. """
    policy_stable = True
    V = eval_policy(policy, env, gamma)
    for s in env.S:
        old_action = np.argmax(policy[s])
        new_action = np.argmax([p * (env.R(s) + gamma * V[env.T(s, a)])
                               for a, p in enumerate(policy[s])])
        if new_action != old_action:
            policy_stable = False
        policy[s] = np.eye(4)[new_action]
    return policy_stable, V

Now, we can put these functions together to complete Policy Iteration.

In [30]:
def policy_iteration(env, gamma):
    pi = init_policy(env)
    done = False
    while not done:
        done, V = impr_policy(pi, env, gamma)
    return pi, V

pi, V = policy_iteration(grids, 0.9)
grids.show(policy=pi)
print(V)

+-+-+-+-+
|<|<|>|>|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
|^|^|^|^|
+-+-+-+-+
[[ 8.99999991  8.09999992  8.99999991  9.99999991]
 [ 8.09999992  7.28999993  8.09999992  7.99999992]
 [ 7.28999993  0.          7.28999993  7.19999993]]
