# Reinforcement Learning
Jin Yeom (jinyeom@utexas.edu)

In this notebook, we'll study and experiment with basics of reinforcement learning (RL). We're going to first talk about RL techniques that are based on known MDPs, i.e., the agent has an access to the transition and reward function, so that it can evaluate each state directly; then we'll talk about methods with unknown MDPs, where the agent has to explore the environment to learn an optimal behavior.

Note that this notebook is based on Sutton and Barto's book __Reinforcement Learning: An introduction__.

## Policy Iteration and Value Iteration

If you look at traditional AI methods (what we now call Good Old-Fashion Artificial Intelligence, or GOFAI) based on search algorithms, they are often designed with an assumption that the model of the environment is completely exposed to the agent. In other words, the agent is given both the transition function and the reward function. So, the agent is often able to search through different states and plan an optimal path to maximize the utility.

In this section, we're going to start from there. The model is given to the agent; but rather than searching and planning to build its policy, it will learn to evaluate each state, so that the agent can later decide whether it should be in a certain state.

### The Bellman Equation

One key idea for the two algorithms in this section is the Bellman Equation (shown below).

$$
v_*(s) = \max_{a}\mathbb{E}[R_{t + 1} + \gamma v_*(S_{t + 1}) \space | \space S_t = s, A_t = a]
$$

An important thing to notice in this equation is that this equation is recursive, i.e., the value of a state is determine by that of its next state, and the value of the next state by the next, and so on. Intuitively, what this equation is telling is, for each state, the optimal value is given by an action that ends up with the highest average value (if you're familiar with search-based AI methods, it's exactly like the Expectimax method).

### Policy Iteration

Another observation that the Bellman Eqaution suggests is that, in order to evaluate each state, the agent must try some sequence of further actions and find out what happens. But rather than searching for the best sequence of actions by trying many different scenarios, we would like the agent to learn the best sequence of actions on its own. In other words, the agent will try some sequence of actions, evaluate it, and fix itself to improve itself. This process is called **Policy Iteration**.

Now, let's try implementing Policy Iteration for a small problem called GridWorld. In this problem, the agent has to move from a grid to another to reach the goal, while avoiding fire. The code for GridWorld is included separately in this repository.

In [1]:
import numpy as np
from scripts import gridworld as gw

Let's start with a common example grid.

In [4]:
# Aliases for easier interpretation of each grid type.
D, W, G, F = gw.GRID, gw.WALL, gw.GOAL, gw.FIRE

grid = gw.GridWorld(np.array([[D, D, D, G],
                              [D, W, D, F],
                              [D, D, D, D]]))

# Visualize the grids.
# Empty grids are ones that are reachable, `#` shows a wall,
# `o` is the good terminal state (goal), and `x` is the bad
# terminal state (fire).
grid.show()

+-+-+-+-+
| | | |o|
+-+-+-+-+
| |#| |x|
+-+-+-+-+
| | | | |
+-+-+-+-+


## Value Iteration