# Grid World Tutorial: JuliaPOMDP for Complete Beginners

In this tutorial, we try to provide a simple example of how to define a Markov decision process (MDP) problem using the [POMDPs.jl](https://github.com/sisl/POMDPs.jl) interface. After defining the problem in this way, you will be able to use the solvers that the interface supports. In this tutorial, we will show you how to use the value iteration and the Monte Carlo Tree Search solvers that the POMDPs.jl interface supports. We assume that you have some knowledge of basic programming, but are not necessarily familiar with all the features that exist in Julia. We try to cover the many language specific features used in POMDPs.jl in this tutorial. We do assume that you know the grid world problem, and are familiar with the formal defintion of the MDP.  Let's get started!

In [1]:
Pkg.add("Distributions")

INFO: Nothing to be done
INFO: METADATA is out-of-date — you may not have the latest version of Distributions
INFO: Use `Pkg.update()` to get the latest versions of your packages


In [2]:
# first import the POMDPs.jl interface
using POMDPs

# import our helper Distributions.jl module
using Distributions

## Problem Overview
In Grid World, we are trying to control an agent who has trouble moving in the desired direction. In our problem, we have a four reward states on a $10\times 10$ grid. Each position on the grid represents a state, and the positive reward states are terminal (the agent stops recieveing reward after reaching them). The agent has four actions to choose from: up, down, left, right. The agent moves in the desired direction with a probability of 0.7, and with a probability of 0.1 in each of the remaining three directions. The problem has the following form:
![example](http://artint.info/figures/ch09/gridworldc.gif)

## MDP Type
An MDP must contain a state space and an action space. In POMDPs.jl, we define an MDP type by parametarizing it (to read more about parametric types look [here](http://docs.julialang.org/en/release-0.4/manual/types/#man-parametric-types)) with states and actions both of which are their own types. For example, if our states and actions are both represented by integers we can define our MDP type in the following way:
```julia
type MyMDP <: MDP{Int64, Int64} # MDP{StateType, ActionType}

end
```
The MyMDP type is inheriting from an abstract MDP type define in POMDPs.jl. If you are interested in in learning more about the type system and inheritance in Jullia check out [this](http://chrisvoncsefalvay.com/2015/02/04/types-and-inheritance-in-julia/) blog post. Let's first define our states and actions, and then we'll go through defining our Grid World MDP type. 

## States
The data container below represents the state of the agent in the grid world.

In [3]:
type NeedleState 
    x::Int64 # x position
    y::Int64 # y position
    psi::Int64 # orientation
    bumped::Bool # did we bump the wall?
    done::Bool # are we in a terminal state?
end

Below are some convenience functions for working with the NeedleState. 

In [4]:
# initial state constructor
NeedleState(x::Int64, y::Int64, psi::Int64) = NeedleState(x,y,psi,false,false)
# checks if the position of two states are the same
posequal(s1::NeedleState, s2::NeedleState) = s1.x == s2.x && s1.y == s2.y && s1.psi == s2.psi
# copies state s2 to s1
function Base.copy!(s1::NeedleState, s2::NeedleState) 
    s1.x = s2.x
    s1.y = s2.y
    s1.psi = s2.psi
    s1.bumped = s2.bumped
    s1.done = s2.done
    s1
end
# if you want to use Monte Carlo Tree Search, you will need to define the functions below
Base.hash(s::NeedleState, h::UInt64 = zero(UInt64)) = hash(s.x, hash(s.y, hash(s.psi, hash(s.bumped, hash(s.done, h)))))
Base.isequal(s1::NeedleState,s2::NeedleState) = s1.x == s2.x && s1.y == s2.y && s1.psi == s2.psi && s1.bumped == s2.bumped && s1.done == s2.done;

## Actions
Since our action is simply the direction the agent chooses to go (i.e. up, down, left, right), we can use a Symbol to represent it. Symbols are special types in Julia that allow for nice represntation of complex data. However, in our case a string could serve the same purpose as the symbol or even and integer, so feel free to use what you're most comfortable with. Note that in this case, we will not define a type for our action, instead we represent it directly with a symbol. So that our action looks like:
```julia
action = :up # can also be :down, :left, :right
```

## MDP
The Needle data container is defined below. It holds all the information we need to define the MDP tuple $$(\mathcal{S}, \mathcal{A}, T, R).$$

In [5]:
# the needle mdp type
type Needle <: MDP{NeedleState, Symbol} # Note that our MDP is parametarized by the state and the action
    size_x::Int64 # x size of the grid
    size_y::Int64 # y size of the grid
    size_psi::Int64 # number of orientation bins
    reward_states::Vector{NeedleState} # target/obstacle states
    reward_values::Vector{Float64} # reward values for those states
    bounds_penalty::Float64 # penalty for bumping the wall
    tprob::Array{Float64} # probability of transitioning to the desired state
    discount_factor::Float64 # disocunt factor
    
end

Before moving on, I want to create a constructor for Needle for convenience. Currently, if I want to create an instance of Needle, I have to pass in all of fields inside the Needle container (size_x, size_y, etc). The function below will return a Needle type with all the fields filled with some default values.

In [6]:
# we use key worded arguments so we can change any of the values we pass in 
function Needle(;sx::Int64 = 10, # size_x
                sy::Int64 = 10, # size_y
                spsi::Int64 = 8, # size_psi
                rs::Vector{NeedleState} = [[NeedleState(9,3,i) for i = 1:8]; [NeedleState(4,6,i) for i = 1:8]], # reward states
                rv::Vector{Float64} = rv = [fill(10.0,8); fill(-10.0,8)], # reward values
                penalty::Float64 = -1.0, # bounds penalty
                tp::Array{Float64} = [0.2, 0.6, 0.15, 0.05], # tprob
                discount_factor::Float64 = 0.9)
    return Needle(sx, sy, spsi, rs, rv, penalty, tp, discount_factor)
end

# we can now create a NeedleState mdp instance like this:
mdp = Needle()
mdp.reward_states # mdp contains all the defualt values from the constructor

16-element Array{NeedleState,1}:
 NeedleState(9,3,1,false,false)
 NeedleState(9,3,2,false,false)
 NeedleState(9,3,3,false,false)
 NeedleState(9,3,4,false,false)
 NeedleState(9,3,5,false,false)
 NeedleState(9,3,6,false,false)
 NeedleState(9,3,7,false,false)
 NeedleState(9,3,8,false,false)
 NeedleState(4,6,1,false,false)
 NeedleState(4,6,2,false,false)
 NeedleState(4,6,3,false,false)
 NeedleState(4,6,4,false,false)
 NeedleState(4,6,5,false,false)
 NeedleState(4,6,6,false,false)
 NeedleState(4,6,7,false,false)
 NeedleState(4,6,8,false,false)

## Spaces
Let's look at how we can define the state and action spaces for our problem.

### State Space
The state space in an MDP represents all the states in the problem. There are two primary functionalities that we want our spaces to support. We want to be able to iterate over the state space (for Value Iteration for example), and sometimes we want to be able to sample form the state space (used in some POMDP solvers). In this notebook, we will only look at iterable state spaces. 

In [7]:
type StateSpace <: AbstractSpace
    states::Vector{NeedleState}
end

Since we can iterate over elements of an array, and our problem is small, we can store all of our states in an array. If your problem is very large (tens of millions of states), it might be worthwhile to create an iterator over your state space. See [this](http://stackoverflow.com/questions/25028539/how-to-implement-an-iterator-in-julia) post on stackoverflow on making simple iterators. 

In [8]:
function POMDPs.states(mdp::Needle)
    s = NeedleState[] # initialize an array of NeedleStates
    # loop over all our states, remeber there are two binary variables:
    # done (d) and bumped(b)
    for d = 0:1, b = 0:1, y = 1:mdp.size_y, x = 1:mdp.size_x, psi = 1:mdp.size_psi
        push!(s, NeedleState(x,y,psi,b,d))
    end
    return StateSpace(s)
end;

Here, the code: ```function POMDPs.states(mdp::GridWorld)``` means that we want to take the function called ```states(...)``` from the POMDPs.jl module and add another method to it. The ```states(...)``` method in POMDPs.jl doesn't know about our GridWorld type. However, now when ```states(...)``` is called with GridWorld it will dispatch the function we defined above! This is the awesome thing about multiple-dispatch, and one of the features that should make working with MDP/POMDPs easier in Julia. 

The solvers that suppor the POMDPs.jl interface know that a function called ```states(...)``` exists in the interface. However, they do not know the behavior of that function for GridWorld. That means in order for the solvers to use this behavior all we have to do is pass an instance of our GridWorld type into the solver. When ```states(...)``` is called in the solver with the GridWorld type, the function above will be called.   

In [9]:
# let's use the constructor for Needle we defined earlier
mdp = Needle()
state_space = states(mdp);
state_space.states[1] # remeber that our state space instance has an array called states in it

NeedleState(1,1,1,false,false)

To finish up we need to define a function that returns the iterator over the state space. Remeber ```states(...)``` returns the state space type, and our iterator is hidden inside of it.

In [10]:
function POMDPs.iterator(space::StateSpace)
    return space.states 
end;

In value iteration, for example, the solver will iterrate over your state space by doing the following:

In [11]:
mdp = Needle()
state_space = states(mdp);
for s in iterator(state_space)
    # value iteration applies the bellman operator to your state s
end

We want to be able to uniformly sample for our state space. A sampling function for doing this is fairly simple to implement if you already have an array of all your states.

In [12]:
function POMDPs.rand(rng::AbstractRNG, space::StateSpace, s::NeedleState)
    sp = space.states[rand(rng, 1:end)]
    copy!(s, sp)
    s
end;

### Action Space

The action space is the set of all actions availiable to the agent. In the grid world problem the action space consists of up, down, left, and right. Let's define the type for our action space.

In [13]:
type ActionSpace <: AbstractSpace
    actions::Vector{Symbol}
end

Let's now write a function called actions that returns our action space.

In [14]:
function POMDPs.actions(mdp::Needle)
#     acts = [:up, :down, :left, :right]
    acts = [:cw, :ccw]
    return ActionSpace(acts)
end;
POMDPs.actions(mdp::Needle, s::NeedleState, as::ActionSpace=actions(mdp)) = as;

Finally, let's make the iterator function for our action space.

In [15]:
function POMDPs.iterator(space::ActionSpace)
    return space.actions 
end;

Let's add a function to sample from the action space.

In [16]:
function POMDPs.rand(rng::AbstractRNG, space::ActionSpace, a::Symbol)
    return space.actions[rand(rng, 1:end)]
end;
function POMDPs.rand(rng::AbstractRNG, space::ActionSpace)
#     a = NeedleAction(:left)
    a = NeedleAction(:cw)
    return rand(rng, space, a)
end;

To finish up, we add a couple of initializer functions for out state and action.

In [17]:
POMDPs.create_state(mdp::Needle) = NeedleState(1,1,1)
POMDPs.create_action(mdp::Needle) = :cw;

Now that we've defined our state and action spaces, we are half-way thorugh our MDP tuple:
$$
(\mathcal{S}, \mathcal{A}, T, R)
$$

## Distributions

Since MDPs are probabilistic models, we have to deal with probability distributions. In this section, we outline how to define probability distriubtions, and what tools are availiable to help you with the task.

### Transition Distribution 

If you are familiar with MDPs, you know that the transition function $T(s' \mid s, a)$ captures the dynamics of the system. Specifically, $T(s' \mid s, a)$ is a real value that defines the probabiltiy of transitioning to state $s'$ given that you took action $a$ in state $s$. The transition distirubtion $T(\cdot \mid s, a)$ is a slightly different construct. This is the actual distribution over the states that our agent can reach given that its in state $s$ and took action $a$. In other words this is the distribution over $s'$. 

There are many ways to implement transition distributions for your problem. Your choice of distribution as well as how you implement it will heavily depend on your problem. [Distributions.jl](https://github.com/JuliaStats/Distributions.jl) provides support for many common univariate and multivarite distributions. Below is how we implement the one for grid world. 

In [18]:
type NeedleDistribution <: AbstractDistribution
    neighbors::Array{NeedleState} # the states s' in the distribution
    probs::Array{Float64} # the probability corresponding to each state s'
    cat::Categorical # this comes from Distributions.jl and is used for sampling
end

To reduce memory allocation, the POMDPs.jl interface defines some initalization functions that return initial types to be filled lated. This function returns the distribution type filled with some values. We don't care what the distribution container has in it, because it will be modified at each call to the transition function.

In [19]:
function POMDPs.create_transition_distribution(mdp::Needle)
    # can have at most five neighbors in grid world
    neighbors =  [NeedleState(i,i,1) for i = 1:5]
    probabilities = zeros(5) + 1.0/5.0
    cat = Categorical(5)
#     neighbors =  [NeedleState(i,j,1) for i = 1:2, j = 1:2] # 2 neighbxoring nodes, 2 neighboring orientations
#     probabilities = zeros(4) + 1.0/4.0
#     cat = Categorical(4)
    return NeedleDistribution(neighbors, probabilities, cat)
end;

The next function we want is ```domain(...)```. For discrete state distributions, domain returns an iterator over the states in that distributions (this is just the neighbors array in our distriubtion type). The function takes on the following form:

In [20]:
function POMDPs.iterator(d::NeedleDistribution)
    return d.neighbors
end;

Let's implement the probability density function (really this is a probability mass function since the distriubtion is discrete, but we overload the pdf function name to serve as both). Below is a fairly inneficient impelemntation of pdf. For the discrete distribution in our problem, the pdf function returns the probability of drawing the state s from the distribution d. 

In [21]:
function POMDPs.pdf(d::NeedleDistribution, s::NeedleState)
    for (i, sp) in enumerate(d.neighbors)
        if s == sp
            return d.probs[i]
        end
    end   
    return 0.0
end;

Finally, we want to implement a smapling function that can draw samples from our distribution. Once again, there are many ways to do this, but we recommend using Distributions.jl. We use POMDPDistributions which mimicks a lot of the behavior of Distributions.jl. 

In [22]:
function POMDPs.rand(rng::AbstractRNG, d::NeedleDistribution, s::NeedleState)
    d.cat = Categorical(d.probs) # init the categorical distribution
    ns = d.neighbors[rand(d.cat)] # sample a neighbor state according to the distribution c
    copy!(s, ns)
    return s # return the pointer to s
end;

There are a few things that might look scary in the rand! function. Don't worry they aren't!

First the ! notation. By Julia convention, when we write a function that changes its imput arguments, we end that function name with a !. This signature tells anyone calling the function that their input arguments will change. In our function we sample a state from our distribution d, and change s to be that state.

Second, one of the arguments into rand! is AbstractRNG. What in the world is that? This is a special type in Julia that sets the internal seed of the random number generator. Let's take a look at an example.

Below we initialize a grid world MDP, a transition distribution and a state. We also use the MeresenneTwister type as our AbstractRNG. MeresenneTwister is a Julia type used for pseudo random number generation. 

In [23]:
mdp = Needle()
# the function below initializes our distriubtion d to have the states at:
# (1,1,1) (2,2,1) (3,3,1) (4,4,1) (5,5,1)
# we should expect to sample only these states from d
d = create_transition_distribution(mdp) 
s = NeedleState(1,1,1)
rng = MersenneTwister(1) # this is an rng type in Julia

for i = 1:5
    rand(rng, d, s)
    println(s)
end

NeedleState(2,2,1,false,false)
NeedleState(4,4,1,false,false)
NeedleState(4,4,1,false,false)
NeedleState(2,2,1,false,false)
NeedleState(1,1,1,false,false)


To recap, there are two functionalities that we require your distirbutions to support. We want to be able to sample from them using the ```rand!(...)``` function, and we want to obtain the probability density using the ```pdf(...)``` function.

## Transition Model

In this section we will define the system dynamics of the gird world MDP. In POMDPs.jl, we work with transition distirbution functions $T(s' \mid s, a)$, so we want to write a function that can generate the transition distributions over $s'$ for us given an $(s, a)$ pair. 

In grid world, the dynamics of the system are fairly simple. We move in the specified direction with some pre-defined probability. This is the tprob parameter in our GridWorld MDP (it is set to 0.7 in the DMU book example).  If we bump against a wall, we recieve a penalty. If we get to state with a positive reward, we've reached a terminal state and can no longer accumulate reward.

In the transition function we want to fill the neighbors in our distribution d with the reachable states from the state, action pair. We want to fill the probs in our distirbution d with the probabilities of reaching that neighbor. 

In [24]:
# transition helpers
function inbounds(mdp::Needle,x::Int64,y::Int64,psi::Int64)
    if 1 <= x <= mdp.size_x && 1 <= y <= mdp.size_y && 1 <= psi <= mdp.size_psi
        return true
    else
        return false
    end
end

function inbounds(mdp::Needle,state::NeedleState)
    x = state.x #point x of state
    y = state.y
    psi = state.psi
    return inbounds(mdp, x, y, psi)
end

function fill_probability!(p::Vector{Float64}, val::Float64, index::Int64)
    for i = 1:length(p)
        if i == index
            p[i] = val
        else
            p[i] = 0.0
        end
    end
end;


In [25]:
function POMDPs.transition(mdp::Needle,
                            state::NeedleState,
                            action::Symbol,
                            d::NeedleDistribution=create_transition_distribution(mdp))
    tp = mdp.tprob
    
    a = action
    x = state.x
    y = state.y
    psi = state.psi
    
    neighbors = d.neighbors
    probability = d.probs
    
    # let's handle the done case first
    if state.done
        # can only transition to the same done state
        fill!(probability, 0.0)
        probability[1] = 1.0
        copy!(neighbors[1], state)
        # when we sample d, we will only get the state in neighbors[1] - our done state
        return d
    end
    
    fill!(probability, 0.0)

    if a == :ccw
        if psi == 1
            neighbors[1].x = x+1; neighbors[1].y = y;   neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y;   neighbors[2].psi = psi+1;
            neighbors[3].x = x+1; neighbors[3].y = y+1; neighbors[3].psi = psi+1;
            neighbors[4].x = x+1; neighbors[4].y = y+1; neighbors[4].psi = psi+2;
        elseif psi == 2
            neighbors[1].x = x+1; neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y+1; neighbors[2].psi = psi+1;
            neighbors[3].x = x;   neighbors[3].y = y+1; neighbors[3].psi = psi+1;
            neighbors[4].x = x;   neighbors[4].y = y+1; neighbors[4].psi = psi+2;
        elseif psi == 3
            neighbors[1].x = x;   neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x;   neighbors[2].y = y+1; neighbors[2].psi = psi+1;
            neighbors[3].x = x-1; neighbors[3].y = y+1; neighbors[3].psi = psi+1;
            neighbors[4].x = x-1; neighbors[4].y = y+1; neighbors[4].psi = psi+2;
        elseif psi == 4
            neighbors[1].x = x-1; neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y+1; neighbors[2].psi = psi+1;
            neighbors[3].x = x-1; neighbors[3].y = y;   neighbors[3].psi = psi+1;
            neighbors[4].x = x-1; neighbors[4].y = y;   neighbors[4].psi = psi+2;
        elseif psi == 5
            neighbors[1].x = x-1; neighbors[1].y = y;   neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y;   neighbors[2].psi = psi+1;
            neighbors[3].x = x-1; neighbors[3].y = y-1; neighbors[3].psi = psi+1;
            neighbors[4].x = x-1; neighbors[4].y = y-1; neighbors[4].psi = psi+2;
        elseif psi == 6
            neighbors[1].x = x-1; neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y-1; neighbors[2].psi = psi+1;
            neighbors[3].x = x;   neighbors[3].y = y-1; neighbors[3].psi = psi+1;
            neighbors[4].x = x;   neighbors[4].y = y-1; neighbors[4].psi = psi+2;
        elseif psi == 7
            neighbors[1].x = x;   neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x;   neighbors[2].y = y-1; neighbors[2].psi = psi+1;
            neighbors[3].x = x+1; neighbors[3].y = y-1; neighbors[3].psi = psi+1;
            neighbors[4].x = x+1; neighbors[4].y = y-1; neighbors[4].psi = psi+2;
        elseif psi == 8
            neighbors[1].x = x+1; neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y-1; neighbors[2].psi = psi+1;
            neighbors[3].x = x+1; neighbors[3].y = y;   neighbors[3].psi = psi+1;
            neighbors[4].x = x+1; neighbors[4].y = y;   neighbors[4].psi = psi+2;
        end
    elseif a == :cw
        if psi == 1
            neighbors[1].x = x+1; neighbors[1].y = y;   neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y;   neighbors[2].psi = psi+7;
            neighbors[3].x = x+1; neighbors[3].y = y-1; neighbors[3].psi = psi+7;
            neighbors[4].x = x+1; neighbors[4].y = y-1; neighbors[4].psi = psi+6;
        elseif psi == 2
            neighbors[1].x = x+1; neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y-1; neighbors[2].psi = psi+7;
            neighbors[3].x = x;   neighbors[3].y = y-1; neighbors[3].psi = psi+7;
            neighbors[4].x = x;   neighbors[4].y = y-1; neighbors[4].psi = psi+6;
        elseif psi == 3
            neighbors[1].x = x;   neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x;   neighbors[2].y = y-1; neighbors[2].psi = psi+7;
            neighbors[3].x = x-1; neighbors[3].y = y-1; neighbors[3].psi = psi+7;
            neighbors[4].x = x-1; neighbors[4].y = y-1; neighbors[4].psi = psi+6;
        elseif psi == 4
            neighbors[1].x = x-1; neighbors[1].y = y-1; neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y-1; neighbors[2].psi = psi+7;
            neighbors[3].x = x-1; neighbors[3].y = y;   neighbors[3].psi = psi+7;
            neighbors[4].x = x-1; neighbors[4].y = y;   neighbors[4].psi = psi+6;
        elseif psi == 5
            neighbors[1].x = x-1; neighbors[1].y = y;   neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y;   neighbors[2].psi = psi+7;
            neighbors[3].x = x-1; neighbors[3].y = y+1; neighbors[3].psi = psi+7;
            neighbors[4].x = x-1; neighbors[4].y = y+1; neighbors[4].psi = psi+6;
        elseif psi == 6
            neighbors[1].x = x-1; neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x-1; neighbors[2].y = y+1; neighbors[2].psi = psi+7;
            neighbors[3].x = x;   neighbors[3].y = y+1; neighbors[3].psi = psi+7;
            neighbors[4].x = x;   neighbors[4].y = y+1; neighbors[4].psi = psi+6;
        elseif psi == 7
            neighbors[1].x = x;   neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x;   neighbors[2].y = y+1; neighbors[2].psi = psi+7;
            neighbors[3].x = x+1; neighbors[3].y = y+1; neighbors[3].psi = psi+7;
            neighbors[4].x = x+1; neighbors[4].y = y+1; neighbors[4].psi = psi+6;
        elseif psi == 8
            neighbors[1].x = x+1; neighbors[1].y = y+1; neighbors[1].psi = psi; 
            neighbors[2].x = x+1; neighbors[2].y = y+1; neighbors[2].psi = psi+7;
            neighbors[3].x = x+1; neighbors[3].y = y;   neighbors[3].psi = psi+7;
            neighbors[4].x = x+1; neighbors[4].y = y;   neighbors[4].psi = psi+6;
        end
    end
    neighbors[5].x = x; neighbors[5].y = y; neighbors[5].psi = psi;
    
    for i = 1:5 neighbors[i].bumped = false end
    for i = 1:5 neighbors[i].done = false end
    reward_states = mdp.reward_states
    reward_values = mdp.reward_values
    n = length(reward_states)
    for i = 1:n
        # if state == reward_states[i] || states are at bound and facing outward
        if posequal(state, reward_states[i]) # && reward_values[i] > 0.0
            fill_probability!(probability, 1.0, 5)
            neighbors[5].done = true
            return d
        end
    end
    
    if !inbounds(mdp, neighbors[1]) || !inbounds(mdp, neighbors[2]) ||
        !inbounds(mdp, neighbors[3]) || !inbounds(mdp, neighbors[4])
        fill_probability!(probability, 1.0, 5)
        neighbors[5].bumped = true
    else
        copy!(probability[1:4],tp)
    end
    d
end;

In [26]:
copy!([1,2],[2,3])

2-element Array{Int64,1}:
 2
 3

## Reward Model
The reward model $R(s,a,s')$ is a function that returns the reward of being in state $s$, taking an action $a$ from that state, and ending up in state $s'$. In our problem, we are rewarded for reaching a terimanl reward state (this could be positive or negative), and we are penalized for bumping into a wall.

In [27]:
function POMDPs.reward(mdp::Needle, state::NeedleState, action::Symbol, statep::NeedleState) #deleted action
    if state.done
        return 0.0
    end
    r = 0.0
    reward_states = mdp.reward_states
    reward_values = mdp.reward_values
    n = length(reward_states)
    for i = 1:n
        if posequal(state, reward_states[i])
            r += reward_values[i]
        end
    end
    if state.bumped
        r += mdp.bounds_penalty
    end
    return r
end;


## Miscellaneous Functions
We are almost done! Just a few simple functions left. First let's implement two functions that return the sizes of our state and action spaces.

In [28]:
POMDPs.n_states(mdp::Needle) = 4*mdp.size_x*mdp.size_y*mdp.size_psi
POMDPs.n_actions(mdp::Needle) = 4;

In [29]:
POMDPs.n_states(mdp::Needle)

3200

Now, we implement the discount function.

In [30]:
POMDPs.discount(mdp::Needle) = mdp.discount_factor;

The last thing we need is an indexing function. This allows us to index between the discrete utility array and the states in our problem. We will use the ```sub2ind()``` function from Julia base to help us here. 

In [31]:
function POMDPs.state_index(mdp::Needle, state::NeedleState)
    sb = Int(state.bumped + 1)
    sd = Int(state.done + 1)
    return sub2ind((mdp.size_x, mdp.size_y, mdp.size_psi, 2, 2), state.x, state.y, state.psi, sb, sd)
end;

Finally let's define a function that checks if a state is terminal.

In [32]:
function POMDPs.isterminal(mdp::Needle, s::NeedleState)
    s.done ? (return true) : (return false)
end;

## Value Iteration Solver

Value iteration is a dynamic porgramming apporach for solving MDPs. See the [wikipedia](https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration) article for a brief explanation. The solver can be found [here](https://github.com/JuliaPOMDP/DiscreteValueIteration.jl). If you haven't isntalled the solver yet, you can run the following from the Julia REPL
```julia
using POMDPs
POMDPs.add("DiscreteValueIteration")
```
to download the module.

Each POMDPs.jl solver provides two data types for you to interface with. The first is the Solver type which contains solver parameters. The second is the Policy type. Let's see hwo we can use them to get an optimal action at a given state.

In [33]:
# first let's load the value iteration module
using DiscreteValueIteration

# initialize the problem
mdp = Needle()

# initialize the solver
# max_iterations: maximum number of iterations value iteration runs for (default is 100)
# belres: the value of Bellman residual used in the solver (defualt is 1e-3)
solver = ValueIterationSolver(max_iterations=100, belres=1e-3)

# initialize the policy by passing in your problem
policy = ValueIterationPolicy(mdp) 

# solve for an optimal policy
# if verbose=false, the text output will be supressed (false by default)
solve(solver, mdp, policy, verbose=true);

Iteration : 1, residual: 11.0, iteration run-time: 0.301121301, total run-time: 0.301121301
Iteration : 2, residual: 9.9, iteration run-time: 0.27606109, total run-time: 0.577182391
Iteration : 3, residual: 6.39, iteration run-time: 0.269754348, total run-time: 0.846936739
Iteration : 4, residual: 4.851, iteration run-time: 0.275431166, total run-time: 1.122367905
Iteration : 5, residual: 1.097379, iteration run-time: 0.270462174, total run-time: 1.3928300789999999
Iteration : 6, residual: 0.5904900000000004, iteration run-time: 0.27678905, total run-time: 1.669619129
Iteration : 7, residual: 0.531441, iteration run-time: 0.265774083, total run-time: 1.935393212
Iteration : 8, residual: 0.47829690000000014, iteration run-time: 0.273771467, total run-time: 2.2091646789999997
Iteration : 9, residual: 0.43046720999999977, iteration run-time: 0.271098308, total run-time: 2.4802629869999997
Iteration : 10, residual: 0.38742048900000015, iteration run-time: 0.275855215, total run-time: 2.756

Now, we can use the policy along with the ```action(...)``` function to get the optimal action in a given state.

In [34]:
# say we are in state (9,2)
s = NeedleState(9,2,1)
a = action(policy, s)

:cw

In [35]:
s = NeedleState(8,3,1)
a = action(policy, s)

:ccw

In [39]:
d.probs

5-element Array{Float64,1}:
 0.2
 0.2
 0.2
 0.2
 0.2

## Monte-Carlo Tree Search Solver
Monte-Carlo Tree Search (MCTS) is another MDP solver. It is an online method that looks for the best action from only the current state by building a search tree. A nice overview of MCTS can be found [here](http://pubs.doc.ic.ac.uk/survey-mcts-methods/survey-mcts-methods.pdf). Run the following command to donwload the module:
```julia
using POMDPs
POMDPs.add("MCTS")
```
Let's quickly run through an example of using the solver:

In [36]:
using MCTS

# initialize the problem
mdp = Needle()

# initialize the solver
# the hyper parameters in MCTS can be tricky to set properly
# n_iterations: the number of iterations that each search runs for
# depth: the depth of the tree (how far away from the current state the algorithm explores)
# exploration constant: this is how much weight to put into exploratory actions. 
# A good rule of thumb is to set the exploration constant to what you expect the upper bound on your average expected reward to be.
solver = MCTSSolver(n_iterations=100, depth=10, exploration_constant=1.0)

# initialize the policy by passing in your problem and the solver
policy = MCTSPolicy(solver, mdp)

# we don't need to call solver for MCTS

# to get the action:
s = NeedleState(9,2,1)
a = action(policy, s)



LoadError: LoadError: ArgumentError: Categorical: the condition isprobvec(p) is not satisfied.
while loading In[36], in expression starting on line 21

Let's simulate the policy.

In [37]:
# we'll use POMDPToolbox for simulation
using POMDPToolbox # if you don't have this module install it by running POMDPs.add("POMDPToolbox")

s = NeedleState(4,1,1) # this is our starting state
hist = HistoryRecorder()

r = simulate(hist, mdp, policy, s)

println("Total discounted reward: $r")

LoadError: LoadError: ArgumentError: Categorical: the condition isprobvec(p) is not satisfied.
while loading In[37], in expression starting on line 7

In [38]:
hist.state_hist # look at the state history

1-element Array{NeedleState,1}:
 NeedleState(4,1,1,false,false)