# Navigating Grid world with DYNA-Q

In this code, I implement a the DYNA-Q algorithm on both a simple and complex grid world.

Unlike the other methods of Monte Carlo, SARA and Q-Learning, DYNA-Q implements Q-Learning but with a planning stage between steps.
This makes the most of the observations made so far.

In planning the algorithm iterates through the points observed to update the action-values.

### DYNA-Q Action Value Update
Is still the same as the Q-Learning algorithm
<img src="misc/QLearning_ActionValueUpdate.PNG" width="400"/>


### GRID WORLDS

There are two grid worlds that DYNA-Q is applied for:

The simple grid world (same as that in the Monte Carlo, Q-Learning and SARSA).
Again we implement the 3 x 4 grid world. There is a win state and a lose state and a wall.
Each run begins in cell [2,0], and the goal is to reach the WIN state, and avoid the LOSE state.

<img src="misc/simple_Gridworld.png" width="400"/>

And a complex grid world where the agent starts in cell [2,0] and the win state is at [0,8].
There are more walls in this complex world

<img src="misc/Complex_Gridworld.PNG" width="400"/>

The actions of the agents are the same as before:
The agent may move left, right, up or down. If the agent hits a wall, they will return to the state they were just at.
The reward for each step taken is 0. A reward is only provided in the WIN or LOSE state.



Initially, the policy is random. After each episode, the policy is updated to select the one which returns the highest action value from a given state. There is however a $\epsilon$ = 0.1 probability of selecting a random action (to allow exploration).

Below is the pseudocode for the method. 
Policy improvement is via the $\epsilon$-soft method shown in the Monte-Carlo article.

<img src="misc/TabularDYNAQ_Psuedocode.PNG" width="700"/>

A dictionary of the state action pairs for all possible states is maintained and referred to when it is time to update the policy.

### How it works

How DYNA-Q works is, when initialised (a) it will first select action to take based on the policy (b).
After the action is taken, it will observe the reward and the next state (c).

This information is used to apply a Q-Learning update to $Q(S,A)$ (d).

This transition information is stored in a dictionary that models the environment (we assume a deterministic environment).
So the information stored is that, if I start in state $S$ and take action $A$, I will end up in state $S'$ with reward $R$.

After this move in the 'Real world', the algorithm takes the model information and uses it to iterate and refine the action-values, in a planning stage (this is like what we do in our heads). In each step of planning, the agent randomly goes through the state and associated actions it has seen in the past, and using the model, update the action-values.The planning stage is repeated $n$ (f).

If the number of planning steps is set to 0, the DYNA-Q code is just normal Q-Learning.

## Results

The results of my code return the following directions for each of the states.
The decay $\gamma$ was 0.9

In these experiments, varying numbers of planning steps were tests.

50 planning steps

5 planning steps

0 planning steps (normal Q-Learning)

It can be observed that with 50 planning steps, the algoirithm is able to hone into an optimal path after a few episodes.

Looking at the number of steps required to reach the WIN STATE over the number of trials, the following is observed:

### Simple Grid World
<img src="misc/DYNA_Q_Simple_performance.png" width="400"/>

### Complex Grid World

<img src="misc/DYNA_Q_Complex_performance.png" width="400"/>