# Navigating Grid world with Monte Carlo Sampling Method

In this code, I implement a grid world policy evaluation and update using Monte Carlo sampling.
The grid world is a 3 x 4 world, with a wall. There is a win state and a lose state.
Each run begins in cell [2,0], and the goal is to reach the WIN state, and avoid the LOSE state.

The agent may move left, right, up or down. If the agent hits a wall, they will return to the state they were just at.
The reward for each step taken is 0. A reward is only provided in the WIN or LOSE state.

<img src="misc/simple_Gridworld.png" width="400"/>

This code applied Monte Carlo sampling to determine the best state action pairs to go from the start to the win state.

Initially, the policy is random. After each episode, the policy is updated to select the one which returns the highest action value from a given state. There is however a $\epsilon$ = 0.1 probability of selecting a random action (to allow exploration).

Below is the pseudocode for the method. Here the method to calculate the gain for each state is also demonstrated (by working backwards).

<img src="misc/MonteCarlo_GridWorld_Psuedocode.PNG" width="700"/>

A dictionary of the state action pairs for all possible states is maintained and referred to when it is time to update the policy.

### How it works

How Monte Carlo policy evaluation and improvement works is by completing an episode following the initial policy. In the case of my code, this policy is random in each direction. During the episode, we keep track of all the actions taken from each state that was visited. In addition, we keep track of the reward recieved from each transition into a new state.

At the end of an epsiode, we work backwards to calculate the gain from each state action pair.

This process is repeated for multiple episodes, and the gain for each state action pair is averaged.

After each episode, the updated state-action pairs are used to update the policy. The policy for the next iteration is one which, for a given state, takes an action which maximises the state-action pair value.

To ensure exploration, we have an $\eta$ probability of not following the polcy and instead following a random policy.

### Results

The results of my code return the following directions for each of the states.
This was the results from a random run of 1000 episodes.
The decay $\gamma$ was 0.9

In [2]:
['[0, 0]', 'right']
['[0, 1]', 'right']
['[0, 2]', 'right']
['[1, 0]', 'up']
['[1, 2]', 'up']
['[2, 0]', 'up']
['[2, 1]', 'left']
['[2, 2]', 'down']
['[2, 3]', 'left']

['[2, 3]', 'left']

<img src="misc/simple_Gridworld_1000_iterations.png" width="350"/>

After 10,000 episodes, the policy is:

In [3]:
['[0, 0]', 'right']
['[0, 1]', 'right']
['[0, 2]', 'right']
['[1, 0]', 'up']
['[1, 2]', 'up']
['[2, 0]', 'up']
['[2, 1]', 'left']
['[2, 2]', 'left']
['[2, 3]', 'left']

['[2, 3]', 'left']

<img src="misc/simple_Gridworld_10000_iterations.PNG" width="350"/>