# **Value Iteration in GridWorld**

## **Problem Statement:**

Company Robo.ai is building a robot that can traverse unassisted, through the environment, and reach the food counter. Instead of creating their own environment, they have planned to use a prebuilt 4x4 grid world. You are a researcher who has to identify the policy and value iteration methods to tackle this task. You have decided to go with the value iteration method.

## **Environment**

This environment possesses two terminal states present at:<br>
* Top left corner
* Bottom right corner

<br>
The 4x4 grid looks as follows:<br>
T  o  o  o<br>
o  x  o  o<br>
o  o  o  o<br>
o  o  o  T<br>
Where x is the position of the agent and T are the two terminal states.<br>

<br>
The actions allowed are as follows:
* UP = 0 
* RIGHT = 1 
* DOWN = 2 
* LEFT = 3 <br>


    Note: The agent will move back to current states if it performs an action that leads it to go off the edge.

Rewards:
The agent is granted a reward of -1 at each step until it reaches a terminal state.

Environment courtesy: Sutton's Reinforcement Learning book, chapter 4.


### **Dependencies**
* Discrete
* Gridworld

## **Import libraries and environment**

In [1]:
import numpy as nump
import sys
from gridworld import GridworldEnv

In [2]:
environment = GridworldEnv()

  deprecation(


## **Value Iteration**

Arguments:
    
* policy = [S, A] shaped matrix
* environment.P = Transition probabilities
* environment.P[s][a] = Transition tuple (prob, next_state, reward, done)
* environment.nS = Number of states 
* environment.nA = Number of actions
* theta = Stopping the evaluation once the value function changes is less than theta for all the states
* discount_factor = Gamma discount factor
* Returns = Tuple of optimal policy and value function
* Returns = Vector of length of action in the environment (contains the value of each action)
        

In [3]:
#Defining Value iteration as environment, discount factor, and theta 
def value_iteration(environment, theta=0.0001, discount_factor=1.0):
  
    #Defining one-step lookahead to calculate the value function of current state
    def one_step_lookahead(state, Val_function):
        
        A = nump.zeros(environment.nA)
        for a in range(environment.nA):
            for prob, next_state, reward, done in environment.P[state][a]:
                A[a] += prob * (reward + discount_factor * Val_function[next_state])
        return A

    Val_function = nump.zeros(environment.nS)
    while True:
        # Stopping condition
        delta = 0
        # Update each state...
        for s in range(environment.nS):
            # Finding the best action with one-step lookahead
            A = one_step_lookahead(s, Val_function)
            best_action_value = nump.max(A)
            # Delta for all states that have been observed so far
            delta = max(delta, nump.abs(best_action_value - Val_function[s]))
            # Value function update
            Val_function[s] = best_action_value        
        # Stopping the iteration if exceeded the threshold 
        if delta < theta:
            break
    
    # Using the optimal value function for creating a deterministic policy
    policy = nump.zeros([environment.nS, environment.nA])
    for s in range(environment.nS):
        # Finding the best action for current state using one-step lookahead
        A = one_step_lookahead(s, Val_function)
        best_action = nump.argmax(A)
        # Taking the best action
        policy[s, best_action] = 1.0
    
    return policy, Val_function

In [5]:
policy, Val_function = value_iteration(environment)

print("Policy Probability Distribution:")
print(policy)
print("")

print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
print(nump.reshape(nump.argmax(policy, axis=1), environment.shape))
print("")

print("Value Function:")
print(Val_function)
print("")

print("Reshaped Grid Value Function:")
print(Val_function.reshape(environment.shape))
print("")

Policy Probability Distribution:
[[1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]

Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):
[[0 3 3 2]
 [0 0 0 2]
 [0 0 1 2]
 [0 1 1 0]]

Value Function:
[ 0. -1. -2. -3. -1. -2. -3. -2. -2. -3. -2. -1. -3. -2. -1.  0.]

Reshaped Grid Value Function:
[[ 0. -1. -2. -3.]
 [-1. -2. -3. -2.]
 [-2. -3. -2. -1.]
 [-3. -2. -1.  0.]]

