## Value Iteration 

This notebook gives a preview of finding Optimal Policy through Value Iteration. The Agent is on a 4*4 grid and its goal is to reach the terminal state marked with solid black fill.
![title](images/gridworld.png)

1.The Agent can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).<br> 
2.Any action that takes an Agent beyond the grid will result in the Agent staying in the same state.<br>
3.Agent receives a reward of -1 at each step until it reaches the terminal state.<br><br><br>
     Let us try to find a policy that can take our Agent to the terminal state and also compute the Value Function for the same using Value Iteration method. We would cover this in detail in subsequent modules,however the demo is provided now to get an illustration of how an RL problem can be solved.

In [1]:
import numpy as np
from gridWorld import GridWorld

In [2]:
"""
Arguments:
        env: OpenAI env
            env.numStates is a number of states in the environment 
            env.numActions is a number of actions in the environment
theta: We stop evaluation once our value function change is less than theta for all states
        discount_factor: Gamma discount factor
        env.model[state][action] is a list of transition tuples (prob, next_state, reward, done)
        Returns:
        A tuple (policy, value_fn) of the optimal policy and the optimal value function
"""
def value_iteration(env, theta=0.0001, discount_factor=1.0):
# Helper function to calculate the value for all actions in a given state    
    def compute_value_fn_update(state,value_fn):
        value_fn_update = np.zeros(env.numActions)
        for action in range(env.numActions):
            for prob,next_state,reward,done in env.model[state][action]:
                value_fn_update[action] += prob * (reward + discount_factor * value_fn[next_state])
                
        return value_fn_update 
    
    value_fn = np.zeros(env.numStates)
    while True:
# Stopping Condition        
        delta = 0
# Update each state        
        for state in range(env.numStates):
# Find the best action
            action_values = compute_value_fn_update(state, value_fn)
            best_action_value = np.max(action_values)
# Calculate delta across all states seen so far
            delta = max(delta, np.abs(best_action_value - value_fn[state]))
# Update the value function
            value_fn[state] = best_action_value        
# Check if we can stop       
        if delta < theta:
            break
    
    # Create a deterministic policy by using the optimal value function
    policy = np.zeros([env.numStates, env.numActions])
    for state in range(env.numStates):
    # Find the best action for this state
        A = compute_value_fn_update(state, value_fn)
        best_action = np.argmax(A)
        # Always take the best action
        policy[state, best_action] = 1.0
    
    return policy, value_fn

We will learn about value iteration in the subsquent module, but below you can see that value iteration is able to learn a policy that would take the agent to terminal state starting from any internal state.

In [3]:
env = GridWorld()
policy, value_fn = value_iteration(env)

In [4]:
print("Policy grid (0=up, 1=right, 2=down, 3=left):")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print("")

Policy grid (0=up, 1=right, 2=down, 3=left):
[[0 3 3 2]
 [0 0 0 2]
 [0 0 1 2]
 [0 1 1 0]]



We also compute the value function for each state that corresponds to the number of steps required for the agent to reach terminal state since the reward is -1 for each step.

In [5]:
print(" Grid Value Function:")
print(value_fn.reshape(env.shape))
print("")

 Grid Value Function:
[[ 0. -1. -2. -3.]
 [-1. -2. -3. -2.]
 [-2. -3. -2. -1.]
 [-3. -2. -1.  0.]]

