Write down the Bellman equations for $v_\pi(s)$ for the states (0,0), (0,1) and (2,1) in this Gridworld, using the policy described above. This means plugging in the right numbers and probabilities, not just writing the abstract formulas.

$v_\pi\Big((0,0)\Big) = (0.25)[-1+\gamma v_\pi\Big((0,0)\Big)\Big] + (0.25)[-1+\gamma v_\pi\Big((0,0)\Big)\Big] + (0.25)[0+\gamma v_\pi\Big((0,1)\Big)\Big] + (0.25)[0+\gamma v_\pi\Big((1,0)\Big)\Big] $

$v_\pi\Big((0,0)\Big) = (0.25)[-1+\gamma v_\pi\Big((0,0)\Big)\Big] + (0.25)[-1+\gamma v_\pi\Big((0,0)\Big)\Big] + (0.25)[0+\gamma v_\pi\Big((0,1)\Big)\Big] + (0.25)[0+\gamma v_\pi\Big((1,0)\Big)\Big]$

$v_\pi\Big((0,0)\Big) = (0.25)\Big[-1+\gamma v_\pi\Big((0,0)\Big) -1+\gamma v_\pi\Big((0,0)\Big) + 0+\gamma v_\pi\Big((0,1)\Big) + 0+\gamma v_\pi\Big((1,0)\Big)\Big]$

$v_\pi\Big((0,0)\Big) = (0.25)\Big[2\gamma v_\pi\Big((0,0)\Big) + \gamma v_\pi\Big((0,1)\Big) + \gamma v_\pi\Big((1,0)\Big) - 2 \Big]$

$v_\pi\Big((0,0)\Big) = \frac{1}{2}\gamma v_\pi\Big((0,0)\Big) + \frac{1}{4}\gamma v_\pi\Big((0,1)\Big) + \frac{1}{4}\gamma v_\pi\Big((1,0)\Big) - \frac{1}{2}$

$v_\pi\Big((0,0)\Big)\Big(1-\gamma\frac{1}{2}\Big) = \frac{1}{4}\gamma v_\pi\Big((0,1)\Big) + \frac{1}{4}\gamma v_\pi\Big((1,0)\Big) - \frac{1}{2}$

$v_\pi\Big((0,0)\Big) = \frac{\frac{1}{4}\gamma v_\pi\Big((0,1)\Big) + \frac{1}{4}\gamma v_\pi\Big((1,0)\Big) - \frac{1}{2}}{\Big(1-\gamma\frac{1}{2}\Big)}$

$v_\pi(0,1) = (0.25)\Big[ 40+\gamma v_\pi(1,4) + \gamma v_\pi(1,4) + \gamma v_\pi(1,4) + \gamma v_\pi(1,4) \Big]$

$v_\pi(0,1) = 10 + \frac{\gamma}{4} 4v_\pi(1,4)$

$v_\pi(0,1) = 10 + \gamma v_\pi(1,4)$



$v_\pi(2,1) = (0.25)\Big[ \gamma v_\pi(2,0) + \gamma v_\pi(2,2) + \gamma v_\pi(1,1) + \gamma v_\pi(3,1) \Big]$

$v_\pi(2,1) = \frac{\gamma}{4} v_\pi(2,0) + \frac{\gamma}{4} v_\pi(2,2) + \frac{\gamma}{4} v_\pi(1,1) + \frac{\gamma}{4} v_\pi(3,1) $

Using a solver for systems of linear equations, solve the complete system of value-state Bellman equations to obtain $v_\pi(s)$

In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from numpy.linalg import inv

In [4]:
class gridworld_env:
    def __init__(self, x, y):
        
        # The board of the grid world
        self.world = np.arange(0,25).reshape((5,5))
        
        # Current position of the agent on the board
        self.current_position = (x,y)
        
        # Gamma coefficient
        self.gamma = 0.9
        
        # Possible moves
        self.WEST = 0
        self.NORTH = 1
        self.EAST = 2
        self.SOUTH = 3
        self.action_spaces = [self.WEST, self.NORTH, self.EAST, self.SOUTH]
        
        # Special states
        self.aprime = (0,1)
        self.bprime = (0,3)
    '''
    Agent take an action to move to new state, the environment returns 
    the reward for the action
    '''
    def take_action(self, action):
        
        # Compute the reward and new position of the agent for the action
        reward = 0
        new_position = self.current_position
        if action == self.NORTH:
            new_position = (max(self.current_position[0]-1,0), self.current_position[1])
            if new_position == self.current_position:
                reward = -1
        elif action == self.WEST:
            new_position = (self.current_position[0], max(self.current_position[1]-1,0))
            if new_position == self.current_position:
                reward = -1
        elif action == self.EAST:
            new_position = (self.current_position[0], (self.current_position[1]+1)%5)
            if new_position[1] == 0:
                reward = -1
                new_position = self.current_position
        elif action == self.SOUTH:
            new_position = ((self.current_position[0]+1)%5, self.current_position[1])
            if new_position[0] == 0:
                reward = -1
                new_position = self.current_position
        
        # Special reward if the agent is in A prime or B prime, then
        # we can ignore the computation of the reward above
        if self.current_position == self.aprime:
            reward = 10
            new_position = (4,1)
        elif self.current_position == self.bprime:
            reward = 5
            new_position = (2,3)
            
        # Create a new state of the gridworld
        new_state = gridworld_env(new_position[0], new_position[1])
        
        # Return the reward of the action and the new state of the gridworld
        return reward, new_state

In [50]:
X = np.zeros((25,25))
Y = np.zeros((25,1))

# Available actions
actions_space = [0, 1, 2, 3]

# Gamma coefficient
gamma = 0.9

for y in range(0, 25):
    total_reward = 0
    for x in range(0, 25):
        env = gridworld_env(int(y/5), y%5)
        state_counter = 0
        s = 0
        for action in actions_space:
            reward, new_state = env.take_action(action)
            if new_state.current_position == (int(x/5), x%5):
                state_counter += 1
                total_reward += reward
                
            if env.current_position == (int(x/5), x%5):
                s = 1
        value = (gamma * state_counter) / 4 - s
        X[y][x] = value
        Y[y] = -total_reward/4

In [51]:
print(X)

[[-0.55   0.225  0.     0.     0.     0.225  0.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.   ]
 [ 0.    -1.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.9
   0.     0.     0.   ]
 [ 0.     0.225 -0.775  0.225  0.     0.     0.     0.225  0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.   ]
 [ 0.     0.     0.    -1.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.9    0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.   ]
 [ 0.     0.     0.     0.225 -0.55   0.     0.     0.     0.     0.225  0.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.   ]
 [ 0.225  0.     0.     0.     0.    -0.775  0.225  0.     0.     0.     0.225
   0.     0.     0.     0.     0.     0.

In [52]:
print(Y)

[[  0.5 ]
 [-10.  ]
 [  0.25]
 [ -5.  ]
 [  0.5 ]
 [  0.25]
 [  0.  ]
 [  0.  ]
 [  0.  ]
 [  0.25]
 [  0.25]
 [  0.  ]
 [  0.  ]
 [  0.  ]
 [  0.25]
 [  0.25]
 [  0.  ]
 [  0.  ]
 [  0.  ]
 [  0.25]
 [  0.5 ]
 [  0.25]
 [  0.25]
 [  0.25]
 [  0.5 ]]


In [53]:
v_pi = np.linalg.solve(X,Y)

Using the solution you obtained in part 5b, print an array such that at state s it displays the value $v_\pi(s)$ , for the policy $\pi$. In other words, this exercise is asking you to replicate Figure 3.2 in the book.

In [54]:
print(v_pi.reshape((5,5)))

[[ 3.30899634  8.78929186  4.42761918  5.32236759  1.49217876]
 [ 1.52158807  2.99231786  2.25013995  1.9075717   0.54740271]
 [ 0.05082249  0.73817059  0.67311326  0.35818621 -0.40314114]
 [-0.9735923  -0.43549543 -0.35488227 -0.58560509 -1.18307508]
 [-1.85770055 -1.34523126 -1.22926726 -1.42291815 -1.97517905]]
