### Lab: Value Iteration in a Grid World

### University of Virginia
### Reinforcement Learning
#### Last updated: May 26, 2025

---

#### Instructions:

Implement value iteration for a $4 \times 3$ gridworld environment. This will measure the value of each state. A robot in this world can make discrete moves: one step up, down, left or right. These actions are deterministic, meaning that the action selected will be taken with probability 1. There is a terminal state with reward +1 in the bottom right corner. All other states have reward 0. The discount factor is 0.9. Use tolerance $\theta=0.01$. Show all code and results.

**Note**: Do not use libraries from `networkx`, `gym`, `gymnasium` when solving this problem.

#### Total Points: 12

In [7]:
import numpy as np

rows, cols = 4, 3

terminal_state = (3,2)

def reward(state):
  if state == terminal_state:
    return 1.0
  else:
    return 0.0

actions = {"up":(-1, 0), "down":(1, 0), "left":(0, -1), "right":(0, 1)}

gamma = 0.9
theta = 0.01

V = np.zeros((rows, cols))

def is_terminal(state):
  return state == terminal_state

def get_next_state(state, action):
  row, col = state
  dr, dc = actions[action]
  next_row, next_col = row + dr, col + dc

  #if action is in bounds, return new state
  if 0 <= next_row < rows and 0 <= next_col < cols:
    return (next_row, next_col)

  #if action goes out of bounds, stay in current state
  return state

def value_iteration(V):
  i = 0
  while True:
    delta = 0
    new_V = np.copy(V)
    for row in range(rows):
      for col in range(cols):
        state = (row, col)
        if is_terminal(state):
          new_V[state] = 0.0
          continue
        values = []
        for action in actions:
          next_state = get_next_state(state, action)
          values.append(reward(next_state) + gamma * V[next_state])
        new_V[state] = max(values)
        delta = max(delta, abs(new_V[state] - V[state]))
    V = new_V
    i += 1
    if delta < theta:
      break
  print(f'Converged after {i} iterations.')
  print("Value Function:")
  for row in range(rows):
    row_vals = " | ".join(f"{V[row, col]:6.3f}" for col in range(cols))
    print(row_vals)

value_iteration(V)

Converged after 6 iterations.
Value Function:
 0.656 |  0.729 |  0.810
 0.729 |  0.810 |  0.900
 0.810 |  0.900 |  1.000
 0.900 |  1.000 |  0.000


---

#### 1) **(POINTS: 2)** As part of your solution, create a GridWorld class with these attributes:

- `nrows` : number of rows in the grid
- `ncols` : number of columns in the grid

and these methods:

- `value_iteration()` with behavior described in [2] below
- `get_reward()` : given the agent row and column, return the reward

The class may include additional attributes and methods as well.

Create an instance using the class, and call `nrows`, `ncols`, and `get_reward()` to verify correctness.

You will not be graded on the implementation of `value_iteration()` for this problem.

#### 2) **(POINTS: 8)** Here, you will be graded on the implementation of `value_iteration()`.
Call `value_iteration()` to calculate and return the value function array. For each sweep over the states, have the function print out the intermediate array.


#### Enter all code here (you may also use multiple cells)

In [9]:
import numpy as np

class GridWorld:
  def __init__(self, nrows, ncols, terminal_state = (3,2), gamma = 0.9, theta = 0.01):
    self.nrows = nrows
    self.ncols = ncols
    self.terminal_state = terminal_state
    self.gamma = gamma
    self.theta = theta

    self.actions = {
        "up": (-1, 0),
        "down": (1, 0),
        "left": (0, -1),
        "right": (0, 1)
    }

    self.V = np.zeros((nrows, ncols))

  def get_reward(self, state):
      if state == self.terminal_state:
          return 1.0
      return 0.0

  def is_terminal(self, state):
      return state == self.terminal_state

  def get_next_state(self, state, action):
      row, col = state
      dr, dc = self.actions[action]
      nr, nc = row + dr, col + dc
      if 0 <= nr < self.nrows and 0 <= nc < self.ncols:
          return (nr, nc)
      return state

  def value_iteration(self):
    i = 0
    while True:
      delta = 0
      new_V = np.copy(self.V)
      for row in range(self.nrows):
        for col in range(self.ncols):
          state = (row, col)
          if self.is_terminal(state):
            new_V[state] = 0.0
            continue
          values = []
          for action in self.actions:
            next_state = self.get_next_state(state, action)
            values.append(self.get_reward(next_state) + self.gamma * self.V[next_state])
          new_V[state] = max(values)
          delta = max(delta, abs(new_V[state] - self.V[state]))
      self.V = new_V
      i += 1
      if delta < self.theta:
        break
    print(f'Converged after {i} iterations.')
    print("Value Function:")
    for row in range(self.nrows):
      row_vals = " | ".join(f"{self.V[row, col]:6.3f}" for col in range(self.ncols))
      print(row_vals)

#### 1) Create and test the class

In [11]:
env = GridWorld(4, 3)
print("Rows:", env.nrows)
print("Cols:", env.ncols)
print("Reward at terminal (3,2):", env.get_reward((3, 2)))
print("Reward at non-terminal (0,0):", env.get_reward((0, 0)))

Rows: 4
Cols: 3
Reward at terminal (3,2): 1.0
Reward at non-terminal (0,0): 0.0


#### 2) Run value iteration

In [12]:
env.value_iteration()

Converged after 6 iterations.
Value Function:
 0.656 |  0.729 |  0.810
 0.729 |  0.810 |  0.900
 0.810 |  0.900 |  1.000
 0.900 |  1.000 |  0.000


#### 3) **(POINTS: 2)** Based on the value function: After the agent has moved right or down, does it ever make sense for it to backtrack (move up or left)? Explain your reasoning.