# MDP and Gym Maze Exploration
*Created by Jeff Jewett and Will Solow, 2024*

*Oregon State AIGSA Reading Group*

This Google Colab notebook is made to help guide students through the concept of a Markov Decision Process (MDP) using the example of a grid maze. It also covers setup of a Gymnasium Environment and an implementation of Value Iteration.

## Installations and Imports

In [None]:
"""Install requisite packages"""
!pip install gymnasium[classic-control]
!pip install numpy
!pip install matplotlib

In [None]:
"""Import packages"""
import io
import time

import ipywidgets
from IPython.display import display, clear_output
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

"""Utility functions"""
def image_to_bytes(img):
  if isinstance(img, np.ndarray):
    img = Image.fromarray(img, 'RGB')

  # Convert the PIL image to bytes
  with io.BytesIO() as output:
    img.save(output, format="PNG")
    image_data = output.getvalue()
  return image_data

def matrix_to_heatmap_image(matrix, output_size=(512, 512)):
  norm_factor = max(np.abs(np.nanmax(matrix)), np.abs(np.nanmin(matrix)))
  if norm_factor != 0:
    matrix = matrix / norm_factor
  cmap = plt.get_cmap('coolwarm')
  cmap.set_bad(color='black')
  norm = plt.Normalize(vmin=-1, vmax=1)
  rgb_array = cmap(norm(matrix))[:, :, :3]
  rgb_image = Image.fromarray((rgb_array * 255).astype(np.uint8))

  # Resize the image to the specified output size
  resized_image = rgb_image.resize(output_size, Image.NEAREST)

  return np.array(resized_image)

## Markov Decision Process (MDP) for a Maze

We want to represent a maze environment, where there is an agent that can move in the cardinal directions. There are walls which restrict movement. For a given maze of size WIDTHxHEIGHT, our MDP is defined as
- S: The state space is {0, 1, 2, ..., WIDTH*HEIGHT-1}, representing the row-wise flattened indices of the maze. For example, for a 5x5 maze, state 8 represents x=1, y=3.
- A: The action space is {0, 1, 2, 3}, representing the directions UP, DOWN, LEFT, and RIGHT respectively.
- R(S, A) -> ℝ The reward function takes a state and an action and gives a reward. For our simple problem, we give a constant penalty of -1 every action the agent takes. No reward is given at the goal.
- P(S, A, S) -> [0,1] The transition function shows the probability of starting at a given state, taking a certain action, and ending up in a different state. Our maze environment is deterministic, so all probabilities are either 0 or 1. For any non-neighbor states S1 and S2, P(S1, ⋅, S2)=0. But if state S3 is to the left of S4, then P(S3, LEFT, S4)=1.
- H: The horizon is the maximum amount of actions our agent is allowed to take.
- γ (gamma): Gamma is a value in [0,1] which acts as a discount factor to make states distant in the future less important. γ=1 means there is no discounting. γ=0.95 means that rewards lose 5% of their value each step into the future.

In [None]:
class MazeMDP():

  MOVES: dict[int,tuple] = {
         0: np.array([-1, 0]), #UP
         1: np.array([1, 0]),  #DOWN
         2: np.array([0, -1]), #LEFT
         3: np.array([0, 1])   #RIGHT
         }
  UP: int = 0
  DOWN: int = 1
  LEFT: int = 2
  RIGHT: int = 3

  def __init__(self, maze, H=100, gamma=.95):
    """Initialize the MDP"""
    self.maze = np.array(maze)
    assert len(self.maze.shape) == 2, "The maze must be a 2-dimensional array"

    self.obstacles = [tuple(loc) for loc in np.argwhere(self.maze == 1)]
    def get_state_index_from_maze(value):
      locations = [tuple(loc) for loc in np.argwhere(self.maze == value)]
      assert len(locations) == 1, f"There should be exactly one {value} in the maze"
      row, col = locations[0]
      return self.row_col_to_state(row, col)
    self.init_state = get_state_index_from_maze(2)
    self.goal_state = get_state_index_from_maze(3)

    """Create the MDP functions"""
    self.S = set(range(self.maze.shape[0] * self.maze.shape[1]))
    self.A = set(range(len(self.MOVES)))
    self.R = self.build_rewards()
    self.P = self.build_transitions()

    self.H = H
    self.gamma = gamma


  def build_rewards(self):
    """Build reward function"""
    # make every state and action and next state give -1 reward, to penalize long paths
    rewards = np.full((len(self.S), len(self.A), len(self.S)), -1, dtype=float)
    # all actions in the goal state give zero reward
    # this is a requirement for all terminating states
    rewards[self.goal_state, :, :] = 0

    return rewards
  def build_transitions(self):
    """Build transitoins function"""
    transitions = np.zeros((len(self.S), len(self.A), len(self.S)))

    # Go through all states and all actions
    for s in self.S:
      for a in self.A:
        # By default, the next state is the current state
        s_p = s

        # Convert state to maze index
        row_col = np.array(self.state_to_row_col(s))

        # Get new potential state
        row_col_new = row_col + self.MOVES[a]

        # Check that new state is valid
        is_in_bounds = 0 <= row_col_new[0] < self.maze.shape[0] and 0 <= row_col_new[1] < self.maze.shape[1]
        is_free_space = tuple(row_col_new) not in self.obstacles

        # Compute the new s' if the transition is valid
        if is_in_bounds and is_free_space:
          s_p = self.row_col_to_state(*row_col_new)

        # Assign transition probability to 1
        transitions[s,a,s_p] = 1

    return transitions

  def state_to_row_col(self, state):
    """A utility function to get the row and col from a state index"""
    return state // self.maze.shape[1], state % self.maze.shape[1]

  def row_col_to_state(self, row, col):
    """A utility function to get the state index from row and col"""
    return row * self.maze.shape[1] + col

## Gymnasium/Gym Standard Interface

Now that we've clearly defined the MDP, we want to provide an interface for an agent to interact with the environment. This is quite a simple environment, but you can imagine that they get quite complex. Gymnasium is a package that provides a common Environment interface between many such environments, and other useful utilities. `gymnasium` is often abbreviated as `gym`, because it is a well-maintained fork of OpenAI's Gym. A `gym.Env` has the following interface:
- `observation_space`: The observation space defines what data types the states are represented by. For example, an integer or a vector of floats.
- `action_space`: The action space defines what data types the actions are represented by. It is common to have an integer represent different choices.
- `reset()` -> `observation, info`: The reset function marks the start of a new episode. It resets the environment back to its starting state. It returns the starting observation (and a dictionary of auxillary info).
- `step(action)` -> `observation, reward, termination, truncation, info`: The step function executes an action chosen by the agent. The resulting state is returned along with the reward for that action. Termination and truncation signal that the current episode is done (such as by reaching the goal or time running out).
- `render()`: Render is for visualization purposes and is optional.

Let's see `gymnasium` in action with a few of their built-in environments. You can refer to https://gymnasium.farama.org/ for more details.

In [None]:
# Uncomment and run these if you try to use a Box2D environment
# !pip install swig
# !pip install gymnasium[box2d]

In [None]:
# Jupyter notebook stuff to render the output
image_widget = ipywidgets.Image()
display(image_widget)

import gymnasium as gym

# Lots of environments to choose from. Here are some examples
# "CartPole-v1", "MountainCar-v0", "Acrobot-v1", "LunarLander-v2", "CarRacing-v2"
environment_id = "MountainCar-v0"
env = gym.make(environment_id, render_mode="rgb_array")

def agent_policy(observation):
  # You can choose actions with any method.
  # This just chooses a random action
  return env.action_space.sample()

print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

observation, info = env.reset()
for _ in range(500):
  action = agent_policy(observation)
  observation, reward, terminated, truncated, info = env.step(action)

  done = terminated or truncated
  if done:
    # restart the episode when the previous finishes
    observation, info = env.reset()

  # Render
  image = env.render()
  image_widget.value = image_to_bytes(image)

env.close()

## Wrapping our MDP with Gym

Now that you are familiar with what Gymnasium does, let's define a gym environment for our previously defined MazeMDP class. In most cases, you would just build the MDP directly into the environment.

### Defining the Gym Environment

In [None]:
"""Define the Maze Gym Environment"""
class MazeEnv(gym.Env):

  def __init__(self, mdp, render_mode = "ansi"):
    """Initialize the maze environment setting all variables and parsing the map"""
    self.mdp = mdp

    self.agent_curr_state = self.mdp.init_state
    self.steps_taken = 0

    # Set action space and observation space
    self.action_space = gym.spaces.Discrete(len(self.mdp.A))
    self.observation_space = gym.spaces.Discrete(len(self.mdp.S))

    # Rendering
    self.render_mode = render_mode

  def reset(self):
    """Reset the environment to its initial state"""
    self.agent_curr_state = self.mdp.init_state
    self.steps_taken = 0

    initial_obs = self.agent_curr_state
    initial_info = {}

    return initial_obs, initial_info

  def step(self, action):
    """Take an action in the environment"""

    assert 0 <= action < 4, "Action should be one of [0,1,2,3]"

    prev_state = self.agent_curr_state
    # Sample next state weighted by transition probabilities
    self.agent_curr_state = np.random.choice(list(self.mdp.S), 1, p=self.mdp.P[self.agent_curr_state, action,:])[0]

    # Get reward according to state and action and the resulting state
    reward = self.mdp.R[prev_state, action, self.agent_curr_state]

    # Make outputs for gym environment
    next_obs = self.agent_curr_state

    self.steps_taken += 1
    has_reached_goal = self.agent_curr_state == self.mdp.goal_state
    has_reached_time_limit = self.steps_taken >= self.mdp.H
    terminated = has_reached_goal or has_reached_time_limit

    # truncated indicates that the episode was ended due to some external condition
    truncated = False
    # info can contain auxillary information for logging and debugging
    info = {}

    return next_obs, reward, terminated, truncated, info

  def render(self):
    """ Render the environment """

    if self.render_mode == "ansi":
      # ansi is text mode
      return render_text(self.agent_curr_state, self.mdp.maze)
    else:
      raise ValueError(f"Unsupported rendering mode {self.render_mode}")

def render_text(agent_state, maze):
  """Get a text form of the maze"""

  # These are the characters that will render in your maze.
  # 0 = "-" is empty space
  # 1 = "X" is a wall
  # 2 = "-" is the agent starting location. However, it can be rendered as empty space
  # 3 = "G" is the goal location
  # 4 = "T" is a teleporter (you will implement this)
  # extend this list if you add more features
  character_map = ["-", "X", "-", "G", "T"]
  ncols, nrows = maze.shape

  string=f""
  for i in range(ncols):
    for j in range(nrows):
      s = (i * ncols + j)
      if s == agent_state:
        # The agent is displayed as "A"
        string += "A"
      else:
        string += character_map[maze[i, j]]
    string += "\n"

  return string


### Random Agent in the Maze
We can test out this environment with an agent that selects random actions.

In [None]:
STEPS_PER_SECOND = 10
output = ipywidgets.Output()
display(output)

# Create the maze MDP. 0 is free space, 1 is walls,
# 2 is the starting location, 3 is the goal location
mdp = MazeMDP([[2,0,0,0,0],
               [1,1,1,1,0],
               [0,0,0,0,0],
               [0,1,0,1,1],
               [0,1,0,0,3]])

# Create the maze gym environment
maze_env = MazeEnv(mdp)

terminated = truncated = done = False
obs, info = maze_env.reset()
num_steps = 0

# Run the maze environment until the episode terminates
while not done:
  # sample a random action
  action = maze_env.action_space.sample()

  next_obs, reward, terminated, truncated, info = maze_env.step(action)
  done = terminated or truncated
  obs = next_obs

  # Render
  with output:
    clear_output(wait=True)
    print(maze_env.render())
  time.sleep(1/STEPS_PER_SECOND)
  num_steps += 1

print(f"Finished in {num_steps} steps")

### Optimal Agent in the Maze
Below, we provide the code for running Value Iteration, a method for solving an MDP. We can use it to test the effects of different reward functions on the Agent's behavior. We will learn how this works later, but the agent will calculate how good each state is for reaching the goal.

You can think of this as the agent considering every possible action in every state. It does this planning in advance, so that when the agent starts moving, it just follows the path that is calculated.

In [None]:
class ValueIterationAgent():
  """Agent class for solving tabular MDPs"""

  def __init__(self, env):
    """Initialize the agent. env should have a tabular mdp"""

    assert hasattr(env, 'mdp'), "Gymansium Environment does not have associated MDP"

    self.env = env
    self.mdp = env.mdp
    self.policy = None
    self.v = None

  def value_improvement(self, epsilon=.000001):
    """Perform value improvement"""
    mdp = self.mdp

    # Randomly initialize v
    # Easiest initialization is all zeros
    v = np.zeros(len(mdp.S))

    # Loop until the horizon is reached
    for i in range(mdp.H):

      old_v = v

      # Perform Bellman Backup
      v = np.max(
            np.sum(
                mdp.P * (mdp.R + mdp.gamma * v[np.newaxis, np.newaxis, :]), axis=-1
            )
          , axis=-1)
      # Set goal state value to 0
      v[mdp.goal_state] = 0

      # Exit if threshold is reached
      if i > 30 and np.max(np.abs(old_v-v)) < epsilon:
        break

    # Determine the best actions with bellman backups
    action_reward = np.sum(mdp.P * (mdp.R + mdp.gamma * v[np.newaxis, np.newaxis, :]), axis=-1)

    # Compute the optimal policy
    best_actions = (action_reward - np.max(action_reward, axis=1)[:,np.newaxis]) == 0
    policy = best_actions * (1 / np.sum( best_actions, axis=1)[:,np.newaxis])

    self.policy = policy
    self.v = v

    return policy, v

  def get_action(self, state):
    """Get the action given by the policy"""

    # If policy is none, return a random action
    if self.policy is None:
      print("Warning: Value iteration policy has not been computed yet")
      return self.env.action_space.sample()

    # Get the action given by the policy
    action = np.random.choice(list(self.mdp.A), 1, p=self.policy[state,:])[0]

    return action

  def get_value(self, state):
    """Get the value of a given state"""

    return self.v[state]

  def get_value_heatmap(self, agent_state, output_shape=(256,256)):
    """Gets a matrix of all the state values"""
    value_matrix = self.v.reshape((self.mdp.maze.shape))
    # arr = np.nan_to_num(arr, nan=10000)
    for loc in self.mdp.obstacles:
      value_matrix[loc] = np.nan
    image = matrix_to_heatmap_image(value_matrix, output_shape)
    # draw agent
    agent_frac_x = (agent_state // self.mdp.maze.shape[1] + 0.5) / self.mdp.maze.shape[0]
    agent_frac_y = (agent_state % self.mdp.maze.shape[1] + 0.5) / self.mdp.maze.shape[1]
    x_half_width = output_shape[0] / self.mdp.maze.shape[0] * 0.25
    y_half_width = output_shape[1] / self.mdp.maze.shape[1] * 0.25
    min_x = int(output_shape[0] * agent_frac_x - x_half_width)
    max_x = int(output_shape[0] * agent_frac_x + x_half_width)
    min_y = int(output_shape[1] * agent_frac_y - y_half_width)
    max_y = int(output_shape[1] * agent_frac_y + y_half_width)
    image[min_x:max_x, min_y:max_y,:] = 255
    return image


Now we can run the same maze to see what the optimal agent does.

In [None]:
STEPS_PER_SECOND = 2
output = ipywidgets.Output()
image_widget = ipywidgets.Image()
display(image_widget)
display(output)

# Create our MDP to specify the maze
mdp = MazeMDP([[2,0,0,0,0],
               [1,1,1,1,0],
               [0,0,0,0,0],
               [0,1,0,1,1],
               [0,1,0,0,3]])

# Create the maze gym environment
maze_env = MazeEnv(mdp)

# Create and call a value iteration agent
agent = ValueIterationAgent(maze_env)
agent.value_improvement()

terminated = truncated = done = False
obs, info = maze_env.reset()
num_steps = 0

# Run the maze environment until the episode terminates
while not done and num_steps < 100:
  # Get best action from the agent policy
  action = agent.get_action(obs)

  next_obs, reward, terminated, truncated, info = maze_env.step(action)
  done = terminated or truncated
  obs = next_obs

  # Render
  with output:
    clear_output(wait=True)
    print(maze_env.render())
  value_heatmap = agent.get_value_heatmap(obs)
  image_widget.value = image_to_bytes(value_heatmap)

  time.sleep(1/STEPS_PER_SECOND)
  num_steps += 1

print(f"Finished in {num_steps} steps")

## Hands On Exploration

### Reward Function

Now, let's do some guided exploration with a new reward function.

1. Try a reward function with 1 for transitions into the goal state, `R[:, :, Goal] = 1` and zero for everything else. Does this reward function change the agent's behavior as you change gamma between 0 and 1? (try these values of gamma: 0, 0.5, 0.95, 1)
  * Refer to the MazeMDP code for an example of how to implement the reward function.
2. Make a reward function of your own. Try to see if you can make the agent take a different path to the goal or demonstrate interesting behavior. For example, you can try these ideas:
  *   Give a negative reward every time the agent moves up
  *   Give a negative reward every time the agent moves left
  *   Give a negative reward if the agent has a certain x or y coordinate
3. What can happen if you change your reward function to give positive rewards each time a certain action is taken? Why is this?

4. Try moving around the agent starting location and the goal location.



In [None]:
class AltRewardMazeMDP(MazeMDP):
  def __init__(self, maze, H=100, gamma=.95):
    super().__init__(maze, H=H, gamma=gamma)

  def build_rewards(self):
    """Build reward function"""
    # This gives you the original reward function. You can use it or not.
    rewards = super().build_rewards()

    # self.xy_to_state(my_x, my_y) is a handy utility function to get the state index from a position
    # You can refer to the actions as self.UP, self.DOWN, self.LEFT, self.RIGHT
    # example: rewards[self.xy_to_state(2,1), self.UP, :] = -10 gives -10 reward when the agent is at x=2, y=1 and moves up (action 0)

    # do stuff here

    return rewards

test_maze = [[2,0,0,0,0],
             [1,1,1,1,0],
             [0,0,0,0,0],
             [0,1,0,1,1],
             [0,1,0,0,3]]

test_maze_2 = [[2,0,0,0,0,1,0,0,0,0],
               [1,1,1,1,0,1,0,1,0,1],
               [0,0,0,0,0,1,0,1,0,0],
               [0,1,1,1,0,1,0,1,0,1],
               [0,0,0,1,0,0,0,1,0,0],
               [0,1,0,0,0,0,1,0,0,0],
               [0,1,1,1,1,1,1,1,0,1],
               [0,0,0,1,0,0,0,0,0,3],
               [0,1,0,0,0,0,1,1,1,0],
               [0,1,0,0,1,0,0,0,0,0],]



Below, we provide testing code to see how your the Value Iteration Agent interacts with your environment after finding the optimal actions for the MDP. Play around with it.

In [None]:
STEPS_PER_SECOND = 2
output = ipywidgets.Output()
image_widget = ipywidgets.Image()
display(image_widget)
display(output)

# Create the alternate reward MDP. Select a maze.
gamma = 0.95
maze = test_maze_2
mdp = AltRewardMazeMDP(maze, gamma=gamma)

# Create the maze gym environment
maze_env = MazeEnv(mdp)

# Create and call a value iteration agent
agent = ValueIterationAgent(maze_env)
agent.value_improvement()

terminated = truncated = done = False
obs, info = maze_env.reset()
num_steps = 0
total_reward = 0

# Run the maze environment until the episode terminates
while not done and num_steps < 40:
  # Get best action from the agent policy
  action = agent.get_action(obs)

  next_obs, reward, terminated, truncated, info = maze_env.step(action)
  done = terminated or truncated
  obs = next_obs
  total_reward += reward

  # Render
  with output:
    clear_output(wait=True)
    print(maze_env.render())
    print(f'Action: {["Up","Down","Left","Right"][action]} ({action})')
  value_heatmap = agent.get_value_heatmap(obs)
  image_widget.value = image_to_bytes(value_heatmap)

  time.sleep(1/STEPS_PER_SECOND)
  num_steps += 1

print(f"Finished in {num_steps} steps with {total_reward} reward")

### Transitions
For some fun, we are going to now play around with modifying the environment transitions by adding a transportation square. When the agent takes an action that would move it into a transportation square, it gets moved to another transportation square on the map at random. This is called a **stochastic** environment, where the same action in the same location can produce a different result. Fortunately, our Value Iteration agent can already handle the complexity.

We specify a teleport state on the map with the number '4'.

In [None]:
class TeleporterMazeMDP(MazeMDP):
  def __init__(self, maze, H=100, gamma=.95):

    self.teleport_states = [tuple(loc) for loc in np.argwhere(np.array(maze) == 4)]
    print(self.teleport_states)
    super().__init__(maze, H=H, gamma=gamma)



  def build_transitions(self):
    """Build transitoins function"""
    transitions = np.zeros((len(self.S), len(self.A), len(self.S)))

    # Go through all states and all actions
    for s in self.S:
      for a in self.A:
        # By default, the next state is the current state
        s_p = s

        # Convert state to maze index
        row_col = np.array(self.state_to_row_col(s))

        # Get new potential state
        row_col_new = row_col + self.MOVES[a]

        # Check that new state is valid
        is_in_bounds = 0 <= row_col_new[0] < self.maze.shape[0] and 0 <= row_col_new[1] < self.maze.shape[1]
        is_free_space = tuple(row_col_new) not in self.obstacles

        # Compute the new s' if the transition is valid
        if is_in_bounds and is_free_space:
          s_p = self.row_col_to_state(row_col_new[0],row_col_new[1])

        if tuple(row_col_new) not in self.teleport_states:
          # If we have not move to a teleporter, assign transition probability to 1
          transitions[s,a,s_p] = 1

        else:
          # Define the logic for the if agent is in a teleport state

          # Get the flat index of the teleport states
          tp_states = np.array([self.row_col_to_state(*s) for s in self.teleport_states])

          # Get how many teleport states there are

          # Define a uniform probability of moving to each teleport state

          # Assign that uniform probability to transitions[s,a,tp_states]
          transitions[s,a,tp_states] = 1/(len(tp_states))

    return transitions


tp_test_maze = [[2,0,0,0,0],
                [1,1,1,1,4],
                [0,0,0,0,0],
                [0,1,0,1,1],
                [0,4,0,0,3]]

tp_test_maze_2 = [[2,0,0,0,0,1,0,0,0,0],
                  [1,1,1,1,0,1,0,1,0,1],
                  [0,0,0,0,0,1,0,1,0,4],
                  [0,1,1,1,0,1,0,1,0,1],
                  [0,0,4,1,0,0,0,1,0,0],
                  [0,1,0,0,0,0,1,0,0,0],
                  [0,1,1,1,1,1,1,1,0,1],
                  [0,0,0,1,0,0,0,0,0,3],
                  [0,1,0,0,0,0,1,1,1,0],
                  [4,1,0,0,1,0,4,0,0,0],]

Below, we provide code to test your TeleporterMDP environment. Notice how the way we wrote the Gym Environment requires *no changes* even though we are using a different underlying MDP. This is the power of the Gym API!

1. How does the optimal policy change with the number and location of teleporting squares in the map? If there is 1 square? 2 squares? 4 squares?

2. What happens if you change the probability of teleporting?

3. Try to make teleporting optional. To do this, add a 5th action (id=4) that will only teleport the agent to another teleporting square if that action is taken. In non-teleporting squares the agent will remain in the same square when taking the teleporting action. How does this change the optimal policy?

4. Does changing the reward function change the utilty of teleporting squares? What if there is a cost associated with a teleporting square? To test this, add the build_rewards() function to the TeleporterMDP and implement it with a different reward function.

In [None]:
STEPS_PER_SECOND = 2
output = ipywidgets.Output()
image_widget = ipywidgets.Image()
display(image_widget)
display(output)

# Create the alternate teleportation MDP. Select a maze.
gamma = 0.99
maze = tp_test_maze
test_mdp = TeleporterMazeMDP(maze, gamma=gamma )

# Create the maze gym environment
maze_env = MazeEnv(test_mdp)

# Create and call a value iteration agent
agent = ValueIterationAgent(maze_env)
agent.value_improvement()

terminated = truncated = done = False
obs, info = maze_env.reset()
num_steps = 0
total_reward = 0

# Run the maze environment until the episode terminates
while not done and num_steps < 40:
  # Get best action from the agent policy
  action = agent.get_action(obs)

  next_obs, reward, terminated, truncated, info = maze_env.step(action)
  done = terminated or truncated
  obs = next_obs
  total_reward += reward

  # Render
  with output:
    clear_output(wait=True)
    print(maze_env.render())
    print(f'Action: {["Up","Down","Left","Right","Teleport"][action]} ({action})')
  value_heatmap = agent.get_value_heatmap(obs)
  image_widget.value = image_to_bytes(value_heatmap)

  time.sleep(1/STEPS_PER_SECOND)
  num_steps += 1

print(f"Finished in {num_steps} steps with {total_reward} reward")

Further exploration! A few more ideas if you are interested:

1. Add a lava tile where the agent incurs a large negative reward for touching the lava tile. When does the agent avoid the lava tiles? Or not? How does this depend on the reward function and the discount factor (gamma)?

2. Add more actions. Allow the agent to move along the diagonal. This should be a straightforward extension (add more moves to the MOVES array!)