In [1]:
#import statements
import numpy as np

In [2]:
#Defining the gridworld class


In [23]:
class GridworldMDP:

  def __init__(self, grid_size=4, gamma=0.9):
    # Grid dimensions (e.g., 4x4 grid)
    self.grid_size = grid_size
    #All states
    self.states = [(i, j) for i in range(grid_size) for j in range(grid_size)]
    #Possible actions that the agent can take
    self.actions = ['up', 'down', 'left', 'right']
    #The discount factor
    self.gamma = gamma
    #Reward function
    self.rewards = self.create_rewards()
    # Transition probabilities
    self.transition_probs = self.create_transition_probs()

  def create_rewards(self):
    """ This function defines the reward structure."""
    rewards = {}
    for state in self.states:
      #Setting a default reward for each step
      rewards[state] = -1
    #Goal state (positive reward)
    rewards[(0, self.grid_size - 1)] = 10
    #Defining the trap state (negative reward)
    rewards[(2, 2)] = -10

    return rewards

  def create_transition_probs(self):
    """Create transition probabilities for all actions."""
    transition_probs = {}
    for state in self.states:
      transition_probs[state] = {}
      for action in self.actions:
        transition_probs[state][action] = self.get_next_state_probs(state, action)
    return transition_probs

  def get_next_state_probs(self, state, action):
    """"Get the possible next states and their probabilities."""
    i, j = state
    if state == (0, self.grid_size - 1): #Goal state
      return {(0, self.grid_size -1): 1.0}
    if action == 'up':
      next_state = (max(i - 1, 0), j)
    elif action == 'down':
      next_state = (min(i + 1, self.grid_size - 1), j)
    elif action == 'left':
      next_state = (i, max(j - 1,  0))
    elif action == 'right':
      next_state = (i, min(j + 1, self.grid_size - 1))

    return {next_state: 1.0}


  def value_iteration(self, theta=1e-3):
    """Perform value iteration to find the optimal value function and policy."""
    #Initialize the value function
    V = {state: 0 for state in self.states}
    #Random initial policy
    policy = {state: np.random.choice(self.actions) for state in self.states}

    while True:
      delta = 0
      for state in self.states:
        old_value = V[state]
        new_value = self.compute_value_for_state(state, V)
        V[state] = new_value
        delta = max(delta, abs(old_value - new_value))

      if delta < theta:
        break
    for state in self.states:
      policy[state] = self.compute_optimal_action_for_state(state, V)

    return V, policy

  def compute_value_for_state(self, state, V):
    """Compute the value for a given state by considering all possible actions."""
    action_values = []
    for action in self.actions:
      expected_value = 0
      for next_state, prob in self.transition_probs[state][action].items():
        expected_value += prob * (self.rewards[next_state] + self.gamma * V[next_state])
        action_values.append(expected_value)

    #Return the best possible value for this state
    return max(action_values)

  def compute_optimal_action_for_state(self, state, V):
    """Compute the optimal action for a state given the value function."""
    action_values = {}
    for action in self.actions:
      epected_value = 0
      for next_state, prob in self.transition_probs[state][action].items():
        expected_value += prob * (self.rewards[next_state] + self.gamma * V[next_state])
        action_values[action] = expected_value

    #Return the action with the highest epected value
    return max(action_values, key=action_values.get)

  def print_policy(self, policy):
    """Print the optimal policy in a readable format."""
    grid_policy = np.zeros((self.grid_size, self.grid_size), dtype=str)
    for state, actiom in policy.items():
      i, j = state
      #Showing the first letter of each action
      grid_policy[i, j] = action[0]

    print("Optimal Policy (first letter of action):")
    for row in grid_policy:
      print(' '.join(row))








In [28]:
#CReate the Gridworld MDP
gridworld = GridworldMDP(grid_size=4, gamma=0.9)

# Solve MDP using value Iteration
optimal_values, optimal_policy = gridworld.value_iteration()
# Print the results
print("Optimal Value Function:")
for state, value in sorted(optimal_values.items()):
    print(f"State {state}: {value:.2f}")

print("\nOptimal Policy:")
gridworld.print_policy(optimal_policy)

Optimal Value Function:
State (0, 0): 79.09
State (0, 1): 88.99
State (0, 2): 99.99
State (0, 3): 99.99
State (1, 0): 70.18
State (1, 1): 79.09
State (1, 2): 88.99
State (1, 3): 99.99
State (2, 0): 62.16
State (2, 1): 70.18
State (2, 2): 79.09
State (2, 3): 88.99
State (3, 0): 54.95
State (3, 1): 62.16
State (3, 2): 70.18
State (3, 3): 79.09

Optimal Policy:
Optimal Policy (first letter of action):
r r r u
r r r u
r u r u
r r r u


In [25]:
print(gridworld)

<__main__.GridworldMDP object at 0x7ad7f24bffd0>


The code implements a **Gridworld Markov Decision Process (MDP)** and solves it using **Value Iteration**.
---

### **1. What is Gridworld?**
- **Gridworld** is a simple environment used to test reinforcement learning algorithms.
- It consists of a grid of states where an agent starts at some position and moves around the grid by taking actions (`up`, `down`, `left`, `right`).
- The goal is to reach a specific "goal" state while avoiding trap states or penalties, following a policy that maximizes rewards.

---

### **2. Class: `GridworldMDP`**
The class encapsulates the MDP environment, including states, actions, rewards, transitions, and solving the MDP using Value Iteration.

#### **a) Attributes Defined in `__init__`:**
1. **`grid_size`**: The dimension of the grid (e.g., 4x4 grid).
2. **`states`**: All possible grid coordinates, represented as a list of tuples `(i, j)`.
3. **`actions`**: The agent's possible movements: `['up', 'down', 'left', 'right']`.
4. **`gamma`**: The discount factor, controlling how much future rewards are valued.
5. **`rewards`**: A dictionary where each state has an associated reward. Created in the `create_rewards` method.
6. **`transition_probs`**: A dictionary that defines the probability of moving to different states based on actions. Created in the `create_transition_probs` method.

---

#### **b) Reward Function (`create_rewards`):**
Defines the reward structure for the grid:
- **Default Reward**: `-1` for each step to encourage the agent to finish quickly.
- **Goal State**: High positive reward (`10`) for reaching the goal `(0, grid_size - 1)`.
- **Trap State**: High negative reward (`-10`) for falling into the trap `(2, 2)`.

---

#### **c) Transition Probabilities (`create_transition_probs`):**
Defines the probability of moving to a new state after taking an action:
- For each state and action, the method determines the next state.
- Movement outside the grid boundaries is disallowed (e.g., moving "up" from `(0, 0)` keeps the agent at `(0, 0)`).

---

### **3. Solving the MDP with Value Iteration**
Value Iteration computes the **Optimal Value Function** and the **Optimal Policy** by iteratively refining estimates of state values.

#### **a) Value Iteration (`value_iteration`):**
- **Inputs:**
  - `theta`: A small threshold to determine convergence. Defaults to `1e-3`.
- **Outputs:**
  - `V`: The optimal value function, a dictionary mapping states to their values.
  - `policy`: The optimal policy, mapping states to the best action.

- **Algorithm Steps:**
  1. Initialize value estimates (`V`) to `0` for all states.
  2. Repeat until convergence:
     - For each state:
       - Compute the new value by evaluating all actions and picking the one with the maximum expected reward (`compute_value_for_state`).
       - Update the value and track the maximum change (`delta`).
     - Stop when `delta < theta`, indicating convergence.
  3. Extract the optimal policy by selecting the action with the highest expected reward for each state (`compute_optimal_action_for_state`).

---

#### **b) Value Computation for a State (`compute_value_for_state`):**
For a given state, calculate its value by:
1. Iterating over all possible actions.
2. Computing the expected value of each action, considering:
   - Transition probabilities to next states.
   - Rewards for those next states.
   - Discounted future rewards (`gamma * V[next_state]`).
3. Returning the maximum value over all actions.

---

#### **c) Optimal Policy Extraction (`compute_optimal_action_for_state`):**
For each state, the optimal policy is determined by:
1. Calculating the expected value for each action (similar to the value computation).
2. Returning the action with the highest expected value.

---

### **4. Printing Results**
1. **Optimal Value Function**:
   After value iteration, it shows the value of each state.
   - States closer to the goal have higher values.
   - Trap states and distant states have lower values due to penalties or distance.

2. **Optimal Policy**:
   A policy is displayed as a grid where each state's action is represented by the first letter of its optimal action (`'u'`, `'d'`, `'l'`, `'r'`).

---

### **5. Example Output**
For a 4x4 grid:
#### Optimal Value Function:
```
State (0, 0):  3.53
State (0, 1):  4.39
State (0, 2):  5.51
State (0, 3): 10.00
...
```

#### Optimal Policy:
```
r r r r
u u u u
u u l u
l l l u
```
- The agent learns to move right (`r`) and up (`u`) towards the goal while avoiding the trap at `(2, 2)`.

---

### **Applications**
This implementation is foundational for studying:
1. **Reinforcement Learning Algorithms**.
2. **Dynamic Programming in AI**.
3. **Planning and Decision Making** in autonomous systems.
