In [2]:
import numpy as np

# 1. THE ENVIRONMENT (4x4 Grid)
class GridWorld:
    def __init__(self):
        self.grid_size = 4
        # Terminal states: Top-Left (0) and Bottom-Right (15)
        self.terminal_states = [0, 15]
        self.actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

    def step(self, state, action):
        if state in self.terminal_states:
            return state, 0, True

        # Move logic
        row, col = divmod(state, self.grid_size)
        if action == 'UP':    row = max(row - 1, 0)
        elif action == 'DOWN':  row = min(row + 1, self.grid_size - 1)
        elif action == 'LEFT':  col = max(col - 1, 0)
        elif action == 'RIGHT': col = min(col + 1, self.grid_size - 1)

        next_state = row * self.grid_size + col
        reward = -1  # Standard penalty for each step
        done = next_state in self.terminal_states
        return next_state, reward, done

    def reset(self):
        # Start anywhere except the terminal states
        start_state = np.random.randint(0, 16)
        while start_state in self.terminal_states:
            start_state = np.random.randint(0, 16)
        return start_state

# 2. THE POLICY (Random)
def generate_episode(env):
    episode = []
    state = env.reset()
    done = False
    while not done:
        # Random Policy: 25% chance for any direction
        action = np.random.choice(env.actions)
        next_state, reward, done = env.step(state, action)
        episode.append((state, action, reward))
        state = next_state
    return episode

# 3. THE ALGORITHM (First-Visit Monte Carlo Policy Evaluation)
def mc_policy_evaluation(env, num_episodes=5000):
    # Initialize Values to 0
    V = np.zeros(env.grid_size * env.grid_size)

    # Store all returns for every state
    returns = {s: [] for s in range(env.grid_size * env.grid_size)}

    for _ in range(num_episodes):
        episode = generate_episode(env)
        G = 0

        # Work backwards from the end of the episode
        for idx in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[idx]
            G = G + reward # Gamma is 1.0, so G = G + R

            # "First-Visit" Check:
            # Only count the return if this was the first time
            # we visited this state in this specific episode.
            previous_states = [x[0] for x in episode[:idx]]
            if state not in previous_states:
                returns[state].append(G)
                V[state] = np.mean(returns[state]) # Average the returns
    return V

# --- Main Driver ---
if __name__ == "__main__":
    env = GridWorld()
    print("Running Monte Carlo Policy Evaluation (5000 Episodes)...\n")

    # Run the algorithm
    values = mc_policy_evaluation(env)

    print("\nResulting Value Function (V):")
    # Reshape to 4x4 for easy reading
    print(np.round(values.reshape(4, 4), 1))

Running Monte Carlo Policy Evaluation (5000 Episodes)...


Resulting Value Function (V):
[[  0.  -14.4 -20.5 -22.3]
 [-14.  -18.2 -19.8 -19.8]
 [-20.1 -20.3 -18.4 -14.5]
 [-22.4 -20.  -13.8   0. ]]


The code consists of three main parts:  
1.  **The Environment (`GridWorld` class)**: Defines the 4x4 grid, actions, rewards, and how the state changes.  
2.  **The Policy (`generate_episode` function)**: Describes how an agent behaves in the environment. Here, it's a simple random policy.  
3.  **The Algorithm (`mc_policy_evaluation` function)**: Implements the First-Visit Monte Carlo method to estimate the value function of the random policy.

### 1. The Environment (`GridWorld` Class)

In [None]:
class GridWorld:
    def __init__(self):
        self.grid_size = 4
        # Terminal states: Top-Left (0) and Bottom-Right (15)
        self.terminal_states = [0, 15]
        self.actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

    def step(self, state, action):
        if state in self.terminal_states:
            return state, 0, True

        # Move logic
        row, col = divmod(state, self.grid_size)
        if action == 'UP':    row = max(row - 1, 0)
        elif action == 'DOWN':  row = min(row + 1, self.grid_size - 1)
        elif action == 'LEFT':  col = max(col - 1, 0)
        elif action == 'RIGHT': col = min(col + 1, self.grid_size - 1)

        next_state = row * self.grid_size + col
        reward = -1  # Standard penalty for each step
        done = next_state in self.terminal_states
        return next_state, reward, done

    def reset(self):
        # Start anywhere except the terminal states
        start_state = np.random.randint(0, 16)
        while start_state in self.terminal_states:
            start_state = np.random.randint(0, 16)
        return start_state

This class defines the grid-world environment:
*   It's a 4x4 grid, with states numbered 0 to 15.
*   States 0 (top-left) and 15 (bottom-right) are *terminal states* where the episode ends.
*   `actions` are 'UP', 'DOWN', 'LEFT', 'RIGHT'.
*   `step(state, action)`: Takes a current state and an action, then returns the `next_state`, the `reward` (-1 for each step, 0 in terminal states), and whether the episode is `done`.
*   `reset()`: Starts a new episode from a random non-terminal state.

### 2. The Policy (`generate_episode` Function)

In [None]:
def generate_episode(env):
    episode = []
    state = env.reset()
    done = False
    while not done:
        # Random Policy: 25% chance for any direction
        action = np.random.choice(env.actions)
        next_state, reward, done = env.step(state, action)
        episode.append((state, action, reward))
        state = next_state
    return episode

This function simulates one full episode using a *random policy*:
*   It starts from a `reset()` state.
*   In each step, the agent chooses an action randomly (each action has a 25% chance).
*   It records the sequence of `(state, action, reward)` tuples until a terminal state is reached. This sequence is called an `episode`.

### 3. The Algorithm (`mc_policy_evaluation` Function)

In [None]:
def mc_policy_evaluation(env, num_episodes=5000):
    # Initialize Values to 0
    V = np.zeros(env.grid_size * env.grid_size)

    # Store all returns for every state
    returns = {s: [] for s in range(env.grid_size * env.grid_size)}

    for _ in range(num_episodes):
        episode = generate_episode(env)
        G = 0

        # Work backwards from the end of the episode
        for idx in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[idx]
            G = G + reward # Gamma is 1.0, so G = G + R

            # "First-Visit" Check:
            # Only count the return if this was the first time
            # we visited this state in this specific episode.
            previous_states = [x[0] for x in episode[:idx]]
            if state not in previous_states:
                returns[state].append(G)
                V[state] = np.mean(returns[state]) # Average the returns
    return V

This is the core of the Monte Carlo policy evaluation:
*   **Goal**: To estimate the *value function* (`V`) for the random policy, which means calculating the expected total future reward (return) from each state.
*   **Initialization**: `V` is initialized to zeros, and `returns` is a dictionary to store all observed returns for each state.
*   **Episode Generation**: It runs `num_episodes` (default 5000), generating a full episode for each run.
*   **Calculating Returns (`G`)**: For each episode, it iterates *backwards* from the end to calculate the return `G`. Since the discount factor (gamma) is implicitly 1.0, `G` is simply the sum of future rewards.
*   **First-Visit MC**: For each state encountered in an episode, it only considers the *first time* that state was visited in that specific episode to calculate its return. This is the "first-visit" aspect of the algorithm.
*   **Averaging Returns**: After calculating the return `G` for a state (on its first visit in an episode), it adds `G` to the list of `returns` for that state. The value `V[state]` is then updated by averaging all the returns observed for that state so far.
*   By running many episodes and averaging the returns, `V` converges to the true value function of the random policy.

### Main Driver

In [None]:
if __name__ == "__main__":
    env = GridWorld()
    print("Running Monte Carlo Policy Evaluation (5000 Episodes)...
")

    # Run the algorithm
    values = mc_policy_evaluation(env)

    print("
Resulting Value Function (V):")
    # Reshape to 4x4 for easy reading
    print(np.round(values.reshape(4, 4), 1))

This block initializes the `GridWorld` environment, calls the `mc_policy_evaluation` function to get the value function, and then prints the results in a 4x4 grid format for readability.

The output `values` array, reshaped into a 4x4 grid, represents the **estimated expected return** (total future reward) if you start from that specific state and follow the random policy until a terminal state is reached.  

Let's break down what these numbers mean:

*   **Terminal States (0 and 15):**
    *   `0.0` at the top-left `(0,0)` and bottom-right `(3,3)` positions. These are your terminal states. Once the agent reaches a terminal state, the episode ends, and no further rewards are accumulated. Hence, their value is 0.

*   **Negative Values:**
    *   All other states have negative values. This is because the `reward` for each step taken in the environment is `-1`. Therefore, the value of a state reflects the expected number of steps it will take, on average, to reach a terminal state from that particular starting state, multiplied by -1.

*   **Interpretation of Magnitude:**
    *   **Less Negative Values (e.g., -14.4, -13.8):** States closer to a terminal state (like those adjacent to the top-left `0` or bottom-right `15` cells) tend to have less negative values. This indicates that, on average, it takes fewer steps to reach a terminal state from these positions under a random policy.
    *   **More Negative Values (e.g., -22.4, -20.5):** States further away from both terminal states (typically in the middle of the grid, or further along a path that tends to wander) have more negative values. This suggests that it takes a greater number of steps, on average, to reach a terminal state from these positions with a purely random movement strategy.