**Monte Carlo Prediction**

Step 1: Calculate Return from First Visit:

In [1]:
def first_visit_return(returns, trajectory, gamma):
  """This function computes the return for each state from its first visit in a trajectory.
  This function trasverses the trajectory backward to calculate the discounted returns and appends
  the result for the first occurence of each state.
  """

  G = 0
  T = len(trajectory) - 1
  for t, sar in enumerate(reversed(trajectory)):
    s, a, r = sar
    G = r + gamma * G
    first_visit = True
    for j in range(T-t):
      if s == trajectory[j][0]:
        first_visit = False
      if first_visit:
        if s in returns:
          returns[s].append(G)
        else:
          returns[s] = [G]
    return returns

Step2: Generate trajectory:

In [2]:
def get_trajectory(env, policy):
  """This function simulates a single episode in the environment based on the current policy and
  returns the sequence of state-action-reward tuples(the trajectory)"""

  trajectory = []
  state = env.reset()
  done = False
  sar = [state]
  while not done:
    action = choose_action(state, policy)
    state, reward, done, info = env.stop(action)
    sar.append(action)
    sar.append(reward)
    trajectory.append(sar)
    sar = [state]

  return trajectory

Step 3: Monte Carlo prediction with First-Visit MC

In [3]:
def first_visit_mc(env, policy, gamma, n_trajectories):
  """This function simulates multiple episodes(trajectories), calculates returns, and average them to estimate the state-value function"""
  returns = {}
  v = {}
  for i in range(n_trajectories):
    trajectory = get_trajectory(env, policy)
    returns = first_visit_return(returns, trajectory, gamma)
  for s in env.state_space:
    if s in returns:
      v[s] = np.round(np.mean(returns[s]), 1)

  return v

Explanation:

---

### **1. Function: `get_trajectory(env, policy)`**

This function generates a trajectory (sequence of state-action-reward tuples) by following a policy in a given environment.

#### **Details:**
- **Inputs:**
  - `env`: The environment (e.g., a Gym environment).
  - `policy`: The policy, which determines the action to take in each state.

- **Steps:**
  1. **Initialize trajectory tracking:**
     - `trajectory`: List to store the trajectory.
     - `state`: Starting state, initialized by resetting the environment.
     - `done`: Boolean to track when the episode ends, initialized to `False`.
     - `sar`: List to temporarily hold a state, action, and reward tuple.

  2. **Iterate until the episode ends (`done` becomes `True`):**
     - Use `choose_action(state, policy)` to select an action based on the current state and policy.
     - Execute the action using `env.step(action)`:
       - Updates the `state`.
       - Returns the reward (`reward`) and whether the episode is done (`done`).
     - Append the current state, action, and reward to `sar`, and add it to the trajectory.
     - Prepare `sar` for the next step by including the new `state`.

  3. **Return the trajectory:**
     - The trajectory is a list of all the state-action-reward triples observed in the episode.

#### **Output:**
- `trajectory`: A list of lists, where each inner list contains `[state, action, reward]` triples.

---

### **2. Function: `first_visit_mc(env, policy, gamma, n_trajectories)`**

This function estimates the state-value function \( V(s) \) using the **First-Visit Monte Carlo Prediction** algorithm.

#### **Details:**
- **Inputs:**
  - `env`: The environment.
  - `policy`: The policy to follow for generating trajectories.
  - `gamma`: Discount factor, which weighs future rewards relative to current rewards.
  - `n_trajectories`: Number of trajectories to simulate.

- **Steps:**
  1. **Initialize data structures:**
     - `returns`: Dictionary to store the list of returns for each state.
     - `v`: Dictionary to store the estimated value of each state.

  2. **Simulate trajectories:**
     - For each trajectory:
       - Generate a trajectory using `get_trajectory(env, policy)`.
       - Update the `returns` dictionary using the `first_visit_return` function (not defined here but assumes it calculates and stores first-visit returns for each state in the trajectory).

  3. **Calculate state values:**
     - For each state \( s \) in the environment's state space:
       - If the state has recorded returns in the `returns` dictionary:
         - Compute the average return for that state and store it in `v`.

  4. **Return the state-value function \( v \):**
     - Each state in `v` has a value approximated using Monte Carlo sampling.

#### **Output:**
- `v`: A dictionary mapping each state to its estimated value \( V(s) \), computed as the average of the returns observed for that state.

---

### **Key Concepts Used:**

1. **First-Visit Monte Carlo (MC):**
   - Tracks the first occurrence of each state in a trajectory.
   - Uses the total discounted return observed from that point onward to estimate \( V(s) \).

2. **Discount Factor \( \gamma \):**
   - Reduces the weight of future rewards.
   - If \( \gamma = 1 \), future rewards are equally weighted as current rewards.
   - If \( \gamma < 1 \), future rewards are weighted less.

3. **Policy-Based Sampling:**
   - The `policy` determines the actions taken in each state, influencing the generated trajectories.

---

### **Assumptions and Missing Details:**

- **`choose_action(state, policy)`**:
  - The function selects an action based on the given `state` and `policy`. It must be defined elsewhere.
  
- **`first_visit_return(returns, trajectory, gamma)`**:
  - Updates the `returns` dictionary with first-visit returns for each state in the trajectory.
  - Must be implemented or imported to make this function operational.

- **`env.state_space`:**
  - Assumes the environment provides a property or method to access all possible states (`state_space`).

- **`env.step(action)` Behavior:**
  - Assumes the environment follows the Gym API and returns `(next_state, reward, done, info)`.

---



# First Monte Carlo Prediction implementation

In [6]:
class MonteCarloPrediction:
    def __init__(self, env, policy, gamma=0.9):
        self.env = env  # Environment
        self.policy = policy  # Fixed policy
        self.gamma = gamma  # Discount factor

    def generate_episode(self):
        """Generate an episode following the policy."""
        episode = []
        state = tuple(self.env.reset())  # Convert state to tuple
        while True:
            action = self.policy(state)
            next_state, reward, done, _ = self.env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = tuple(next_state)  # Convert to tuple
        return episode

    def first_visit_mc_prediction(self, num_episodes):
        """First-Visit Monte Carlo Prediction."""
        V = defaultdict(float)  # Value function
        returns = defaultdict(list)  # List of returns for each state

        for _ in range(num_episodes):
            episode = self.generate_episode()
            visited_states = set()  # Track first visits

            G = 0  # Initialize return
            for t in reversed(range(len(episode))):
                state, _, reward = episode[t]
                G = reward + self.gamma * G

                # Ensure state is a tuple
                state = tuple(state)

                if state not in visited_states:
                    visited_states.add(state)
                    returns[state].append(G)
                    V[state] = np.mean(returns[state])  # Update value as mean of returns

        return V


In [7]:
env = gym.make('CartPole-v1', new_step_api=True)


In [8]:
import gym

# Define a simple random policy
def random_policy(state):
    return np.random.choice([0, 1])  # Example: 2 actions (0 and 1)

# Create a simple environment
env = gym.make('CartPole-v1')

# Monte Carlo Prediction
mc = MonteCarloPrediction(env, policy=random_policy, gamma=0.9)
value_function = mc.first_visit_mc_prediction(num_episodes=5000)

# Print value function
for state, value in sorted(value_function.items()):
    print(f"State {state}: {value:.2f}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
State (0.13384412, 0.3986499, -0.09965166, -0.56995666): 7.18
State (0.13386795, 0.7983499, -0.15205172, -1.2055591): 2.71
State (0.13387848, 0.82863194, -0.19011773, -1.4206879): 1.00
State (0.13388383, 0.75465256, -0.11373794, -1.094881): 3.44
State (0.13390788, -0.16936526, -0.03147144, 0.36932886): 8.33
State (0.13391201, 0.95649666, -0.14537077, -1.5195075): 2.71
State (0.13391852, 0.39306858, -0.20049277, -0.9159028): 1.00
State (0.13391861, -0.9318131, -0.10966974, 1.0619614): 10.00
State (0.13393472, -0.03851165, -0.074052066, 0.073452316): 9.66
State (0.1339362, -0.009964, 0.15566032, 0.8401755): 2.71
State (0.1339473, 0.5351117, 0.009444428, -0.2950372): 7.71
State (0.1339595, 0.19596438, -0.14841439, -0.47323814): 7.18
State (0.13397771, -0.0009107997, -0.14564674, -0.13838315): 7.46
State (0.13397907, 0.04180865, -0.11426748, -0.13543847): 6.86
State (0.13398579, 0.3797067, -0.19477639, -0.9106308): 1.00
State

In [10]:
import gym
import numpy as np
from collections import defaultdict

class MonteCarloPrediction:
    def __init__(self, env, policy, gamma=0.9):
        self.env = env  # Environment
        self.policy = policy  # Policy to follow
        self.gamma = gamma  # Discount factor

    def generate_episode(self):
        """Generate an episode following the policy."""
        episode = []
        state = tuple(self.env.reset())  # Convert state to tuple
        while True:
            action = self.policy(state)
            next_state, reward, terminated, truncated, _ = self.env.step(action)  # Updated step format
            episode.append((state, action, reward))
            if terminated or truncated:  # Check both conditions for episode end
                break
            state = tuple(next_state)  # Convert to tuple
        return episode

    def first_visit_mc_prediction(self, num_episodes=5000):
        """First-visit Monte Carlo prediction."""
        value_function = defaultdict(float)  # Initialize value function
        returns = defaultdict(list)  # Track returns for each state

        for _ in range(num_episodes):
            episode = self.generate_episode()
            visited_states = set()
            G = 0  # Return for each state

            for state, _, reward in reversed(episode):  # Iterate in reverse
                G = reward + self.gamma * G  # Calculate return
                if state not in visited_states:
                    visited_states.add(state)
                    returns[state].append(G)  # Save return
                    value_function[state] = np.mean(returns[state])  # Update value function

        return value_function

# Define a simple random policy
def random_policy(state):
    return np.random.choice([0, 1])  # Example: 2 actions (0 and 1)

# Create the environment
env = gym.make('CartPole-v1', new_step_api=True)

# Monte Carlo Prediction
mc = MonteCarloPrediction(env, policy=random_policy, gamma=0.9)
value_function = mc.first_visit_mc_prediction(num_episodes=5000)

# Print value function
print("Value Function:")
for state, value in sorted(value_function.items()):
    print(f"State {state}: {value:.2f}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
State (0.13037845, 1.176968, -0.14096747, -1.8228403): 1.90
State (0.13039061, 0.9817968, -0.118013546, -1.531115): 2.71
State (0.1303936, 0.94720554, -0.15125403, -1.5153292): 1.90
State (0.13039792, 1.2036198, -0.11331709, -1.5203013): 2.71
State (0.13044901, 0.011552827, -0.16183719, -0.30048883): 5.22
State (0.13044935, 0.63466436, -0.18322423, -1.1589229): 1.90
State (0.13047941, 0.56351906, -0.13617203, -0.93899393): 3.44
State (0.13050568, 0.99712324, -0.09466941, -1.5576538): 3.44
State (0.13053787, 1.1453621, -0.1349135, -1.7202793): 1.90
State (0.13055335, 0.9475735, -0.09105829, -1.1210469): 4.10
State (0.1305621, 1.2054038, -0.1091775, -1.7732229): 2.71
State (0.13056439, -0.011358449, 0.05564978, 0.74373955): 5.22
State (0.13057971, 0.99916416, -0.17895214, -1.7340822): 1.00
State (0.13058828, 0.38187787, -0.17775482, -0.80238044): 1.90
State (0.13061063, 0.99138457, -0.20440462, -1.5692043): 1.00
State (0.13