# Monte Carlo

We have previously explored [Policy Iteration](https://github.com/priyammaz/PyTorch-Adventures/blob/main/PyTorch%20for%20Reinforcement%20Learning/Intro%20to%20Reinforcement%20Learning/Model-Based%20Learning/intro_rl_and_policy_iter.ipynb) and [Value Iteration](https://github.com/priyammaz/PyTorch-Adventures/blob/main/PyTorch%20for%20Reinforcement%20Learning/Intro%20to%20Reinforcement%20Learning/Model-Based%20Learning/value_iteration.ipynb), and I will assume all of that as a prereq to this! We leveraged these iterative techniques to solve the Bellman Equation so we could play the game Frozen Lake:

![image](https://github.com/priyammaz/PyTorch-Adventures/blob/main/src/visuals/frozen_lake.gif?raw=true)

Although we were able to successfuly build our policy (that determines the optimal action to take at a state), there is a major limitation to Value/Policy iteration: You Need the MDP!! In our Frozen Lake Game, we had the following provided to us by the ```gymnasium``` package:

```python
state1: {action1:[(probability, next_state, reward, done),
                  (probability, next_state, reward, done),
                  (probability, next_state, reward, done)],

         action2:[(probability, next_state, reward, done),
                  (probability, next_state, reward, done),
                  (probability, next_state, reward, done)],
        ...}
state2: {action1:[(probability, next_state, reward, done),
                  (probability, next_state, reward, done),
                  (probability, next_state, reward, done)],

         action2:[(probability, next_state, reward, done),
                  (probability, next_state, reward, done),
                  (probability, next_state, reward, done)],
        ...}
...
```

This is the MDP of the game, that describes everything we need to know about the environment. For every state in the game, and for every action I take, what is the reward I will get and where could I end up? This is problematic as what if we need to solve an environment where we dont have the MDP? Unfortunately, the methods of Policy and Value iteration no longer work.

In both policy and value iteration, we try to estimate the values of the different states, and we could solve this by iteratively just solving the MDP. Now we have to leverage something else: Interactive Learning 

## Model Free Learning

If you dont have the model of the game, then its called Model Free Learning. This is the more practical case, in most situations in the real world, you dont have some nice dictionary describing all the states and actions. Instead, we can try to learn the dynamics of the environment by interacting with it and learning from experience. 

This literally means: Send your agent out into the game blind over and over again and slowly learn the game (and in the end estimate the values of the states)

The way we will explore today to do this is the Monte Carlo Method

### Monte Carlo Values Estimation

Monte Carlo experiments are a method of just randomly sampling a bunch and then using the information gained from the samples to produce an estimate. This means, send out our agent into the frozen lake a bunch of times and just log all the trajectories of how things are going every time. Initially, the choices the agent will make is basically random, but we can use this information to start to update our knowledge of the game and improve the decision making.

When doing our Policy and Value iterations, remember that we never did ```env.step()``` when training our models. We only looked at our MDP and then once we had our best policy we applied it to the game. So then what are the steps of Monte Carlo?


#### Steps:

1) The agent will interact with the environment. Every step will produce an tuple $(S_t, A_t, R_{t+1}, S_{t+1})$ which is a fancy way of saying, where am I now ($S_t$), what action did I take ($A_t$), what reward did I get ($R_{t+1}$) and where did I end up ($S_{t+1}$). We will then get a sequence of these until the game ends (reaches the prize or falls in a hole)
2) For every state we went to, we will calculate the estimated Q function $Q(s,a)$. What we want to estimate is the expected return of taking a specific action at a specific state, but how can we do this without the MDP? Well, we just completed a trajectory, so I know start to end exactly what happened. This means I can technically compute the future rewards at every state based on the experience I just had! There is obviously a lag here though, I can only update and learn once I have completed a full trajectory, we will solve that problem later!
5) We can repeat these trajectories a bunch of times, greedy updating our policy as we go!

### Lets Generate Trajectories

It would be helpful to have a function that can generate trajectories!

In [1]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

env = gym.make("FrozenLake-v1", is_slippery=True)

def sample_trajectory(pi, env, max_steps=50, epsilon=0.1):

    """
    In this method we will play the game until 
    its either over or we hit max steps according
    to some policy PI

    Args:
        pi: The current policy
        env: The Game
        max_steps: Truncate trajectories longer than this
        epsilon: Inject some randomness for exploration
    """

    done = False
    trajectory = []
    num_steps = 0

    ### Start New Game ###
    state, _ = env.reset()
    
    while not done:

        ### Select Action According to Policy ###
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # Explore random action
        else:
            action = pi[state]  # Exploit best known action

        ### Take a step in the environment ###
        next_state, reward, done, _, _= env.step(action)

        ### Create and Store your Experience Tuple ###
        experience = (state, action, reward, next_state, done)
        trajectory.append(experience)

        ### Iterate ###
        num_steps += 1

        if num_steps >= max_steps:
            done = False
            break

        ### Update current State to the Next State ###
        state = next_state

    return trajectory

### Initialize a Random Policy ###
policy = np.random.choice(env.action_space.n, size=(env.observation_space.n, ))
trajectory = sample_trajectory(policy, env)

print(trajectory)

[(0, 1, 0.0, 4, False), (4, 0, 0.0, 4, False), (4, 1, 0.0, 5, True)]


### Compute Future Returns

Now that we can generate trajectories, we need to be able to compute the future discounted rewards at every state we visited. Now a question you may have is, what if we visited a state multiple times? There are actually two variants of this:

- First Visit Monte Carlo: Only compute your future expected rewards for the first time you get to a state
- Every Visit Monte Carlo: Average up the future expected rewards for all times you visited a state

Today we will just do First Visit, but either would work!

#### Reminder: How to Compute Returns

Remember that $G_t$ is computed like the following:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$$

### Computing returns $G_t$ (pretend we have 3 steps 0 -> 2):

- **At $t = 2$ (last step):**  
  $$
  G_2 = R_3
  $$

- **At $t = 1$ (second-to-last step):**  
  $$
  G_1 = R_2 + \gamma G_2
  $$

- **At $t = 0$ (first step):**  
  $$
  G_0 = R_1 + \gamma G_1
  $$

In [2]:
def compute_returns(trajectory, gamma=0.99):

    ### Create Empty Dictionary to Store Returns ###
    returns = {}

    ### Initialize Returns ###
    G = 0

    ### Reverse and Compute Returns ###
    for t in reversed(trajectory):

        state, action, reward, _, _ = t

        ### Compute new G ###
        G = reward + gamma * G

        ### First Visit Check ###
        if (state, action) not in returns:
            returns[(state, action)] = G

    return returns

compute_returns(trajectory)

{(4, 1): 0.0, (4, 0): 0.0, (0, 1): 0.0}

### Monte Carlo Estimation of Q

Now that we can generate trajectories, we will do it over and over again, trying to estimate the Q function. We will do this by running a trajectory, computing its discounted returns (on the first visit) and then store those values. Then we can average up all the returns for every state/action pair and update Q

In [3]:
def monte_carlo_estimation(pi,
                           env, 
                           gamma=0.99, 
                           max_steps=50,
                           num_episodes=5000):

    ### Start Empty Q Values ###
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    returns = {(s, a): [] for s in range(env.observation_space.n) for a in range(env.action_space.n)}

    for _ in range(num_episodes):

        ### Sample a Trajectory ###
        trajectory = sample_trajectory(pi, env, max_steps)

        ### Compute Returns for Episode ###
        returns_for_episode = compute_returns(trajectory, gamma)

        ### Update Returns ###
        for (state,action), G in returns_for_episode.items():
            returns[(state,action)].append(G)

    ### Loop Through All the Accumulated Returns, Average Them Up and Update Q ###
    for (state, action), returns_list in returns.items():

        if len(returns_list) > 0:
            Q[state, action] = np.mean(returns_list)

    return Q
    
monte_carlo_estimation(policy, env)

array([[0.00000000e+00, 1.87702338e-04, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 3.81595812e-03, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 6.75409672e-03, 2.63130282e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 5.84146222e-04, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.05956757e-03, 0.00000000e+00, 0.00000000e+00],
       [3.66666667e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.14667920e-03, 0.00000000e+00, 0.00000000e

### Update Our Policy

Now that we have an estimate for our Q function, we can update our policy accoring to this Q Function. There are a few ways to do this, but easiest is a greedy selection of the highest value action in every state. 

In [4]:
def policy_improvement(Q):

    """
    Greedy select the action with the highest vlaue at every state
    """
    return np.argmax(Q, axis=-1)

### Toggle Between the Two

Just like in policy iteration where we toggled between estimating the Values function and then updating our policy, we will be toggling between estimating our Q function and then updating our policy.

**NOTE**

There are again tons of ways to do this. We could have also done an online estimation of our policy, updating it as we are estimating Q at the same time. Its really upto you! I am going for this toggle approach, because it lets me get a good estimate of Q first and then update the policy according to that

In [5]:
def monte_carlo_policy_iteration(env, 
                                 gamma=0.99, 
                                 max_steps=50, 
                                 num_episodes=10000):

    # Start with a random policy
    policy = np.random.choice(env.action_space.n, size=(env.observation_space.n, ))

    while True:
        # Policy Evaluation: Estimate Q(s,a) for the current policy
        values = monte_carlo_estimation(policy, env, gamma, max_steps, num_episodes)

        # Policy Improvement: Generate a new policy based on the estimated Q(s,a)
        new_policy = policy_improvement(values)

        # If the policy stops changing, we have converged
        if np.array_equal(policy, new_policy):
            break

        policy = new_policy  # Update policy for next iteration

    return policy, values

optimal_policy, optimal_values = monte_carlo_policy_iteration(env)

print(optimal_policy)

[0 3 1 3 0 0 2 0 3 1 0 0 0 2 1 0]


### Lets Test Our our Policy! ###

In [6]:
def test_policy(policy, env, num_episodes=500):
    success_count = 0

    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)

            if done and reward == 1.0:  # Reached the goal
                success_count += 1

    success_rate = success_count / num_episodes
    print(f"Policy Success Rate: {success_rate * 100:.2f}%")
    return success_rate

# Test the learned policy
test_policy(optimal_policy, env)

Policy Success Rate: 77.20%


0.772