![banner](https://github.com/priyammaz/PyTorch-Adventures/blob/main/src/visuals/rl_banner.png?raw=true)

# On-Policy TD Learning: SARSA

As of now you should understand the [Monte Carlo](https://github.com/priyammaz/PyTorch-Adventures/blob/main/PyTorch%20for%20Reinforcement%20Learning/Intro%20to%20Reinforcement%20Learning/Model-Free%20Learning/monte_carlo.ipynb) method of solving a Model-Free RL problem! But there is one major issue when it came to Monte Carlo:

**You have to complete a full trajectory (wait for an entire episode to finish) before we start learning our estimates for the Q values**

This restriction makes it basically impossible to use in complex scenarios. In our case in our easy Frozen Lake example that we have been working on, the time it takes us to get through a full trajectory (either we fall into a hole or get to the finish line) is not really a problem. But imagine a significantly larger game board, each episode would take forever, and we need to do thousands of them before we have a helpful estimate! So what do we do?

### TD Learning

Temporal-Difference Learning is a method that, instead of learning at the end of a full trajectory, just learns at every step along each trajectory. The reason this works is becauase TD Learning uses the current estimate of future rewards rather than waiting for the full episode.

##### Reminder: Monte Carlo 

In the Monte Carlo method we completed a full trajectory and then computed the expected future rewards for every state that we touched using the following formula we have always had for our returns:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$$

We then used these computed rewards from our trajectory to update our $Q$ function:

$$Q(s,a) = Q(s,a) + \alpha_t(G - Q(s,a))$$

The problem is, for us to have this estimate $G$, we needed to have fully completed the Trajectory. So instead of  using the full trajectory as our estimate, can we just use the current $Q$ value instead?

### TD Error

TD Methods update the estimates based on the TD Error, which is the difference between the new estimate and the next steps information. The TD Target is an estimate of the return, and the setup is the same as what we had before:

$$Q(s,a) = Q(s,a) - \alpha_t(\text{TD Target} - Q(s,a))$$

But what is the TD Target? It depends on what you are doing!

**TD Targets**:

- Monte Carlo: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$
- SARSA: $r + \gamma * Q(s', a')$ -> Uses next action from the current policy (On Policy)
- Q-Learning: $r + \gamma * max_{a'} Q(s', a')$ -> Uses max Q Value of next state (Off Policy)

But nothing else changed! We are still doing what we had done in Monte Carlo, but in both Q-Learning and SARSA, we don't have to wait for the trajectory to be done, we can learn as we go. This only works though because of Bootstrapping:

### Bootstrapping

Bootstrapping is a commonly used technique in statistics. It is a procedure to estimate something by using another estimate. For example, if we want to estimate something about a population, its typically impossible to do so as the population could be very large. Instead we grab small samples from our population and try to learn things about the smaller samples. Hopefully with enough small samples you can then say something about your population!

Similarly, in TD Learning, instead of waiting for the future to unfold (going through the full trajectory like Monte Carlo), just use the current guess of the $Q$ Values as our TD Target. Therefore TD learning uses bootstrapping because it updates Q-values using estimates (not actual returns like Monte Carlo). This sounds good, but has its obvious Pros and Cons:

Pros:

- Faster Learning (as we dont need to complete the full trajectory every time)
- Allows us to work with partial episodes
- Memory efficient, we dont care about early parts of the trajectory once we are done with it

Cons: 

- More challenging convergence (as we are using partial information to estimate our returns)
- Needs more focus on exploration vs exploitaion

### On Policy

We will be implementing SARSA today, but Q-Learning is basically the same with one key difference: On vs Off Policy!

Lets look at our TD Targets again:

- SARSA: $r + \gamma * Q(s', a')$
- Q-Learning: $r + \gamma * max_{a'} Q(s', a')$

In SARSA, we will select the action $a'$ that our current policy prescribes for the state $s'$. Therefore, by following the policy, this is an On-Policy method

In Q-Learning, we will select the action $a'$ that has the max $Q$ value at that state $s'$ (regardless of the action we actually took!). This action may not necessarily be the same as the policy, therefore this is known as an Off-Policy method! This way the agent can learn the optimal policy independent of the actions it took. (We will do this next time!)

### Lets Implement SARSA 

An important change here (that we could have also done in our Monte Carlo) is we wont create a specific policy. We know that our policy is just the Argmax of Q, so just updating the Q table is fine, we can always get the policy from it. 

In [1]:
import numpy as np
import gymnasium as gym

def epsilon_greedy(Q, state, epsilon, env):

    ### Either sample a random action (explore) or 
    ### Use best action according to Q Table 
    
    if np.random.rand() < epsilon:
        return env.action_space.sample() 
    else:
        return np.argmax(Q[state])
        
def sarsa(env, 
          num_episodes=25000, 
          alpha=0.1, 
          gamma=0.99, 
          epsilon=0.1):

    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(num_episodes):
        
        state, _ = env.reset()

        ### Select an Action to Start Out ###
        action = epsilon_greedy(Q, state, epsilon, env)
            
        ### Loop Until Done ###
        done = False
        while not done:

            ### Take the Action ###
            next_state, reward, done, _, _ = env.step(action)

            ### Select the Next Action from the New State ###
            next_action = epsilon_greedy(Q, next_state, epsilon, env)

            ### SARSA Update Rule (On policy as we are using the Q value from the actual action taken) ###
            Q[state, action] = Q[state, action] + alpha * (reward + \
                                   gamma * Q[next_state, next_action] - Q[state, action])
            
            ### Update Current State/Action ###
            state, action = next_state, next_action

    return Q

env = gym.make("FrozenLake-v1", is_slippery=True)
Q_sarsa = sarsa(env)

policy = np.argmax(Q_sarsa, axis=-1)
print(policy)

[0 3 0 3 0 0 2 0 3 1 0 0 0 2 1 0]


### Lets Test our Policy!

In [2]:
def test_policy(policy, env, num_episodes=500):
    success_count = 0

    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)

            if done and reward == 1.0:  # Reached the goal
                success_count += 1

    success_rate = success_count / num_episodes
    print(f"Policy Success Rate: {success_rate * 100:.2f}%")

test_policy(policy, env)

Policy Success Rate: 76.20%
