![banner](https://github.com/priyammaz/PyTorch-Adventures/blob/main/src/visuals/rl_banner.png?raw=true)

# Off Policy TD Learning: Q Learning

Now that we undestand SARSA (an On-Policy TD Learning method) lets look at Q Learning (an Off-Policy TD Learning Method)

**Reminder: SARSA TD Target:** $r + \gamma * Q(s', a')$

Our goal is always to find future rewards for taking an action. This is why in Monte-Carlo, we used actual data (an entire trajectory) to compute what to expect in the future, but this leads to obvious inefficencies, as discussed before! 

Therefore in TD Learning, instead of using real trajectories to get the estimate of the future returns, we instead use our **CURRENT ESTIMATE**. The returns are discounted so this always looks like:

$$\text{Current Reward} + \gamma * \text{Future Reward}$$

But there are a few ways we can estimate our Future Rewards.

So the process is this:

1) Start the game at state $s_1$
2) Select an action $a_1$ using Epsilon Greedy policy
3) Take action $a_1$ and end up in state $s_2$
4) Select an action $a_2$ using Epsilon Greedy policy
5) **NOW THE CHOICE**
    - **On Policy:** Use the Q Value for the action actually selected $a_2$ by the epsilon-greedy policy
        -  $Q(s_1, a_1) = Q(s_1, a_1) + \alpha * \left[r + \gamma Q(s_2, a_2) - Q(s_1, a_1)\right]$
        -  The key idea here is that, both our action selection in step 2 and our Q value selection (for action selected at step 4) for future rewards were done by an epsilon greedy policy
    - **Off-Policy:** Use the maximum Q Value for the next state, regardles of which $a_2$ was selected
        - $Q(s_1, a_1) = Q(s_1, a_1) + \alpha * \left[r + \gamma \max_{a'}Q(s_2,a') - Q(s_1, a_1)\right]$
        - The key idea here is that, our action selection in step 2 was an epsilon greedy policy, but our Q value selection for future rewards was Greedy only! This means than in the randomness of our Epsilon Greedy strategy, we may not have picked the action with the highest Q value, but in our update rule, we assume that the best possible action was taken regardless of what actually happened!

6) Repeat until convergence

This difference is subtle but important!

### Lets Implement It!

The code here is basically identical to our SARSA code, we just updated our learning step to use the Max of the Q table regardless of what action was taken!

In [3]:
import numpy as np
import gymnasium as gym

def epsilon_greedy(Q, state, epsilon, env):

    ### Either sample a random action (explore) or 
    ### Use best action according to Q Table 
    
    if np.random.rand() < epsilon:
        return env.action_space.sample() 
    else:
        return np.argmax(Q[state])
        
def q_learning(env, 
               num_episodes=25000, 
               alpha=0.1, 
               gamma=0.99, 
               epsilon=0.1):

    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(num_episodes):
        
        state, _ = env.reset()

        ### Select an Action to Start Out ###
        action = epsilon_greedy(Q, state, epsilon, env)
            
        ### Loop Until Done ###
        done = False
        while not done:

            ### Take the Action ###
            next_state, reward, done, _, _ = env.step(action)

            ### Select the Next Action from the New State ###
            next_action = epsilon_greedy(Q, next_state, epsilon, env)

            ### Q Update Rule (Off policy as we are using the Q value of the max regardless of action taken) ###
            Q[state, action] = Q[state, action] + alpha * (reward + \
                                   gamma * np.max(Q[next_state]) - Q[state, action])
            
            ### Update Current State/Action ###
            state, action = next_state, next_action

    return Q

env = gym.make("FrozenLake-v1", is_slippery=True)
Q_qlearning = q_learning(env)

policy = np.argmax(Q_qlearning, axis=-1)
print(policy)

[0 3 0 1 0 0 0 0 3 1 0 0 0 2 1 0]


In [4]:
def test_policy(policy, env, num_episodes=500):
    success_count = 0

    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)

            if done and reward == 1.0:  # Reached the goal
                success_count += 1

    success_rate = success_count / num_episodes
    print(f"Policy Success Rate: {success_rate * 100:.2f}%")

test_policy(policy, env)

Policy Success Rate: 77.00%
