## **DOUBLE Q-LEARNING**

### **Q-Learning**
* Estimates optimal action-values function
* It has tendency to overestimates by updating based on max Q.
* Might learn to suboptimal policy learning(especially in the environments with noisy and stochastic rewards).


### **Double Q-Learning**
* Maintains two Q tables.
* It maintains Q0 and Q1 . Each table is updated using information from other, thus reducing the risk of overestimating Q-values.\
*The key insight behind Double Q-learning is that by splitting the maximization step between two tables, we obtain a more accurate estimate of the action-value function.*\
**To update the Q-value for the chosen action:**
* Randomly select a table.\
![image](3.png)

Let's say it picks Q0. It then uses Q0 to determine the best next action but updates its value based on the reward observed and the estimated value of the next action from Q1.

$$
Q_o(s,a)=(1-\alpha)Q_o(s,a)+\alpha[r+\gamma Q_1(s',max_{a})]
$$

If it picks Q1, it uses Q1 to determine the best next action but it updates the Q-value based on the reward observed and the estimated value of the next action from Q0.

### **Double q-learning**
* reduces overestimates bias
* alternates between Q0 and Q1 updates.
* ensures both q-tables contribute to learning process

## **Implementation with forzen lake environment**

In [1]:
import numpy as np
import pandas as pd
import gymnasium as gym

In [2]:
env=gym.make('FrozenLake-v1',is_slippery=False)
num_states=env.observation_space.n
num_actions=env.action_space.n

In [3]:
Q=[np.zeros((num_states,num_actions))]*2#represent our dual estimators
num_episodes=1000
alpha=0.5#learning rate
gamma=0.99#discount factor

In [5]:
def update_q_table(state,action,reward,next_state):
    i=np.random.randint(2)
    best_next_action=np.argmax(Q[i][next_state])
    Q[i][state,action]=(1-alpha)*Q[i][state,action]+alpha*(reward+gamma*Q[1-i][next_state,best_next_action])
    

In [6]:
for episode in range(num_episodes):
    state,info=env.reset()
    terminated=False
    while not terminated:
        action=np.random.choice(num_actions)
        next_state,reward,terminated,truncated,info=env.step(action)
        update_q_table(state,action,reward,next_state)
        state=next_state

In [7]:
final_q=(Q[1]+Q[0])/2
policy={state:np.argmax(final_q[state]) for state in range(num_states)}
print(policy)

{0: np.int64(2), 1: np.int64(2), 2: np.int64(1), 3: np.int64(0), 4: np.int64(1), 5: np.int64(0), 6: np.int64(1), 7: np.int64(0), 8: np.int64(2), 9: np.int64(2), 10: np.int64(1), 11: np.int64(0), 12: np.int64(0), 13: np.int64(2), 14: np.int64(2), 15: np.int64(0)}
