# About Double DQN

Although DQN manages to alleviate the memory required by Q learning, DQN does not manage to solve some of the other flaws of Q learning. For instance in DQN, Q-value that is the expected value is calculated with the reward added to the next state's maximum Q value as seen from the bellman equation. As a result if for a certain state, the Q value calculated is high then the value that is obtained from the output of the neural network for that particular state will get higher everytime. This leads the algorithm to be overly optimistic in taking that action even though it does not actually provide that much value. For example if for a particular episode action A does well and receives a high reward the neural network will learn to give action A a high approximation even though other actions might be more valuable in some cases.

Double DQN manages to solve this problem by using two identical neural network models. One learns during the experience replay just like DQN does, and the other one is a copy of the last episode of the former model. If the model overestimates the Q value for a particular episode the idea is that the model from the previous episode will control the bias of the model when updating the Q values.

In [None]:
import numpy as np
import gym
from keras.layers import Dense, Activation
from keras.models import Sequential, load_model
import keras
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt 

In [None]:
env = gym.make('LunarLander-v2')

In [None]:
class replayBuffer:
    def __init__(self,maxSize,stateDim):
        self.state=np.zeros((maxSize,stateDim))
        self.action=np.zeros(maxSize,dtype= np.int8)
        self.reward=np.zeros(maxSize)
        self.done=np.zeros(maxSize,)
        self.nextState=np.zeros((maxSize,stateDim))
        self.maxSize=maxSize
        self.curser=0
        self.size=0

    def save(self,state,action,reward,nextState,done):
        self.state[self.curser]=state
        self.action[self.curser]=action
        self.reward[self.curser]=reward
        self.nextState[self.curser]=nextState
        self.done[self.curser]=done
        self.curser=(self.curser+1)%self.maxSize
        if self.size<self.maxSize:
            self.size+=1 

    def sample(self,batchSize):
        batchSize=min(self.size,batchSize-1)
        indexes=np.random.choice([i for i in range(self.size-1)],batchSize)
        return self.state[indexes],self.action[indexes],self.reward[indexes],self.nextState[indexes],self.done[indexes]

In [None]:
class Agent:
    def __init__(self,stateShape,actionShape,exploreRate,exploreRateDecay,minimumExploreRate,gamma,copyNetsCycle):
        self.gamma=gamma
        self.exploreRate=exploreRate
        self.exploreRateDecay=exploreRateDecay
        self.minimumExploreRate=minimumExploreRate
        self.actionShape=actionShape
        self.memory=replayBuffer(1000000,stateShape)
        self.model=self.buildModel(stateShape,actionShape)
        self.model.compile(optimizer='Adam',loss='mse')
        self.tModel=self.buildModel(stateShape,actionShape)
        self.tModel.compile(optimizer='Adam',loss='mse')
        self.learnThreshold=0
        self.copyNetsCycle=copyNetsCycle

    def buildModel(self,input,output):
        inputLayer=keras.Input(shape=(input,))
        layer=Dense(256,activation='relu')(inputLayer)
        layer=Dense(256,activation='relu')(layer)
        outputLayer=Dense(output)(layer)
        model=keras.Model(inputs=inputLayer,outputs=outputLayer)
        model.compile(optimizer='Adam',loss='mse')
        return model
    
    def getAction(self,state):
        q=self.model.predict(np.expand_dims(state,axis=0), verbose = 0)[0]
        if np.random.random()<=self.exploreRate:
            return np.random.choice([i for i in range(env.action_space.n)])
        else:
            return np.argmax(q)

    def exploreDecay(self):
        self.exploreRate=max(self.exploreRate*self.exploreRateDecay,self.minimumExploreRate)

    def saveModel(self,modelName="DoubleDQN_LunarLanderV2.h"):
        self.model.save_weights(f"{modelName}")

    def loadModel(self,modelName="DoubleDQN_LunarLanderV2.h"):
        self.model.load_weights(f"{modelName}")
        self.tModel.set_weights(self.model.get_weights())
      
    def learn(self,batchSize=64):
        if self.memory.size>batchSize:
            states,actions,rewards,nextStates,done=self.memory.sample(batchSize)
            qState=self.model.predict(states,verbose = 0)
            qNextState=self.model.predict(nextStates,verbose = 0)
            qNextStateTarget=self.tModel.predict(nextStates, verbose = 0)
            maxActions=np.argmax(qNextState,axis=1)
            batchIndex = np.arange(batchSize-1, dtype=np.int32)
            qState[batchIndex,actions]=(rewards+(self.gamma*qNextStateTarget[batchIndex,maxActions.astype(int)]*(1-done)))
            _=self.model.fit(x=states,y=qState,verbose=0)
            self.learnThreshold+=1

            if(self.learnThreshold%self.copyNetsCycle)==0:
                self.tModel.set_weights(self.model.get_weights())
                self.saveModel()
                self.learnThreshold=0

In [None]:
agent=Agent(stateShape=env.observation_space.shape[0],actionShape=env.action_space.n , \
            exploreRate=1.0,exploreRateDecay=0.977,minimumExploreRate=0.01,gamma=0.99,copyNetsCycle=100)

averageRewards=[]
totalRewards=[]
for i in range(1,300):
    done=False
    state=env.reset()
    rewards=0
    while not done:
        action=agent.getAction(state)
        nextState,reward,done,info=env.step(action)
        agent.memory.save(state,action,reward,nextState,int(done))
        rewards+=reward
        state=nextState
        agent.learn(batchSize=64)
        
    agent.exploreDecay()
    totalRewards.append(rewards)    
    averageRewards.append(np.mean(totalRewards[-50:]))
          
    print(f"episode: {i}   reward: {rewards}  avg so far:{averageRewards[-1]} exploreRate:{agent.exploreRate}")

episode: 1   reward: -168.27615399262075  avg so far:-168.27615399262075 exploreRate:0.977
episode: 2   reward: -89.76465335118822  avg so far:-129.0204036719045 exploreRate:0.954529
episode: 3   reward: -77.9104763052607  avg so far:-111.98376121635657 exploreRate:0.932574833
episode: 4   reward: -120.96418987150106  avg so far:-114.22886838014269 exploreRate:0.9111256118409999
episode: 5   reward: -153.8418495798761  avg so far:-122.15146462008938 exploreRate:0.8901697227686569
episode: 6   reward: -176.85571967658854  avg so far:-131.26884046283922 exploreRate:0.8696958191449777
episode: 7   reward: -130.66869539257755  avg so far:-131.18310545280184 exploreRate:0.8496928153046432
episode: 8   reward: -63.529832939746775  avg so far:-122.72644638866996 exploreRate:0.8301498805526364
episode: 9   reward: -77.66841331054219  avg so far:-117.71999826887799 exploreRate:0.8110564332999257
episode: 10   reward: -213.1426998597127  avg so far:-127.26226842796146 exploreRate:0.7924021353340

episode: 81   reward: -31.53557601427778  avg so far:-29.835698665308705 exploreRate:0.15186568773569667
episode: 82   reward: 31.47342310783995  avg so far:-23.98538481693937 exploreRate:0.14837277691777565
episode: 83   reward: -59.592635160419704  avg so far:-25.1958687663333 exploreRate:0.1449602030486668
episode: 84   reward: -292.71908844083646  avg so far:-31.072382908795834 exploreRate:0.14162611837854747
episode: 85   reward: 107.17814113966773  avg so far:-28.03729987416328 exploreRate:0.13836871765584088
episode: 86   reward: -271.8392521447573  avg so far:-33.60460004831193 exploreRate:0.13518623714975653
episode: 87   reward: 183.96497983091388  avg so far:-28.900103273709465 exploreRate:0.13207695369531214
episode: 88   reward: -32.44175346268007  avg so far:-26.035799276868083 exploreRate:0.12903918376031995
episode: 89   reward: -50.4423761879724  avg so far:-26.154144917502837 exploreRate:0.1260712825338326
episode: 90   reward: 118.2925421222753  avg so far:-23.568436

This model was not able to finish training due to the time constraint as well as the lack of a continued access to the lab PCs as home PCs cannot provide the performance and memory requirements needed. However some useful insights can still be gleaned.
- As shown above the model was only able to train till episode 148 and achieved an average reward of 44. For DQN to achieve the same level of reward it took ~400 episodes. Hence we can see that the Double DQN improves the convergence speed tremendously by estimating the Value function more accurately.

- We also see from the trend of average rewards that the model is steadily learning to estimate the Value function better and is not very volatile.