# Code Modification


**目录：**
1. Double DQN 代码修改

2. Duel DQN 代码修改

3. 基础Replay Buffer 代码修改

4. Prioritized Experience Replay 代码修改

---

## 1) Double DQN 代码修改

根据论文我们知道，Double DQN只改动了对于target value的计算，而我们对于target value的计算只在agent.learn()中进行，所以也只需要进行少量的修改即可。

In [1]:
class agent:
    
    def learn(self):
        states, actions, rewards, next_states, dones = random.sample(self.memory, self.bs)
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)
    
        Q_values = self.Q_local(states)
        Q_values = torch.gather(input=Q_values, dim=-1, index=actions)
    
        with torch.no_grad():
            Q_targets = self.Q_target(next_states)
            ############################## Change Here ##############################
            if not self.double: # this is a True/False parameter when initialize this agent
                Q_targets, _ = torch.max(input=Q_targets, dim=1, keepdim=True)
            else:
                inner_actions = torch.max(input=self.Q_local(next_states), dim=1, keepdim=True)[1]
                Q_targets = torch.gather(input=Q_targets, dim=1, index=inner_actions)
            ############################## Change Ends ##############################
            Q_targets = rewards + self.gamma * (1 - dones) * Q_targets
    
        deltas = Q_values - Q_targets
        loss = deltas.pow(2).mean()
    
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

## 2) Duel DQN 代码修改

Duel Network Structure只是网络结构的更改，所以需要的更改的只有网络结构文件network.py，在agent中只需要加入对应的参数即可

In [2]:
import torch
import torch.nn as nn
import torch.functional as F

class Q_Network(nn.Module):

    def __init__(self, state_size, action_size, hidden=[64, 64], duel=False):
        super(Q_Network, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden[0])
        self.fc2 = nn.Linear(hidden[0], hidden[1])
        self.fc3 = nn.Linear(hidden[1], action_size)
        self.duel = duel
        if self.duel:
            self.fc4 = nn.Linear(hidden[1], 1)

    def forward(self, state):
        x = state
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        if self.duel:
            x1 = self.fc3(x)
            x1 = x1 - torch.max(x1, dim=1, keepdim=True)[0] # set the max to be 0
            x2 = self.fc4(x)
            return x1 + x2
        else:
            x = self.fc3(x)
            return x

## 3) 基础 Replay Buffer 代码修改

在原先的代码中，agent使用一个deque作为replay buffer，所以这里需要的修改是Replay Buffer的初始化，以及transition的进入和抽样。核心代码如下：

In [3]:
## In agent.py

class agent:
    def __init__(self, *args, **kwargs):
        self.memory = Replay_Buffer(int(1e5), bs)
    
    def learn():
        states, actions, rewards, next_states, dones = self.memory.sample(self.bs)
        
# Do not require other changes

## 4) Prioritized Experience Replay 代码修改

在Prioritized Experience Replay中，主要的部分都可以在Replay Buffer内部进行，这些我们已经实现了，剩下的必须在训练过程中完成的部分是：
  * 计算权重$w_i$
  * 更新TD-error $\delta_i$
  * 将error返回到Replay Buffer中

In [4]:
class agent:
    
    def __init__(self, *args, **kwargs):
        ############################## Change Here ##############################
        if not self.prioritized:
            self.memory = Replay_Buffer(int(1e5), bs)
        else:
            #self.memory = Rank_Replay_Buffer(int(1e5), bs)
            # or
            self.memory = Proportion_Replay_Buffer(int(1e5), bs)
        ############################## Change Ends ##############################

    def learn(self):
        ############################## Change Here ##############################
        if not self.prioritized:
            states, actions, rewards, next_states, dones = self.memory.sample(self.bs)
            w = torch.ones(actions.size())
            w = w.to(self.device)
        else:
            index_set, states, actions, rewards, next_states, dones, probs = self.memory.sample(self.bs)
            w = 1/len(self.memory)/probs
            w = w/torch.max(w)
            w = w.to(self.device)
        ############################## Change Ends ##############################
            
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)

        Q_values = self.Q_local(states)
        Q_values = torch.gather(input=Q_values, dim=-1, index=actions)

        with torch.no_grad():
            Q_targets = self.Q_target(next_states)
            Q_targets, _ = torch.max(input=Q_targets, dim=1, keepdim=True)
            Q_targets = rewards + self.gamma * (1 - dones) * Q_targets

        deltas = Q_values - Q_targets
        ############################## Change Here ##############################
        loss = (w*deltas.pow(2)).mean()
        ############################## Change Ends ##############################

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        ############################## Change Here ##############################
        if self.prioritized:
            deltas = np.abs(deltas.detach().cpu().numpy().reshape(-1))
            for i in range(self.bs):
                self.memory.insert(deltas[i], index_set[i])
        ############################## Change Ends ##############################