# DQN Externtions

Since 2015, many improvements have been proposed to DQN algorithem, along with tweaks to the basic architecture, which significantly improve convergence, stability and sample efficiency of the basic DQN invented by DeepMind. In this chapter, we'll take a deeper look at some of those ideas. Very conveniently, in October 2017, DeepMind published a paper called Rainbow: Combining Improvements in Deep Reinforcement Learning ([1](https://arxiv.org/abs/1710.02298) Hessel and others, 2017), which presented the seven most important improvements to DQN, some of which were invented in 2015, but some of which are very recent. In this paper, state-of-the-art results on the Atari Games suite were reached, just by combining all those seven methods together.
The DQN extensions we'll become familiar with are as follows:
*  **N-steps DQN:** How to improve convergence speed and stability with a simple unrolling of the Bellman equation and why it's not an ultimate solution
* **Double DQN:** How to deal with DQN overestimation of the values of actions
* **Noisy networks:** How to make exploration more efficient by adding noise to the network weights
* **Prioritized replay buffer:** Why uniform sampling of our experience is not the best way to train
* **Dueling DQN:** How to improve convergence speed by making our network's architecture closer represent the problem we're solving
* **Categorical DQN:** How to go beyond the single expected value of action and work with full distributions

## The PyTorch Agent Net library

The implementaiton of this chapter is based on a PTAN library: 
To be able to focus only on the significant parts, it would be useful to have as small and concise version of a DQN as possible, preferably with reusable code pieces. This will be extremely helpful when you're experimenting with some methods published in papers or your own ideas. In that case, you don't need to reimplement the same functionality again and again, fighting with the inevitable bugs.
With this in mind, some time ago I started to implement my own toolkit for the deep RL domain. I called it PTAN, which stands for PyTorch Agent Net, as it was inspired by another open-source library called [AgentNet](https://github.com/yandexdataschool/AgentNet). The basic design principles I tried to follow in PTAN are as follows:
* Being as simple and clean as possible 
* PyTorch-nativeness 
* Containing small, reusable pieces of functionality 
* Extensibility and flexibility

The library is available in GitHub: https://github.com/Shmuma/ptan. All the subsequent examples were implemented using version 0.3 of PTAN, which can be installed in your virtual environment by running the following:

pip install ptan==0.3 

Let's look at the basic building blocks that PTAN provides.

### Agent

The agent entity provides a unified way of **bridging observations from the environment and the actions** that we want to execute. So far, we've seen only a simple, stateless DQN agent that uses a neural net to obtain actions' values from the current observation and behaves greedily on those values. We've used epsilon-greedy behavior to explore the environment, but this doesn't change the picture much.
In the RL field, this could be more complicated. For example, instead of predicting the values of the actions, our agent can predict probability distribution over actions. Such agents are called policy agents and we'll talk about those methods in part three of the book. The other requirement could be some kind of memory in the agent. For example, very often one observation (or even k last observation) is not enough to make a decision about the action and we want to keep some memory in the agent to capture the necessary information. There is a whole subdomain of RL which tries to address this complication with **Partially-Observable Markov Decision Process (POMDP)** formalism. We'll briefly touch on this case in the last part of the book.
To capture all those variants and make the code flexible, the agent in the PTAN is implemented as an extensible hierarchy of classes with the ptan.agent.BaseAgent abstract class at the top. From the high level, the agent needs to accept the batch of observation (in the form of a NumPy array) and return the batch of actions that the agent wants to take. The batch is used to make the processing more efficient, as processing several observations in one pass in GPU is frequently much faster than processing them individually. The abstract base class doesn't define the type of input and output, which makes it very flexible and easy to extend. For example, in the continuous domain, our actions won't any longer be indices of discrete actions, but float values.
The agent that corresponds to our current DQN requirements is `ptan.agent.DQNAgent`, which uses the provided PyTorch `nn.Module` to convert a batch of observations into action values. To convert the network's output into actual actions to be taken, the DQNAgent class needs the second object to be passed on creation: action selector.
The purpose of action selector is to convert the output of the network (usually it's a vector of numbers) into some action. In a discrete action space case, the action will be one or several action indices to be taken. There are two action selectors in the PTAN that we'll need: `ptan.actions.ArgmaxActionSelector` and `ptan.actions. EpsilonGreedyActionSelector`. As you may guess from the names, the first one (ArgmaxActionSelector) applies **argmax** to the provided values, which corresponds to greedy actions over Q-values.

The second action selector supports **epsilon-greedy** behavior, by having **epsilon** as a parameter and with this probability taking the random action instead of the greedy selection. To combine all this together, to create the agent for **CartPole**, with epsilongreedy action selection, we can write the following code:

In [2]:
import gym 
import ptan
import numpy as np 
import torch.nn as nn
from IPython.core.debugger import Tracer

env = gym.make("CartPole-v0")
net = nn.Sequential(
    nn.Linear(env.observation_space.shape[0], 256), 
    nn.ReLU(),
    nn.Linear(256, env.action_space.n) 
) 

action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=0.1) 
agent = ptan.agent.DQNAgent(net, action_selector)

obs = np.array([env.reset()], dtype=np.float32)
agent(obs)

(array([1]), [None])

This results in:
(array([1]), [None])  
Which is a tuple where the The first itemis a batch of actions to take, while the second value is related to stateful agents and should be ignored.
During the run, we can change the epsilon attribute in our action selector to change the random action probability during the training.

### Agent's experience
The second important abstraction in PTAN is the so-called experience source. 
In our DQN example in the previous chapter, we worked with one-step experience pieces, which include four things:
* The **observed state** of the environment at some time step: $s_t$ 
* The **action** the agent has taken: $a_t$
* The **reward** the agent has obtained: $r_t$ 
* The observation of the **next state**: $s_{t+1}$

We used those values ($s_t$ , $a_t$ , $r_t$ , $s_{t+1}$) to update our $Q$ approximation using the **Bellman equation**. However, for a general case, we can be interested in longer chains of experience, including more time steps of the agent's interaction with the environment.
Bellman's equation also could be unrolled to longer experience chains.

$$Q(s_t , a_t) = \mathbb{E} [r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +  ... + \gamma^k \max_a Q(s_{t+k}, a)]$$

One of the methods to improve DQN stability and convergence, discussed in this chapter, does just this: by unrolling the Bellman's equation to k steps forward (when k is usually 2...5), we significantly improve the speed of our training convergence.
To support this situation in a generic way, in PTAN we have the `ptan.experience.ExperienceSourceFirstLast` class, which takes the environment and the agent and provides to us the stream of experience tuples:

($s_t$ , $a_t$ , $r_t$ , $s_{t+k}$), where $R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +  ... + \gamma^{k-1}r_{t+k-1}$

When $k=1, R_t = r_t$

This class automatically handles end-of-episode situations, letting us know about them by setting the last tuple entry to None. In such cases, a reset of the environment is performed automatically. Class `ExperienceSourceFirstLast` exposes the iterator interface, generating on every iteration the tuple with experience. The example of this class is as follows:


In [7]:

exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=0.99, steps_count=1)
exp_source
it = iter(exp_source)
#Tracer()()
next(it)

RuntimeError: Expected object of type torch.FloatTensor but found type torch.DoubleTensor for argument #4 'mat1'

### Gym env wrappers
To avoid implementing (or copy-pasting) common Atari wrappers over and over again, I put them in the ptan.common.wrappers module. They are mostly the same (with minor PyTorch-specific modifications) as wrappers available in the OpenAI Baselines project: https://github.com/openai/ baselines. To wrap the Atari environment in one line, it's enough to call the `ptan.common.wrappers.wrap_dqn(env)` method. That's basically it! As I've said before, PTAN wasn't supposed to be the ultimate RL framework; it's just a collection of entities designed to be used together, but not to depend much on each other.

### Basic DQN
By combining all the above, we can reimplement the same DQN agent in a much shorter, but still flexible, way, which will become handy later, when we'll start to modify and change various DQN parts to make the DQN better.

In the basic DQN implementation we have three modules:
* Chapter07/lib/dqn_model.py: The DQN neural network, which is the same as we've seen in the previous chapter
* Chapter07/lib/common.py: Common functions used in this chapter's examples, but too specialized to be moved to PTAN
* Chapter07/01_dqn_basic.py: The creation of all used pieces and the training loop

Let's start with the contents of `lib/common.py`. First of all, we have here hyperparameters for our Pong environment, that was introduced in the previous chapter. The hyperparameters are stored in the dict, with keys as the configuration name and values as a dict of parameters. This makes it easy to add another configuration set for more complicated Atari games.


HYPERPARAMS = {
    'pong': {
        'env_name':         "PongNoFrameskip-v4",
        'stop_reward':      18.0,
        'run_name':         'pong',
        'replay_size':      100000,
        'replay_initial':   10000,
        'target_net_sync':  1000,
        'epsilon_frames':   10**5,
        'epsilon_start':    1.0,
        'epsilon_final':    0.02,
        'learning_rate':    0.0001,
        'gamma':            0.99,
        'batch_size':       32
    }
    
In addition, `common.py` has a function that takes the batch of transitions and packs it into the set of NumPy arrays. Every transition from ExperienceSourceFirstLast has a type of namedtuple with the following fields:
* **state**: Observation from the environment. 
* **action**: Integer action taken by the agent. 
* **rewards**: If we've created ExperienceSourceFirstLast with attribute steps_count=1, it's just the immediate reward. For larger step counts, it contains the discounted sum of rewards for this number of steps.
* **last_state**: If the transition corresponds to the final step in the environment, then this field

We handle the final transitions in the batch. To avoid the special handling of such cases, for terminal transitions we store the initial state in the last_states array. To make our calculations of the Bellman update correct, we'll mask such batch entries during the loss calculation using the dones array. Another solution would be to calculate the value of last states only for non-terminal transitions, but it would make our loss function logic a bit more complicated.

The **loss function** is exactly the same as we had in the previous chapter. We calculate the values of actions taken from the first state, then calculate the values of the same actions using the Bellman equation. The resulting loss is a Mean Square Error between those two quantities: `nn.MSELoss()(state_action_values, expected_state_action_values)`

Also, in common.py, we have two utility classes to help us to simplify the training loop:

The `EpsilonTracker` class takes the instance of EpsilonGreedyActionSelector and our hyperparams for a specific configuration. Also, in its only method `frame()`, it updates the value of epsilon according to the standard DQN epsilon decay schedule: linearly decreasing it for the first epsilon_frames steps and then keeping it constant.
The second class, `RewardTracker`, is supposed to be informed about the total reward at the end of every episode and track mean reward for the last episodes, report the current values in TensorBoard and console, and, finally, check that the game has been successfully solved. It also measures the speed in frames per second, which is useful to know, as performance is an important metric of the training.
The class is implemented to be used as a context manager, automatically closing the TensorBoard writer on exit. The main logic is performed in the reward() method, which is being called every time an episode finishes. It's mostly the same code as the previous chapter training loop.
In the beginning of the training loop, we create the reward tracker, which will report mean reward for every episode completed, increment the frame counter and ask our experience replay buffer to pull one transition from the experience source. This call to buffer.populate(1) will start the following chain of actions inside the PTAN lib: • ExperienceReplayBuffer will ask the experience source to get the next transition.
* The experience source will feed the current observation to the agent to obtain the action.
* The agent will apply the NN to the observation to calculate Q-values, then ask the action selector to choose the action to take.
* The action selector (which is an epsilon-greedy selector) will generate the random number to check how to act: greedily or randomly. In both cases, it will decide which action to take.
* The action will be returned to the experience source, which will feed it into the environment to obtain the reward and the next observation. All this data (the current observation, action, reward, and next observation) will be returned to the buffer.
* The buffer will store the transition, pushing out old observations to keep its length constant.


#### 01_dqn_basic.py


In [12]:
#!/usr/bin/env python3
import gym
import ptan
import argparse

import torch
import torch.optim as optim

from tensorboardX import SummaryWriter

from lib import dqn_model, common #Need to make sure that the lib folder is at the top level
from IPython.core.debugger import Tracer

NSTEP_DQN = 2 #N steps DQN

if __name__ == "__main__":
    params = common.HYPERPARAMS['pong']
    params['epsilon_frames'] = 50000
    params['learning_rate'] = 0.001
    print(params)

    device = 'cpu'

    env = gym.make(params['env_name'])
    env = ptan.common.wrappers.wrap_dqn(env)

    writer = SummaryWriter(comment="-" + params['run_name'] + "-basic")
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)

    tgt_net = ptan.agent.TargetNet(net) # Wrapper that copies the nn (net) and syncs its weights and parameters periodically
    action_selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=params['epsilon_start']) #For the action selector, we use epsilon-greedy policy with epsilon decayed according to our schedule defined by hyperparams.
    epsilon_tracker = common.EpsilonTracker(action_selector, params) 
    agent = ptan.agent.DQNAgent(net, action_selector, device=device) #creating the agent passing the net 

    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=NSTEP_DQN)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
    optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])

    frame_idx = 0

    with common.RewardTracker(writer, params['stop_reward']) as reward_tracker:
        while True:
            frame_idx += 1
            buffer.populate(1)
            epsilon_tracker.frame(frame_idx)

            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:
                if reward_tracker.reward(new_rewards[0], frame_idx, action_selector.epsilon):
                    break

            if len(buffer) < params['replay_initial']:
                continue

            optimizer.zero_grad()
            batch = buffer.sample(params['batch_size'])
            loss_v = common.calc_loss_dqn(batch, net, tgt_net.target_model, gamma=params['gamma']**NSTEP_DQN, device=device)
            #print('loss_v=', loss_v)
            loss_v.backward()
            optimizer.step()

            if frame_idx % params['target_net_sync'] == 0:
                tgt_net.sync() #Sync the parmetees 
                
            #### Temporary for debuging #####
            #break
        


{'env_name': 'PongNoFrameskip-v4', 'stop_reward': 18.0, 'run_name': 'pong', 'replay_size': 100000, 'replay_initial': 10000, 'target_net_sync': 1000, 'epsilon_frames': 50000, 'epsilon_start': 1.0, 'epsilon_final': 0.02, 'learning_rate': 0.001, 'gamma': 0.99, 'batch_size': 32}
911: done 1 games, mean reward -20.000, speed 160.13 f/s, eps 0.98
1794: done 2 games, mean reward -20.000, speed 171.42 f/s, eps 0.96
2640: done 3 games, mean reward -20.333, speed 124.56 f/s, eps 0.95
3503: done 4 games, mean reward -20.250, speed 180.54 f/s, eps 0.93
4278: done 5 games, mean reward -20.400, speed 143.87 f/s, eps 0.91
5192: done 6 games, mean reward -20.500, speed 98.03 f/s, eps 0.90
6014: done 7 games, mean reward -20.571, speed 96.17 f/s, eps 0.88
7074: done 8 games, mean reward -20.500, speed 77.92 f/s, eps 0.86
7888: done 9 games, mean reward -20.556, speed 200.90 f/s, eps 0.84
8785: done 10 games, mean reward -20.500, speed 154.71 f/s, eps 0.82
9604: done 11 games, mean reward -20.545, speed

KeyboardInterrupt: 

In [9]:
params


{'env_name': 'PongNoFrameskip-v4',
 'stop_reward': 18.0,
 'run_name': 'pong',
 'replay_size': 100000,
 'replay_initial': 10000,
 'target_net_sync': 1000,
 'epsilon_frames': 50000,
 'epsilon_start': 1.0,
 'epsilon_final': 0.02,
 'learning_rate': 0.0001,
 'gamma': 0.99,
 'batch_size': 32}

### N-step DQN

The first improvement that we'll implement and evaluate is quite an old one. It was first introduced in the paper by Richard Sutton ([2] Sutton, 1988). To get the idea, let's look at the **Bellman update** used in **Q-learning** once again.

$$Q(s_t , a_t) = r_t + \gamma \max_a Q(s_{t+1}, a_{t+1})$$

This equation is recursive, which means that we can express $Q(s_{t+1}, a_{t+1})$ in terms of itself, which gives us this result: 

$$Q(s_t , a_t) = r_t + \gamma \max_a [r_{a,t+1} + \gamma \max_{a'}Q(s_{t+2}, a')]$$

Value $r_{a,t+1}$ means local reward at time t+1, after issuing action $a$. However, if we assume that our action a at the step $t+1$ was chosen optimally, or close to optimally, we can omit $\max_a$ and operation and obtain this: 

$$Q(s_t , a_t) = r_t + \gamma r_{t+1} + \gamma^2 \max_{a'}Q(s_{t+2}, a')]$$

This value could be unrolled again and again any number of times. As you may guess, this unrolling can be easily applied to our DQN update by replacing one-step transition sampling with longer transition sequences of n-steps. To understand why this unrolling will help us to speed up training, let's consider the example illustrated below. Here we have a simple environment of four states, s1 action available at every state, except s4, which is a terminal state.

![](img/fig7-2.png)

So, what happens in a **one-step** case? We have three total updates possible (we don't use max, as there is only one action available):

1. $Q(s_1 , a) \leftarrow r_1 + \gamma Q(s_2, a)$
2. $Q(s_2 , a) \leftarrow r_2 + \gamma Q(s_3, a)$
3. $Q(s_3 , a) = r_3$

Let's imagine that, at the beginning of the training, we complete the updates above in this order. The first two updates will be useless, as our current $Q(s_2, a)$ and $Q(s_3, a)$ are incorrect and contain initial random data. The only useful update will be update three, which correctly assigns reward $r_3$ to the state s_3, prior to the terminal state.
Now let's complete the updates above over and over again. On the second iteration, the correct value will be assigned to the $Q(s_2 ,a)$, but the update of $Q(s_1, a)$ will still be noisy. Only on the third iteration will we get the valid values for all $Q$. So, even in a one-step case, **it takes three steps to propagate** the correct values to all the states.

Now let's consider a **two-step** case. This situation again has three updates:

1. $Q(s_1 , a) \leftarrow r_1 + + \gamma r_2 + \gamma^2 Q(s_2, a)$
2. $Q(s_2 , a) \leftarrow r_2 + \gamma r_3$
3. $Q(s_3 , a) = r_3$


In this case, on the first loop over the updates, the correct values will be assigned to both $Q(s_2,a)$ and $Q(s_3,a)$. On the second iteration, the value of $Q(s_1 ,a)$ will be also properly updated. So, multiple steps improve the propagation speed of values, which improves convergence. Okay, you may be thinking that if it's so helpful, let's unroll the Bellman equation, say, 100 steps ahead. Will it speed up our convergence 100 times? Unfortunately, the answer is no.

Despite our expectations, our DQN will fail to converge at all. To understand why, let's again return to our unrolling process, especially where we dropped the $max_a$. Was it correct? Strictly speaking, no. We've omitted the max operation at the intermediate step, assuming that our action selection during experience gathering (or our policy) was optimal. What if it wasn't, for example, in the beginning of the training, when our agent acted randomly? In that case, our calculated value for $Q(s_t,a_t)$ may be smaller than the optimal value of the state (as some steps we' taken randomly, but not following the most promising paths by maximizing the Q-value). The more steps that we unroll the Bellman equation on, the more incorrect our update could be.
Our large experience replay buffer will make the situation even worse, as it increases the chance of getting transitions obtained from the old bad policy (dictated by old bad approximations of $Q$). This will lead to a wrong update of the current$Q$ approximation, so it can easily break our training progress. The above problem is a fundamental characteristic of RL methods, as was briefly mentioned in Chapter 4, The Cross-Entropy Method, when we talked about RL methods' taxonomy. There are two large classes: the **off-policy** and **on-policy** methods.
The first class of **off-policy** methods doesn't depend on **"freshness of data"**. For example, a simple DQN is **off-policy**, which means that we can use very old data sampled from the environment several million steps ago, and this data will still be useful for learning. That's because we're just updating the value of the action $Q(s_t, a_t )$ with immediate reward, plus discounted current approximation of the best action's value. Even if the action at was sampled randomly, it doesn't matter because for this particular action at methods, we can use a **very large experience buffer** to make our data closer to being **independent and identically distributed (i.i.d)**.
On the other hand, on-policy methods heavily depend on the training data to be sampled according to the current policy we're updating. That happens because on-policy methods are trying to improve the current policy indirectly (as in the n-step DQN above) or directly (the whole of part three of the book is devoted to such methods).
So, which class of methods is better? Well, it depends. **Off-policy** methods allow you to train on the previous **large history of data**or even on human demonstrations, but, usually, they are slower to converge. **On-policy** methods are usually **faster**, but require much more **fresh data** from the environment, which can be costly. Just imagine a self-driving car trained with the on-policy method. It will cost you lots of crashed cars before the system learns that walls and trees are things that it should avoid.
You may have a question: why are we talking about an **n-step DQN** if this "n-stepness" turns it into an on-policy method, which will make our large experience buffer useless? In practice, this is usually not black and white. You may still use an n-step DQN if it will help to speed up DQNs, but you need to be **modest with the selection of n**. **Small values of two or three usually work well**, because our trajectories in the experience buffer are not that different from one-step transitions. In such cases, convergence speed usually improves proportionally, but large values of n can break the training process. So, the number of steps should be tuned, but convergence speeding up usually makes it worth doing.

#### Implemetation 
As the `ExperienceSourceFirstLast` class already supports the multi-step Bellman unroll, our n-step version of a DQN is extremely simple. There are only two modifications that we need to make in the basic DQN to turn it into an n-step version:
* Pass the count of steps that we want to unroll on `ExperienceSourceFirstLast` creation in the steps_count parameter.
* Pass the correct `gamma` to the `calc_loss_dqn` function. This modification is really easy to overlook, but it can be harmful to convergence. As our Bellman is now n-steps, the discount coefficient for the last state in the experience chain will no longer be just $\gamma$, but $\gamma^n$

The change is made in the cell above in the line:
```python
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=args.n)
```

The args.n value is count of steps passed in command-line arguments, default is to use 2 steps.
```python
loss_v = common.calc_loss_dqn(batch, net, tgt_net.target_model, gamma=params['gamma']**args.n, device=device)
```

### Double DQN
The next fruitful idea on how to improve a basic DQN came from DeepMind researchers in a paper titled *Deep Reinforcement Learning with Double Q-Learning ([3] van Hasselt, Guez, and Silver, 2015)*. In the paper, the authors demonstrated that the basic DQN has a tendency to overestimate values for Q, which may be harmful to training performance and sometimes can lead to suboptimal policies. The root cause of this is the max operation in the Bellman equation, but the strict proof is too complicated to write down here. As a solution to this problem, the authors proposed **modifying the Bellman update a bit**.
In the basic DQN, our target value for Q looked like this:


In the basic DQN, our target value for Q looked like this:
$$Q(s_t , a_t) = r_t + \gamma \max_a Q(s_{t+1}, a_{t+1})$$

$Q(s_{t+1}, a) was Q-values calculated using our target network, so we update with the trained network every n steps. The authors of the paper proposed choosing actions for the next state using the trained network but taking values of Q from the target net. So, the new expression for target Q-values will look like this:

$$Q(s_t , a_t) = r_t + \gamma \max_a Q'(s_{t+1}, \arg \max_a Q(s_{t+1},a)$$

The authors proved that this simple tweak fixes overestimation completely and they called this new architecture **double DQN**.

#### Implementation
The core implementation is very simple. What we need to do is to slightly modify our loss function. Let's go a step further and compare action values produced by the basic DQN and double DQN. To do this, we store a random held-out set of states and periodically calculate the mean value of the best action for every state in the evaluation set.


In [None]:
#!/usr/bin/env python3
import gym
import ptan
import argparse
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from tensorboardX import SummaryWriter

from lib import dqn_model, common

import pdb #Python debugger

STATES_TO_EVALUATE = 1000
EVAL_EVERY_FRAME = 100


def calc_loss(batch, net, tgt_net, gamma, device="cpu", double=True):
    states, actions, rewards, dones, next_states = common.unpack_batch(batch)

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    if double:
        next_state_actions = net(next_states_v).max(1)[1]
        next_state_values = tgt_net(next_states_v).gather(1, next_state_actions.unsqueeze(-1)).squeeze(-1)
    else:
        next_state_values = tgt_net(next_states_v).max(1)[0]
    next_state_values[done_mask] = 0.0

    expected_state_action_values = next_state_values.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)


def calc_values_of_states(states, net, device="cpu"):
    mean_vals = []
    for batch in np.array_split(states, 64):
        states_v = torch.tensor(batch).to(device)
        action_values_v = net(states_v)
        best_action_values_v = action_values_v.max(1)[0]
        mean_vals.append(best_action_values_v.mean().item())
    return np.mean(mean_vals)


if __name__ == "__main__":
    params = common.HYPERPARAMS['pong']
    print(params)

    args = {'cuda': False, 'double': True}
    pdb.set_trace()
    device = torch.device("cuda" if args['cuda'] else "cpu")

    env = gym.make(params['env_name'])
    env = ptan.common.wrappers.wrap_dqn(env)

    writer = SummaryWriter(comment="-" + params['run_name'] + "-double=" + str(args['double']))
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)

    tgt_net = ptan.agent.TargetNet(net)
    selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=params['epsilon_start'])
    epsilon_tracker = common.EpsilonTracker(selector, params)
    agent = ptan.agent.DQNAgent(net, selector, device=device)

    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=1)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
    optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])

    frame_idx = 0
    eval_states = None

    with common.RewardTracker(writer, params['stop_reward']) as reward_tracker:
        while True:
            frame_idx += 1
            buffer.populate(1)
            epsilon_tracker.frame(frame_idx)

            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:
                if reward_tracker.reward(new_rewards[0], frame_idx, selector.epsilon):
                    break

            if len(buffer) < params['replay_initial']:
                continue
            if eval_states is None:
                eval_states = buffer.sample(STATES_TO_EVALUATE)
                eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
                eval_states = np.array(eval_states, copy=False)

            optimizer.zero_grad()
            batch = buffer.sample(params['batch_size'])
            loss_v = calc_loss(batch, net, tgt_net.target_model, gamma=params['gamma'], device=device,
                               double=args['double'])
            loss_v.backward()
            optimizer.step()

            if frame_idx % params['target_net_sync'] == 0:
                tgt_net.sync()
            if frame_idx % EVAL_EVERY_FRAME == 0:
                mean_val = calc_values_of_states(eval_states, net, device=device)
                writer.add_scalar("values_mean", mean_val, frame_idx)



{'env_name': 'PongNoFrameskip-v4', 'stop_reward': 18.0, 'run_name': 'pong', 'replay_size': 100000, 'replay_initial': 10000, 'target_net_sync': 1000, 'epsilon_frames': 100000, 'epsilon_start': 1.0, 'epsilon_final': 0.02, 'learning_rate': 0.0001, 'gamma': 0.99, 'batch_size': 32}
> <ipython-input-1-b094557018b3>(58)<module>()
-> device = torch.device("cuda" if args['cuda'] else "cpu")
(Pdb) c
863: done 1 games, mean reward -20.000, speed 142.41 f/s, eps 0.99
1741: done 2 games, mean reward -20.500, speed 176.48 f/s, eps 0.98
2571: done 3 games, mean reward -20.333, speed 176.74 f/s, eps 0.97
3526: done 4 games, mean reward -20.250, speed 67.96 f/s, eps 0.96
4543: done 5 games, mean reward -20.400, speed 69.93 f/s, eps 0.95
5407: done 6 games, mean reward -20.333, speed 114.97 f/s, eps 0.95
6166: done 7 games, mean reward -20.429, speed 115.23 f/s, eps 0.94
7041: done 8 games, mean reward -20.500, speed 160.89 f/s, eps 0.93
8105: done 9 games, mean reward -20.444, speed 152.66 f/s, eps 0.9

In [12]:
class DotDict(dict):
    pass

args = DotDict()
args['cuda'] = False
args.cuda

AttributeError: 'DotDict' object has no attribute 'cuda'