# The PTAN library

At a high level, PTAN provides the following entities:
- Agent: a class that knows how to convert a batch of observations to a batch
of actions to be executed. It can contain an optional state, in case you need
to track some information between consequent actions in one episode.
(We will use this approach in Chapter 17, Continuous Action Space, in the
deep deterministic policy gradient (DDPG) method, which includes the
Ornstein–Uhlenbeck random process for exploration.) The library provides
several agents for the most common RL cases, but you always can write your
own subclass of BaseAgent.
- ActionSelector: a small piece of logic that knows how to choose the action
from some output of the network. It works in tandem with Agent.
- ExperienceSource and variations: the Agent instance and a Gym
environment object can provide information about the trajectory from
episodes. In its simplest form, it is one single (a, r, s') transition at a time, but
its functionality goes beyond this.
- ExperienceSourceBuffer and friends: replay buffers with various
characteristics. They include a simple replay buffer and two versions of
prioritized replay buffers.
- Various utility classes, like TargetNet and wrappers for time series
preprocessing (used for tracking training progress in TensorBoard).
- PyTorch Ignite helpers to integrate PTAN into the Ignite framework.
- Wrappers for Gym environments, for example, wrappers for Atari games
(copied and pasted from OpenAI Baselines with some tweaks).

## Action selectors
In PTAN terminology, an action selector is an object that helps with going from
network output to concrete action values. The most common cases include:
- Argmax: commonly used by Q-value methods when the network predicts
Q-values for a set of actions and the desired action is the action with the
largest Q(s, a).
- Policy-based: the network outputs the probability distribution (in the form
of logits or normalized distribution), and an action needs to be sampled from
this distribution. You have already seen this case in Chapter 4, The Cross-
Entropy Method, where we discussed the cross-entropy method.
An action selector is used by the Agent and rarely needs to be customized (but you
have this option). The concrete classes provided by the library are:
- ArgmaxActionSelector: applies argmax on the second axis of a passed
tensor. (It assumes a matrix with batch dimension along the first axis.)
- ProbabilityActionSelector: samples from the probability distribution
of a discrete set of actions.
- EpsilonGreedyActionSelector: has the parameter epsilon, which
specifies the probability of a random action to be taken.

In [1]:
import ptan
import numpy as np


q_vals = np.array([[1, 2, 3], [1, -1, 0]])
print("q_vals")
print(q_vals)

selector = ptan.actions.ArgmaxActionSelector()
print("argmax:", selector(q_vals))

selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=0.0)
print("epsilon=0.0:", selector(q_vals))

selector.epsilon = 1.0
print("epsilon=1.0:", selector(q_vals))

selector.epsilon = 0.5
print("epsilon=0.5:", selector(q_vals))
selector.epsilon = 0.1
print("epsilon=0.1:", selector(q_vals))

selector = ptan.actions.ProbabilityActionSelector()
print("Actions sampled from three prob distributions:")
for _ in range(10):
    acts = selector(np.array([
        [0.1, 0.8, 0.1],
        [0.0, 0.0, 1.0],
        [0.5, 0.5, 0.0]
    ]))
    print(acts)

q_vals
[[ 1  2  3]
 [ 1 -1  0]]
argmax: [2 0]
epsilon=0.0: [2 0]
epsilon=1.0: [0 2]
epsilon=0.5: [2 0]
epsilon=0.1: [2 0]
Actions sampled from three prob distributions:
[1 2 0]
[1 2 0]
[0 2 0]
[2 2 0]
[2 2 1]
[1 2 1]
[2 2 1]
[1 2 1]
[1 2 1]
[1 2 1]


All the classes assume that NumPy arrays will be passed to them.

## The agent

The agent entity provides a unified way of bridging observations from the
environment and the actions that we want to execute. So far, you have seen only
a simple, stateless DQN agent that uses a neural network (NN) to obtain actions'
values from the current observation and behaves greedily on those values. We have
used epsilon-greedy behavior to explore the environment, but this doesn't change
the picture much.

In the RL field, this could be more complicated. For example, instead of predicting
the values of the actions, our agent could predict a probability distribution over the
actions. Such agents are called policy agents, and we will talk about those methods
in part three of the book.

The other requirement could be the necessity for the agent to keep a state
between observations. For example, very often one observation (or even the k last
observation) is not enough to make a decision about the action, and we want to keep
some memory in the agent to capture the necessary information. There is a whole
subdomain of RL that tries to address this complication with partially observable
Markov decision process (POMDP) formalism, which is not covered in the book.
The third variant of the agent is very common in continuous control problems, which
will be discussed in part four of the book. For now, it will be enough to say that in
such cases, actions are not discrete anymore but some continuous value, and the
agent needs to predict them from the observations.

To capture all those variants and make the code flexible, the agent in PTAN is
implemented as an extensible hierarchy of classes with the ptan.agent.BaseAgent
abstract class at the top. From a high level, the agent needs to accept the batch of
observations (in the form of a NumPy array) and return the batch of actions that it
wants to take. The batch is used to make the processing more efficient, as processing
several observations in one pass in a graphics processing unit (GPU) is frequently
much faster than processing them individually.
The abstract base class doesn't define the types of input and output, which makes
it very flexible and easy to extend. For example, in the continuous domain, our
actions will no longer be indices of discrete actions, but float values.

In any case, the agent can be seen as something that knows how to convert
observations into actions, and it's up to the agent how to do this. In general,
there are no assumptions made on observation and action types, but the concrete
implementation of agents is more limiting. PTAN provides two of the most common
ways to convert observations into actions: DQNAgent and PolicyAgent.
In real problems, a custom agent is often needed. These are some of the reasons:
- The architecture of the NN is fancy—its action space is a mixture of
continuous and discrete, it has multimodal observations (text and pixels,
for example), or something like that.
- You want to use non-standard exploration strategies, for example, the
Ornstein–Uhlenbeck process (a very popular exploration strategy in the
continuous control domain).
- You have a POMDP environment, and the agent's decision is not fully
defined by observations, but by some internal agent state (which is also
the case for Ornstein–Uhlenbeck exploration).
All those cases are easily supported by subclassing the BaseAgent class, and in the
rest of the book, several examples of such redefinition will be given.

### DQNAgent and PolicyAgent
This class is applicable in Q-learning when the action space is not very large,
which covers Atari games and lots of classical problems. This representation is not
universal, and later in the book, you will see ways of dealing with that. DQNAgent
takes a batch of observations on input (as a NumPy array), applies the network
on them to get Q-values, and then uses the provided ActionSelector to convert
Q-values to indices of actions.
Let's consider a small example. For simplicity, our network always produces the
same output for the input batch.

PolicyAgent expects the network to produce policy distribution over a discrete set of actions. Policy distribution could be either logits (unnormalized) or a normalized distribution. In practice, you should always use logits to improve the numeric stability of the training process.

In [2]:
import torch
import torch.nn as nn


class DQNNet(nn.Module):
    def __init__(self, actions: int):
        super(DQNNet, self).__init__()
        self.actions = actions

    def forward(self, x):
        # we always produce diagonal tensor of shape (batch_size, actions)
        return torch.eye(x.size()[0], self.actions)


class PolicyNet(nn.Module):
    def __init__(self, actions: int):
        super(PolicyNet, self).__init__()
        self.actions = actions

    def forward(self, x):
        # Now we produce the tensor with first two actions
        # having the same logit scores
        shape = (x.size()[0], self.actions)
        res = torch.zeros(shape, dtype=torch.float32)
        res[:, 0] = 1
        res[:, 1] = 1
        return res


net = DQNNet(actions=3)
net_out = net(torch.zeros(2, 10))
print("dqn_net:")
print(net_out)

selector = ptan.actions.ArgmaxActionSelector()
agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)
ag_out = agent(torch.zeros(2, 5))
print("Argmax:", ag_out)

selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)
agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=1.0:", ag_out)

selector.epsilon = 0.5
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=0.5:", ag_out)

selector.epsilon = 0.1
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=0.1:", ag_out)

net = PolicyNet(actions=5)
net_out = net(torch.zeros(6, 10))
print("policy_net:")
print(net_out)

selector = ptan.actions.ProbabilityActionSelector()
agent = ptan.agent.PolicyAgent(model=net, action_selector=selector, apply_softmax=True)
ag_out = agent(torch.zeros(6, 5))[0]
print(ag_out)

dqn_net:
tensor([[1., 0., 0.],
        [0., 1., 0.]])
Argmax: (array([0, 1]), [None, None])
eps=1.0: [1 2 2 0 2 0 1 1 2 0]
eps=0.5: [0 2 2 2 2 0 1 0 1 0]
eps=0.1: [0 1 2 0 0 0 0 1 0 0]
policy_net:
tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])
[1 4 0 1 2 1]


## Experience source

The agent abstraction described in the previous section allows us to implement
environment communications in a generic way. These communications happen in the
form of trajectories, produced by applying the agent's actions to the Gym environment.
At a high level, experience source classes take the agent instance and environment
and provide you with step-by step data from the trajectories. The functionality
of those classes includes:
- Support of multiple environments being communicated at the same time.
This allows efficient GPU utilization as a batch of observations is being
processed by the agent at once.

- A trajectory can be preprocessed and presented in a convenient form for
further training. For example, there is an implementation of subtrajectory
rollouts with accumulation of the reward. That preprocessing is convenient
for DQN and n-step DQN, when we are not interested in individual
intermediate steps in subtrajectories, so they can be dropped. This
saves memory and reduces the amount of code we need to write.
- Support of vectorized environments from OpenAI Universe. We will
cover this in Chapter 17, Continuous Action Space, for web automation and
MiniWoB environments.
So, the experience source classes act as a "magic black box" to hide the environment
interaction and trajectory handling complexities from the library user. But the overall
PTAN philosophy is to be flexible and extensible, so if you want, you can subclass
one of the existing classes or implement your own version as needed.
There are three classes provided by the system:
    - ExperienceSource: using the agent and the set of environments, it produces
n-step subtrajectories with all intermediate steps
    - ExperienceSourceFirstLast: this is the same as ExperienceSource, but
instead of a full subtrajectory (with all steps), it keeps only the first and last
steps, with proper reward accumulation in between. This can save a lot of
memory in the case of n-step DQN or advantage actor-critic (A2C) rollouts
    - ExperienceSourceRollouts: this follows the asynchronous advantage
actor-critic (A3C) rollouts scheme described in Mnih's paper about Atari
games (referenced in Chapter 12, The Actor-Critic Method)
All the classes are written to be efficient both in terms of central processing unit
(CPU) and memory, which is not very important for toy problems, but might become
an issue when you want to solve Atari games and need to keep 10M samples in the
replay buffer using commodity hardware.

For demonstration, we will implement a very simple Gym environment with
a small predictable observation state to show how experience source classes work.
This environment has integer observation, which increases from 0 to 4, integer
action, and a reward equal to the action given.

In [3]:
import gym
import ptan
from typing import List, Optional, Tuple, Any


class ToyEnv(gym.Env):
    """
    Environment with observation 0..4 and actions 0..2
    Observations are rotated sequentialy mod 5, reward is equal to given action.
    Episodes are having fixed length of 10
    """

    def __init__(self):
        super(ToyEnv, self).__init__()
        self.observation_space = gym.spaces.Discrete(n=5)
        self.action_space = gym.spaces.Discrete(n=3)
        self.step_index = 0

    def reset(self):
        self.step_index = 0
        return self.step_index

    def step(self, action):
        is_done = self.step_index == 10
        if is_done:
            return self.step_index % self.observation_space.n, \
                   0.0, is_done, {}
        self.step_index += 1
        return self.step_index % self.observation_space.n, \
               float(action), self.step_index == 10, {}

In addition to this environment, we will use an agent that always generates fixed
actions regardless of observations:

In [4]:
class DullAgent(ptan.agent.BaseAgent):
    """
    Agent always returns the fixed action
    """
    def __init__(self, action: int):
        self.action = action

    def __call__(self, observations: List[Any],
                 state: Optional[List] = None) \
            -> Tuple[List[int], Optional[List]]:
        return [self.action for _ in observations], state

In [5]:
env = ToyEnv()
s = env.reset()
print("env.reset() -> %s" % s)
s = env.step(1)
print("env.step(1) -> %s" % str(s))
s = env.step(2)
print("env.step(2) -> %s" % str(s))

for _ in range(10):
    r = env.step(0)
    print(r)

agent = DullAgent(action=1)
print("agent:", agent([1, 2])[0])

env.reset() -> 0
env.step(1) -> (1, 1.0, False, {})
env.step(2) -> (2, 2.0, False, {})
(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, False, {})
(1, 0.0, False, {})
(2, 0.0, False, {})
(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, True, {})
(0, 0.0, True, {})
(0, 0.0, True, {})
agent: [1, 1]


The first class is ptan.experience.ExperienceSource, which generates chunks
of agent trajectories of the given length. The implementation automatically handles
the end of episode situation (when the step() method in the environment returns
is_done=True) and resets the environment.
The constructor accepts several arguments:
- The Gym environment to be used. Alternatively, it could be the list of
environments.
- The agent instance.
- steps_count=2: the length of subtrajectories to be generated.
- vectorized=False: if set to True, the environment needs to be an OpenAI
Universe vectorized environment. We will discuss such environments in
detail in Chapter 16, Web Navigation.

The class instance provides the standard Python iterator interface, so you can just
iterate over this to get subtrajectories:

In [6]:
env = ToyEnv()
agent = DullAgent(action=1)
exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=2)
for idx, exp in enumerate(exp_source):
    if idx > 15:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))
(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))
(E

On every iteration, ExperienceSource returns a piece of the agent's trajectory
in environment communication. It might look simple, but there are several things
happening under the hood of our example:
1. reset() was called in the environment to get the initial state
2. The agent was asked to select the action to execute from the state returned
3. The step() method was executed to get the reward and the next state
4. This next state was passed to the agent for the next action
5. Information about the transition from one state to the next state was returned
6. The process iterated (from step 3) until it iterated over the experience source
If the agent changes the way it generates actions (we can get this by updating the
network weights, decreasing epsilon, or by some other means), it will immediately
affect the experience trajectories that we get.

The ExperienceSource instance returns tuples of length equal to or less than the
argument step_count passed on construction. In our case, we asked for two-step
subtrajectories, so tuples will be of length 2 or 1 (at the end of episodes). Every object
in a tuple is an instance of the ptan.experience.Experience class, which is a
namedtuple with the following fields:
- state: the state we observed before taking the action
- action: the action we completed
- reward: the immediate reward we got from env
- done: whether the episode was done

If the episode reaches the end, the subtrajectory will be shorter and the underlying
environment will be reset automatically, so we don't need to bother with this and
can just keep iterating.

We can ask ExperienceSource for subtrajectories of any length:

In [7]:
exp_source = ptan.experience.ExperienceSource(env=env, agent=agent, steps_count=4)
print(next(iter(exp_source)))

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))


We can pass it several instances of gym.Env. In that case, they will be used in roundrobin
fashion.

In [8]:
exp_source = ptan.experience.ExperienceSource(env=[ToyEnv(), ToyEnv()], agent=agent, steps_count=2)
for idx, exp in enumerate(exp_source):
    if idx > 4:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))


The class ExperienceSource provides us with full subtrajectories of the given length
as the list of (s, a, r) objects. The next state, s', is returned in the next tuple, which is
not always convenient. For example, in DQN training, we want to have tuples (s, a,
r, s') at once to do one-step Bellman approximation during the training. 

In addition, some extension of DQN, like n-step DQN, might want to collapse longer sequences
of observations into (first-state, action, total-reward-for-n-steps, state-after-step-n).
To support this in a generic way, a simple subclass of ExperienceSource is
implemented: ExperienceSourceFirstLast. It accepts almost the same arguments
in the constructor, but returns different data.

In [9]:
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)
for idx, exp in enumerate(exp_source):
    print(exp)
    if idx > 10:
        break

ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)


Now it returns a single object on every iteration, which is again a namedtuple
with the following fields:
- state: the state we used to decide on the action to take
- action: the action we took at this step
- reward: the partial accumulated reward for steps_count (in our case,
steps_count=1, so it is equal to the immediate reward)
- last_state: the state we got after executing the action. If our episode ends,
we have None here

This data is much more convenient for DQN training, as we can apply Bellman
approximation directly to it.
Let's check the result with a larger number of steps:

In [10]:
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=4)
for idx, exp in enumerate(exp_source):
    print(exp)
    if idx > 10:
        break

ExperienceFirstLast(state=0, action=1, reward=4.0, last_state=4)
ExperienceFirstLast(state=1, action=1, reward=4.0, last_state=0)
ExperienceFirstLast(state=2, action=1, reward=4.0, last_state=1)
ExperienceFirstLast(state=3, action=1, reward=4.0, last_state=2)
ExperienceFirstLast(state=4, action=1, reward=4.0, last_state=3)
ExperienceFirstLast(state=0, action=1, reward=4.0, last_state=4)
ExperienceFirstLast(state=1, action=1, reward=4.0, last_state=None)
ExperienceFirstLast(state=2, action=1, reward=3.0, last_state=None)
ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
ExperienceFirstLast(state=0, action=1, reward=4.0, last_state=4)
ExperienceFirstLast(state=1, action=1, reward=4.0, last_state=0)


So, now we are collapsing 4 steps on every iteration and calculating the immediate
reward. As the episode ends, we have last_state=None in those samples, but additionally,
we calculate the reward for the tail of the episode. Those tiny details are very easy
to implement wrongly if you are doing all the trajectory handling yourself.

## Experience replay buffers

In DQN, we rarely deal with immediate experience samples, as they are heavily
correlated, which leads to instability in the training. Normally, we have large replay
buffers, which are populated with experience pieces. Then the buffer is sampled
(randomly or with priority weights) to get the training batch. The replay buffer
normally has a maximum capacity, so old samples are pushed out when the replay
buffer reaches the limit.

There are several implementation tricks here, which become extremely important
when you need to deal with large problems:
- How to efficiently sample from a large buffer
- How to push old samples from the buffer
- In the case of a prioritized buffer, how priorities need to be maintained and
handled in the most efficient way

All this becomes a quite non-trivial task if you want to solve Atari, keeping 10-100M
samples, where every sample is an image from the game. A small mistake can lead
to a 10-100x memory increase and major slowdowns of the training process.
PTAN provides several variants of replay buffers, which integrate simply with
ExperienceSource and Agent machinery. Normally, what you need to do is ask
the buffer to pull a new sample from the source and sample the training batch. The
provided classes are:
- ExperienceReplayBuffer: a simple replay buffer of predefined size with
uniform sampling.
- PrioReplayBufferNaive: a simple, but not very efficient, prioritized replay
buffer implementation. The complexity of sampling is O(n), which might
become an issue with large buffers. This version has the advantage over the
optimized class, having much easier code.
- PrioritizedReplayBuffer: uses segment trees for sampling, which makes
the code cryptic, but with O(log(n)) sampling complexity.

In [11]:
env = ToyEnv()
agent = DullAgent(action=1)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)
buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=100)
len(buffer)

0

All replay buffers provide the following interface:
- A Python iterator interface to walk over all the samples in the buffer
- The method populate(N) to get N samples from the experience source
and put them into the buffer
- The method sample(N) to get the batch of N experience objects

So, the normal training loop for DQN looks like an infinite repetition of the following
steps:
1. Call buffer.populate(1) to get a fresh sample from the environment
2. batch = buffer.sample(BATCH_SIZE) to get the batch from the buffer
3. Calculate the loss on the sampled batch
4. Backpropagate
5. Repeat until convergence (hopefully)

All the rest happens automatically: resetting the environment, handling
subtrajectories, buffer size maintenance, and so on.

In [12]:
for step in range(6):
    buffer.populate(1)
    # if buffer is small enough, do nothing
    if len(buffer) < 5:
        continue
    batch = buffer.sample(4)
    print("Train time, %d batch samples:" % len(batch))
    for s in batch:
        print(s)

Train time, 4 batch samples:
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
Train time, 4 batch samples:
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)


## The TargetNet class

TargetNet is a small but useful class that allows us to synchronize two NNs
of the same architecture. The purpose of this was described in the previous
chapter: improving training stability. TargetNet supports two modes of such
synchronization:
- sync(): weights from the source network are copied into the target network.
- alpha_sync(): the source network's weights are blended into the target
network with some alpha weight (between 0 and 1).

The first mode is the standard way to perform a target network sync in discrete
action space problems, like Atari and CartPole. We did this in Chapter 6, Deep
Q-Networks. 

The latter mode is used in continuous control problems, which will
be described in several chapters in part four of the book. In such problems, the
transition between two networks' parameters should be smooth, so alpha blending is
used, given by the formula 𝑤𝑖 = 𝑤i*𝛼 + 𝑠𝑖(1 − 𝛼) , where wi is the target network's ith
parameter and si is the source network's weight. 

The following is a small example of
how TargetNet should be used in code. Assume we have the following network:

In [13]:
class DQNNet(nn.Module):
    def __init__(self):
        super(DQNNet, self).__init__()
        self.ff = nn.Linear(5, 3)

    def forward(self, x):
        return self.ff(x)

In [14]:
net = DQNNet()
print(net)
tgt_net = ptan.agent.TargetNet(net)

DQNNet(
  (ff): Linear(in_features=5, out_features=3, bias=True)
)


The target network contains two fields: model, which is the reference to the original
network, and target_model, which is a deep copy of it. If we examine both
networks' weights, they will be the same:

In [15]:
print("Main net:", net.ff.weight)

Main net: Parameter containing:
tensor([[ 0.1682, -0.4226,  0.1878,  0.3536,  0.2065],
        [-0.1958, -0.1072, -0.3952, -0.0881, -0.1878],
        [ 0.3368,  0.2458,  0.2772, -0.2961, -0.0219]], requires_grad=True)


In [16]:
tgt_net.target_model.ff.weight

Parameter containing:
tensor([[ 0.1682, -0.4226,  0.1878,  0.3536,  0.2065],
        [-0.1958, -0.1072, -0.3952, -0.0881, -0.1878],
        [ 0.3368,  0.2458,  0.2772, -0.2961, -0.0219]], requires_grad=True)

They are independent of each other, however, just having the same architecture:

In [17]:
net.ff.weight.data += 1.0

In [18]:
print("After update")
print("Main net:", net.ff.weight)
print("Target net:", tgt_net.target_model.ff.weight)

After update
Main net: Parameter containing:
tensor([[1.1682, 0.5774, 1.1878, 1.3536, 1.2065],
        [0.8042, 0.8928, 0.6048, 0.9119, 0.8122],
        [1.3368, 1.2458, 1.2772, 0.7039, 0.9781]], requires_grad=True)
Target net: Parameter containing:
tensor([[ 0.1682, -0.4226,  0.1878,  0.3536,  0.2065],
        [-0.1958, -0.1072, -0.3952, -0.0881, -0.1878],
        [ 0.3368,  0.2458,  0.2772, -0.2961, -0.0219]], requires_grad=True)


To synchronize them again, the sync() method can be used:

In [19]:
tgt_net.sync()
print("After sync")
print("Main net:", net.ff.weight)
print("Target net:", tgt_net.target_model.ff.weight)

After sync
Main net: Parameter containing:
tensor([[1.1682, 0.5774, 1.1878, 1.3536, 1.2065],
        [0.8042, 0.8928, 0.6048, 0.9119, 0.8122],
        [1.3368, 1.2458, 1.2772, 0.7039, 0.9781]], requires_grad=True)
Target net: Parameter containing:
tensor([[1.1682, 0.5774, 1.1878, 1.3536, 1.2065],
        [0.8042, 0.8928, 0.6048, 0.9119, 0.8122],
        [1.3368, 1.2458, 1.2772, 0.7039, 0.9781]], requires_grad=True)
