# High-Level RL Libraries
* [**PTAN**](https://github.com/Shmuma/ptan) - based on PyTorch, described below
* [Keras-RL](https://github.com/keras-rl/keras-rl) - based on Keras, includes basic RL methods
* [TF-Agents](https://www.tensorflow.org/agents) - based on TensorFlow, made by Google in 2018
* [Dopamine](https://github.com/google/dopamine) - made by Google in 2018, TensorFlow specific
* [Ray](https://ray.io/) - library for distributed execution of ML code including a RL library
* [ReAgent](https://reagent.ai/) - made by Facebook, based on PyTorch and JSON problem description
* [CatalystRL](https://github.com/catalyst-team/catalyst-rl) - uses PyTorch backend
* [SLM Lab](https://github.com/kengz/SLM-Lab) - another RL library

Following sections describe the API and basic components of the *PTAN* library.

## Action Selectors
An `ActionSelector` is a class that given a network output (Q values), or a batch thereof, selects a actions to play. There are three basic types:
* *Argmax* is the standard selector used in Q-Learning
* *Policy-based* samples actions from a given normalized distribution
* *Epsilon Greedy* implements the epsilon-greedy strategy given an epsilon hyperparameter

In [1]:
import numpy as np
import ptan

# Let's assume we got a batch of Q values from a NN
q_vals = np.array([[1, 2, 3], [1, -1, 0]])
q_vals

array([[ 1,  2,  3],
       [ 1, -1,  0]])

### Argmax Selector

In [2]:
selector = ptan.actions.ArgmaxActionSelector()
selector(q_vals)

array([2, 0])

### Epsilon-greedy Selector

In [3]:
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=0.0)
selector(q_vals)

array([2, 0])

In [4]:
selector.epsilon = 1.0
selector(q_vals)

array([0, 0])

In [5]:
selector.epsilon = 0.5
selector(q_vals)

array([1, 0])

In [6]:
selector.epsilon = 0.1
selector(q_vals)

array([2, 0])

### Probability Selector

In [7]:
# A batch of probability distributions over actions
distributions = np.array(
    [
        [0.1, 0.8, 0.1],
        [0.0, 0.0, 1.0],
        [0.5, 0.5, 0.0],
    ]
)

selector = ptan.actions.ProbabilityActionSelector()

for _ in range(10):
    # Each time sample actions from each distribution
    actions = selector(distributions)
    print(actions)

[1 2 0]
[1 2 1]
[1 2 1]
[1 2 1]
[1 2 0]
[1 2 1]
[1 2 0]
[2 2 1]
[1 2 0]
[1 2 0]


## The Agent
The main purpose of the agent class is to convert observations to actions. Actually, the input (output) is a batch of observations (actions) for better GPU utitlization by the underlying NN.

There are two main classes of agents:
* `DQNAgent` is an agent based on a DQN that makes its decisions based on action values (Q values)
* `PolicyAgent` samples actions from a normalized probability distribution (policy) over a discrete set of actions or logits (non-normalized distribution; preferred for better numerical stability)

Finally, a typical scenario is to implement custom agent by extending some of these classes (or rather the base class). The reason might be for instance:
* Agent's decision making is not fully determined by current observation but rather a history so we need to keep an internal state (e.g. beliefs in POMDPs)
* The NN has multi-modal inputs - e.g. both text and image pixels
* The agent uses non-standard exploration strategy

### DQNAgent

In [8]:
import torch  # noqa
import torch.nn as nn  # noqa


class DQN(nn.Module):
    def __init__(self, n_actions: int) -> None:
        super().__init__()
        self.n_actions = n_actions

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Produces diagonal tensor of shape (batch_size, n_actions)"""
        batch_size = x.size()[0]
        return torch.eye(batch_size, self.n_actions)


dqn = DQN(n_actions=3)

# Try out the DQN
dqn(torch.zeros(2, 10))

tensor([[1., 0., 0.],
        [0., 1., 0.]])

In [9]:
# Create a DQN agent using a greedy policy
dqn_agent = ptan.agent.DQNAgent(
    dqn_model=dqn,
    action_selector=ptan.actions.ArgmaxActionSelector(),
)

# Test the greedy DQN agent
dqn_agent(torch.zeros(2, 5))

(array([0, 1]), [None, None])

In [10]:
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)

# Create a DQN agent using an epsilon-greedy policy
dqn_agent = ptan.agent.DQNAgent(dqn_model=dqn, action_selector=selector)

# Test the epsilon-greedy DQN agent
dqn_agent(torch.zeros(10, 5))

(array([0, 1, 2, 1, 1, 0, 2, 1, 2, 0]),
 [None, None, None, None, None, None, None, None, None, None])

In [11]:
selector.epsilon = 0.5
dqn_agent(torch.zeros(10, 5))

(array([2, 1, 2, 0, 2, 1, 0, 0, 0, 0]),
 [None, None, None, None, None, None, None, None, None, None])

In [12]:
selector.epsilon = 0.1
dqn_agent(torch.zeros(10, 5))

(array([0, 1, 2, 0, 0, 0, 0, 0, 0, 0]),
 [None, None, None, None, None, None, None, None, None, None])

### PolicyAgent

In [13]:
class PolicyNet(nn.Module):
    def __init__(self, n_actions: int) -> None:
        super().__init__()
        self.n_actions = n_actions

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Produces a tensor with first two actions having the same logit scores
        """
        batch_size = x.size()[0]
        output = torch.zeros(
            (batch_size, self.n_actions),
            dtype=torch.float32,
        )
        output[:, 0] = 1
        output[:, 1] = 1
        return output


policy_net = PolicyNet(n_actions=5)

# Try the policy network
policy_net(torch.zeros(6, 10))

tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])

In [14]:
# Create a policy-based agent
#  - Note: The agent uses softmax to convert the NN output to probabilities
policy_agent = ptan.agent.PolicyAgent(
    model=policy_net,
    action_selector=ptan.actions.ProbabilityActionSelector(),
    apply_softmax=True,
)

# Test the policy agent
policy_agent(torch.zeros(6, 5))

(array([4, 1, 1, 4, 3, 3]), [None, None, None, None, None, None])

## Experience Sources
The philosophy behind *experience sources* is to provide an easy-to-use wrapper for handling trajectories (basically n steps of experiences). The responsibility of an experiences source is to provide trajectories given an agent and (multiple) environments.

There are three basic classes:
* `ExperienceSource` can operate over multiple environments at once (to create batches for GPU processing) and returns n-step subtrajectories with all intermediate steps
* `ExperienceSourceFirstLast` returns subtrajectories with just the first and last step (with accumulated reward) - saves a lot of memory for n-step DQN or in A2C rollouts
* `ExperienceSourceRollouts` follows the A3C rollouts scheme

In [15]:
from typing import Any, Dict, List, Optional, Tuple  # noqa

import gym  # noqa


class ToyEnv(gym.Env):
    """
    Environment with observation 0..4 and actions 0..2
      * Observations are rotated sequentialy mod 5
      * Reward is equal to given action.
      * Episodes are having fixed length of 10
    """

    def __init__(self) -> None:
        super().__init__()
        self.observation_space = gym.spaces.Discrete(n=5)
        self.action_space = gym.spaces.Discrete(n=3)
        self.step_index = 0

    def reset(self) -> int:
        self.step_index = 0
        return self.step_index

    def step(self, action: int) -> Tuple[int, float, bool, Dict]:
        observation = self.step_index % self.observation_space.n
        done = self.step_index == 10
        reward = 0.0 if done else float(action)
        if not done:
            self.step_index += 1
        return observation, reward, done, {}


# Experiment with the toy environment
env = ToyEnv()

state = env.reset()
print(f"env.reset() -> {state}")

state = env.step(1)
print(f"env.step(1) -> {state}")

state = env.step(2)
print(f"env.step(2) -> {state}")

for _ in range(10):
    print(env.step(0))

env.reset() -> 0
env.step(1) -> (0, 1.0, False, {})
env.step(2) -> (1, 2.0, False, {})
(2, 0.0, False, {})
(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, False, {})
(1, 0.0, False, {})
(2, 0.0, False, {})
(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, True, {})
(0, 0.0, True, {})


In [16]:
class DummyAgent(ptan.agent.BaseAgent):
    """Agent that always returns a fixed action"""

    def __init__(self, action: int) -> None:
        self.action = action

    def __call__(
        self,
        observations: List[Any],
        state: Optional[List[Any]] = None,
    ) -> Tuple[List[int], Optional[List[Any]]]:
        return [self.action for _ in observations], state


# Create a dummy agent
agent = DummyAgent(action=1)
agent([1, 2])

([1, 1], None)

In [17]:
from typing import Iterable, TypeVar  # noqa

T = TypeVar("T")


def take(n: int, xs: Iterable[T]) -> Iterable[T]:
    assert n >= 0
    for x in xs:
        if n == 0:
            break
        yield x
        n -= 1


# Create new experience source with single environment
#  - There will be 2 experiences in each subtrajectory
exp_source = ptan.experience.ExperienceSource(
    env=ToyEnv(),
    agent=agent,
    steps_count=2,
)

# Print first 15 subtrajectories from the source
for exp in take(15, exp_source):
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))
(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(

In [18]:
# Show a subtrajectory containing four experiences
exp_source = ptan.experience.ExperienceSource(
    env=ToyEnv(),
    agent=agent,
    steps_count=4,
)
next(iter(exp_source))

(Experience(state=0, action=1, reward=1.0, done=False),
 Experience(state=0, action=1, reward=1.0, done=False),
 Experience(state=1, action=1, reward=1.0, done=False),
 Experience(state=2, action=1, reward=1.0, done=False))

In [19]:
# Create a source that interacts with two environments at once
exp_source = ptan.experience.ExperienceSource(
    env=[ToyEnv(), ToyEnv()],
    agent=agent,
    steps_count=2,
)

# Take and show first 4 subtrajectories
for exp in take(4, exp_source):
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))


In [20]:
# Create an aggregating source with single environment
exp_source = ptan.experience.ExperienceSourceFirstLast(
    env=ToyEnv(),
    agent=agent,
    gamma=1.0,
    steps_count=1,
)

# Show first 10 subtrajectories from this source
for exp in take(10, exp_source):
    print(exp)

ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)


## Experience Replay Buffer

In [21]:
# Crate an experience source with single environment
#  - Uses the dummy agent and produces one experience at a time
exp_source = ptan.experience.ExperienceSourceFirstLast(
    env=ToyEnv(),
    agent=DummyAgent(action=1),
    gamma=1.0,
    steps_count=1,
)

# Create a ER buffer linked to the source with 100 exp. capacity
replay_buffer = ptan.experience.ExperienceReplayBuffer(
    experience_source=exp_source,
    buffer_size=100,
)

# Sketch a usage of the ER buffer in a trainng loop

min_buffer_size = 5
batch_size = 4

# Run 6 steps of a training loop
for _ in range(6):

    # Add single sample from the experience source to the ER buffer
    replay_buffer.populate(1)

    # Let the buffer fill up
    if len(replay_buffer) < min_buffer_size:
        continue

    # Sample a batch from the ER buffer
    batch = replay_buffer.sample(batch_size)

    # Log batch info
    print(f"Train time, {len(batch)}\tbatch samples:")
    for sample in batch:
        print(sample)

Train time, 4	batch samples:
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
Train time, 4	batch samples:
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)


## Target Network

In [22]:
class LinearDQN(nn.Module):
    """DQN with single linear layer"""

    def __init__(self) -> None:
        super().__init__()
        self.lin = nn.Linear(5, 3)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.lin(x)


# Create a DQN and a corresponding target network
main_net = LinearDQN()
target_net = ptan.agent.TargetNet(main_net)

# Compare parameters of both networks
print("Main net:", main_net.lin.weight)
print("Target net:", target_net.target_model.lin.weight)

Main net: Parameter containing:
tensor([[ 0.3377, -0.3519, -0.3948, -0.3051, -0.1529],
        [ 0.3760, -0.1221, -0.4275, -0.3246, -0.2839],
        [-0.3563, -0.2029, -0.0848, -0.2980, -0.1327]], requires_grad=True)
Target net: Parameter containing:
tensor([[ 0.3377, -0.3519, -0.3948, -0.3051, -0.1529],
        [ 0.3760, -0.1221, -0.4275, -0.3246, -0.2839],
        [-0.3563, -0.2029, -0.0848, -0.2980, -0.1327]], requires_grad=True)


Simulate an update to the weights of the main network.

In [23]:
main_net.lin.weight.data += 1.0
print("Main net:", main_net.lin.weight)
print("Target net:", target_net.target_model.lin.weight)

Main net: Parameter containing:
tensor([[1.3377, 0.6481, 0.6052, 0.6949, 0.8471],
        [1.3760, 0.8779, 0.5725, 0.6754, 0.7161],
        [0.6437, 0.7971, 0.9152, 0.7020, 0.8673]], requires_grad=True)
Target net: Parameter containing:
tensor([[ 0.3377, -0.3519, -0.3948, -0.3051, -0.1529],
        [ 0.3760, -0.1221, -0.4275, -0.3246, -0.2839],
        [-0.3563, -0.2029, -0.0848, -0.2980, -0.1327]], requires_grad=True)


Synchronize parameters of both models by copying weights from the main to the target network.

In [24]:
target_net.sync()
print("Main net:", main_net.lin.weight)
print("Target net:", target_net.target_model.lin.weight)

Main net: Parameter containing:
tensor([[1.3377, 0.6481, 0.6052, 0.6949, 0.8471],
        [1.3760, 0.8779, 0.5725, 0.6754, 0.7161],
        [0.6437, 0.7971, 0.9152, 0.7020, 0.8673]], requires_grad=True)
Target net: Parameter containing:
tensor([[1.3377, 0.6481, 0.6052, 0.6949, 0.8471],
        [1.3760, 0.8779, 0.5725, 0.6754, 0.7161],
        [0.6437, 0.7971, 0.9152, 0.7020, 0.8673]], requires_grad=True)


## PTAN CartPole Example

In [25]:
class DQN(nn.Module):
    def __init__(
        self,
        n_states: int,
        n_actions: int,
        n_hidden_units: int,
    ) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_states, n_hidden_units),
            nn.ReLU(),
            nn.Linear(n_hidden_units, n_actions),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x.float())


@torch.no_grad()
def unpack_batch(
    batch: Iterable[ptan.experience.ExperienceFirstLast],
    net: nn.Module,
    gamma: float,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    states = []
    actions = []
    rewards = []
    done_masks = []
    last_states = []

    # Unzip experiences from the batch
    for exp in batch:
        done = exp.last_state is None
        states.append(exp.state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        done_masks.append(done)
        last_states.append(exp.state if done else exp.last_state)

    # Convert all to tensors
    states = torch.tensor(states)
    actions = torch.tensor(actions)
    rewards = torch.tensor(rewards)
    last_states = torch.tensor(last_states)

    # Compute Q values of all actions for all the last states
    q_values = net(last_states)

    # max_a' Q(s_last, a')
    future_values = torch.max(q_values, dim=1)[0]
    future_values[done_masks] = 0.0

    # Returns states, actions and TD targets for the whole batch
    return states, actions, rewards + gamma * future_values

In [26]:
# Hyperparameters
n_hidden_units = 128
target_sync_period = 10
batch_size = 16
buffer_capacity = 1000
min_buffer_size = 2 * batch_size
gamma = 0.9
learning_rate = 1e-3
epsilon_decay = 0.99
solution_reward_bound = 150

# Create the CartPole environment
env = gym.make("CartPole-v0")
n_states = env.observation_space.shape[0]
n_actions = env.action_space.n

# Create main and target DQNs
main_net = DQN(n_states, n_actions, n_hidden_units)
target_net = ptan.agent.TargetNet(main_net)

# Create an optimizer linked to the parameters of the main DQN
optimizer = torch.optim.Adam(main_net.parameters(), learning_rate)

# Create DQN agent that uses epsilon-greedy exploration policy
#  - Note: Initial epsilon is 1 but it will be decayed during training

selector = ptan.actions.EpsilonGreedyActionSelector(
    epsilon=1,
    selector=ptan.actions.ArgmaxActionSelector(),
)

agent = ptan.agent.DQNAgent(main_net, selector)

# Create new aggregating experience source
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=gamma)

# Setup a ER replay buffer liked to the exp. source
replay_buffer = ptan.experience.ExperienceReplayBuffer(
    experience_source=exp_source,
    buffer_size=buffer_capacity,
)

In [27]:
step = 0
episode = 0
reward = 0.0
solved = False

while True:

    # Sample new experiences from the environment into the ER buffer
    step += 1
    replay_buffer.populate(1)

    # Look for a completed episode since the last call
    for reward, steps in exp_source.pop_rewards_steps():
        episode += 1
        print(
            f"{step}: episode {episode} done, reward={reward:.3f}, "
            f"epsilon={selector.epsilon:.2f}"
        )
        solved = reward > solution_reward_bound

    # Termination condition
    if solved:
        print("Solved")
        break

    # Let the ER buffer fill up
    if len(replay_buffer) < min_buffer_size:
        continue

    # Sample a batch from the ER buffer
    batch = replay_buffer.sample(batch_size)

    # Extract all the states, actions and TD targets from the batch
    #  - Note: TD targets are computed using the TargetNet, hence fixed
    states, actions, td_targets = unpack_batch(
        batch=batch,
        net=target_net.target_model,
        gamma=gamma,
    )

    # Reset gradients before the next backpropagation step
    optimizer.zero_grad()

    # Let the main DQN compute Q values of the actions played in the batch
    q_values = main_net(states)
    q_values = q_values.gather(1, actions.unsqueeze(-1)).squeeze(-1)

    # Compute the gradient of the MSE between Q values end TD targets
    loss = torch.nn.functional.mse_loss(q_values, td_targets)
    loss.backward()

    # Make single gradient descent step and update weights of the main DQN
    optimizer.step()

    # Decay epsilon based on a fixed rate schedule
    selector.epsilon *= epsilon_decay

    # Periodically synchronize weights of the main and target DQNs
    if step % target_sync_period == 0:
        target_net.sync()

47: episode 1 done, reward=46.000, epsilon=0.86
65: episode 2 done, reward=18.000, epsilon=0.72
97: episode 3 done, reward=32.000, epsilon=0.52
116: episode 4 done, reward=19.000, epsilon=0.43
127: episode 5 done, reward=11.000, epsilon=0.38
139: episode 6 done, reward=12.000, epsilon=0.34
155: episode 7 done, reward=16.000, epsilon=0.29
166: episode 8 done, reward=11.000, epsilon=0.26
177: episode 9 done, reward=11.000, epsilon=0.23
186: episode 10 done, reward=9.000, epsilon=0.21
195: episode 11 done, reward=9.000, epsilon=0.19
204: episode 12 done, reward=9.000, epsilon=0.18
216: episode 13 done, reward=12.000, epsilon=0.16
225: episode 14 done, reward=9.000, epsilon=0.14
238: episode 15 done, reward=13.000, epsilon=0.13
248: episode 16 done, reward=10.000, epsilon=0.11
258: episode 17 done, reward=10.000, epsilon=0.10
268: episode 18 done, reward=10.000, epsilon=0.09
278: episode 19 done, reward=10.000, epsilon=0.08
286: episode 20 done, reward=8.000, epsilon=0.08
296: episode 21 d