DRL Toolbox

Overview

This repository contains code reinforcement learning code for solving the Udacity Deep reinforcement Learning projects. It has been refactored in such a way as to be useful for applying the implemented algorithms to new environments with minimal setup.

Prerequisites

Anaconda
Linux with GPU
CUDA 10.2

All models are developed in Pytorch

Install

Recreate the Anaconda environment with:
conda env create -f environment.yml

Activate the conda environment with:
conda activate drl_toolbox

Repository Structure

The code is organized as follows:

/agents
- Agents are responsible for interacting with the environment -- given a state provided from the environment, the agent should be able to choose an action. At the next time step the agent will receive a reward for that action. The goal of the agent is to choose actions in order to maximize the long-term (cumulative) reward.
- /agents/models:
  - Models are interchangeable components representing an agent's brain. The current implementations of models are parametric and invoke neural networks.
  - The input to a model is a State, and the output is a representation Q(s_t, a_t; θ) of the value of the (s, a) pair
  - /agents/models/components:
    - Model components are torch.nn.Modules which are combined to form models. In the context of the DQN model, these components form the featurizer and outputs of the DQN model.
- /agents/policies:
  - Policies are responsible for choosing the next action given the current state. Policies utilize the parametric models from /agents/models for the purpose of generating actions given a state.
  - The Policy classes are responsible for taking Q(a_t, s_t), generated from a Model and outputting an Action
- /agents/memory:
  - Implements memory buffers for the agent, used for storing experiences sampled from the environment and recalling experiences during learning
/simulation
- The simulation modules are responsible for emulating the the environment, and provide helper functions for training and evaluating an agent. The simulator provides environmental state for the agent, receives an action, and generates the next environmental state along with a scalar reward signal for the agent.
/tasks
- Tasks are the learning tasks (or problems) which we develop DRL agents to solve.
- /tasks/<task_name>
  - The solution and discussion for a particular task can be found in these folders
/tools
- Contains common tools for solving tasks

Task solutions and reports

Quick navigation

Ray-tracing implementation
Visual Banana (pixel) implementation
Reacher (20 homogeneous agents) implementation
Crawler (12 homogeneous agents) implementation
Soccer (multi agent) implementation
Tennis (multi agent)) implementation

Agent Implementations and explanation

Currently only the Deep Q-Network algorithm is implemented, along with a few extensions to the original algorithm outlined in the Rainbow DQN implementation. Below, we discuss the algorithm at a high level, along with the implemented extensions.

Deep Q-Network (DQN)

Overview

Q-Learning is a value-based reinforcement learning algorithm which can be used to learn an optimal policy (strategy) for selection actions in an environment. Being value based, the Q-learning algorithm aims to identify the expected return (cumulative sum of future rewards):

Q : S x A -> ℝ, for all s ∈ S, a ∈ A

Q^π(s_t, a_t) = ℰ[R_t, γR_t+1, γ²R_t+2 + ... | s_t, a_t]

Where π is the policy being followed, t is the current time step, and R_t is the reward (provided by the environment) at time t.

The goal in Q-learning is to find the optimal action-value function, or:

Q^*(s_t, a_t) = max_π ℰ[R_t, γR_t+1, γ²R_t+2 + ... | s_t, a_t]

The optimal action value function thus the maximum expected return which can be achieved by taking action a in state s.

The optimal action-value function obeys the Bellman equation, which allows Q^*(s_t, a_t) to be writting as a recurrence relation:

Q(s_t, a_t) = ℰ[R_t + γ max_{a_t+1} Q(s_t+1, a_t+1) | s_t, a_t]

Which can then be cast as an iterative update rule as:

Q_i+1(s_t, a_t) = ℰ_{s_t+1}[R_t + γ max_{a_t+1} Q_i(s_t+1, a_t+1) | s_t, a_t]

Where i indicates the update step. The Deep Q-Network Algorithm uses parametric (neural network) models as function approximators to estimate the value of Q(s, a), resulting in iterative loss equations of:

L_i(θ_i) = ℰ_{s_t,a_t,r_t} [ℰ_{s_t+1}([R_t + γ max_{a_t+1} Q_i(s_t+1, a_t+1; θ^-)] - Q(s_t, a_t; θ_i))²]

Where the target network's weights θ^- represents a set of parameters which are periodically copied (synced) with teh online network's weights, θ, every τ time steps, and held fixed otherwise. For convenience, we can write the target value function as:

Y_t^DQN ≡ R_t + γ max_{a_t+1} Q(s_t+1, a_t+1; θ_t^-)

Double DQN

The double DQN algorithm is a minor but significant adaptation to the DQN algorithm which helps reduce over-estimation of action values Q(s, a). The traditional DQN algorithm has:

Y_t^DQN ≡ R_t + γ max_{a_t+1} Q(s_t+1, a_t+1; θ_t^-) = R_t + γQ(S_t+1, argmax_aQ(S_t+1, a;θ_t); θ_t)

where the same θ_t is used to compute both the current state-values and next state-values, where we've used the fact that a_t+1 is approximated by argmax_a(Q(S_t+1, a;θ_t). Because the same parametric model is used to both select and to evaluate an action, the Double DQN authors show that over-estimation of value estimates can occur.

The double DQN Algorithm was shown to significantly reduce over-estimation bias in action-value estimates by using the target network (parametrized by θ^-) to evaluate the action, while using the online network (parametrized by θ) for the action selection.

Y_t^Double-DQN ≡ R_t + γQ(s_t+1, argmax_a Q(s_t+1, a; θ_t) θ_t^-)

See the compute_errors method of the Base Policy class for code implementation

Prioritized Experience Replay (PER)

Rather than performing learning updates on experiences as they are sampled from the environment (i.e. sequentially through time), the DQN algorithm uses an experience buffer to store experiences, which are then sampled in a probabilistic way to break correlations between sequential experiences. In this implementation, experiences are stored as tuples of (s_t, a_t, r_t, s_t+1) corresponding to the state, action, reward and next_state, respectively.

During learning, experiences tuples (s_t, a_t, r_t, s_t+1) are drawn non-uniformly from the replay buffer according according to the error between the predicted and expected action-value for the given experience. Experience samples which result in greater error are given a higher probability for being sampled.

Samples are added to the replay buffer such that the probability of being sampled is:

P(i) = p_i^α / ∑_k p_k^α

Where p_i > 0 is the priority of sample i, and α is used to control how much the priotiziation is used, with α=0 representing the case of uniform sampling.

To determine the priority of an experience tuple, TD Error is used (and can be applied to either the DQN or Double-DQN case). We show the case for the Double-DQN, since that is what's implemented in the code:

TD Error = R_t + γQ(s_t+1, argmax_a Q(s_t+1, a; θ_t) θ_t^-) - Q(s_t, a_t; θ)

In the code, we use the L2 norm of the TD error as our un-normalized priority values p_i.

Due to the introduced bias from non-uniform sampling of experiences, the gradients are weighted by importance-sampling weights as a correction:

w_i = (1/N * 1/P(i)) ^β

A SumTree data structure is implemented to perform weighted sampling efficiently. See the implementation of the PER buffer, and the SumTree.

See the compute_errors method of the Base Policy class shows where importance weights are applied to scale the gradients, and step method of the DQNAgent contains the implementation of updating the priorities.

Dueling DQN network

The dueling DQN network architecture attempts to decompose Q(s, a) as the sum of the value, V(s) of a state, and the advantage, A(s, a) of taking an action in that state (advantage over all other possible actions from that state).

Q(s, a) = A(s, a) + V(s)

To accomplish this, the output-component of the DQN is separated into two steams, one for V(s) and one for A(s, a). These two streams are then aggregated to obtain the final Q(s, a) values.

By decoupling, the Dueling DQN learns the value of a state independently from the value of actions from that state, which is particularly useful if actions to not affect the value of a state in a relevant way.

For implementation details, see the _get_dueling_output method of the dqn.

Noisy DQN network

The Noisy DQN output replaces the traditional ε-greedy exploration technique for a neural-network layer whose weights and biases are perturbed by a parametric function which is learned with the rest of the network. Noisy networks have the advantage over ε-greedy in that the amount of noise they inject into the network is learned rather than annealed in a heuristic manner.

For implementation details, see the get_output method of the dqn.

Distributional (Categorical) DQN network

The categorical DQN algorithm attempts to model the return distribution for an action, rather than the expected return, thus modelling the distribution of Q(s, a). The categorical DQN is implemented in the get_output method of dqn, with corresponding categorical policy which is responsible for computing the errors between the target and online network distributions. Please refer to the paper for theoretical details and to this reference implementation, from which the code is adapted from.
Deep Deterministic Policy Gradients

Overview

DDPG shares many commonalities with the DQN network, but is designed to accommodate continuous action spaces. DDPG displays an actor-critic architecture, where the critic samples mini-batches of experience from a replay buffer to calculate a bootstrapped TD target for training the Q-function off-policy, and the actor is trained on-policy to learn the optimal action to take in a given state.

The major difference between the DQN and DDPG can be explained as follows. The DQN loss function at each iteration can be describes as:

L_t (θ_t) = ℰ_(s,a,r,s')[(r + γ Q(s', argmax_a'Q(s', a'; θ^-); θ^- - Q(s, a; θ_t)))²]

which uses the argmax of the q-function of the next state to obtain the greedy action.

In contrast, the DDPG loss at time step t is given by:

L_t (θ_t) = ℰ_(s,a,r,s')[(r + γ Q(s', μQ(s'; φ^-); θ^- - Q(s, a; θ_t)))²]

Where the argmax has been replaced by a learned policy function μ parametrized by φ, which learns the deterministic greedy action from the current state. Note that both actor and critic networks use the target-network approach as in DQN.

The actor/policy network μ(-; φ_t) is trained to find the action which maximizes the expected q-value, where the loss is given by:

J_i(φ_i) = ℰ_s [Q(s, μ(s; φ);θ)]

Where the states s are sampled from the replay buffer, and in practice are the same states used to update the critic network.

To allow for exploration with deterministic policies, we inject Gaussian noise into the actions selected by the policy.

Prioritized experience replay

As in DQN, we can replace the uniformly-sampled replay buffer with a prioritized replay buffer

Twin Delayed DDPG (TD3)

The TD3 paper introduces a number of improvements to the classic TD3 algorithm which improve performance, given below:
- Double Learning/ twin loss function
  - This improvement splits the critic model into two separate and independent (unless weight sharing is used) networks, where only the optimizer is shared, creating two separate estimates for the value of the state-action pair. The joint loss function is computed as the sum of the MSEs of each of the two streams:
    
    J_{twin, t} = J_t(θ_t^a) + J_t(θ_t^b)
    
    for the two streams a and b, with:
    
    J_t(θ_t^a) = ℰ_{s, a, r, s'}[(TWIN ^target - Q(s, a; θ_t^a))²]
    
    J_t(θ_t^b) = ℰ_{s, a, r, s'}[(TWIN ^target - Q(s, a; θ_t^b))²]
    
    TWIN ^target = r + γ min_n Q(s', μ(s';φ^-); θ^{n, -})
    
    where we use the target q-value is the minimum q-value obtained from the two streams, using target networks for both policy and value network.
- Smoothing targets for policy updates
  - In DDPG, Gaussian noise was added to the actions used for exploration. In TD3, this is extended by adding noise to the actions used to calculate the targets, which can be seen as regularization as the network is forced to learn similarities between actions, and helps especially during the beginning of learning where it is most likely for the network to converge to incorrect actions.
  a^',smooth = clamp(μ(s';φ^-)) + clamp(ε, ε_low, ε_high), action_low, action_high)
  
  where clamp(x, l, h) = max(min(x, h), l) is the clamping function, ε is an array of Gaussian noise, and ε_low/high and action_low/high represent the minimum/maximum of the sampled noise and actions, respectively.
  
  The smooth TD target is then:
  
  TD3^target = r + γmin_nQ(s', a^{', smooth}; θ^{n, -})
- Delaying policy updates
  - The last TD3 improvement is to update the online Q-function at a higher frequency than both the policy network updates and the soft-copying of target networks -> online networks. Delaying updates in this manner allows the value function to stabilize into more accurate values prior to passing it to the policy network. The typical delay parameter used is τ=2, meaning that the policy and target networks are updated every other update compared to the online Q-function.
Multi-Agent DDPG (MADDPG)

Overview

The MADDPG algorithm attempts to address the challenges of implementing single-agent algorithms in a multi-agent setting, which generally results in poor performance due to 1) the non stationarity of the environment due to the presence of other agents, and 2) the exponential increase in state/action spaces resulting from modeling multiple interacting agents (due to the curse of dimensionality). The MADDPG algorithm is an example of a class of algorithms referred to as "centralized planning with decentralized execution".

Centralized Planning

The concept of centralized planning with decentralized execution involves having agents which only have access to local (i.e. their own) observations, and during training all agent's policy updates are guided by the same (centralized) critic, while during inference the centralized critic is removed and agents are guided only by their independently learned policies. The centralized critic used during training encompasses information from all agents, alleviating the non-stationarity of the environment.

Decentralized execution

During inference the centralized critic module is removed, forcing each agent to make decisions guided by local observations and their individual policies.

MADDPG Architecture

In the MADDPG framework, there are multiple agents, where each agent has a discrete observation space and a continuous action space. Each agent has an online-actor, target-actor, and a critic which uses global information (states and actions from all agents). If the agents are Homogeneous, the actor/target networks can be shared amongst agents

Learning algorithm

As in DDPG, MADDPG learns off-policy using an experience replay buffer, where experience is stored as a tuple of all agents a₁, ..., a_N, state information x, x' required for the update of the critic network, and the agent's rewards r₁, ..., r_N

(x, x', a₁, ..., a_N, r₁, ..., r_N )

Critic update

In general, there may be N independent policies μ_{θ_i} yielding a gradient of:

∇θ_i J(μ_i) = ℰ _{x, a ~ D} [∇_{θ_i} μ_i(a_i | o_i) ∇θ_{a_i} Q_i^μ (x, a₁, ..., a_n | a_i = μ_i(o_i))]

Where x, a are sampled from the replay buffer. The centralized action-value function Q_i ^μ is then updated as:

L(θ_i) = ℰ_{x, a, r, x'} [(Q_i^μ (x, a₁, ..., a_n) - y) ²],

where y = r_i + γ Q_i^μ' (x', a₁, ..., a_n)

Actor update

The actor update for MADDPG is the same as for DDPG, where each agent's actor μ_i is updated as:

∇_{θ_i} = ℰ _{x, a ~ D} = [∇_{θ_i} μ_i(a_i | o_i) ∇_{a_i}Q_i^μ (x, a₁, ..., a_n) | | a_i = μ_i(o_i)]

The actor's gradient, having access only to the agent's local observations, is scaled by the gradient of the critic which has access to global information, reducing the effects of non-stationarity

Policy inference

Rather than explicitly passing in the

MATD3 Extensions

We can apply the same extensions which take DDPG to TD3 to the multi-agent case, resulting in better algorithm convergence as in the MATD3 paper
Proximal Policy Optimization (PPO)

Overview

The PPO method was introduced in 2017 and is an on-policy learning algorithm which alternates between sampling data from the environment and optimizing a surrogate objective function by gradient descent, in contrast to other policy gradient methods which perform a gradient update on each sample before discarding the sample. The PPO algorithm has experimentally been shown to offer better sampling complexity than other policy gradient algorithms, while showing robustness and simplicity over other families of algorithms such as q-learning.

Clipped Surrogate Objective Function

The PPO algorithm borrows the surrogate objective function from the Trust Region Policy Optimization (TRPO) algorithm, which states:

L^CPI(θ) = ℰ_t[A_t π_θ(a_t | s_t) / π_θ(a_t | s_t) = ℰ_t[r_t(θ) A_t]

The clipped surrogate objective function that is proposed in PPO is:

L^CLIP(θ) = ℰ_t[min(r_t(θ)A_t, clip(r_t(θ), 1 - ε, 1 + ε) A_t]

where ε is a small constant, such as ε=0.02. The first term is the same as L^CPI, while the second term clips r_t to be within the interval [1 - ε, 1 + ε]. By taking the minimum of the unclipped and clipped objective, we obtain a lower bound for the final objective function which improves the stability of the algorithm.

During training the policy function is typically guided by a state-value function V(s), which is included in the optimization function as

L^VF_t = (V_θ(s_t) - V_t^targ)²

and an additional entropy term of the policy distribution which helps to ensure sufficient exploration,

L^entropy = S[π] (s_t)

Resulting in a final loss function of L_t = L_t^CLIP(θ) + c₁L^VF_t + c₂L_t^entropy

where c1 and c2 are constants.

MAPPO

The PPO algorithm above can be extended to the multi-agent scenario in an analogous way as DDPG to MADDPG. This involves passing the state and actions of all other agents in the environment to the (joint_state, joint_actions) to the Critic of each agent during training, whos value estimate will assist in guiding the learning of the policy (Actor) network. During evaluation, only the policy network is used, and the agents are not provided any external information regarding other agents.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
agents		agents
simulation		simulation
tasks		tasks
tools		tools
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
readme.md		readme.md

lajd/Udacity-Deep-Reinforcement-Learning

Folders and files

Latest commit

History

Repository files navigation

DRL Toolbox

Overview

Prerequisites

Install

Repository Structure

Task solutions and reports

Quick navigation

Agent Implementations and explanation

Deep Q-Network (DQN)

Overview

Overview

Prioritized experience replay

Double Learning/ twin loss function

Smoothing targets for policy updates

Delaying policy updates

Overview

Centralized Planning

Decentralized execution

MADDPG Architecture

Learning algorithm

Critic update

Actor update

Policy inference

MATD3 Extensions

Overview

Clipped Surrogate Objective Function

MAPPO

About

Topics

Resources

Stars

Watchers

Forks

Languages