Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 62 additions & 12 deletions docs/source/Examples/a2c_three_columns.rst
Original file line number Diff line number Diff line change
@@ -1,38 +1,68 @@
A2C algorithm on three columns data set
=======================================
A2C algorithm on mock data set
==============================


A2C algorithm
-------------
Overview
--------

Both the Q-learning algorithm we used in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_ and the SARSA algorithm in
`Semi-gradient SARSA on a three columns data set <semi_gradient_sarsa_three_columns.html>`_ are value-based methods; that is they estimate directly value functions. Specifically the state-action function
:math:`Q`. By knowing :math:`Q` we can construct a policy to follow for example to choose the action that at the given state
maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy. These methods are called off-policy methods.

However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`. One class of algorithms towards this directions are policy gradient algorithms
like REINFORCE and Advantage Actor-Critic of A2C algorithms.
like REINFORCE and Advantage Actor-Critic or A2C algorithms. A review of A2C methods can be found in [1]

Typically, with these methods we approximate directly the policy by a parametrized model.

A2C algorithm
-------------

Typically with policy gradient methods and A2C in particular, we approximate directly the policy by a parametrized model.
Thereafter, we train the model i.e. learn its paramters by taking samples from the environment.
The main advantage of learning a parametrized policy is that it can be any learnable function e.g. a linear model or a deep neural network.

The A2C algorithm is a a synchronous version of A3C. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
The A2C algorithm is a the synchronous version of A3C [REF]. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
and a parametrized value function; the critic. The role of the policy or actor network is to indicate which action to take on a given state. In our implementation below,
the policy network returns a probability distribution over the action space. Specifically, a tensor of probabilities. The role of the critic model is to evaluate how good is
the action that is selected.

In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance of the data set to anonymize. A shared model is then optimized by each worker.

The objective of the agent is to maximize the expected discounted return:
The objective of the agent is to maximize the expected discounted return [2]:

.. math::

J(\pi_{\theta}) = E_{\tau \sim \rho_{\theta}}\left[\sum_{t=0}^T\gamma^t R(s_t, \alpha_t)\right]

where :math:`\tau` is the trajectory the agent observes with probability distribution :math:`\rho_{\theta}`, :math:`\gamma` is the
discount factor and :math:`R(s_t, \alpha_t)` represents some unknown to the agent reward function.
We can use neural networks to approximate both models
discount factor and :math:`R(s_t, \alpha_t)` represents some unknown to the agent reward function. We can rewrite the expression above as

.. math::

J(\pi_{\theta}) = E_{\tau \sim \rho_{\theta}}\left[\sum_{t=0}^T\gamma^t R(s_t, \alpha_t)\right] = \int \rho_{\theta} (\tau) \sum_{t=0}^T\gamma^t R(s_t, \alpha_t) d\tau


Let's condense the involved notation by using :math:`G(\tau)` to denote the sum in the expression above i.e.

.. math::

G(\tau) = \sum_{t=0}^T\gamma^t R(s_t, \alpha_t)

The probability distribution :math:`\rho_{\theta}` should be a function of the followed policy :math:`\pi_{\theta}` as this dictates what action is followed. Indeed we can write [2],

.. math::

\rho_{\theta} = p(s_0) \Pi_{t=0}^{\infty} \pi_{\theta}(a_t, s_t)P(s_{t+1}| s_t, a_t)


where :math:` P(s_{t+1}| s_t, a_t)` denote state transition probabilities.
Policy gradient methods use the gardient of :math:`J(\pi_{\theta})` in order to make progress. It turns out, see for example [2, 3] that we can write

.. math::

\nabla_{\theta} J(\pi_{\theta}) = \int \rho_{\theta} \nabla_{\theta} log (\rho_{\theta}) G(\tau) d\tau

However, we cannot fully evaluate the integral above as we don't know the transition probabilities. Thus, we resort into taking samples from the
environment in order to obtain an estimate.


Specifically, we will use a weight-sharing model. Moreover, the environment is a multi-process class that gathers samples from multiple
Expand All @@ -44,7 +74,19 @@ The advanatge :math:`A(s_t, \alpha_t)` is defined as [REF]

A(s_t, \alpha_t) = Q_{\pi}(s_t, \alpha_t) - V_{\pi}(s_t)

It represents a goodness fit for an action at a given state. where ...
It represents a goodness fit for an action at a given state. We can use the critic function for :math:`V_{\pi}(s_t)` and use the following approximation
for :math:`Q_{\pi}(s_t, \alpha_t)`

.. math::

Q_{\pi}(s_t, \alpha_t) = r_t + \gamma V_{\pi}(s_{t+1})

leading to


.. math::

A(s_t, \alpha_t) = r_t + \gamma V_{\pi}(s_{t+1}) - V_{\pi}(s_t)



Expand All @@ -68,3 +110,11 @@ Overall, the A2C algorithm is described below
- Compute Critic gradients
Code
----


References
----------

1. Ivo Grondman, Lucian Busoniu, Gabriel A. D. Lopes, Robert Babuska, A survey of Actor-Critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man and Cybernetics-Part C Applications and Reviews, vol 12, 2012.
2. Enes Bilgin, Mastering reinforcement learning with python. Packt Publishing.
3. Miguel Morales, Grokking deep reinforcement learning. Manning Publications.
6 changes: 6 additions & 0 deletions docs/source/Examples/qlearning_all_columns.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,9 @@ The following images show the performance of the learning process
.. figure:: images/qlearn_distortion_multi_cols.png

Running average total distortion.


References
-----------

1. Richard S. Sutton and Andrw G. Barto, Reinforcement Learning. An Introduction 2nd Edition, MIT Press.