From 6a7ea7648152b172a863356b0c3a274df9202dd0 Mon Sep 17 00:00:00 2001 From: Alex Date: Mon, 25 Apr 2022 11:03:52 +0100 Subject: [PATCH] Fix documentation --- docs/source/Examples/a2c_three_columns.rst | 74 ++++++++++++++++--- .../source/Examples/qlearning_all_columns.rst | 6 ++ 2 files changed, 68 insertions(+), 12 deletions(-) diff --git a/docs/source/Examples/a2c_three_columns.rst b/docs/source/Examples/a2c_three_columns.rst index 38bc8f3..a4934b6 100644 --- a/docs/source/Examples/a2c_three_columns.rst +++ b/docs/source/Examples/a2c_three_columns.rst @@ -1,9 +1,8 @@ -A2C algorithm on three columns data set -======================================= +A2C algorithm on mock data set +============================== - -A2C algorithm -------------- +Overview +-------- Both the Q-learning algorithm we used in `Q-learning on a three columns dataset `_ and the SARSA algorithm in `Semi-gradient SARSA on a three columns data set `_ are value-based methods; that is they estimate directly value functions. Specifically the state-action function @@ -11,28 +10,59 @@ Both the Q-learning algorithm we used in `Q-learning on a three columns dataset maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy. These methods are called off-policy methods. However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`. One class of algorithms towards this directions are policy gradient algorithms -like REINFORCE and Advantage Actor-Critic of A2C algorithms. +like REINFORCE and Advantage Actor-Critic or A2C algorithms. A review of A2C methods can be found in [1] -Typically, with these methods we approximate directly the policy by a parametrized model. + +A2C algorithm +------------- + +Typically with policy gradient methods and A2C in particular, we approximate directly the policy by a parametrized model. Thereafter, we train the model i.e. learn its paramters by taking samples from the environment. The main advantage of learning a parametrized policy is that it can be any learnable function e.g. a linear model or a deep neural network. -The A2C algorithm is a a synchronous version of A3C. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor +The A2C algorithm is a the synchronous version of A3C [REF]. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor and a parametrized value function; the critic. The role of the policy or actor network is to indicate which action to take on a given state. In our implementation below, the policy network returns a probability distribution over the action space. Specifically, a tensor of probabilities. The role of the critic model is to evaluate how good is the action that is selected. In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance of the data set to anonymize. A shared model is then optimized by each worker. -The objective of the agent is to maximize the expected discounted return: +The objective of the agent is to maximize the expected discounted return [2]: .. math:: J(\pi_{\theta}) = E_{\tau \sim \rho_{\theta}}\left[\sum_{t=0}^T\gamma^t R(s_t, \alpha_t)\right] where :math:`\tau` is the trajectory the agent observes with probability distribution :math:`\rho_{\theta}`, :math:`\gamma` is the -discount factor and :math:`R(s_t, \alpha_t)` represents some unknown to the agent reward function. -We can use neural networks to approximate both models +discount factor and :math:`R(s_t, \alpha_t)` represents some unknown to the agent reward function. We can rewrite the expression above as + +.. math:: + + J(\pi_{\theta}) = E_{\tau \sim \rho_{\theta}}\left[\sum_{t=0}^T\gamma^t R(s_t, \alpha_t)\right] = \int \rho_{\theta} (\tau) \sum_{t=0}^T\gamma^t R(s_t, \alpha_t) d\tau + + +Let's condense the involved notation by using :math:`G(\tau)` to denote the sum in the expression above i.e. + +.. math:: + + G(\tau) = \sum_{t=0}^T\gamma^t R(s_t, \alpha_t) + +The probability distribution :math:`\rho_{\theta}` should be a function of the followed policy :math:`\pi_{\theta}` as this dictates what action is followed. Indeed we can write [2], + +.. math:: + + \rho_{\theta} = p(s_0) \Pi_{t=0}^{\infty} \pi_{\theta}(a_t, s_t)P(s_{t+1}| s_t, a_t) + + +where :math:` P(s_{t+1}| s_t, a_t)` denote state transition probabilities. +Policy gradient methods use the gardient of :math:`J(\pi_{\theta})` in order to make progress. It turns out, see for example [2, 3] that we can write + +.. math:: + + \nabla_{\theta} J(\pi_{\theta}) = \int \rho_{\theta} \nabla_{\theta} log (\rho_{\theta}) G(\tau) d\tau + +However, we cannot fully evaluate the integral above as we don't know the transition probabilities. Thus, we resort into taking samples from the +environment in order to obtain an estimate. Specifically, we will use a weight-sharing model. Moreover, the environment is a multi-process class that gathers samples from multiple @@ -44,7 +74,19 @@ The advanatge :math:`A(s_t, \alpha_t)` is defined as [REF] A(s_t, \alpha_t) = Q_{\pi}(s_t, \alpha_t) - V_{\pi}(s_t) -It represents a goodness fit for an action at a given state. where ... +It represents a goodness fit for an action at a given state. We can use the critic function for :math:`V_{\pi}(s_t)` and use the following approximation +for :math:`Q_{\pi}(s_t, \alpha_t)` + +.. math:: + + Q_{\pi}(s_t, \alpha_t) = r_t + \gamma V_{\pi}(s_{t+1}) + +leading to + + +.. math:: + + A(s_t, \alpha_t) = r_t + \gamma V_{\pi}(s_{t+1}) - V_{\pi}(s_t) @@ -68,3 +110,11 @@ Overall, the A2C algorithm is described below - Compute Critic gradients Code ---- + + +References +---------- + +1. Ivo Grondman, Lucian Busoniu, Gabriel A. D. Lopes, Robert Babuska, A survey of Actor-Critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man and Cybernetics-Part C Applications and Reviews, vol 12, 2012. +2. Enes Bilgin, Mastering reinforcement learning with python. Packt Publishing. +3. Miguel Morales, Grokking deep reinforcement learning. Manning Publications. diff --git a/docs/source/Examples/qlearning_all_columns.rst b/docs/source/Examples/qlearning_all_columns.rst index 4f40124..4be87c3 100644 --- a/docs/source/Examples/qlearning_all_columns.rst +++ b/docs/source/Examples/qlearning_all_columns.rst @@ -160,3 +160,9 @@ The following images show the performance of the learning process .. figure:: images/qlearn_distortion_multi_cols.png Running average total distortion. + + +References +----------- + +1. Richard S. Sutton and Andrw G. Barto, Reinforcement Learning. An Introduction 2nd Edition, MIT Press.