# Trust Region Policy Optimization

* 의사결정RL : 파트 4 - 딥강화학습 [1]
* 김무성

# Contents
* Abstract
* 1 Introduction
* 2 Preliminaries
* 3 Monotonic Improvement Guarantee for General Stochastic Policies
* 4 Optimization of Parameterized Policies
* 5 Sample-Based Estimation of the Objective and Constraint
    - 5.1 Single Path
    - 5.2 Vine
* 6 Practical Algorithm
* 7 Connections with Prior Work
* 8 Experiments
    - 8.1 Simulated Robotic Locomotion
    - 8.2 Playing Games from Images
* 9 Discussion

# Abstract

#### 참고자료
* [4] Deep Reinforcement Learning - https://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf
* [3] 강화학습 튜토리알 - 인공 신경망으로 '퐁' 게임을 학습시키자 (Andrej Karpathy 포스트 번역) - http://keunwoochoi.blogspot.kr/2016/06/andrej-karpathy.html
* [8] Deep Reinforcement Learning / David Silver - http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
* [9] Policy Gradient Methods - http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf
* [5] rllab - https://github.com/rllab/rllab

# 1 Introduction

* Most algorithms for policy optimization can be classified into three broad categories: 
    - (1) policy iteration methods, which alternate between estimating the value function under the current policy and improving the policy (Bertsekas, 2005); 
    - (2) policy gradient methods, which use an estima- tor of the gradient of the expected return (total reward) obtained from sample trajectories (Peters & Schaal, 2008a) (and which, as we later discuss, have a close connection to policy iteration); and 
    - (3) derivative-free optimization methods, such as the cross-entropy method (CEM) and covariance matrix adaptation (CMA), which treat the return as a black box function to be optimized in terms of the policy parameters 
* In this article, 
    - we first prove that minimizing a certain surrogate objective function guarantees policy improvement with non-trivial step sizes.
    - Then we make a series of approximations to the theoretically-justified algorithm, yielding a practical algorithm, 
        - which we call trust region policy optimization (TRPO).

# 2 Preliminaries

<img src="figures/cap1.png" width=600 />

<img src="figures/cap2.png" width=600 />

<img src="figures/cap3.png" width=600 />

<img src="figures/cap4.png" width=600 />

# 3 Monotonic Improvement Guarantee for General Stochastic Policies

<img src="figures/cap5.png" width=600 />

<img src="figures/cap6.png" width=600 />

<img src="figures/cap7.png" width=600 />

<img src="figures/cap8.png" width=600 />

<font color="red">Trust region policy optimization, which we propose in the following section, is an approximation to Algorithm 1, which uses a constraint on the KL divergence rather than a penalty to robustly allow large updates</font>.

# 4 Optimization of Parameterized Policies

* In the previous section, we considered the policy optimization problem independently of the parameterization of π and under the assumption that the policy can be evaluated at all states. 
* We now describe how to derive a practical algorithm from these theoretical foundations, under finite sample counts and arbitrary parameterizations.

<img src="figures/cap9.png" width=600 />

<img src="figures/cap10.png" width=600 />

<img src="figures/cap11.png" width=600 />

# 5 Sample-Based Estimation of the Objective and Constraint
* 5.1 Single Path
* 5.2 Vine

The previous section proposed a constrained optimization problem on the policy parameters (Equation (12)), which optimizes an estimate of the expected total reward η subject to a constraint on the change in the policy at each update. 

This section describes how the objective and constraint functions can be approximated using Monte Carlo simulation.

<img src="figures/cap12.png" width=600 />

<img src="figures/cap13.png" width=600 />

#### sampling

* All that remains is to replace the expectations by sample averages and replace the Q value by an empirical estimate. 
* The following sections describe two different schemes for performing this estimation.
    - single path
        - The first sampling scheme, which we call single path, is the one that is typically used for policy gradient estimation (Bartlett & Baxter, 2011), and is based on sampling individual trajectories. 
    - vine
        - The second scheme, which we call vine, involves constructing a rollout set and then perform- ing multiple actions from each state in the rollout set. This method has mostly been explored in the context of policy iteration methods

<img src="figures/cap14.png" width=600 />

## 5.1 Single Path

* In this estimation procedure, we collect a sequence of states by sampling $s_{0} ∼ ρ_{0}$ and then simulating the policy $π_{θ_{old}}$ for some number of timesteps to generate a trajectory $s_{0},a_{0},s_{1},a_{1},...,s_{T−1},a_{T−1},s_{T}$ . 
* Hence, $q(a|s)$ = $π_{θ_{old}}(a|s)$. 
* $Q_{θ_{old}}(s,a)$ is computed at each state-action pair ($s_t$, $a_t$) by taking the discounted sum of future rewards along the trajectory.


## 5.2 Vine

<img src="figures/cap15.png" width=600 />

# 6 Practical Algorithm

<img src="figures/cap16.png" width=600 />

<img src="figures/cap17.png" width=600 />
<img src="figures/cap18.png" width=600 />

# 7 Connections with Prior Work

<img src="figures/fig2.png" width=600 />

<img src="figures/cap19.png" width=600 />

<img src="figures/cap20.png" width=600 />

# 8 Experiments
* 8.1 Simulated Robotic Locomotion
* 8.2 Playing Games from Images

<img src="figures/cap21.png" width=600 />

## 8.1 Simulated Robotic Locomotion

<img src="figures/cap22.png" width=600 />
<img src="figures/cap23.png" width=600 />

<img src="figures/cap24.png" width=600 />

<img src="figures/cap25.png" width=600 />

## 8.2 Playing Games from Images

<img src="figures/cap26.png" width=600 />

# 9 Discussion

# 참고자료

* [1] Schulman, Levine, Moritz, Jordan, Abbeel: Trust Region Policy Optimization - http://arxiv.org/abs/1502.05477
* [2] Deep Reinforcement Learning: Pong from Pixels - http://karpathy.github.io/2016/05/31/rl/
* [3] 강화학습 튜토리알 - 인공 신경망으로 '퐁' 게임을 학습시키자 (Andrej Karpathy 포스트 번역) - http://keunwoochoi.blogspot.kr/2016/06/andrej-karpathy.html
* [4] Deep Reinforcement Learning - https://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf
* [5] rllab - https://github.com/rllab/rllab
* [6] Trust Region Policy Optimization(video) - https://sites.google.com/site/trpopaper/
* [7] ICML 2016 Tutorials - http://icml.cc/2016/?page_id=97
* [8] Deep Reinforcement Learning / David Silver - http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
* [9] Policy Gradient Methods - http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf