# Reinforcement learning of motor skills with policy gradients

* 싸이지먼트/의사결정RL/파트 3 - 딥강화학습 [1]
* 김무성

# Contents
* abstract
* Appendix. Motor primitive equations
* 1.Introduction
    - 1.1. General assumptions and problem statement
    - 1.2. Motor primitive policies
* 2.Policy gradient approaches for parameterized motor primitives
    - 2.1. Finite-difference methods
    - 2.2. Likelihood ratio methods and REINFORCE
* 3.‘Vanilla’ policy gradient approaches
    - 3.1. Policy gradient theorem and G(PO)MDP
    - 3.2. Optimal baselines
    - 3.3. Compatible function approximation
* 4.Natural Actor-Critic
    - 4.1. Motivation
    - 4.2. Connection to the compatible function approximation
    - 4.3. Natural actor-critic algorithms
* 5.Empirical evaluations
    - 5.1. Comparing policy gradient methods on motor primitives
    - 5.2. Robot application: Motor primitive learning for baseball
* 6.Conclusion & discussion

#### 참고
* [2] Lecture 7: Policy Gradient Methods - http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf
* [4] Reinforcement Learning for Motor Control - http://www.igi.tugraz.at/pfeiffer/documents/MotorControlAndRL.ppt

# abstract

# Appendix. Motor primitive equations

#### 참고
* [3] Learning Attractor Landscapes for Learning Motor Primitives - https://papers.nips.cc/paper/2140-learning-attractor-landscapes-for-learning-motor-primitives.pdf

<img src="http://www.nlpu.com/Articles/art23-1.gif" />

<img src="http://image.slidesharecdn.com/imagesofcomplexity-150420110543-conversion-gate01/95/images-of-complexity-9-638.jpg?cb=1429528305" width=600 />

<img src="figures/ap.png" width=600 />

# 1.Introduction
* 1.1. General assumptions and problem statement
* 1.2. Motor primitive policies

## 1.1. General assumptions and problem statement

Most robotics domains require the state-space and the action spaces to be continuous and high dimensional such that learning methods based on discretizations are not applicable for higher- dimensional systems. However, as the policy is usually imple- mented on a digital computer, we assume that we can model the control system in a discrete-time manner and we will denote the current time step 2 by k. In order to take possible stochasticity of the plant into account, we denote it using a probability distribution

<img src="figures/cap1.png" width=600 />

where uk ∈ RM denotes the current action, and xk, xk+1 ∈ RN denote the current and the next state respectively. We furthermore assume that actions are generated by a policy

<img src="figures/cap2.png" width=600 />

which is modeled as a probability distribution in order to incorporate exploratory actions; for some special problems, the optimal solution to a control problem is actually a stochastic controller

The sequence of states and actions forms a trajectory (also called history or roll-out) denoted by τ = [x0:H , u0:H ] where H denotes the horizon, which can be infinite. 

At each instant of time, the learning system receives a reward denoted by r (xk , uk ) ∈ R.

The general goal of policy gradient reinforcement learning is to optimize the policy parameters θ ∈ RK so that the expected return

<img src="figures/cap3.png" width=600 />

is optimized where ak denote time-step-dependent weighting factors and aΣ is a normalization factor in order to ensure that the normalized weights ak/aΣ sum up to one.

we can rewrite a normalized expected return in the form

<img src="figures/cap4.png" width=600 />

<img src="figures/cap5.png" width=600 />

## 1.2. Motor primitive policies

#### 참고
* Cartesian Trajectory Planning - http://www.slideshare.net/DamianGordon1/cartesian-trajectory-planning
* Introduction to Computer Vision and Robotics: Motion Generation - http://www.dpi.physik.uni-goettingen.de/cns/uploads/downloads/lecture_coputervision_and_robotics/slides/Chapter_08_Movement_generation_2014.ppt

In this section, we first discuss how motor plans can be represented and then how we can bring these into the standard reinforcement learning framework. For this purpose, we consider two forms of motor plans, i.e.,
* (1) spline-based trajectory plans and 
* (2) nonlinear dynamic motor primitives introduced in Ijspeert et al. (2002).

#### spline-based trajectory plans

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Cubic_splines_three_points.svg/600px-Cubic_splines_three_points.svg.png" width=600 />

A desired trajectory is represented as connected pieces of simple polynomials, e.g., for third-order splines, we have

<img src="figures/cap6.png" width=600 />

A given tracking controller, e.g., a PD control law or an inverse dynamics controller, ensures that the trajectory is realized accurately. Thus, a desired movement is parameterized by its spline nodes and the duration of each spline node.

These parameters can be learned from fitting a given trajectory with a spline approximation algorithm (Wada & Kawato, 1994), or by means of optimization or reinforcement learning (Miyamoto et al., 1996). We call such parameterized movement plans motor primitives.

#### nonlinear dynamic motor primitives

For nonlinear dynamic motor primitives, we use the approach developed in Ijspeert et al. (2002). These dynamic motor primitives can be seen as a type of central pattern generator which is particularly well suited for learning as it is linear in the parameters and are invariant under rescaling.

<img src="figures/at.png" with=600 />

In this approach, movement plans (qd , q ̇ d ) for each degree of freedom (DOF) of the robot are represented in terms of the time evolution of the nonlinear dynamical systems

<img src="figures/cap7.png" width=600 />

where (q , q ̇ ) denote the desired position and velocity of a joint, dd
z the internal state of the dynamic system which evolves in accordance to a canonical system z ̈ = fc(z,τ), g the goal (or point attractor) state of each DOF, τ the movement duration shared by all DOFs, and θ the open parameters of the function f .

The systems in Eqs. (8) and (9) are point-to-point movements,
i.e., such tasks are rather well suited for the introduced episodic
reinforcement learning methods. In both systems, we have access
to at least 2nd derivatives in time, i.e., desired accelerations,
which are needed for model-based feedforward controllers. In
order to make the reinforcement framework feasible for learning
with motor primitives, 

<img src="figures/cap8.png" width=600 />

# 2.Policy gradient approaches for parameterized motor primitives
* 2.1. Finite-difference methods
* 2.2. Likelihood ratio methods and REINFORCE

<img src="figures/cap9.png" width=600 />

<img src="figures/cap10.png" width=600 />

## 2.1. Finite-difference methods

<img src="figures/cap11.png" width=600 />

<img src="figures/cap12.png" width=600 />

## 2.2. Likelihood ratio methods and REINFORCE

<img src="figures/cap13.png" width=600 />

<img src="figures/cap15.png" width=600 />

<img src="figures/cap16.png" width=600 />

<img src="figures/cap14.png" width=600 />


# 3.‘Vanilla’ policy gradient approaches
* 3.1. Policy gradient theorem and G(PO)MDP
* 3.2. Optimal baselines
* 3.3. Compatible function approximation

<img src="figures/cap17.png" width=600 />

## 3.1. Policy gradient theorem and G(PO)MDP

<img src="figures/cap18.png" width=600 />

<img src="figures/cap19.png" width=600 />

<img src="figures/cap20.png" width=600 />

<img src="figures/cap21.png" width=600 />

<img src="figures/cap22.png" width=600 />


## 3.2. Optimal baselines

<img src="figures/cap23.png" width=600 />

## 3.3. Compatible function approximation

<img src="figures/cap24.png" width=600 />

# 4.Natural Actor-Critic
* 4.1. Motivation
* 4.2. Connection to the compatible function approximation

<img src="figures/cap25.png" width=600 />

## 4.1. Motivation

<img src="figures/cap26.png" width=600 />

<img src="figures/cap27.png" width=600 />

## 4.2. Connection to the compatible function approximation

<img src="figures/cap28.png" width=600 />

## 4.3. Natural actor-critic algorithms
* 4.3.1. Episodic natural actor-critic
* 4.3.2. Episodic natural actor-critic with a time-variant baseline

<img src="figures/cap29.png" width=600 />

### 4.3.1. Episodic natural actor-critic

<img src="figures/cap30.png" width=600 />

<img src="figures/cap31.png" width=600 />

<img src="figures/cap32.png" width=600 />

<img src="figures/cap33.png" width=600 />

### 4.3.2. Episodic natural actor-critic with a time-variant baseline

<img src="figures/cap34.png" width=600 />

# 5. Empirical evaluations
* 5.1. Comparing policy gradient methods on motor primitives
* 5.2. Robot application: Motor primitive learning for baseball

## 5.1. Comparing policy gradient methods on motor primitives

<img src="figures/cap35.png" width=600 />

<img src="figures/cap36.png" width=600 />

<img src="figures/cap37.png" width=600 />

In [None]:
## 5.2. Robot application: Motor primitive learning for baseball


<img src="figures/cap38.png" width=600 />
<img src="figures/cap39.png" width=600 />
<img src="figures/cap40.png" width=600 />
<img src="figures/cap41.png" width=600 />
<img src="figures/cap42.png" width=600 />
<img src="figures/cap43.png" width=600 />
<img src="figures/cap44.png" width=600 />

# 6. Conclusion & discussion

# 참고자료

* [1] Reinforcement learning of motor skills with policy gradients - http://www.keck.ucsf.edu/~houde/sensorimotor_jc/possible_papers/JPeters08a.pdf
* [2] Lecture 7: Policy Gradient Methods - http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf
* [3] Learning Attractor Landscapes for Learning Motor Primitives - https://papers.nips.cc/paper/2140-learning-attractor-landscapes-for-learning-motor-primitives.pdf
* [4] Reinforcement Learning for Motor Control - http://www.igi.tugraz.at/pfeiffer/documents/MotorControlAndRL.ppt