### Key Concepts 

$s$   state: complete description of the state of the world 

$o$   observation: partial or full observed state 

$a$   action space: can be discrete or continuous 


$\tau$ trajectory: sequence of states and actions visited 

First state sampled from start state distribution 

State transitions are Markov, depend only on current state and most recent action taken 


**Policy**: rule to determine what action to take at $t$ given state at $t$

$a_t = \mu(s_t)$ deterministic policy

$a_t = \pi(*|s_t)$ stochastic policy 

policies can be parameterized, typically use $\theta$ $\mu_{\theta}(s_t)$ 

Stochastic policies: sample actions from policy, calculate log likelihood of actions 

* Categorical distribution for discrete action spaces 
* Diagonal Gaussian (multivariate Gaussian with covariance matrix that has nonzero entries along diagonal, zero elsewhere 


$r_t = R(s_t, a_t, s_{t+1})$  reward at time t: based on defined reward function

$R(\tau)$  return: sum of rewards given throughout a trajectory, can be discounted at rate $\gamma$ to help conversion


**Goal: Select policy which maximizes expected return** 


### Policy Gradient Theorem  / RL with Function Approximation

[Link to paper](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) 

Optimizes for long term rewards 

Optimization function can be formulated in two ways: 

$\rho$ := performance of the corresponding policy $\pi$

**1.** Long term average reward per step (averaged across starting states $s_0$)

$\rho(\pi) = \lim_{n \rightarrow \infty} \frac{1}{n} \mathbb{E}[r_1 + r_1 + ... r_n | \pi]$

Under the average reward per step, the value of the state-action pair (s,a) 

$Q^{\pi}(s,a) = \sum_{t=1}^{\infty} \mathbb{E}[r_t - \rho(\pi) | s_0=s, a_0=a, \pi]  \forall s \in S, a \in A$


**2.** Designate a specific start state and optimize for long term performance (discounted return) starting at that state

$\rho(\pi) = \mathbb{E}[\sum_{t=1}^{\infty} \gamma^{t-1}r^{t} | s_0, \pi]$


Now value of state-action pair is written like 

$Q^{\pi}(s,a) = \mathbb{E}[\sum_{k=1}^{\infty} \gamma^{k-1}r_{t+k} | s_t = s, a_t = a, \pi]$

**Theorem 1: $\rho(\pi)$ is differentiable with respect to its parameters $\theta$**

Note that we do not know the true value of the state-action value function $Q$

**Theorem 2: $Q$ can be approximated with a function $f_w$** 

Substituting $Q$ for $f$ in the derivative: 

$\frac{\partial{\rho}}{\partial \theta} = \sum_s d(s) \sum_a \frac{\partial \pi(s,a)}{ \partial \theta}f_w(s,a)$

-> an example is given in the paper of how to derive $f$ if the policy $\pi$ follows a Gibbs distribution 



### Vanilla Policy Gradient

In [6]:
from spinup import vpg_pytorch as vpg 
import torch
import gym 

env_fn = lambda: gym.make('CartPole-v0')
logger_kwargs = dict(output_dir='outputs', exp_name='vpg_test')
vpg(env_fn, logger_kwargs=logger_kwargs)


[32;1mLogging data to outputs/progress.txt[0m
[36;1mSaving config:
[0m
{
    "ac_kwargs":	{},
    "actor_critic":	"MLPActorCritic",
    "env_fn":	"<function <lambda> at 0x7f7b98977488>",
    "epochs":	50,
    "exp_name":	"vpg_test",
    "gamma":	0.99,
    "lam":	0.97,
    "logger":	{
        "<spinup.utils.logx.EpochLogger object at 0x7f7beb3540f0>":	{
            "epoch_dict":	{},
            "exp_name":	"vpg_test",
            "first_row":	true,
            "log_current_row":	{},
            "log_headers":	[],
            "output_dir":	"outputs",
            "output_file":	{
                "<_io.TextIOWrapper name='outputs/progress.txt' mode='w' encoding='UTF-8'>":	{
                    "mode":	"w"
                }
            }
        }
    },
    "logger_kwargs":	{
        "exp_name":	"vpg_test",
        "output_dir":	"outputs"
    },
    "max_ep_len":	1000,
    "pi_lr":	0.0003,
    "save_freq":	10,
    "seed":	0,
    "steps_per_epoch":	4000,
    "train_v_iters":	80,
    "vf



---------------------------------------
|             Epoch |               0 |
|      AverageEpRet |              21 |
|          StdEpRet |            10.1 |
|          MaxEpRet |              59 |
|          MinEpRet |               9 |
|             EpLen |              21 |
|      AverageVVals |           -0.26 |
|          StdVVals |          0.0665 |
|          MaxVVals |         -0.0234 |
|          MinVVals |           -0.43 |
| TotalEnvInteracts |           4e+03 |
|            LossPi |         0.00527 |
|             LossV |             226 |
|       DeltaLossPi |               0 |
|        DeltaLossV |            -138 |
|           Entropy |           0.689 |
|                KL |       -2.98e-11 |
|              Time |            1.14 |
---------------------------------------
---------------------------------------
|             Epoch |               1 |
|      AverageEpRet |            22.2 |
|          StdEpRet |            11.2 |
|          MaxEpRet |              76 |


---------------------------------------
|             Epoch |              10 |
|      AverageEpRet |            26.1 |
|          StdEpRet |            15.8 |
|          MaxEpRet |             135 |
|          MinEpRet |              10 |
|             EpLen |            26.1 |
|      AverageVVals |            14.5 |
|          StdVVals |            5.41 |
|          MaxVVals |              20 |
|          MinVVals |          -0.795 |
| TotalEnvInteracts |         4.4e+04 |
|            LossPi |         -0.0146 |
|             LossV |             106 |
|       DeltaLossPi |               0 |
|        DeltaLossV |           -9.38 |
|           Entropy |           0.689 |
|                KL |       -7.53e-10 |
|              Time |            12.7 |
---------------------------------------
---------------------------------------
|             Epoch |              11 |
|      AverageEpRet |            25.5 |
|          StdEpRet |            14.5 |
|          MaxEpRet |              80 |


---------------------------------------
|             Epoch |              20 |
|      AverageEpRet |            28.8 |
|          StdEpRet |            14.1 |
|          MaxEpRet |              80 |
|          MinEpRet |               9 |
|             EpLen |            28.8 |
|      AverageVVals |              16 |
|          StdVVals |            7.37 |
|          MaxVVals |            24.1 |
|          MinVVals |          -0.277 |
| TotalEnvInteracts |         8.4e+04 |
|            LossPi |         -0.0297 |
|             LossV |            70.6 |
|       DeltaLossPi |               0 |
|        DeltaLossV |            -1.7 |
|           Entropy |           0.681 |
|                KL |       -6.03e-10 |
|              Time |              24 |
---------------------------------------
---------------------------------------
|             Epoch |              21 |
|      AverageEpRet |            30.3 |
|          StdEpRet |            17.8 |
|          MaxEpRet |              96 |


---------------------------------------
|             Epoch |              30 |
|      AverageEpRet |            34.2 |
|          StdEpRet |            21.2 |
|          MaxEpRet |             116 |
|          MinEpRet |              10 |
|             EpLen |            34.2 |
|      AverageVVals |            20.6 |
|          StdVVals |            8.46 |
|          MaxVVals |            29.8 |
|          MinVVals |          -0.986 |
| TotalEnvInteracts |        1.24e+05 |
|            LossPi |         -0.0434 |
|             LossV |             129 |
|       DeltaLossPi |               0 |
|        DeltaLossV |           -6.33 |
|           Entropy |           0.667 |
|                KL |        1.42e-10 |
|              Time |            35.1 |
---------------------------------------
---------------------------------------
|             Epoch |              31 |
|      AverageEpRet |            39.6 |
|          StdEpRet |            25.8 |
|          MaxEpRet |             169 |


---------------------------------------
|             Epoch |              40 |
|      AverageEpRet |            40.1 |
|          StdEpRet |            22.1 |
|          MaxEpRet |             163 |
|          MinEpRet |              10 |
|             EpLen |            40.1 |
|      AverageVVals |            22.7 |
|          StdVVals |            10.1 |
|          MaxVVals |              34 |
|          MinVVals |           0.599 |
| TotalEnvInteracts |        1.64e+05 |
|            LossPi |         -0.0467 |
|             LossV |             126 |
|       DeltaLossPi |               0 |
|        DeltaLossV |           -5.68 |
|           Entropy |           0.656 |
|                KL |       -3.65e-10 |
|              Time |              46 |
---------------------------------------
---------------------------------------
|             Epoch |              41 |
|      AverageEpRet |            44.6 |
|          StdEpRet |            23.7 |
|          MaxEpRet |             146 |


<img src="outputs/img/results_plot.png">

<img src="outputs/img/cartpole_test.gif">