# Goal

This notebook is designed to review some key points and formulas learned from [berkeley-CS294-fa17](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/) along with some key points from other related paper.

# 1. Basic concepts

## 1.1 Decision Sequence and Stochastic Environment 

**Decision Sequence** - Reinforcement learning are used to study a sequence of interactions between an autonomous agent and its living environment. In each cycle, the agent gets a new observation signal from the environment. It process the signal and then return an action signal to the environment.

**Stochastic Environment** - The transition of environment between one state to another is not deterministic in traditional reinforcement learning model. One pair of current state and action can result into a bunch of following state.

## 1.2 Trajactory

In stochastic environment, the probablity of of a trajactory happens can be calculated as below:

$p_{\theta}(s_{1}, a_{1} ... s_{T}, a_{T}) = p(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) $

**Notation**

1. trajactory - a sequence of actions $a_t$ and observed environment state $s_t$
2. $s_t$ - the state of environment at time $t$
3. $a_t$ - the state of action at time $t$
4. $p(s_1)$ - the probability of initial state
5. $p(s_{t+1}|s_t, a_t)$ - the probability of environment state at $s_{t+1}$ given previous state $s_t$ and action $a_t$
6. $\pi_{\theta}(a_t|s_t)$ - the policy distribution given action and state

## 1.3 Markovian Environment
In a markovian environment, the probability of next state only relies on the current state and action.

However, in reality, many time we need to look back into the history and figure out the actual state of the environment.

## 1.4 Goal function

In RL, the goal function describes the fundamental problem we try to solve - find the best actions(models) which gives out the trajactory with the maximum expected reward.

$ \theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] $

**Notation**

1. $\theta^*$ - best model of agent, can be a set of policies used by the agent
2. $E_{\tau \sim p_\theta(\tau)}$ - the expectation of reward respect to trajactory $\tau$ and distribution of $\tau$ related to policy $\theta$.
3. $\sum_{t}^{} r(s_t, a_t)$ - the accumulated from trajactory $\tau$

## 1.5  Q-function

$Q^\pi(s_t, a_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t, a_t] $

$Q$ function represents the total expected retrievable reward(for $T$ steps) from taking an action $a_t$ at state $s_t$.

&nbsp;


**Text Explanation:**

Q-function tells you how good is your particular action taken in current state in terms of future reward.

## 1.6  Value function

$V^\pi(s_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t] $

$V$ function represents the total expected retrievable reward(for $T$ steps) at state $s_t$. Basically it is the average of sum of $Q^\pi(s_t, a_t)$, given uniform $a_t$ space.

&nbsp;

**Text Explanation:**

Value function tells you how good is your current state(in average) in terms of future reward.

$Q^\pi(s_t, a_t) > V^\pi(s_t)$ then $a_t$ is an action better than average.

## 1.7 Tradoffs

**Off-Policy vs On-Policy**
1. Off-Policy: The algorithm able to improve the current policy without multiple times with some sampling. The algorithm does not require the policy to be rerun for each single update.
2. On-Policy: The algorithm requires a rerun of new policy to get its next update.

# 2. Supervised Learning

Supervised learning is a batch learning algorithm in reinforcement learning, in which case the agent tries to learn human policies by training with human expert data.

## 2.1 DAgger

**Algorithm**

1. Train model $\pi(a_t|o_t)$ with trajactory $D = {o_1,a_1,...,o_n,a_n}$
2. run $\pi(a_t|o_t)$ to generate new trajactory $D_\pi = {o_1,a_1,...,o_n,a_n}$
3. Ask an expert to label your actions $a_n$ in $D_\pi$
4. Aggregate $D_{new} \leftarrow D \bigcup D_\pi$
5. Then iterate again until your action looks good

**Pros**
1. No need to understand transition
2. Algorithm is relatively simple

**Cons**
1. Need tedious human labeling
2. The trained model is biased (biased to both expert behavior and training setting)

# 3. Policy Gradient

## 3.1 Goal function

Recall the goal function from 1.4:

$ \theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] $

Our goal is to find the best actions which can maximize the accumulated reward to the end of time.

$ J(\theta) = E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] \approx \frac{1}{N} \sum_{i} \sum_{t} r(s_{i,t},a_{i,t})$

It is easy to show that $J(\theta)$ above is the accumulated reward we try to maximize. One way to approach this expectation value is through Monto-carlo sampling. If we tries N sampling over the model $\theta$, we can possibly sum up the reward and get the approximate expectation.

## 3.2 Gradient 

Use a short notation $r(\tau)$ to represent the total reward

$r(\tau) = \sum_{t} r(s_{i,t},a_{i,t})$

Now we take a gradient on $J(\theta)$ and get

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ r(\tau)] 
            = \nabla \int \pi_{\theta}(\tau)r(\tau)\,dr
            = \int \nabla_{\theta} \pi_{\theta}(\tau)r(\tau)\,d\tau$
            
Here, $\pi_{\theta}(\tau)$ denotes to the policy we care about.


**Notice that**

No state information is shown in the above equation because our abreviation of using $\tau$. In actual case, our policy will be learned based on state.

Because taking a gradient on policy itself can be hard(Minimizing the production of probability), we will use the a trick to tranform the above $J(\theta)$ to a logrithm likelihood so that we can evaluate the sum of log probabilities instead.

Recall $ x\,\nabla \log{x} = x\frac{\log x}{x} = \log x$

Now recall previous $J(\theta)$ and replace $\nabla_{\theta} \pi_{\theta} (\tau)$ with our trick.

$ J(\theta) = \int \nabla_{\theta} \pi_{\theta}(\tau)r(\tau)\,d\tau
            = \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)\,d\tau $

One step more and we are back to our favorite expectation.

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] $

## 3.3 Derivative

After our trick in 3.2, $ J(\theta) $ looks like below

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] $

We know that $\pi_{\theta}(\tau)$ is actually the production of probability involved in the trajactory

$\pi_{\theta}(\tau) = \pi_{\theta}(s_{1}, a_{1} ... s_{T}, a_{T}) 
                    = \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) $
                    
Let's take a derivative of this long formula:

$\nabla_{\theta} \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t)
                    = \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) + \sum_{t=1}^{T} \log p(s_{t+1}|s_t, a_t) $
                    
See what we get here:

1. $\log p(s_1)$ - the log probability of initial state

2. $\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) + \sum_{t=1}^{T}$ - Sum of log possibility of chosen action given state at step $t$

3. $\sum_{t=1}^{T} \log p(s_{t+1}|s_t, a_t) $ - Sum of log transition function $p(s, t)$

Since the transition function is considered as a given constant in our case, the initial state is also a constant. The derivative of these two constants must be 0. Therefore, the only changable value is the second term.

$\nabla_{\theta} \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t)$

The goal function can be also simplified as the product of two sums:

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] = E_{\tau \sim \pi_\theta(\tau)} [\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)]$


&nbsp;

**Notice that** 

In policy gradient, there's no need to understand the dynamics. But we are assuming fully observeness and strick markovian condition here, inwhich case the next timestep can be solely determined by a single probability distribution based on the current state and action. 

## Evaluation

Just like our evaluation of $ J(\theta) $ in 3.1, we can use monto-carlo to get the gradient policy simply by sampling over a number of trajactories. All we need to do is sum up the reward and sum up the log possibility of the actions we have taken.

$ \nabla_{\theta} J(\theta) = E_{\tau \sim \pi_\theta(\tau)} [\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)]$

If we sample N trajactories, then the final goal function looks like

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} (\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)) $

**cross entropy + softmax**


Since we try to maximize $J(\theta)$, we do gradient ascent in this case

$\theta = \theta + \alpha \nabla_{\theta} J(\theta)$

### In Practice

When we apply this formula in tensorflow, our output from the NN is a sequence of weight of  action.

Suppose in each iteration, we sample j states in terms of timesteps (the # of sampled trajectories is varied) and the observation space m.

1. observation - $\left [\left [o_{11},...,o_{1m}\right ],...,\left [o_{j1},...,o_{jm}\right ] \right ]$

After we fit the observation into the model we will get policy, which has a dimension of action k.

2. policy - $\left [\left [a_{11},...,a_{1j}\right ],...,\left [a_{j1},...,a_{jk}\right ] \right ]$

We apply a softmax on our policy so that each row $\left [a_{j1},...,a_{jk}\right ]$ in policy can sum up to 1, which represents the possibility we want.

3. softmax[ policy ] - $\left [\left [p(a_{11}),...,p(a_{1j})\right ],...,\left [p(a_{j1}),...,p(a_{jk})\right ] \right ]$

Then We multiply the distribution of policy with the actual action taken in our sampling to get $\log \pi_{\theta}(a_t | s_t)$

4. $\sum^{N} \log \pi_{\theta}(a_t | s_t) = a_t * \log p(a | s_t)$ = - cross_entropy (label = $a_t$, probability = $p(a | s_t )$)  

**Notice**

The negative in front of cross_entropy cancels out the negative sign embeded in cross_entropy