# Goal

This notebook is designed to review some key points and formulas learned from [berkeley-CS294-fa17](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/) along with some key points from other related paper.

# 1. Basic concepts

## 1.1 Decision Sequence and Stochastic Environment 

**Decision Sequence** - Reinforcement learning are used to study a sequence of interactions between an autonomous agent and its living environment. In each cycle, the agent gets a new observation signal from the environment. It process the signal and then return an action signal to the environment.

**Stochastic Environment** - The transition of environment between one state to another is not deterministic in traditional reinforcement learning model. One pair of current state and action can result into a bunch of following state.

## 1.2 Trajactory

In stochastic environment, the probablity of of a trajactory happens can be calculated as below:

$p_{\theta}(s_{1}, a_{1} ... s_{T}, a_{T}) = p(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) $

**Notation**

1. trajactory - a sequence of actions $a_t$ and observed environment state $s_t$
2. $s_t$ - the state of environment at time $t$
3. $a_t$ - the state of action at time $t$
4. $p(s_1)$ - the probability of initial state
5. $p(s_{t+1}|s_t, a_t)$ - the probability of environment state at $s_{t+1}$ given previous state $s_t$ and action $a_t$
6. $\pi_{\theta}(a_t|s_t)$ - the policy distribution given action and state

## 1.3 Markovian Environment
In a markovian environment, the probability of next state only relies on the current state and action.

However, in reality, many time we need to look back into the history and figure out the actual state of the environment.

## 1.4 Goal function

In RL, the goal function describes the fundamental problem we try to solve - find the best actions(models) which gives out the trajactory with the maximum expected reward.

$ \theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] $

**Notation**

1. $\theta^*$ - best model of agent, can be a set of policies used by the agent
2. $E_{\tau \sim p_\theta(\tau)}$ - the expectation of reward respect to trajactory $\tau$ and distribution of $\tau$ related to policy $\theta$.
3. $\sum_{t}^{} r(s_t, a_t)$ - the accumulated from trajactory $\tau$

## 1.5  Q-function

$Q^\pi(s_t, a_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t, a_t] $

$Q$ function represents the total expected retrievable reward(for $T$ steps) from taking an action $a_t$ at state $s_t$.

&nbsp;


**Text Explanation:**

Q-function tells you how good is your particular action taken in current state in terms of future reward.

## 1.6  Value function

$V^\pi(s_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t] $

$V$ function represents the total expected retrievable reward(for $T$ steps) at state $s_t$. Basically it is the average of sum of $Q^\pi(s_t, a_t)$, given uniform $a_t$ space.

&nbsp;

**Text Explanation:**

Value function tells you how good is your current state(in average) in terms of future reward.

$Q^\pi(s_t, a_t) > V^\pi(s_t)$ then $a_t$ is an action better than average.

## 1.7 Tradeoffs

**Off-Policy vs On-Policy**
1. Off-Policy: The algorithm able to improve the current policy without multiple times with some sampling. The algorithm does not require the policy to be rerun for each single update.
2. On-Policy: The algorithm requires a rerun of new policy to get its next update.

**Thoughts**

If the update of policy relies on the sampling from the same policy it is currently using, then the algorithm is on-policy. Otherwise, if the agent improve its policy using the sampling from other resources( let's say a bucket of pre-sampled data ), then the algorithm is off-policy. It's like learning by reading a book and learning by trying things yourself(without any given knowledge).

# 2. Imitation Learning

Imitation learning is a batch learning algorithm in reinforcement learning, in which case the agent tries to learn human policies by training with human expert data.

## 2.1 DAgger

**Algorithm**

1. Train model $\pi(a_t|o_t)$ with trajactory $D = {o_1,a_1,...,o_n,a_n}$
2. run $\pi(a_t|o_t)$ to generate new trajactory $D_\pi = {o_1,a_1,...,o_n,a_n}$
3. Ask an expert to label your actions $a_n$ in $D_\pi$
4. Aggregate $D_{new} \leftarrow D \bigcup D_\pi$
5. Then iterate again until your action looks good

**Pros**
1. No need to understand transition
2. Algorithm is relatively simple

**Cons**
1. Need tedious human labeling
2. The trained model is biased (biased to both expert behavior and training setting)

# 3. Policy Gradient

## 3.1 Goal function

Recall the goal function from 1.4:

$ \theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] $

Our goal is to find the best actions which can maximize the accumulated reward to the end of time.

$ J(\theta) = E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] \approx \frac{1}{N} \sum_{i} \sum_{t} r(s_{i,t},a_{i,t})$

It is easy to show that $J(\theta)$ above is the accumulated reward we try to maximize. One way to approach this expectation value is through Monto-carlo sampling. If we tries N sampling over the model $\theta$, we can possibly sum up the reward and get the approximate expectation.

## 3.2 Gradient 

Use a short notation $r(\tau)$ to represent the total reward

$r(\tau) = \sum_{t} r(s_{i,t},a_{i,t})$

Now we take a gradient on $J(\theta)$ and get

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ r(\tau)] 
            = \nabla \int \pi_{\theta}(\tau)r(\tau)\,dr
            = \int \nabla_{\theta} \pi_{\theta}(\tau)r(\tau)\,d\tau$
            
Here, $\pi_{\theta}(\tau)$ denotes to the policy we care about.


**Notice that**

No state information is shown in the above equation because our abreviation of using $\tau$. In actual case, our policy will be learned based on state.

Because taking a gradient on policy itself can be hard(Minimizing the production of probability), we will use the a trick to tranform the above $J(\theta)$ to a logrithm likelihood so that we can evaluate the sum of log probabilities instead.

Recall $ x\,\nabla \log{x} = x\frac{\log x}{x} = \log x$

Now recall previous $J(\theta)$ and replace $\nabla_{\theta} \pi_{\theta} (\tau)$ with our trick.

$ J(\theta) = \int \nabla_{\theta} \pi_{\theta}(\tau)r(\tau)\,d\tau
            = \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)\,d\tau $

One step more and we are back to our favorite expectation.

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] $

## 3.3 Derivative

After our trick in 3.2, $ J(\theta) $ looks like below

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] $

We know that $\pi_{\theta}(\tau)$ is actually the production of probability involved in the trajactory

$\pi_{\theta}(\tau) = \pi_{\theta}(s_{1}, a_{1} ... s_{T}, a_{T}) 
                    = \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) $
                    
Let's take a derivative of this long formula:

$\nabla_{\theta} \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t)
                    = \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) + \sum_{t=1}^{T} \log p(s_{t+1}|s_t, a_t) $
                    
See what we get here:

1. $\log p(s_1)$ - the log probability of initial state

2. $\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) + \sum_{t=1}^{T}$ - Sum of log possibility of chosen action given state at step $t$

3. $\sum_{t=1}^{T} \log p(s_{t+1}|s_t, a_t) $ - Sum of log transition function $p(s, t)$

Since the transition function is considered as a given constant in our case, the initial state is also a constant. The derivative of these two constants must be 0. Therefore, the only changable value is the second term.

$\nabla_{\theta} \pi(s_1) \prod \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t)$

The goal function can be also simplified as the product of two sums:

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] = E_{\tau \sim \pi_\theta(\tau)} [\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)]$


&nbsp;

**Notice that** 

In policy gradient, there's no need to understand the dynamics. But we are assuming fully observeness and strick markovian condition here, in which case the next timestep can be solely determined by a single probability distribution based on the current state and action. 

## Evaluation

Just like our evaluation of $ J(\theta) $ in **3.1**, we can use monto-carlo to get the gradient policy simply by sampling over a number of trajactories. All we need to do is sum up the reward and sum up the log possibility of the actions we have taken.

$ \nabla_{\theta} J(\theta) = E_{\tau \sim \pi_\theta(\tau)} [\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)]$

If we sample N trajactories, then the final goal function looks like

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} (\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)) $

**cross entropy + softmax**


Since we try to maximize $J(\theta)$, we do gradient ascent in this case

$\theta = \theta + \alpha \nabla_{\theta} J(\theta)$

### In Practice

When we evaluate $\sum^{N} (\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t)$ in tensorflow, we need to keep in mind that our output from the NN is a sequence of weight of  action.

Suppose in each iteration, we sample j states in terms of timesteps (the # of sampled trajectories is varied) and the observation space m.

1. observation - $\left [\left [o_{11},...,o_{1m}\right ],...,\left [o_{j1},...,o_{jm}\right ] \right ]$

After we fit the observation into the model we will get policy, which has a dimension of action k.

2. policy - $\left [\left [a_{11},...,a_{1j}\right ],...,\left [a_{j1},...,a_{jk}\right ] \right ]$

We apply a softmax on our policy so that each row $\left [a_{j1},...,a_{jk}\right ]$ in policy can sum up to 1, which represents the possibility we want.

3. softmax[ policy ] - $\left [\left [p(a_{11}),...,p(a_{1j})\right ],...,\left [p(a_{j1}),...,p(a_{jk})\right ] \right ]$

Then We multiply the distribution of policy with the actual action taken in our sampling to get $\log \pi_{\theta}(a_t | s_t)$

4. $\sum^{N} \log \pi_{\theta}(a_t | s_t) = a_t * \log p(a | s_t)$ = - cross_entropy (label = $a_t$, probability = $p(a | s_t )$)

**Notice**

The negative in front of cross_entropy cancels out the negative sign embeded in cross_entropy

## 3.4 Compare with Imitation learning

Very similar, the major difference is that in policy gradient, we evaluate the gradient policy  after muliply the sum of reward. In other word, the trajectory is weighted based on the sum of  reward.

One way to think about it is that in imitation learning, each sample are equally weighted as a success trajectary (or trajectory with a same total reward). Because of this equal view of trajectory, actions involved in trajectory happened more frequently will be weighted higher.

In policy gradient, we introduce the preference of higher rewarded trajectory. Therefore, some trajectory(sampling) are more important than the other trajectories because the higher sum of reward, which makes more sense - a policy prefering a trajectory with higher reward is a policy we want for objective.

## 3.5 Continuous actions

In some cases, our action might not be a a collection of discrete value; instead, the action might be allowed to choose from a interval of value. Therefore, we need to output logits represent the random variable of each actions in our policy.


Recall the PDF of normal distribution

1. $f(x | \mu, \sigma^{2}) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{- \frac{(x-\mu)^2}{2 \sigma^{2}} }$

and plugin to our policy

2. $\pi_{\theta}(a_t | s_t) =  \frac{1}{constant} e^{- \frac{(x-\mu)^2}{2 \sigma^{2}} }$

Log of policy is straightforward, we can remove the constant here since we will try different learning rate. In the end, we will reach to the squre of z score which is easy to compute and understand

3. $log \pi_{\theta}(a_t | s_t) = \log e^{- \frac{(a_t-\mu)^2}{2 \sigma^{2}} } 
                                = - \frac{(a_t-\mu)^2}{\sigma^{2}} = \frac{1}{2} z^2$  

Now look back to our goal
 
3. $\sum^{N} \log \pi_{\theta}(a_t | s_t) = \sum^{N} \frac{1}{2} z_{\mu, \sigma} (a_t)^2$

## 3.6 Problems with Policy Gradient

The absolute value of reward varies in each problem and makes the convergence inconsistent, sampling whole trajectory then average the reward is costly since the trajectory can take variated steps.

## 3.7 Reward-to-go

Recall our old formula for **policy gradient**:

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} (\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)) $

In the above formula, the $\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t)$ probability of whole trajectory is calculated then multiply the sum of reward till the end of time $\sum_{t=1}^{T} r(s_t | a_t)$. The N represents the number of trajectories we have sampled.

Why this thing work? think about a example below for a single trajectory: 

$\nabla J = \big(\log\pi(a_1) + \log\pi(a_2) + \log\pi(a_3)\big) * (r_1 + r_2 + r_3)$

Distribute $\pi(a_n)$ to the reward on the right side, we get

$\nabla J = \log\pi(a_1) * (r_1 + r_2 + r_3) + \log\pi(a_2) * (r_1 + r_2 + r_3) + \log\pi(a_3) * (r_1 + r_2 + r_3)$

Because of the rule of causilty, it is easy to tell that a term $\log\pi(a_{x})r_{y} = 0$ *iff* $ x > y$. Basically, a thing happened in the future would not affect the reward we gain in the past.

OK, let's reform our formula above again.
$\nabla J = \log\pi(a_1) * (r_1 + r_2 + r_3) + \log\pi(a_2) * (r_2 + r_3) + \log\pi(a_3) * r_3$

Now we can modify this formula to get the true **reward_to_go**:

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t} | s_{t}) * \sum_{t=t'}^{T} r(s_t | a_t)\big) $

*Notice that*
THe $N$ here is sill the number of trajectories

## 3.8 Baseline

In addition to the reward-to-go trick in calculating reward, another frequent used method is to apply a baseline. Instead of calculating the raw accumulated reward, we only calculate the advantage of reward in terms of a selected baseline.

Recall our objective function of **Reward-to-go** from last section

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t} | s_{t}) * \sum_{t=t'}^{T} r(s_t | a_t)\big) $

Now subtract a baseline $b$ from the reward

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * (\sum_{t=t'}^{T} r(s_t | a_t) - b)\big) $

One good gussing of $b$ is simply the average of the sum of rewards, which is the Q-value we need to calculate.

$b = Q^\pi(s_t, a_t)$ (requires another NN!)

## 3.9 Importance Sampling

https://en.wikipedia.org/wiki/Importance_sampling

# 4. Actor-Critic

## 4.1 Goal function

Recall our definition of Q function and V function from 1.5 and 1.6:

$Q^\pi(s_t, a_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t, a_t] $

$Q$ function represents the total expected retrievable reward(for $T$ steps) from taking an action $a_t$ at state $s_t$.

$V^\pi(s_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t] $

$V$ function represents the total expected retrievable reward(for $T$ steps) at state $s_t$. Basically it is the average of sum of $Q^\pi(s_t, a_t)$.

Then let's give a look at the objective function we learned from **3.8**. It is easy to find some similarities 

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * (\sum_{t=t'}^{T} r(s_t | a_t) - b)\big) $

$\sum_{t=t'}^{T} r(s_t | a_t) =  Q^\pi(s_t, a_t)$

Now, we use the true value of accumulated reward to replace the estimated reward from averging the summation

$b = V^\pi(s_t)$

Here, we use the true value of state to replace the estimated reward, which is a constant.


And the advantage is basically the difference between these two terms 

$A^{\pi} (s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)$

Now we can rewrite our objective function from Policy-gradient into:

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * A^{\pi} (s_t, a_t) \big) $



**Notice**

The function described here is little bit different from Q-learning which will be described later on.

## 4.2 Derivitive

We have multiple terms here $A^{\pi} (s_t, a_t)$, $V^\pi(s_t)$, $Q^\pi(s_t, a_t)$

**Question**

Which one should we fit?

Answer is $V^\pi(s_t)$

While advantage seems most straight forward, instead, we should actually fit $V^\pi(s_t)$, $Q^\pi(s_t, a_t)$ can be easily retrieved from $V^\pi(s_t)$ + a single step reward

$Q^\pi(s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1})$

Now let's change our function to reflect this change

$A^{\pi} (s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$


## 4.3 Evaluation

Simple way to represent $V^\pi(s_t)$ is to just averging the summation of reward using a NN net. In otherword, we will feed the state and action as input and use the averged reward from monto-carlo as output of our NN.

$V^\pi(s_t) = \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) $

**Notice that**

So what's the difference between this and out previous approach? We are summing up the reward again right? 

Not quite. The major difference between this and the raw summation is that, previously, we can only use the reward accumulated from sampling once. Now we are fitting these data into a model and reusing them to make guesses. If a (s_t, a_t) pair come into our mind which looks really similar to what we have before, we might just give out the number from NN with a reasonable accuracy. 

Take the word from the beginning of this section

"In otherword, we will feed the state and action as input and use the averged reward from monto-carlo as output of our NN"

This tells us that we a going after a regression model which tries to solve $L(\phi)$

$L(\phi) = \frac{1}{2} \big| V_{\phi}^\pi(s_t) - y_i \big |^2  $

1. $V_{\phi}^\pi(s_t)$ - the value from model $\phi$
2. $y_i$ - the emperical data we collected from cumulated reward, usually through monto-carlo

Because we still want to do less sampling. We can perform the same trick on $y_i$ as well - estimate $y_{i}$ use the $V_{\phi}^\pi(s_{t+1})$, in which

$y_i = r(s, t) + V{\phi}^\pi(s_{t+1})$

## 4.4 Bootstrap

Look at our $y_i$ above, because we are trying to optimize $V_{\phi}$ using the $V{\phi}^\pi(s_{t+1})$ from previous step, we call this method Bootstrap(Update a variable use itself from previous iteration)

$y_i = r(s, t) + V{\phi}^\pi(s_{t+1})$

## 4.5 Algorithm

Though not directly, we have almost covered everything needed to build for a actor-critic algorithm.

The key idea of actor-critic lies like this - We build two NN separately to reflect the policy and the value function. One NN serves as the model for policy(actor) $\pi(a_t | s_t)$, the other NN serves as the **Value function** we mentioned in *4.3*, *4.4*. The name actor-critic comes from the nature where the policy model looks like the player who's playing the game, while the **Value function** looks like a critic who's giving you score on how good is your last step.

Batch actor-critic algorithm:
1. Sample $\{s_i, a_i\}$ from $\pi(a|s)$ - (run the policy through robot)
2. fit $V{\phi}^\pi(s)$ to sampled reward sums - (observation as input, reward as output)
3. Evaluate $A^{\pi} (s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$
4. $\theta \leftarrow \alpha \nabla_{\theta} J(\theta)$ (Maximize $\theta$ through gradient ascent)

## 4.6 Discounted factor

In some cases, we want to consider the evaluation of reward in terms of infinite timesteps.  In order to make our agent focus on the reward close to the current timestep, we should apply a discounted factor $\lambda$ to the future reward.

Two-options:

1. Discount the reward from the first state of reward-to-go:

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * \sum_{t=t'}^{T} \gamma^{t'-t} r(s_t | a_t)\big) $

2. Discount the reward from the first state of of the whole trajectory: 

$ \nabla_{\theta} J(\theta) = \frac{1}{N}\sum^{N} \sum_{t}^{T} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * \sum_{t=t'}^{T} \gamma^{t'-1} r(s_t' | a_t')\big)
                            = \frac{1}{N}\sum^{N} \sum_{t}^{T} \gamma^{t'-1} \big( \log \pi_{\theta}(a_{t'} | s_{t'}) * \sum_{t=t'}^{T} r(s_t' | a_t')\big)$
                            
**Notice that**                        
The gradient is somehow discounted if we choose options two.

Typically, we would like to choose option 1 over option 2 since we don't want to discount policeis happen in the later of the trajectory. Although a policy might sampled later in the trajectory, it still value the same to our model regard of timestep.

## 4.7 Online Actor-Critic algorithm

Online actor-critic algorithm:
1. Sample single action $\{s_i, a_i\}$ from $\pi(a|s)$ - (run the policy through robot)
2. update $V_{\phi}^{\pi}$ using target $r + \gamma V_{\phi}^{\pi}$ (single update to NN using observation as input, and $\gamma V_{\phi}^{\pi}$ as output)
3. Evaluate $A^{\pi} (s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$
4. $\theta \leftarrow \alpha \nabla_{\theta} J(\theta)$ (Maximize $\theta$ through gradient ascent)

## 4.8 Bias variance tradeoff

Recall the Advantage function we use in actor-critic

$A^{\pi} (s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$

Because we are performing gradient on both $V^\pi(s_{t+1})$ and $V^\pi(s_t)$, our estimation can be super biased based on the initial samplings.

Another way to achieve this issue is to use the original approach from policy-gradient, in which we only gradient on a single variable $V^\pi(s_t)$, and use the true accumulated reward to replace $r(s_t, a_t) + V^\pi(s_{t+1})$

$A^{\pi} (s_t, a_t) = \sum_{t=t'}^{T} \gamma^{t'-1} r(s_t' | a_t') - V^\pi(s_t)$


# 5. Q-learning

## 5.1 Goal function

Recall our favorite goal function from **3.2**

$ J(\theta) = \nabla E_{\tau \sim \pi_\theta(\tau)}[ \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)] 
= E_{\tau \sim \pi_\theta(\tau)} [\sum_{t=1}^{T} \log \pi_{\theta}(a_t | s_t) *  \sum_{t=1}^{T} r(s_t | a_t)]$

In previous sections, both vanilla policy gradient and actor-critic requires our to compute the gradient based on the policy $\log \pi_{\theta}(a_t | s_t)$, which can be kind urgly if our policy space is hard to gradient by itself(Noise, Sparse).

So, Can we get the action without explicitly having a distribution of policy $\log \pi_{\theta}$? Yes!

Recall the definition of advantage:

$A^{\pi} (s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)$

If we are able to know the advantage of each actions, then we can pick the action by choosing the the action which gives us the best advantage.

$a_t = \underset{a_t}{\operatorname{argmax}}  A^{\pi} (s_t, a_t)$

$
\begin{equation}
  \pi(a_t | s_t) =
    \begin{cases}
      1 & a_t = \underset{a_t}{\operatorname{argmax}}  A^{\pi} (s_t, a_t)\\
      0 & \text{otherwise}
    \end{cases}       
\end{equation}$

Substitude policy using value function estimation is the key point in Q-learning.

## 5.2 Algorithm

The simpliest Q-learning would look like below, suppose we have a big table of $V$ values

1. evaluate $V^{\pi} (s, a)$
2. set $\pi \leftarrow \pi '$
    
This is what we called policy iteration algorithm.

recall our equation from bootstrap

$y_i = r(s, t) + V{\phi}^\pi(s_{t+1})$

If we know the dynamics of the environment, then we can just fit $y_i$ using recursive iterations.

## 5.3 Value Iteration

We know that our action is calcuclated through:

$a_t = \underset{a_t}{\operatorname{argmax}} A^{\pi} (s_t, a_t)$

Because $\operatorname{argmax}$ (constant) $= 0$ and $A^{\pi} (s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)$

Given that V^\pi(s_t) is a known value, because it is the value of current state, we can simply our goal function of action to:

$a_t = \underset{a_t}{\operatorname{argmax}} Q^{\pi} (s_t, a_t)$

What's good about using Q here is that we can easily compute the value function by simply choosing the maximum Q, in which case we don't need to have any policy function at all.

$V^\pi(s_t) = \sum_{t'=t}^{T} E_{\pi_\theta} [r(s_t', a_t')|s_t] = {\operatorname{max}} \, Q^{\pi} (s_t, a_t)$

Now apply what we've inducted from top, we can write down the algorithm for value iteration.

1. set $Q(s, a) \leftarrow r(s, a) + \gamma E[V(s')]$
2. set $V^\pi(s_t) \leftarrow {\operatorname{max}} \, Q^{\pi} (s_t, a_t)$

**Notice that**

No policy function is involved in our algorithm. All we need to do is update $Q$ and $V$ functions through maximization.

**Thought**

We have seen different algorithms built upon different focuss on $r(a_t, s_t)$, $Q(s, a)$, $V^\pi(s_t)$. The reason behind these various algorithm can be tracked back to our original objective function

$ \theta^* = \underset{\theta}{\operatorname{argmax}} E_{\tau \sim p_\theta(\tau)}[ \sum_{t}^{} r(s_t, a_t)] $

If we perceive our model as a complicated function, then its input are reward space $R$ and action space $A$. It's output is our policy $\pi$, or a way in which we can choose our action from.

Different algorithms here can be seen as different balancing over these terms.

## 5.4 Fitting Value Iteration
How could we solve V(s') in $Q(s, a) \leftarrow r(s, a) + \gamma E[V(s')]$?, We need a regression model.

$L(\phi) = \frac{1}{2} \big | \big | V_\phi(s) - \operatorname{max} Q^\pi (s,a) \big | \big |$

1. $\phi$ - the model representing V
2. $\pi$ - the actions we've tried on

Now we can rewrite our algorithm of Value iteration, we call this fitted value iteration

1. set $y_i \leftarrow max[r(s, a) + \gamma E[V(s')]]$
2. set $\phi \leftarrow {\operatorname{argmin}} \frac{1}{2} \big | \big | V_\phi(s) - y_i \big | \big |$

Fitted value iteration allows to operate continuous complicated environment since we don't need a table of Q in this case. However, we still need to understand the dynamics of the system.

In addition, we can apply the trick of $V$ to $Q$ from previous discussion. Now, we only need to fit $Q$ instead both $Q$ and $V$.

1. set $y_i \leftarrow r(s, a) + \gamma \, max_{a_{i}} Q_\phi (s, a)$
2. set $\phi \leftarrow {\operatorname{argmin}} \frac{1}{2} \big | \big | Q_\phi(s_t, a_t) - y_i \big | \big |^2$

Now we don't need to even have two networks, we use the same $Q$ network to represent the reward-to-go and choose actions from it as well. Q-learning is just an online version of this algorith,

**Notice that**

This is a off-policy algorithm since we don't need to sample data everytime we update the policy. Also, we are using only one network doing the gradient, there's no guarrantee of convergence. 

Another problem with this approach is the bias brought by the model $Q_\phi (s, a)$ itself.  Since we are calculating target $y_i$ using predicter $Q_\phi (s, a)$, step 2 is not a gradient descent step. Also, though NN with enough layer can be universal function approximater. In reality, our NN with limited layer and neurons, therefore the bias brought by model is often non-negligable.

## 5.5 Bellman Error

Interestingly, we can plugin the definition of $y_i$ to the second equation and get

$\epsilon = \frac{1}{2} E_{(s,a) ~ \phi} \big [ Q_\phi(s_t, a_t) - [r(s, a) + \gamma \, max_{a_{i}} Q_\phi (s, a) \big] $ 

This $\epsilon$ here is what we called bellman error from dynamic programming. It evaluates the difference between current mapping of $Q$ table with the optimal $Q*$ table using a single reward.

**Thoughts**

Throught out the history, many people have stared at the problem of a system interacting with its environment with a optimal behavior. Dynamic Programming, Optimal Control and reinforcement learning are basically study the same topic with minor differences in their assumptions..

## 5.6 Greedy policy

In order to increase the capability of performing exploration, we usually modify our policy a little bit to give chances to less likely actions.

$
\begin{equation}
  \pi(a_t | s_t) =
    \begin{cases}
      1 - \epsilon & a_t = \underset{a_t}{\operatorname{argmax}}  A^{\pi} (s_t, a_t)\\
      \frac{\epsilon}{|A| - 1} & \text{otherwise}
    \end{cases}       
\end{equation}$

The above is one method called $\epsilon$-greedy which gives minor actions some probability which can sum up to $\epsilon$.

# 6. DQN

## 6.1 Replay buffer

Recall from 5.4 that we know Q-learning is a off policy algorithm because the data used to update step does not need to come from the policy we are using.

Therefore, we can build a big database of transitions between different states and just pull samples from it(like learning through video tapes). This method decouples sampling procedure with update procedure. We can now put these parts into different processes.

## 6.2 Compare Q-learning with Regression

Recall our algorithm for batch Q learning

1. set $y_i \leftarrow r(s, a) + \gamma max_{a_{i}} Q_\phi (s, a)$
2. set $\phi \leftarrow {\operatorname{argmin}} \frac{1}{2} \big | \big | Q_\phi(s_t, a_t) - y_i \big | \big |^2$

The major convergence problem comes from the moving target $y^i$, so it would be better for u to fix or at least slow the change of y^i in order to make our algorithm perform better.

Combining what we have from **5.7**, we can get our algorithm for Q-learning with replay buffer and target network.

1. save target parameters: $\phi \leftarrow \phi'$ 
 2. Collect data using some policy, add to $B$(some database)
  3. sample a batch from $B$
      4. $\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi} \big( Q_\phi(s_t, a_t) - [r(s, a) + \gamma \, max_{a_{i}} Q_\phi (s, a) ]\big)$
      
$\alpha$ is the learning rate we care about 

## 6.3 Algorithm

A more detailed DQN algorithm is described below:

1. Take some actions $a_i$ and observe $(s_i, a_i,s'_i,r_i)$, add them to database $B$
2. sample mini batch $(s_i, a_i,s'_i,r_i)$ from B uniformly
3. compute $y_j = r_j + \gamma \,max_{a'_j} Q_{\phi '} (s'_j,r_j)$
4. $\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi} \big( Q_\phi(s_t, a_t) - [r(s, a) + \gamma \, max_{a_{i}} Q_\phi (s, a) ]\big)$
5. $\phi ' \leftarrow \phi$ every N steps

## 6.4 Alternative Algorithm

1. Polyak averaging

Some problem can be seen from this algorithm. Let's see what heuristic solutions we have here.

One problem lies in the updating of $\phi '$, since our update is delayed for N steps, the sudden update at the end of iteration seems to odd to us. We can use a new update ratio $\tau$ to make this procedure more smooth.

$\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi} \big( Q_\phi(s_t, a_t) - [r(s, a) + \gamma max_{a_{i}} Q_\phi (s, a) ]\big)$

$\phi ' \leftarrow \tau\phi ' + (1-\tau)\phi$


2. Double Q-learning

Recall our $y_i$ from previous 
$y_i \leftarrow r(s, a) + \gamma max_{a_{i}} Q_\phi (s, a)$

# 7. Model-based RL

# 8. Advanced Policy Gradient

# 8.2 Kullback-Liebler divergence

The "distance" between two probability distribution

# 9. Exploration vs Exploitation