# Step-1 What is Deep Reinforcement Learning?
Link = https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419

## The Reinforcement Learning Process
![](https://cdn-images-1.medium.com/max/1116/1*aKYFRoEmmKkybqJOvLt2JQ.png?raw=true)

This RL loops output a sequnce of **state**, **action** and **reward**

### The central idea of the Reward Hypothesis

Cumulative reward at each time step $t$:

$$G_t=\sum_{k=0}^{T}R_{t+k+1}\tag{1.1}$$

The rewards that come sooner are more probable to happen, since they are more predictable than the long term future reward.

$$G_t=\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\text{, where } 0\leq\gamma<1\tag{1.2}$$

- the larger $\gamma$ the smaller the discount
- the smaller $\gamma$ the bigger the discount

### Episodic or Continuing tasks

**Episodic task** has an starting and an ending points<br>
**continuous task** does not have ending point. The agent keeps running until we decide to stop him.

### Monte Carlo vs TD Learning methods

In **Monte Carlo Approach** we collect the rewards **at the end of the episode** and then *calculate* the **maximum expected future reward**.

$$V(S_t)\leftarrow V(S_t) + \alpha[G_t-V(S_t)]\tag{1.3}$$

In **Temporal difference learning** we estimate the **reward at *each step***

$$\displaystyle V(S_t) \leftarrow \underbrace{V(S_t)}_{\text{previous estimate}} + \alpha\left[\overbrace{\underbrace{R_{t+1}}_{\text{reward at t+1}}+\underbrace{\gamma V(S_{t+1})}_{\text{discount value on next step}}}^{\text{TD target}} \right]\tag{1.4}$$

By running more and more episodes, **the agent will learn to play better and better**.

### Exploration/Exploitation trade-off

- **Exploration** is finding more information about the environment
- **Exploitation** is exploiting known information to maximize the reward.

### Three approaches to Reinforcement Learning
#### Value-based
Value-based RL optimizes the value function
<br>***Value function** is a function that tells us the maximum expected future reward the agent will get at each step*.
$$v_\pi(s) =\mathbb{E}\left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1}\mid S_t=s \right]
\tag{1.5}$$

#### Policy based
Policy based RL directly optimizes the policy function $\pi(s)$ without using a value function.

$$\underbrace{a\quad=\quad\pi(s)}_{\text{action = policy(function)}}
\tag{1.6}$$

- *Deterministic* policy will always return the same action at a given state.
- *Stochastic* policy outputs a distribution probability over actions

$$\pi(a|s) =\mathbb{P}[A_t=a|S_t=s]
\tag{1.7}$$

#### Model Based
Model based RL models the environment.

## Introducing Deep Reinforcement Learning
Deep reinforcement learning introduces deep neural networks to solve Reinforcement Learning problems--hence the name "deep".
![](https://cdn-images-1.medium.com/max/1395/1*w5GuxedZ9ivRYqM_MLUxOQ.png?raw=true)

# Step-2 Diving deeper into Reinforcement Learning with Q-Learning
Link = https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe

## Introducing the Q-table
In **Q-table**, the columns will be actions $a$, the rows will be states $s$, the value of each cell will bethe maximum expected future reward for that given state and action $Q^*(x,a)$.

## Q-Learning algorithm: laerning the Action Value Function
$$Q^\pi(s_t,a_t)=\mathbb{E}\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1}\mid s_t,a_t\right]\tag{2.1}$$
![](https://cdn-images-1.medium.com/max/1116/1*yklmxNRdXleiDbv6aSZUIg.png?raw=true)

### The Q-learning algorithm Process
![](https://cdn-images-1.medium.com/max/1116/1*QeoQEqWYYPs1P8yUwyaJVQ.png?raw=true)

1. $\ $ Initialize Q-values $Q(s,a)$ arbitrarily for all state-action pairs
2. $\ $For life or until learning is stopped 
3. $\quad$ Choose an action $(a)$ in the current world state $(s)$ based on current Q-value estimates $Q(s,\cdot)$
4. $\quad$ Take the action $(a)$ and observe the outcome state $(s')$ and reward $(r)$
5. $\quad$ Update $Q(s,a):=Q(s,a)+\alpha[r+\gamma\max_{a'}Q(s',a')-Q(s,a)]$

#### How to choose an action at step 3?
The idea is that in the beginning, we'll use the **epsilon greedy strategy**:
- We specify an exploration ratie "epsilon", which we set to 1 in the beginning. This is the rate of steps that we'll do randomly. In the beginning, this rate must be at its highest value, because we don't know anything about the values in Q-table. This means we need to do alot of exploration, by randomly chossing our actions.
- We generate a random number. If this number is larger "epsilon", the we will do "exploitation" (this means we use waht we already know to select the best action at each step). Else, we'll do exploration.
- The idea is that we must have a big epsilon at the beginnign of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.


![](https://cdn-images-1.medium.com/max/1116/1*9StLEbor62FUDSoRwxyJrg.png?raw=true)

#### Recap ...
- Q-learning is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a q function.
- It evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.
- Goal: maximize the value function Q (expected future reward given a state and action).
- Q table helps us to find the best action for each state.
- To maximize the expected reward by selecting the best of all possible actions.
- The Q come from quality of a certain action in a certain state.
- Function Q(state, action) → returns expected future reward of that action at that state.
- This function can be estimated using Q-learning, which iteratively updates Q(s,a) using the Bellman Equation
- Before we explore the environment: Q table gives the same arbitrary fixed value → but as we explore the environment → Q gives us a better and better approximation.

# Step-3 Deep Q-Learning
# An introduction to Deep Q-Learning: let’s play Doom
Link = https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8


## Adding ‘Deep’ to Q-Learning
Q-learning updates Q-table, this is a good strategy but not scalable (when size of state and action state are **giant**).

Create a neural network that will approximate, given a state, the different Q-values for each action.

## How does Deep Q-Learning work?
![](https://cdn-images-1.medium.com/max/1395/1*LglEewHrVsuEGpBun8_KTg.png?raw=true)

Our Deep Q Neural Network takes a stack or four frams as an input. These pass through its network, and output a vector of Q-values for each action possible in the given state. We need to take the biggest Q-value of this vector to find our best action (move left/right or shoot).

### Preprocessing part
![](https://cdn-images-1.medium.com/max/1116/1*QgGnC_0BkQEtPqMUftRC6A.png?raw=true)

### The problem of temporal limitation
Link = https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

The first question that you can ask is why we stack frames together?

We stack frames together because it helps us to handle the problem of temporal limitation.

### Using convolution networks
#### ELU function
Link = https://arxiv.org/pdf/1511.07289.pdf
$$
f(x)=
\left\{\begin{matrix}
 x & \text{if } x> 0 \\  
 \alpha(\exp(x)-1)& \text{otherwise} 
\end{matrix}\right.
\tag{3.1}$$

$$
f'(x)=
\left\{\begin{matrix}
 1 & \text{if } x> 0 \\  
 f(x)+\alpha& \text{otherwise} 
\end{matrix}\right.
\tag{3.2}$$

### Experience Replay: making more efficient use of observed experience
### Avoid forgetting previous experiences
![](https://cdn-images-1.medium.com/max/2000/1*RFt8MBBkUSPZdolp_WfZFA.png?raw=true)
Think of the **replay buffer** as a folder where every sheet is an experiment tuple (state, action, reward). We feed it by interacting with the environment. And then we take some random sheet to feed the newral network.
### Reducing correlation between experiences
First, we must stop learning while interacting with the environment. We shouldtry different things and play a little randomly to explore the state space. We can save these experiences in the replay buffer.

Then, we can recall these experiences and learn from them. After that, go back to play with updated value function.

## Deep Q-Learning algorithm
The error is calculated by taking the difference between our Q_target (maximum possible value from the next state) and Q_value (our current prediction of the Q-value)

$$\underbrace{\Delta w}_{\text{change in weights}}=\alpha\left[\overbrace{\underbrace{R+\gamma\max_a\hat{Q}(s',a,w)}_{\text{Q-target}}-\underbrace{\hat{Q}(s,a,w)}_{\text{Q-value}}}^{\text{Temporal difference error}}\right]\underbrace{\nabla_w\hat{Q}(s,a,w)}_{\text{gradient of current predicted Q-value}} \tag{3.3}$$

There are two processes that are happening in this algorithm:
- We sample the environment where we performan actions and store the observed experiences tuples in a replay memory.
- Select the small batch of tuple random and learn from it using a gradient descent update step.

# Step 3+ Improvements in Deep Q Learning: 
Link = https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682

Four strategies that improve the training and the results DQN agents:
- **Double DQNs**
- **Dueling Double DQN (aka *DDQN*)**, 
- **Prioritized Experience Replay (aka *PER*)**
- **fixed Q-targets**

## 3+.1. Fixed Q-targets

In $(3.3)$ **we don't have any idea of the real TD target** (maximum possible Q-value for the next_state), so we have to **estimate** it, using *Bellman equation*:
$$\mathcal{Q}(s,a)=r(s,a) +\gamma\max_a\mathcal{Q}(s'a)\tag{3.4}$$
However, the problem is that we *using the same parameters (weights)* for estimating **Q-target** and **Q-value**. So at evry step of training, ous **Q-target shifts simultaneously with Q-values**. That means we get closer to our target but the target is also moving. **This lead to a big oscillation in training**.

We can use the idea of fixed **Q-targets** introduced by *DeepMind*:
- Using a separate network with a fixed parameter ($w^-$) for estimating the **TD target**.
- At every time step, we copy the parameters from **our DQN** to update the target network.
$$\underbrace{\Delta w}_{\text{change in weights}} = \alpha\underbrace{\left[\overbrace{R+\gamma\max_a\hat{\mathcal{Q}}(s',a,w^-)}^{\begin{matrix}\text{maximum possible Q-value}\\ \text{for the next state (= Q-target)}\end{matrix}}-\overbrace{\hat{\mathcal{Q}}(s',a,w)}^{\begin{matrix}\text{current predicted}\\ \text{Q-value}\end{matrix}}\right]}_{\text{temporal difference error}}\underbrace{\nabla_w\hat{\mathcal{Q}}(s,a,w)}_{\begin{matrix}\text{gradient of current} \\ \text{predicted Q-value}\end{matrix}}\tag{3.3'}$$
At every time step:
<br>$\quad\quad\quad w^-\leftarrow w$

## 3+.2. Double DQNs
Link = https://papers.nips.cc/paper/3964-double-q-learning

**Double DQNs** handles the **problem of overestimation of Q-values**:

$$\underbrace{\mathcal{Q}(s,a)}_{\text{Q-target}} = \underbrace{r(s,a)}_{\begin{matrix}\text{reward of taking} \\ \text{action } a \text{ at state }s\end{matrix}} +\underbrace{\gamma\max_a\mathcal{Q}(s'a)}_{\begin{matrix}\text{discounted max Q-value among all}\\\text{possible actions from next state}\end{matrix}}\tag{3.4}$$

Calculating **TD target** we face a simple problem: how to be sure that **the best action for the next state is the action with the highest Q-value**?

The **accuracy of Q-value** depends on what action we tries and what neighboring states we explored.

As a consequence, at the beginning of the training we don't have enough information about the best action to take. Therefore, taking the max. Q-value as the best action to take can lead to false positives. if non-optimal actions are regularly **given a higher Q-value than the optimal best action, the learning will be complicated**.

The solution is: when we compute the Q-target, we use two networks to decouple the action selection from the target Q-value generation. We:
- use our DQ network to select what is the best action to take for the next state
- use our target network to calculate the target Q-value of taking that action at the next state.

$$
\underbrace{\mathcal{Q}(s,a)}_{\text{TD target}}=r(s,a) +\underbrace{\gamma \mathcal{Q}\left(s',\underbrace{arg\max_a\mathcal{Q}(s',a)}_{\begin{matrix}\text{DQN chooses action}\\\text{for the next state}\end{matrix}}\right)}_{\begin{matrix}\text{target network calculates the Q-}\\ \text{value of taking that action at state }s\end{matrix}}\tag{3.5}
$$

Therefore, **Double DQN** helps us reduce the overestimation of **Q-values** and, as a consequence, helps train faster and have more stable learning.

## 3+. 3.  Dueling DQN (aka DDQN)
Link = https://arxiv.org/pdf/1511.06581.pdf

**Q-value** corresponds to **how good it is to be at state $s$ and taking action $s$**
<br>So we can decompose $Q(s,a)$ as the sum of:
$$
\mathcal{Q}(s,a)=V(s)+A(s,a)\tag{3.6}
$$
where:
- $V(s)$ is the value of being at that state
- $A(s,a)$ is the advantage of taking that action $a$ at state $s$

With **DDQN**, we want to separate the estimator of these elements, using two nes streams:
- one that estimates the **state value** $V(s)$
- one that estimates the **advantage for each action** $A(s,a)$

And then combine thess two streams **through a special aggregation layer to get an estimate of** $Q(s,a)$
![](https://cdn-images-1.medium.com/max/1395/1*FkHqwA2eSGixdS-3dvVoMA.png)

By decoupling the estimation, intuitively our **DDQN** can learn which states are (or are not) valuable **without** having to learn the effect of each action at each state (since it's also calculating $V(s)$)

Calculating $V(s)$ is particulary **useful for states where their actions do not affect the environment in a relevant way**.

![](https://cdn-images-1.medium.com/max/1116/0*qor_kPiSwiWt8uQF)

Concerning the aggregation layer, we want to generate the Q-values for each action at state $s$. To avoid the **issue of identifiability** (cannot find $A(s,a)$ and $V(a)$ for given $Q(s,a)$, which can lead to impossible back-propagation) we subtract the average advantage of all actions $a'$ possible of stae $a$ as follow:
$$
\mathcal{Q}(s,a;\theta,\alpha,\beta)=V(s;\theta,\beta)+\left[A(s,a;\theta,\alpha)-\underbrace{\frac{1}{\mathcal{A}}\sum_{a'}A(s,a';\theta,\alpha)}_{\text{average advantage}}\right]
\tag{3.7}$$
where:
- $\theta$ - common network parameters
- $\alpha$ - advantage stream parameters
- $\beta$ - value stream parameters

Therefore, this architecture helps us accelerate the training, find much more reliable Q-values for each action by decoupling the estimation between two streams.

## 3+. 4. Prioritized Experience Replay (PER)
Link = https://arxiv.org/search?searchtype=author&query=Schaul%2C+T

The idea of PER is that **some experiences** may be **more important** than others for our trining, but might **occour less frequently**. We try to change the sampling distribution by using a criterion to define the priority of each tuple of experience. We want to tatke in priority **experience where there is a big difference between our prediction and the TD-target, since it means that we have a lot to learn about it**.

We use the absolute value of the magnitude of our TD error:

$$p_t\quad=\underbrace{|\delta_t|}_{\begin{matrix}\text{magnitude of}\\\text{our TD-error}\end{matrix}}+\underbrace{e}_{\begin{matrix}\text{constant assures that no experience}\\ \text{has prob. = 0 to be taken}\end{matrix}}\tag{3.8}$$

And **put that priority in the experience of each replay buffer**.

**Probability of being chosen for a replay**:
![](https://cdn-images-1.medium.com/max/1116/0*iCkLY7L3R3mWEh_O)

Using priority sampling can lead to bias toward high-priority smaples and a risk of over-fitting in normally updating weights. To correct this bias, we use importance sampling weight (IS) that will adjust the updating by reducing the weights of the often seen samples:

![](https://cdn-images-1.medium.com/max/1116/0*Lf3KBrOdyBYcOVqB)

Link = http://pemami4911.github.io/paper-summaries/deep-rl/2016/01/26/prioritizing-experience-replay.html

The role of **b** is to control how much these importance sampling weights affect learning.

## 3+.5. Implement example
Link = https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Dueling%20Double%20DQN%20with%20PER%20and%20fixed-q%20targets/Dueling%20Deep%20Q%20Learning%20with%20Doom%20%28%2B%20double%20DQNs%20and%20Prioritized%20Experience%20Replay%29.ipynb

# Step 4: An introduction to Policy Gradients with Cartpole
Link = https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f

In policy-based method, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function $pi$ that maps state to action.

We can use a value function to optimize the policy parameters, but the value function $Q(s,a)$ will not be used to select an action.

## 4.1. Why using Policy-Based method?
### Two types of policy
A *deterministic policy* maps state to action, and is used in *deterministic environmnet*.

A *stochastic policy* outputs a probability distribution over actions and is used when the environment is uncertain. We call this process a *Partially Observable Markov Decision Process* (POMDP).

### Advantages
#### Convergenve
**Policy-based methods have better convergence properties.**
<br>Value-based methods can have a big oscillation while training. Because the choice of action may change dramatically for an arbitrarily small change in the estimated action values. 

With policy gradient, we just follow the gradient to find the best parameters and we are guaranteed to converge on a local extremum.

#### Policy gradients are more effective in high dimensional action spaces
**DQN**'s predictions assign a score for ech possible action, which can fail if action sapce is infinite.

In policy-based methods, we just adjust the parameters directly rather than coputing (estimating) the maximum directly at every step.

![](https://cdn-images-1.medium.com/max/800/1*_hAkM4RIxjKjKqAYFR_9CQ.png)

#### Policy gradients can learn stochastic policies
We don't need to implement an exploration/exploitation trade-off. It's automatically handled by policy, without hard-coding.

### Disadvantages
- Policy gradients converge on a local maximum rather than on the global optimums.
- Policy gradients converge slower an take a long time to train.

## 4.2. Policy search

Probability of taking action $a$ given a state $s$ with parameter $\theta$
$$
\pi_\theta(a\mid s)\quad=\quad P([a \mid s])
\tag{4.1}
$$
outputs a probability distribution of actions.

Considering policy as an optimization problem, we can be sure that our policy is good by fiding the best parameters $(\theta)$ to maximize a score function $J(\theta)$:
$$
J(\theta)\quad=\quad\mathbb{E}_{\pi_\theta}\left[\sum\gamma r\right]
\tag{4.2}
$$
$J(\theta)$ will tell us how good our $\pi$ is and gradient acsent will help us to find the best policy parameters to maximize the sample of goods actions.

### First step: Policy score function $J(\theta)$

In an **episodic environment**, we can use the start value. Calculate the mean of the return form the first time step ($G1$):
$$
J_1(\theta)=\mathbb{E}_{\pi}\left[\underbrace{G_1=R_1+\gamma R_2 + \gamma^2R_3+...}_{\begin{matrix}\text{cumulative discounted rewards}\\ \text{starting at start state}\end{matrix}}\right]\approx\underbrace{\mathbb{E}_{\pi}\left[V(s_1)\right]}_{\text{value of state 1}}
\tag{4.3}$$
The idea is simple. if I always start in some state $s_1$, what's the total reward I'll get from that start state until the end?

Link = https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419

In a **continuous environment**, we can use the average value, because we can't rely on a specifis start state:
$$
J_{\text{avgv}}(\theta)=\mathbb{E}_\pi[V(s)]=\sum\left[\sum\frac{N(s)}{\sum_{s'}N(s')}V(s)\right]
\tag{4.4}$$
where:
- $N(s)$: number of occurences of the state
- $\sum_{s'}N(s')$: total number occurences of all states

Moreover, we can use the **average reward per time step**:
$$
J_{avR}(\theta)=\mathbb{E}[\pi(r)]=\sum_s{\underbrace{d(s)}_{\begin{matrix}\text{probability}\\\text{in state }s\end{matrix}}}\sum_a{\underbrace{\pi\theta(s,a)}_{\begin{matrix}\\\text{probability taking action }a\\\text{from state }s\text{ under policy }\pi \end{matrix}}R^a_s}
\tag{4.5}$$
The idea here is that we want to get the most reward per time step


### Second step: Policy gradient ascent

- Gradient descent: go opposite gradient, find minimum
- Gradient ascent: go toward gradient, dind maximum

The idea is to find the gradient to the current policy $\pi$ that updates the parameters in the direction of the greatest increase, and iterate:
- Policy: $\pi$
- Objective function: $J(\theta)$
- Gradient: $\nabla_\theta J(\theta)$
- Update: $\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)$

The best parameter $\theta^*$:
$$
\theta^*=arg\max_\theta\underbrace{\mathbb{E}_{\pi_\theta}\left[\sum_t{R(s_t,a_t)}\right]}_{J(\theta)}
\tag{4.6}$$
where: 
$$
J(\theta)=\underbrace{\mathbb{E}_{\pi_\theta}\left[\underbrace{R(\tau)}_{\text{expected future reward}}\right]}_{\text{expected given policy}}
\tag{4.7}$$

Score function $J(\theta)$ can be define as:
$$
J_1(\theta)=V_{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]=\underbrace{\sum_{s\in S}{d(s)}}_{\text{state distribution}}\underbrace{\sum_{a\in A}{\pi_\theta(s,a)R^a_s}}_{\text{action distribution}}
\tag{4.8}
$$

**Some problems** here:
- Policy parameters change how actions are chosen => what rewards get and which states we will see and how often.

So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.

Both of these are affected by policy parameters. *The effect of policy parameters on the actions is simple to find*, but how do we find **the effect of policy on the state distribution**? While **the function of environment is unknown**.
- As a consequence, how do we **estimate the gradient with respect to policy parameters**, when the gradient depends on the unknown effect of policy changes on the state distribution?

The solution will be to use the **Policy Gradient Theorem**. This 
provides an analytic expression for the gradient of score function with respect to policy parameters $\nabla_\theta J(\theta)$, that does not involve the differentiation of the state distribution.

Since $\nabla\log x = \nabla x /x$:

$$
\nabla_\theta J(\theta) = \nabla_\theta\left(\mathbb{E}_{\pi_\theta}\left[R(\tau)\right]\right) = \nabla_\theta \sum_{\tau}{\pi(\tau;\theta)R(\tau)} = \sum_{\tau}{\nabla_\theta \pi(\tau;\theta)R(\tau)} \\
= \sum_{\tau}{\pi(\tau;\theta) \nabla_\theta (\log \pi(\tau;\theta))R(\tau)}\\
= \mathbb{E}_{\pi_\theta} \left[\nabla_\theta (\log \pi(\tau;\theta))R(\tau)\right]
\tag{4.9}$$

And **conclude**:
- Policy gradient: $\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta} \left[\nabla_\theta (\log \underbrace{\pi(s,a,\theta)}_{\text{policy function}})\underbrace{R(\tau)}_{\text{score function}}\right]$
- Update rule: $\Delta\theta=\alpha*\nabla_\theta (\log \pi(s,a,\theta))R(\tau)$

$R(\tau)$ is like a scalar value score:
- if $R(\tau)$ is high it means that on average we took action s that lead to high rewards. We want to push the probabilities of the actions seen.
- if $R(\tau)$ is low, we want to push down the probabilities of the actions seen.

## 4.3. Monte Carlo Policy Gradients

$1.\ \textbf{Initialize }\theta$<br>
$2.\ \textbf{for} \text{ each episode } \tau = s_0, a_0, r_1; s_1, a_1, r_2; ...; s_T:$<br>
$\quad\quad \textbf{for } t\leftarrow \text{ to } T-1:$<br>
$\quad\quad\quad\Delta\theta=\alpha\nabla\theta\log(\pi(s_t, a_t,\theta))G_t$<br>
$\quad\quad\quad\theta=\theta+\Delta\theta$

$3.\ \textbf{for} \text{ each episode }:$<br>
$\quad\quad\text{At each time step within that episode:}$<br>
$\quad\quad\quad\text{3.1. Compute the log probabilities produced by our policy function}\\\quad\quad\quad\quad\text{Multiply it by the score function}$<br>
$\quad\quad\quad\text{3.2. Update the weights}$

### Problem:
$R$ is only calculated at the end of each episode, **all actions are averaged**. Even if **some taken actions are very bad**, if $R$ is quite high, these actions will **be averaged as good**.

# Step 5: An intro to Advantage Actor Critic (A2C) methods

Link = https://medium.freecodecamp.org/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d

Both **value-based** and **model-based** method have big drawbacks, so we will apply a hybrid method that can be called **Actor Critc** , using two NN:
- a **Critic** that measures how good the action taken is (value-based)
- an Actor that controls how our agent behaves (policy-based)

An advatage of A2C is **Proximal Policy Optimization (PPO)**.

## 5.1. The quest for a better learning model

Concerning Policy Gradient Method, in a situation of **Monte Carlo**, we have to wait until the end of episode to calculate the reward. We may conclude that if we have a high reward(**R(t)**), **all actions** that we took were **good**, **even if some were really bad.

This results that we need a lot of samples, which can lead to slow learning because of time-cost.

## 5.2. Introducing Actor Critic
Instead of wating until the end of episode, **we make an update at each step (TD learning)**

**Policy update**: $$\Delta\theta=\alpha*\nabla_\theta (\log \pi(S_t,A_t,\theta))R(t)$$
**Actor-Critic update**: $\Delta\theta=\alpha*\nabla_\theta (\log \pi(S_t,A_t,\theta))*Q(S_t,A_t)\tag{5.1}$
We update at each time step, we can't use the total rewards $R(t)$. Instead, we need to train a Critic model that **approximates the value function** and this value function replaces the reward function in Policy gradient.

### How Actor-Critic works
Learning from **feedbacks**, the Actor will update the policy and be better at future actions, while Critic will also update its own way to provide feedback so it can be better next time.

We update both **ACTOR** $\pi(s,a,\theta)$ and **CRITIC** $\hat{q}(s,a,w)$ in **parallel** and **optimize separately**:

**Policy update**:
$$
\Delta\theta = \alpha\nabla_\theta(\log\pi_\theta(s,a))*\underbrace{\hat{q}_w(s,a)}_{\begin{matrix}q \text{ learning function}\\\text{approximation}\end{matrix}}
\tag{5.1'}$$

**Value update**:
$$
\Delta w=\beta* \underbrace{\left(R(s,a)+\gamma\hat{q}_w(s_{t+1}, a_{t+1})-\hat{q}_w(s_t,a_t)\right)}_{\text{TD error}}* \underbrace{\nabla_w\hat{q}_w(s_t,a_t)}_{\text{value function gradient}}
\tag{5.2}$$

### The Actor Critic Process
- At each time step $t$, we take the **current state ($s_t$)** from the environment and pass it **as an input** through our **Actor** and **Critic**,
- Our **Policy** takes the state, **outputs an action ($a_t$)**, and receives a **new state** ($s_{t+1}$) and a **reward** ($r_{t+1}$),
- Due to that:
    - the **Critic** computes the **value of taking that action** at that sate ($q(s,a)$),
    - the **Actor** *updates* its policy parameters ($\theta$) by (5.1') and using this $q$ value 
- The **Actor** then *produces* **next action** ($a_{t+1}$) to take at new state ($s_{t+1}$)
- The **Critic** then *updates* its **value parameters** ($w$) by (5.2)

## A2C and A3C
### Introducing the Advantage function to stabilize learning
In DDQN section (3.3) we split value function $\mathcal{Q}(s,a)$ into value $V(s)$ and advantage $A(s,a)$ to reduce the **high variability** of learning process. Advantage function tells us the **improvement compared to the average the action taken at a state is** (the extra reward). if $A(s,a)>0$ we put gradient in that direction, and in opposite if $A(s,a)<0$.
 
The problem of implementing $A(s,a)$ is that it requires two value functions $Q(s,a)$ and $V(s)$. We can **use TD error as a good estimatior of $A(s,a)$**:
 
$$
 \mathcal{A}(s,a) = \mathcal{Q}(s,a)-\mathcal{V}(s) = r +\gamma\mathcal{V}(s')-\mathcal{V}(s)=\text{ TD error.}
 \tag{5.3}$$
 
 ### Two diffenrent strategies: Asynshronous / Synchronous
 - A2C (Advantage Actor Critic)
 - A3C (Asynchronous Advantage Actor Critc)
 
In A3C we don't use experience replay as this requires lot of memory. Instead, we asynchronously **execute different agents in parallel on multiple instances of the environment**. Each worker will update the golobal network asynchronously.
 
In A2C we synchronously update the global network, waiting until all workers have finished their training and calculated thire gradients to avarage them.
 
### A2C or A3C?
 Because of asynchronous nature of A3C, some workers will be play with older version of the parameters. Thus the aggregating update will not be optimal. That's why A2C will wait for each actor to finish their segment before update global parameters.
 
 As a consequence, the trainign will be more cohesive and faster.

### A2C in practice
![](https://cdn-images-1.medium.com/max/873/1*bNw9TH5700_x3X64YXHPdQ.png)
- Creates a vecter of $n$ environments using the multi-processing library
- Creates a runner object that handles different environments, executing in parallel
- Two version of network:
    1. **step_model**: generates experiences from environments
    2. **train_model**: trains the experiences
- Runner takes a step, outputs a brach of experience
- Train_model computes gradient at once
- Finnally update step_model with new weights.

**An implement of A2C**: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/A2C%20with%20Sonic%20the%20Hedgehog

# Step 6. Proximal Policy Optimization (PPO)
The central idea of Procimal Policy Optimization is to avoid having too large policy update, clipping large ratio between new and old policy to 0.8-1.2. PPO trains the agent by running K epoches of gradient descent over sampling mini batches.

## Problem with Policy Gradient Objective function

$$
\underbrace{L^{PG}(\theta)}_{\text{policy loss}}=\mathbb{E}_t\left[\log_\pi\theta(a_t\mid s_t)\cdot\mathcal{A}_t\right]
\tag{6.1}$$

The problems come from the step size:
- If step size is too small, **training process is too slow**
- If step size is too high, **there's too much variability in training**.

## Clipped surrogate objective function
This function will constraint the policy change in a small range using clip.

Instead of using log probability to trace the impact of the actions, we can use the **ration between the probability of action under current policy divided by the probability of the action under previous policy**:

$$
r_t(\theta)=\frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_\text{previous}}(a_t\mid s_t)}\text{, so }r(\theta_\text{previous})=1
\tag{6.2}$$

- If $r_t(\theta)>1$ means that the action is more probable in the current policy than the old one
- If $0<r_t(\theta)<1$ means the action is less probable for current policy.

As, a consequence:
$$
L^{CPI}(\theta)=\hat{\mathbb{E}}_t\left[\frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_\text{previous}}(a_t\mid s_t)}\cdot\hat{\mathcal{A}}_t\right]=\hat{\mathbb{E}}_t\left[r_t(\theta)\cdot\hat{\mathcal{A}}_t\right]
\tag{6.3}$$

However, if the action taken is much more probable in our current policy than in our former, this would lead to a large policy gradient step and consequence an excessive policy update. Thus we need to constraint this onjective function by penalize changes that lead to a ratio that will away from 1 to ensure that not having too large policy update.

Two **solutions**:
- TRPO (Trust Region Policy Optimization) uses KL divergence constraints outside of the ojective function to constraint the policy update. This method is much complicate to implement and it takes more conputation time.
- PPO clip probability ration directly in the objective function with its Clipped surrogate objective function.

$$
L^{CLIP}(\theta)=\mathbb{\hat{E}}\left[\min(\underbrace{r_t(\theta)\hat{A}_t}_{\text{L CPI}}, \underbrace{\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\cdot\hat{A}_t)}_{\text{modified surrogate objective}}) \right]
\tag{6.4}$$

### Case 1: when the advantage > 0

$\hat{A}_t>0$ means that the action is better than the average of all the actions in that state. Therefore, we should encourage our new policy to increase the probability of taking that action at that state. It means increasing $r_t$. However, we don't want to update too much our policy because it can lead to a bad one.

### Case 1: when the advantage < 0

$\hat{A}_t<0$ means the action should be discouraged because negative effect of the outcome. $r_t$ will be decreased but not too much because of clipping.

The **final** Clipped Surrogate Objective Loss:

$$
L_t^{CLIP+VF+S}(\theta)=\mathbb{\hat{E}}_t\left[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t) \right]
\tag{6.5}$$
where:
- $c_1, c_2$ - coefficients
- $L_t^{VF}(\theta)$ - squared-error value loss: $(V_\theta(s_t)-V_t^\text{target})^2$
- $S[\pi\theta](s_t)$ - additional entropy bonus to ensure sufficient exploitation.

An implementation: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/PPO%20with%20Sonic%20the%20Hedgehog

# Step 7: Curiosity-Driven Learning - Part 1

Curiosity-Driven Learning is one of the **most exciting and promising strategy in Deep Reinforcement Learning**

Current problem of RL is that **reward function is hard coded by a human, which is not scalable**.

The idea of Curiosity-Driven learning is to build a reward function that intrinsic to the agent. The agent will be a self-learner since he will be the student but also a feedback master.

![](https://cdn-images-1.medium.com/max/1371/1*SI8itmr1PZPkXCBtgIh2Sw.png)

The idea was first introduce in 2017 paper [Curiosity-driven Exploration by Self-supervised Prediction](https://pathak22.github.io/noreward-rl/). The results were then improved with the second paper [Large-Scale Study of Curiosity-Driven Learning](https://pathak22.github.io/large-scale-curiosity/).

## Two main problems in Reinforcement Learning

- **sparse rewards** is the time difference between an action and its feedback. An agent learns fast if each of its action has a reward, so that he gets a rapid feedback. 

We can face sparse rewards in real time games, where agent will not have a direct reward for each action. Therefor, a **bad decision will not have a feedback untils hours later.**
- *Extrinsic rewards are not scalable*. That means we can't always scale a reward function in a specific environment to other complex ones.

## A new reward function: Curiosity
This function is instrinsic to the agent (generated by agent itself). Curiosity equals to the **error of our agent to predict the consequence of its own actions given its current state**.

The idea of curiosity is to encourage the agent to perform actions that reduce the unceretainty in its ability to predict the consequence of its own action (unceertainty is high in areas where the agent spends less time of with complex dynamics).

Measuring error requires **building a model of env dynamics that predicts the next state $s'$ given the current state $s$ and the action $a$.

## Introducing the Intrinsic Curiosity Module
### The need of a good feature space
**How can the agent predict the next state given current state and action?**

We can define the curiosity as the **error between** the **predicted** new state $s_{t+1}$ given our state $s_t$ and action $a_t$ and the **real** *new state*.

In games, our state is a stack of 4 frames, so prediction is really hard because:
- hard to predict pixels directly (large number of them and their properties)
- this is not the right thing, according to the paper's authors

![](https://media.giphy.com/media/keavcHqJnWkzS/giphy.gif)Trying to predict the movement of each pixel at each timeframe is really hard

So instead of making prediction in the raw sensory sapce, we **transform raw sensory input into a feature space with only relevant information**. To do this we need to:
- model things that can be controlled by the agent
- model things that can't be controlled by the agent but can affect the agent.
- not model things that are not in agent's control and have no effect on agent.

The desired embedding space should:
- be compact in terms of dimensional
- preserve sufficient information about the observation
- stable because non-stationary rewards make it difficult for reinforcemnet agents to learn.

## Intrinsic Curiosity Module (ICM)
From the paper
![](https://cdn-images-1.medium.com/max/1371/1*JHhacgi6jzpzKtReLgNE2w.png)
ICM is the system that **helps us to generate curiosity** and is composed of two neural networks.

To learn the feauture space we **use self-supervision**, training a neural network on a proxy inverse dynamics task of predicting the agent action $(\hat{a}_t)$ given its current and next states ($s_t$ and $s_{t+1}$).

Since the neral network is only required to predict the action, **it has no incentive of represent within its feature embedding space, the factors of variation in the environment that does not affect the agent itself**.

Then we use this feature space to train a forward dynamics model that predicts the future representation of the next state $\phi(s_{t+1})$, **given the feature representation of the current state $\phi(s_{t})$ and the action $a_t$**.

$$\textbf{Curiosity = predicted}[\phi(s_{t+1})]-\textbf{actual}[\phi(s_{t+1})]
\tag{7.1}$$

So, **we have two model in ICM**:
- **Inverse Model**: Encode the states $s_t$ and $s_{t+1}$ into the feature vectors $\phi(s_t)$ and $\phi(s_{t+1})$ that are trained to predict action $\hat{a}_t$:

$$
\underbrace{\hat{a}_t}_{\text{predicted action}}=\underbrace{g}_{\text{learning function}}\left(s_t,s_{t+1};\underbrace{\theta_I}_{\text{inverse model parameters}}\right)
\tag{7.2}$$

Inverse loss function that measures the difference between the real and predicted actions:

$$\min_{\theta_I}L_I(\hat{a}_t,a_t)\tag{7.3}$$

- **Forward Model**: Take as input $\phi(s_t)$ and predict the feature representation $\phi(s_{t+1})$ of $s_{t+1}$:

$$
\underbrace{\hat{\phi}(s_{t+1})}_{\begin{matrix}\text{predicted feature represen-}\\\text{tation of the next state}\end{matrix}} =  \overbrace{f}^{\begin{matrix}\text{foward}\\\text{model}\end{matrix}}\left(\underbrace{\phi(s_t)}_{\begin{matrix}\text{feature vector}\\ \text{of state}\end{matrix}},a_t;\theta_F\right)
\tag{7.4}$$

Forward model loss function:

$$
L_F\left(\phi(s_t),\hat{\phi}(s_{t+1})\right)=\frac{1}{2}\left\| \hat{\phi}(s_{t+1})-\phi(s_{t+1})\right\|_2^2
\tag{7.5}$$

Then mathematically speaking, **curiosity** will **be** the **difference between** ***predicted feature vector*** of the next state and its ***real feature***.

$$
\underbrace{r^i_t}_{\text{curiosity}}=\frac{1}{2}\cdot\overbrace{\eta}^{\begin{matrix}\text{scaling}\\\text{factor}\end{matrix}}\cdot\left\|\underbrace{\hat{\phi}(s_{t+1})}_{\begin{matrix}\text{predicted feature vector} \\ \text{of next state}\end{matrix}} -\underbrace{\phi(s_{t+1})}_{\begin{matrix}\text{real feature vector} \\ \text{of next state}\end{matrix}} \right\|_2^2
\tag{7.6}$$

### The overall optimization problem

$$
\min_{\theta_P, \theta_I, \theta_F}\left[-\lambda\mathbb{E}_{\pi(s_t;\theta_P)}\left[\sum_t {r_t}\right] + (1-\beta)L_I+\beta L_F\right]
\tag{7.7}
$$

where:
- $\theta_P, \theta_I, \theta_F$ - parameters of policy, inverse & forward models
- $\lambda>0$ is a scalar that weighs the importance of the policy gradient loss against the importance of learning the intrinsic reward
- $0<\beta<1$ - scalar that weighs the inverse model loss aginst the forward model loss
- $L_I$ - inverse model loss
- $L_F$ - forward model loss

## Recap for step 7. Curiosity-Driven Learning
- Because of the extrinsic rewards implementation and sparse rewards problems, **we want to create a reward that is intrinsic to the agent**
- To do that we created curiosity, **which is the agent's error in predicting the consequence of its action given its current state**.
- Using curiosity will push our agent to **favor transitions with high prediction error** (which will be higher **in areas where the agent has spent less time**, or in areas with complex dynamics) and consequently better explore our environment.
- But because we can't predict the next state by predicting the next frame (too much complicated), we use a **better feature representation that will keep only elements that can be controlled by our agent of affect our agent**.
- To generate curiosity, we use **Intrinsic Curiosity Module (ICM)** that is composed of two models: **Inverse model** that is used to learn the feature representation of state and next state and **Forward Dynamics model** used to generate the predicted feature representation of the next state.
- Curiosity will be equal to **the difference between predicted $\hat{\phi}(s_{t+1})$ (Forward Dynamics Model) and real $\phi()s_{t+1}$ (Inverse Dynamics Model)**
