# Reinforcement Learning Part 2
<center><img src='https://cdn-media-1.freecodecamp.org/images/1*rvGVriKT_aLeLKvAP16S0A.gif' width=1000>

# Terminology and Notation

<center><img src="https://cdn-images-1.medium.com/max/1000/1*vz3AN1mBUR2cr_jEG8s7Mg.png" width=500>

* $s_t$ - State: pixels on the screen (what Mario sees)

* $a_t$ - Action: for Mario right, left, up, down

* $V_\pi(s)$ - Value: how good it is to be in state $s$ when following policy $\pi$

* $Q_\pi(s,a)$ - Value: how good it is to be in state $s$ and take action $a$ when following policy $\pi$

* $\tau = s_1, a_1, ..., s_T, a_T$ - Finite Trajectory: sequence of states & actions

# Return

* $G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty} \gamma^kR_{t+k+1}$
* Sum of discounted rewards going forward 

* $\gamma \in [0,1]$ - discount factor
    * Penalize future rewards
    * Rewards now are better than rewards in the future
    * Provides mathematical convenience

# State Value Function

* $V_\pi(s) = \textbf{E}_\pi[G_t | S_t = s]$
* Expected return from state $s$, at time $t$, following policy $\pi$

\begin{aligned}
V(s) &= \mathbb{E}[G_t \vert S_t = s] \\
&= \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \vert S_t = s] \\
&= \mathbb{E} [R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \dots) \vert S_t = s] \\
&= \mathbb{E} [R_{t+1} + \gamma G_{t+1} \vert S_t = s] \\
&= \mathbb{E} [R_{t+1} + \gamma V(S_{t+1}) \vert S_t = s]
\end{aligned}

# Action Value Function

* $Q_\pi(s, a) = \textbf{E}_\pi[G_t | S_t = s, A_t=a]$
* Expected return from state $s$, at time $t$, taking action $a$, following policy $\pi$

<center><img src="https://lilianweng.github.io/lil-log/assets/images/bellman_equation.png">

# Recovering the State Value Function

* How can I write the value function, $V_\pi(s)$, using $Q_\pi(s,a)$?

* $V_\pi(s) = \sum_{a\in A}Q_\pi(s,a)\pi(a|s)$

\begin{aligned}
Q(s, a) 
&= \mathbb{E} [R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s, A_t = a] \\
&= \mathbb{E} [R_{t+1} + \gamma \mathbb{E}_{a\sim\pi} Q(S_{t+1}, a) \mid S_t = s, A_t = a]
\end{aligned}

# $V_\pi(s)$ vs $Q_\pi(s,a)$

* $Q_\pi(s,a)$ is typically more usefull
    * Tells us which actions to take

* $V_\pi(s)$ is only useful if we know the transition dynamics $P(S_{t+1}|S_t, A_t)$
    * If we know how to get from one state to another

* We'll focus on $Q_\pi(s,a)$

# Value of Goals
<center><img src="https://i.gadgets360cdn.com/large/google_deepmind_lab_1480999487929.gif" width=1000>

# Optimal Policies

* The best (optimal) policy 

* $\pi_* = \text{argmax}_\pi Q_\pi(s,a)$

# Parameterized Action Value Function

* Use neural network with parameters $\theta$
* $Q_\theta(s,a)$

* Example: 
    * $s$ is an image (pixels of atari game)
    * $a$ actions in game (left, right, shoot)

* Output Q value for all actions
    * More efficient than passing each action through the network

# Temporal Difference (TD) Learning

* Update targets using existing extimates instead of waiting for actual results
* This is known as **bootstrapping**

* These should be equal
* $Q_\pi(S_t, A_t) = R_{t+1} + \gamma \max_{a\in A}Q_\pi(S_{t+1}, a)$

* Because we're approximating 
* We'll update $Q_\theta(S_t, A_t)$ to be closer to $R_{t+1} + \gamma \max_{a\in A}Q_\theta(S_{t+1}, a)$
* $Q_\theta(S_t, A_t) \approx R_{t+1} + \gamma \max_{a\in A}Q_\theta(S_{t+1}, a)$

# TD vs Monte Carlo

<center><img src="https://lilianweng.github.io/lil-log/assets/images/TD_MC_DP_backups.png" width=1200>

* $R_{t+1}$ is the reward we observed after our network took action $a$
* We want to make the value of $Q_\theta(S_{t}, a)$ be more accurate so we'll bootstrap a target using the next state we see $S_{t+1}$
* This gives us more information and will lead to a more accurate target

# Q-Learning: Off Policy TD Control

* Start from state $S_t$, pick $A_t = \text{argmax}_{a\in A}Q_\theta(S_t, a)$
    * $\epsilon$ greedy

* Get reward $R_{t+1}$ from taking action $A_t$; Go into state $S_{t+1}$

* Update action-value function:  
    $Q_\theta(S_t, A_t) \leftarrow Q_\theta(S_t, A_t) + \alpha(R_{t+1} + \gamma \max_{a\in A}Q_\theta(S_{t+1}, a) - Q_\theta(S_t, A_t))$

* Repeat

This is known as the TD error $R_{t+1} + \gamma \max_{a\in A}Q_\theta(S_{t+1}, a) - Q_\theta(S_t, A_t)$

# Problems with Q-Learning

* Targets are constantly changing
    * Because we're bootstrapping
    * Every time we update $Q_\theta$ the bootstrapped targets change

* Samples are sequentially correlated
    * $S_{t+1}$ always paired with $S_t$ 

# Experience Replay

* Store episodes $e_t=(S_t,A_t,R_t,S_{t+1})$ in replay memory $D_t={e_1,…,e_t}$
* During Q-learning updates, samples are drawn at random from the replay memory

* Improves data efficiency
* Removes correlations in the observation sequences
* Smooths over changes in the data distribution

# Periodically Updated Targets

* Two neural networks
    * $Q_\theta$: Updated every time step
    * $Q_{\theta^{(T)}}$: Target network, updated every $C$ steps

* $Q_{\theta^{(T)}}$ provides stable targets to learn from

* Every $C$ steps $Q_{\theta^{(T)}} \leftarrow Q_\theta$

# Q-Learning Experience Replay and Target

\begin{equation}
\mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s_{t+1}) \sim U(D)} \Big[ \big( r + \gamma \max_{a_{t+1}} Q_{\theta^{(T)}}(s_{t+1}, a_{t+1}) - Q_\theta(s, a) \big)^2 \Big]
\end{equation}

<center><img src="https://lilianweng.github.io/lil-log/assets/images/DQN_algorithm.png" width=1000>