<a href="https://colab.research.google.com/github/lblogan14/reinforcement_learning_with_tensorflow/blob/master/ch6_asynchronous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Review:
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch4/actor-critic.PNG?raw=true)

Recall the structure of the actor-critic algorithm from Chapter 4. \\
The **Actor** takes the current environment state and determines best action to take; \\
the **Critic** plays a policy-evaluation role by taking in the environment state and action, and then returns a score depicting how good an action is for the state.

Thus, the actor-critic algorithm learns both the policy and state-action value function.

#Asynchronous Methods
Deep Q-network utilizes the experience replay to train the deep neural network in order to find out the maximum Q-value for the most favorable action, but it takes too much memory usage and heavy computation over time. Thus, the asynchronous method is to overcome this issue. Instead of using experience replay, in asynchronous methods, multiple instances of the environment are created and multiple agents asynchronously execute actions in parallel:
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch6/asy_method.PNG?raw=true)

Thus, each thread is assigned the process that contains a learner representing an agent network that interacts with its own copy of the environment. Multiple learners run in parallel exploring their own environment. The parallelism allows the agent to experience varied different states simultaneously at any given time-step, and covers the fundamentals of both off-policy and on-policy learning algorithms. These multiple learners running in parallel use different exploration policies, which maximizes the diversity. Different exploration policies by different learners changes the parameters, and these updates have the least chance to be correlated in time. Therefore, experience replay memory is not required.

Examples of asynchronous methods:
* Asynchronous one-step Q-learning
* Asynchronous one-step SARSA
* Asynchronous n-step Q-learning
* Asynchronous advantage actor critic (A3C)

#Asynchronous one-step Q-learning
An agent in DQN is represented by a set of primary and target networks, where **one-step loss** is calculated as the square of the difference between the state-action value of the current state s predicted by the primary network and the target state-action value of the current state calculated by the target network. (Similar to DQN)

(New for Asynchronous) There are multiple learning agents running and calculating the one-step loss in parallel. Thus the gradient calculation occurs in parallel in different threads where each learning agent interacts with its own copy of environment. The accumulation of these gradients in
different threads over multiple time steps are used to update the policy network parameters
after a fixed time step, or when an episode is over. The accumulation of gradients is
preferred over policy network parameter updates because this avoids overwriting the
changes perform by each of the learner agents.

Adding a different exploration policy to different threads makes the learning
diverse and robust. This improves the performance owing to better exploration, because
each of the learning agents in different threads is subjected to a different exploration policy.

Pseudocode for Asynchronous one-step Q-learning: \\
where \\
* $\theta$: parameters of the policy network
* $\theta^t$: parameters of the target network
* $T$: overal time step counter

`// Globally shared parameters ` $\theta, \theta^t$ and $T$ \\
`// ` $\theta$ `is initialized arbitrarily` \\
`// T is initialized 0` \\
 \\
`Initialize thread level time step counter t=0` \\
`Initialize ` $\theta^t=\theta$ \\
`Initialize network gradients` $d\theta=0$ \\
`Start with the initial state s` \\
`repeat until ` $T>T_{\max}:$ \\
$\quad$ `Choose action a with ` $\epsilon$ `-greedy policy such that:` \\
$\quad\quad a=\{\begin{array}{l}a\,random\, action\quad ,with\,probability\,\epsilon \\ \arg\max_{a'}Q(\phi(s),a';\theta) \quad otherwise \end{array}$ \\
$\quad$ `Perform action a` \\
$\quad$ `Receive new state s' and reward r` \\
$\quad$ `Compute target y: ` $y = \{ \begin{array}{l}
                                                 r \quad , for\,terminal\,s'\\ 
                                                 r+\gamma\max_{a'}Q(s',a';\theta_t) \quad ,otherwise 
                                                 \end{array}$ \\
$\quad$ `Compute the loss, ` $L(\theta)=(y-Q(s,a;\theta))^2$ \\
$\quad$ `Accumulate the gradient w.r.t. ` $\theta : d\theta=d\theta+\frac{\nabla L(\theta)}{\nabla\theta}$ \\
$\quad$ `s = s'` \\
$\quad$ `T = T + 1` \\
$\quad$ `t = t + 1` \\
 \\
$\quad$ `if T mod ` $I_{target}==0:$ \\
$\quad\quad$ `Update the parameters of target network: ` $\theta^t=\theta$ \\
$\quad\quad$ `# After every ` $I_{target}$ `time steps the parameters of target network is updated` \\
 \\
$\quad$ `if t mod ` $I_{AsyncUpdate}==0$ `or s = terminal state:` \\
$\quad\quad$ `Asynchronous update of ` $\theta$ `using ` $d\theta$ \\
$\quad\quad$ `Clear gradients: ` $d\theta=0$ \\
$\quad\quad$ `#at every ` $I_{AsyncUpdate}$ `time step in the thread or if s is the terminal state` \\
$\quad\quad$ `# update ` $\theta$ `using accumulated gradients ` $d\theta$

#Asynchronous one-step SARSA
uses $\epsilon$-greedy to choose the action $a'$ for the next state $s'$ and the Q-value of the next state-action pair: $Q(s',a'; \theta^t)$ is used to calculate the target state-action value of the current state.

Pseudocode for Asynchronous one-step SARASA: \\
where \\
* $\theta$: parameters of the policy network
* $\theta^t$: parameters of the target network
* $T$: overal; time step counter

`// Globally shared parameters ` $\theta, \theta^t$ and $T$ \\
`// ` $\theta$ `is initialized arbitrarily` \\
`// T is initialized 0` \\
 \\
`Initialize thread level time step counter t=0` \\
`Initialize ` $\theta^t=\theta$ \\
`Initialize network gradients` $d\theta=0$ \\
`Start with the initial state s` \\
`Choose action a with ` $\epsilon$ `-greedy policy such that:` \\
$\quad a=\{\begin{array}{l}a\,random\, action\quad ,with\,probability\,\epsilon \\ \arg\max_{a'}Q(\phi(s),a';\theta) \quad otherwise \end{array}$ \\
`repeat until ` $T>T_{\max}$ : \\
$\quad$ `Perform action a` \\
$\quad$ `Receive new state s' and reward r` \\
$\quad$ `Choose action a' with ` $\epsilon$ `-greedy policy such that:` \\
$\quad\quad a'=\{\begin{array}{l}a\,random\, action\quad ,with\,probability\,\epsilon \\ \arg\max_{a'}Q(\phi(s),a'';\theta) \quad otherwise \end{array}$ \\
$\quad$ `Compute target y: ` $y = \{ \begin{array}{l}
                                                 r \quad , for\,terminal\,s'\\ 
                                                 r+\gamma\max_{a'}Q(s',a';\theta_t) \quad ,otherwise 
                                                 \end{array}$ \\
$\quad$ `Compute the loss, ` $L(\theta)=(y-Q(s,a;\theta))^2$ \\
$\quad$ `Accumulate the gradient w.r.t. ` $\theta : d\theta=d\theta+\frac{\nabla L(\theta)}{\nabla\theta}$ \\
$\quad$ `s = s'` \\
$\quad$ `T = T + 1` \\
$\quad$ `t = t + 1` \\
$\quad$ `a = a'` \\
 \\
$\quad$ `if T mod ` $I_{target}==0:$ \\
$\quad\quad$ `Update the parameters of target network: ` $\theta^t=\theta$ \\
$\quad\quad$ `# After every ` $I_{target}$ `time steps the parameters of target network is updated` \\
 \\
$\quad$ `if t mod ` $I_{AsyncUpdate}==0$ `or s = terminal state:` \\
$\quad\quad$ `Asynchronous update of ` $\theta$ `using ` $d\theta$ \\
$\quad\quad$ `Clear gradients: ` $d\theta=0$ \\
$\quad\quad$ `#at every ` $I_{AsyncUpdate}$ `time step in the thread or if s is the terminal state` \\
$\quad\quad$ `# update ` $\theta$ `using accumulated gradients ` $d\theta$

#Asynchronous n-step Q-learning
similar to asynchronous one-step Q-learning, but for asynchronous n-step Q-learning, the learning agent actions are selected using the exploration policy for up to $t_{\max}$ steps or until a terminal state is reached, in order to compute a single update of policy network parameters. The loss for each time step is calculated as the difference between the discounted future rewards at that time step and the estimated Q-value.

The loss gradient with respect to thread-specific network parameters for each time step is calculated and accumulated. There are multiple such
learning agents running and accumulating the gradients in parallel. These accumulated
gradients are used to perform asynchronous updates of policy network parameters.

Pseudocode for asynchronous n-step Q-learning: \\
where \\
* $\theta$: parameters of the policy network
* $\theta^t$: parameters of the target network
* $T$: overall time step counter
* $t$: thread level time step counter
* $T_{\max}$: maximum number of overall time steps
* $t_{\max}$: maximum number of time steps in a thread

`// Globally shared parameters ` $\theta, \theta^t$ and $T$ \\
`// ` $\theta$ `is initialized arbitrarily` \\
`// T is initialized 0` \\
 \\
`Initialize thread level time step counter t=0` \\
`Initialize ` $\theta^t=\theta$ \\
`Initialize ` $\theta'=\theta$ \\
`Initialize network gradients` $d\theta=0$ \\
 \\
`repeat until ` $T>T_{\max}$: \\
$\quad$ `Clear gradient: ` $d\theta=0$ \\
$\quad$ `Synchronize thread-specific parameters: ` $\theta'=\theta$ \\
$\quad t_{start}=t$ \\
$\quad$ `Get state ` $s_t$ \\
$\quad$ `r = [] // list of rewards` \\
$\quad$  `a = [] // list of actions` \\
$\quad$ `s = [] // list of states` \\
$\quad$ `repeat until ` $s_t$ `is a terminal state or ` $t-t_{start}==t_{\max}$: \\
$\quad\quad$ `Choose action` $a_t$  `with ` $\epsilon$ `-greedy policy such that:` \\
$\quad\quad\quad a_t=\{\begin{array}{l}a\,random\, action\quad ,with\,probability\,\epsilon \\ \arg\max_{a_t}Q(\phi(s),a'';\theta) \quad otherwise \end{array}$ \\
$\quad\quad$ `Perform action` $a_t$ \\
$\quad\quad$ `Receive new state ` $s_{t+1}$ `and reward ` $r_t$ \\
$\quad\quad$ `Accumulate rewards by appending ` $r_t$ `to r` \\
$\quad\quad$ `Accumulate actions by appending ` $a_t$ `to a` \\
$\quad\quad$ `Accumulate states by appending ` $s_t$ `to s` \\
$\quad\quad$ `t = t + 1` \\
$\quad\quad$ `T = T + 1` \\
$\quad\quad s_t=s_{t+1}$ \\
$\quad$ `Compute returns, R:` $R=\{\begin{array}{l}0\quad ,\,for\,terminal\,s_t \\ \max_{a}Q(s_t,a;\theta^t) \quad otherwise \end{array}$ \\
$\quad$ `for ` $i\in[t-1,......, t_{start}]$ `do:` \\
$\quad\quad R=r_i+\gamma R$ \\
$\quad\quad$ `Compute loss, ` $L(\theta')=(R-Q(s_i,a_i;\theta'))^2$ \\
$\quad\quad$ `Accumulate gradients w.r.t. ` $\theta': d\theta=d\theta+\frac{\nabla L(\theta')}{\nabla\theta'}$ \\
$\quad$ `Asynchronous update of ` $\theta$ `using ` $d\theta$ \\
$\quad$ `if T mod ` $I_{target}==0:$ \\
$\quad\quad$ `Update the parameters of target network: ` $\theta^t=\theta$ \\
$\quad\quad$ `# After every ` $I_{target}$ `time steps the parameters of target network is updated` \\