# Actor Critic with Experience Replay


## Objective

### 1. Actor

$$\hat{g}_t^{acer} = \bar{\rho}_t  \bigtriangledown_\theta log \pi_\theta (a_t|s_t) [ Q^{ret} (s_t, a_t) - V_{\theta_v}(s_t) ] + \mathbb{E}_{a \sim \pi} ( [\frac{\rho_t(a) - c}{\rho_t(a)}]_{+} \bigtriangledown_\theta log \pi_\theta (a|s_t) [Q_{\theta_v}(s_t, a) - V_{\theta_v}(s_t)] )   $$

#### Notations
+ $\rho_t = \frac{\pi(a_t|s_t)}{\mu(a_t|s_t)}$: importance sampling term. *(B, 1)*

+ $\bar{\rho}_t = max\{c, \rho_t\}$: reduce variance. *(B, 1)*
+ $\pi_\theta(a_t|s_t)$: probability of $a_t$ under $s_t$ by current policy. *(B, 1)*
+ $log \pi_\theta (a|s_t)$: probability distribution of predicted actions under $s_t$ by current policy *(B, action_size)*

+ $Q^{ret}(s_t, a_t)$: return of state-action value function calculated from Retrace algortihm. *(B, 1)*

+ $Q_{\theta_v}(s_t, a)$: expected values of all actions under $s_t$ given by current value network. *(B, action_size)*

+ $V_{\theta_v}(s_t)$: state value function, the expected reward of $s_t$ by current policy and value function. For example, $V_{\theta_v}(s_t) = \sum_{a_t} pi_\theta(a_t|s_t) Q_{\theta_v}(s_t, a_t)$. *(B, 1)*. Therefore, the shape of $Q_{\theta_v}(s_t, a) - V_{\theta_v}(s_t)$ is *(B, action_size)*

+ $\rho_t(a) =  \frac{\pi(a|s_t)}{\mu(a|s_t)} = \pi_\theta(s_t)$:  probability distribution of predicted actions under $s_t$. Different from $\rho_t$. *(B, action_size)*

+ $ [\frac{\rho_t(a) - c}{\rho_t(a)}]_{+} = min\{ 0, \frac{\rho_t(a) - c}{\rho_t(a)} \}$: *(B, action_size)*

+ $\mathbb{E}_{a \sim \pi}$: math expectation. Therefore, the second item needs to be **multiplied by the probability distribution of predicted action** and **summation**.




### 2. Critic
$$MSE(Q^{ret}(s_t, a_t), Q_{\theta_v}(s_t, a_t))$$

#### Notations
+ $Q_{\theta_v}(s_t, a_t)$: state-action value function *(B, 1)*

### 3. Retrace Algorithm
$$Q^{ret}(s_t, a_t) = r_t + \gamma \bar{\rho}_{t+1}[  Q^{ret}(s_{t+1}, a_{t+1}) - Q_{\theta_v}(s_{t+1}, a_{t+1})   ] + \gamma V_{\theta_v}(s_{t+1})$$

#### Notations
+ $r_t$: reward from environment

## Importance Sampling
*With importance sampling, we can reuse samples collected from old policy to calculate the policy gradient.*
### 1. derivatiton
Problems: We want to estimate the expected value of $f(x)$ where $x$ has a data distribution $p$. However, instead of sampling from $p$, we calculate the result from sampling $q$.

Deviration:
$$\mathbb{E}_{x \sim p(x)}[f(x)] =  \int_x \mathrm{p(x)} \mathrm{f(x)} \,\mathrm{d}x  = \int_x  \mathrm{q(x)} \frac{ \mathrm{p(x)}}{\mathrm{q(x)}} \mathrm{f(x)} \,\mathrm{d}x  =
\mathbb{E}_{x \sim q(x)}[ \frac{p(x)}{q(x)} f(x)]  $$

### 2. property
+ unbiased
+ $\frac{p(x)}{q(x)}$ cannot be too large


## Points
+ rollout, replay buffer
+ Retrace Algorithm to calculate return value
+ clip importance sampling term
+ on policy + off policy
+ Replay Buffer saves a part of transitions as a unit. `is_first` is used to indicate the first item of the transitions
+ `r`, `done_mask` and `is_first` can be list rather than Tensor


## Reference
+ Paper: [Sample Efficient Actor-Critic with Experience Replay](https://arxiv.org/pdf/1611.01224.pdf)
+ Code: [acer.py](https://github.com/seungeunrho/minimalRL/blob/master/acer.py)

## 1. Import packages

In [0]:
import gym
import numpy as np
import collections
import random
import torch
import torch.nn as nn
from torch.distributions import Categorical
import torch.nn.functional as F


## 2. Define constants

In [0]:
gamma = 0.98
num_epochs = 5000
num_rollouts = 10
reward_div = 100
c = 1.0
max_buffer = 6000
batch_size = 4 # 4 * num_rollouts = 40 samples total

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 3. Prepare data

In [0]:
env = gym.make("CartPole-v0")


class ReplayBuffer(object):
    def __init__(self):
        self.buffer = collections.deque(maxlen=max_buffer)
    
    def append(self, sample): # I store one sample in the ReplayBuffer, while the minimalRL stores 1 transition
        self.buffer.append(sample)
    
    def sample(self, n, on_policy):
        '''
            return transitions unit (on policy, because this transitions unit is obtained from the most recent sampling using the current policy)
              or a batch of transitions units (off policy) 
        '''
        if on_policy:
            transitions_batch = [self.buffer[-1]]
        else:
            transitions_batch = random.sample(self.buffer, n)
        
        ss, aa, rr, s_primes, probs, done_masks, is_firsts = [], [], [], [], [], [], []
        for transitions in transitions_batch:
            is_first = True
            for s, a, r, s_prime, done_mask, prob in transitions:
                ss.append(s)
                aa.append(a)
                rr.append(r)
                s_primes.append(s_prime)
                probs.append(prob)
                done_masks.append(done_mask)
                is_firsts.append(is_first)
                is_first = False
                
      
        return torch.FloatTensor(ss).to(device), torch.LongTensor(aa).to(device), torch.FloatTensor(rr).to(device), \
               torch.FloatTensor(s_primes).to(device), done_masks, torch.FloatTensor(probs).to(device),  is_firsts
      
    def __len__(self):
        return len(self.buffer)
      
buffer = ReplayBuffer()


def get_sample(env, policy, buffer):
    '''
        Save transitions to buffer and return rewards for computing scores
    '''
    done = False
    s = env.reset() # (state_size, )
    is_first = True
    while not done:
        rr = []
        transitions = []
        for t in range(num_rollouts):
            a, probs = policy.sample_action(torch.Tensor(s).to(device)) # probs.shape: (B=1, action_size)
            s_prime, r, done, _ = env.step(a) # a is 0 or 1
            rr.append(r)
            done_mask = 0.0 if done else 1.0
            
            transitions.append((s, a, r, s_prime, done_mask, probs.detach().cpu().numpy()))
            s = s_prime
            if done:
                break
        buffer.append(transitions)
        yield rr

## 4. Build model

In [0]:
def GAE(advantages, gamma, lmbda):
    gae_advantages = torch.zeros_like(advantages)
    gae = 0

    for ri in reversed(range(len(advantages))):
        gae = gae * gamma * lmbda + advantages[ri]
        gae_advantages[ri] = gae
    return gae_advantages


class ActorCritic(nn.Module):
    def __init__(self):
        super(ActorCritic, self).__init__()
        
        self.fc1 = nn.Linear(4, 256)
        self.fc_pi = nn.Linear(256, 2)
        self.fc_v = nn.Linear(256, 2) # !!! value network in ACER returns value for each action
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0005, betas=(0.9, 0.99))

    def policy(self, state, softmax_dim=0):
        net = F.relu(self.fc1(state)) # (B, 256) # !!! Do not forget ReLU
        net = self.fc_pi(net) # (B, 2)
        probs = F.softmax(net, dim=softmax_dim)
        return probs
        
    def sample_action(self, state, softmax_dim=0): # state: (4,) => indicates that the fully-connected layer in PyTorch can receive inputs without batch_size
        probs = self.policy(state)
        m = Categorical(probs) # !!! The cpu or gpu version will influence the seed. In other words, even if we set the seed to be 2, different versions of `probs` might produce different results
        a_pred = m.sample().item()
        return a_pred, probs # (predicted action: 0 or 1, log of probability of current action)

    def value(self, state):
        net = F.relu(self.fc1(state)) # !!! Do not forget ReLU
        return self.fc_v(net)
      
    def fit(self, buffer, on_policy=False):
        s, a, r, _, done_mask, probs, is_firsts = buffer.sample(batch_size, on_policy)

        # -------------------- Preprocess sample ------------------------------------
        r /= reward_div 
        a = a.view(-1, 1) # (B, 1)
        
        
        # Note: 
        # `var` represents results on all actions with the sahpe of (B, action_size)
        # `var_a` indicates result on the specific action with shape of (B, 1)
        
        pi = self.policy(s, softmax_dim=1) # (B, 2) $\pi_\theta(a|s_t)$ !!! softmax_dim is important
        pi_a = pi.gather(1, a) # (B, 1)  $\pi_\theta(a_t|s_t)$
        
        q = self.value(s) # (B, 2) $Q_{\theta_v}(s_t, a)$
        q_a = q.gather(1, a) # (B, 1) $Q_{\theta_v}(s_{t}, a_{t})$

        v = (pi * q).sum(1).unsqueeze(1).detach() # (B, 1) $V_{\theta_v}(s_t)$ !!! detach
        
        rho = pi.detach() / probs
        rho_a = rho.gather(1, a) # (B, 1), $\rho_t$
        rho_bar = rho_a.clamp(max=c) # (B, 1) $\bar{\rho}_t$
        correction_coeff = (1 - c / rho).clamp(min=0)

        
        # --------------------- Compute return using Retrace ----------------------------
        q_rets = list() # a list of $Q^{ret}(s_t, a_t)$
        q_ret = v[-1] * done_mask[-1]
        for i in reversed(range(len(r))):
            q_ret = r[i] + gamma * q_ret
            q_rets.append(q_ret.item())
            q_ret = rho_bar[i] * (q_ret - q_a[i]) + v[i]
            
            if is_firsts[i] and i != 0: # i is the first of transitions as well as the end of the iteration. Whereas i - 1 is the last transition as well as the next iteration
                q_ret = v[i - 1] * done_mask[i - 1]
        
        q_rets.reverse() # !!! reverse
        q_rets = torch.Tensor(q_rets).to(device) # (B)
        q_rets = q_rets.unsqueeze(1) # (B, 1)
        
        loss1 = rho_bar * torch.log(pi_a) * (q_rets - v)
        loss2 = correction_coeff * torch.log(pi) * pi * (q - v)
        loss = -(loss1 + loss2).mean() + F.smooth_l1_loss(q_a, q_rets.detach())
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        
ac = ActorCritic().to(device)

## 5. Train

In [5]:
score = 0.0

for epoch in range(num_epochs):
    # ------------------------- Sampling ---------------------------------------------
    for rewards in get_sample(env, ac, buffer):
        score += sum(rewards)

    # ------------------------- Train ---------------------------------------------
    if len(buffer) > 500:
        ac.fit(buffer, on_policy=True)
        ac.fit(buffer, on_policy=False)

    if epoch % 100 == 0:
        print('Epoch %d || Average Score: %.6f'%(epoch, score / (epoch + 1)))


Epoch 0 || Average Score: 26.000000
Epoch 100 || Average Score: 25.227723
Epoch 200 || Average Score: 24.208955
Epoch 300 || Average Score: 25.275748
Epoch 400 || Average Score: 29.561097
Epoch 500 || Average Score: 33.333333
Epoch 600 || Average Score: 37.392679
Epoch 700 || Average Score: 42.335235
Epoch 800 || Average Score: 50.314607
Epoch 900 || Average Score: 60.534961
Epoch 1000 || Average Score: 69.889111
Epoch 1100 || Average Score: 76.549500
Epoch 1200 || Average Score: 81.293922
Epoch 1300 || Average Score: 87.966180
Epoch 1400 || Average Score: 91.714490
Epoch 1500 || Average Score: 94.067288
Epoch 1600 || Average Score: 96.288570
Epoch 1700 || Average Score: 100.193416
Epoch 1800 || Average Score: 102.495836
Epoch 1900 || Average Score: 104.726460
Epoch 2000 || Average Score: 106.287356
Epoch 2100 || Average Score: 109.752975
Epoch 2200 || Average Score: 112.570195
Epoch 2300 || Average Score: 114.990439
Epoch 2400 || Average Score: 117.872553
Epoch 2500 || Average Score: 