# Advantage Actor-Critic

## Loss
### 1. Policy Network
$$\bigtriangledown_{\theta}\text{J}(\theta) = \frac{1}{N}\sum_{i=1}^N{ [ \sum_{t=0}^T{ [\bigtriangledown_{\theta} \text{log} \pi_{\theta}(a_{i, t}|s_{i, t}) (r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1}) - v(i, s_t))] }] }$$

If the episode length is $\infty$, then $v(s)$ can get infinitely large in many cases.

Simple tricks: better to get rewards sooner than later + discount factor

$$y_{i, t} = r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1})$$

where $\gamma \in [0, 1]$(0.99 works well)

Note that we use $\gamma v(s_{i, t+1})$ instead of $\gamma q(s_{i, t+1},  a_{i, t+1})$ because we choose the maximum action. In other words, the Value network is essentially the maximum future reward.

### 2. Value Network
$$\bigtriangledown_{\theta}\text{J}(\theta) = \frac{1}{N}\sum_{i=1}^N{ [ \sum_{t=0}^T{[ \text{smooth_l1_loss}(v(s_{i, t}), r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1}) )  ]}] }$$
 
 
## Points
+ In this code, the reward should be divided by 100

+ ReLU is very important
 
+ $\gamma v(s_{i, t+1})$ can be multiplied by (1-done). In other words, if this (s, a) pair terminates this episode, then $\gamma v(s_{i, t+1})$ is not required.

+ Update the parameters during the episode. However, the Policy Gradient updates the parameters after one episode

+ Regression problem, so both MSE loss and Smooth L1 Loss ( See details [here](https://stats.stackexchange.com/questions/351874/how-to-interpret-smooth-l1-loss)) work well: $$L_{1;smooth} = \begin{cases}|x| & \text{if $|x|>\alpha$;} \\
\frac{1}{|\alpha|}x^2 & \text{if $|x| \leq \alpha$}\end{cases}$$

## Questions

+ Why does the reward should be multiplied by 100?

+ What role does the rollout perform?

## 1. Import packages

In [0]:
import gym
import torch
import torch.nn as nn
from torch.distributions import Categorical
import torch.nn.functional as F

## 2. Define constants

In [0]:
gamma = 0.98
num_epochs = 3000
num_rollouts = 5
reward_div = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 3. Prepare data

In [0]:
env = gym.make("CartPole-v0")

def get_sample(env, policy):
    done = False
    s = env.reset() # (state_size, )
    while not done:
        ss, aa, rr, s_primes, done_masks = list(), list(), list(), list(), list()
        for t in range(num_rollouts):
            a = policy.sample_action(s)
            s_prime, r, done, _ = env.step(a) # a is 0 or 1
            ss.append(s)
            aa.append(a)
            rr.append(r)
            s_primes.append(s_prime)
            done_mask = 0.0 if done else 1.0
            done_masks.append(done_mask)
            s = s_prime
            if done:
                break
                
        sample = (torch.Tensor(ss).to(device), torch.LongTensor(aa).to(device), torch.Tensor(rr).to(device), torch.Tensor(s_primes).to(device), torch.Tensor(done_masks).to(device))
        yield sample

## 4. Build model

In [0]:
class ActorCritic(nn.Module):
    def __init__(self):
        super(ActorCritic, self).__init__()
        
        self.fc1 = nn.Linear(4, 256)
        self.fc_pi = nn.Linear(256, 2)
        self.fc_v = nn.Linear(256, 1)
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0002, betas=(0.9, 0.99))

    def policy(self, state, softmax_dim=0):
        net = F.relu(self.fc1(state)) # (B, 256) # !!! Do not forget ReLU
        net = self.fc_pi(net) # (B, 2)
        probs = F.softmax(net, dim=softmax_dim)
        return probs
        
    def sample_action(self, state, softmax_dim=0): # state: (4,) => indicates that the fully-connected layer in PyTorch can receive inputs without batch_size
        state = torch.Tensor(state).to(device)
        probs = self.policy(state)
        m = Categorical(probs) # !!! The cpu or gpu version will influence the seed. In other words, even if we set the seed to be 2, different versions of `probs` might produce different results
        a_pred = m.sample().item()
        return a_pred # (predicted action: 0 or 1, log of probability of current action)

    def value(self, state):
        net = F.relu(self.fc1(state)) # !!! Do not forget ReLU
        return self.fc_v(net)
      
    def fit(self, sample): # samples: [(s1, a1, r1), (s2, a2, r2), ...]
        (s, a, r, ns, done_mask) = sample
        
        r /= reward_div # !!! divide by 100 is very important
        td_target = (r + gamma * self.value(ns).squeeze() * done_mask).unsqueeze(1) # (num_rollouts, 1)
        vs = self.value(s) # (num_rollouts, 1)
        delta = td_target - vs # (num_rollouts, 1)
        
        probs = self.policy(s, softmax_dim=1) # (num_rollouts, action_size=2)
        probs = probs.gather(1, a.unsqueeze(1)) # (num_rollouts, 1)
        loss = torch.mean(-torch.log(probs) * delta.detach() +  F.smooth_l1_loss(vs, td_target.detach()))
#         loss = torch.mean(-torch.log(probs) * delta.detach() +  F.mse_loss(vs, td_target.detach())) # Work well, too
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
ac = ActorCritic().to(device)

## 5. Train

In [5]:
# mse(vs, td_target)
score = 0.0

for epoch in range(num_epochs):
    sample_iter = get_sample(env, ac)
    for sample in sample_iter:
        ac.fit(sample)
        rewards = sample[2] * reward_div
        score += sum(rewards)
        
    if epoch % 100 == 0:
        print('Epoch %d || Average Score: %.6f'%(epoch, score / (epoch + 1)))

Epoch 0 || Average Score: 13.000000
Epoch 100 || Average Score: 25.623762
Epoch 200 || Average Score: 36.805969
Epoch 300 || Average Score: 64.820595
Epoch 400 || Average Score: 81.882797
Epoch 500 || Average Score: 96.031937
Epoch 600 || Average Score: 108.652245
Epoch 700 || Average Score: 118.460777
Epoch 800 || Average Score: 126.640450
Epoch 900 || Average Score: 134.659271
Epoch 1000 || Average Score: 139.136856
Epoch 1100 || Average Score: 142.532242
Epoch 1200 || Average Score: 146.371353
Epoch 1300 || Average Score: 149.510376
Epoch 1400 || Average Score: 153.055679
Epoch 1500 || Average Score: 154.393738
Epoch 1600 || Average Score: 152.937546
Epoch 1700 || Average Score: 155.192230
Epoch 1800 || Average Score: 154.067184
Epoch 1900 || Average Score: 155.566544
Epoch 2000 || Average Score: 157.206894
Epoch 2100 || Average Score: 156.545456
Epoch 2200 || Average Score: 155.346207
Epoch 2300 || Average Score: 154.759674
Epoch 2400 || Average Score: 154.353180
Epoch 2500 || Aver