# Actor-Critic

## Loss
### Policy Network
$$\bigtriangledown_{\theta}\text{J}(\theta) = \frac{1}{N}\sum_{i=1}^N{ [ \sum_{t=0}^T{ [\bigtriangledown_{\theta} \text{log} \pi_{\theta}(a_{i, t}|s_{i, t}) (r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1}) - v(i, s_t))] }] }$$

If the episode length is $\infty$, then $v(s)$ can get infinitely large in many cases.

Simple tricks: better to get rewards sooner than later + discount factor

$$y_{i, t} = r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1})$$

where $\gamma \in [0, 1]$(0.99 works well)

Note that we use $\gamma v(s_{i, t+1})$ instead of $\gamma q(s_{i, t+1},  a_{i, t+1})$ because we choose the maximum action. In other words, the Value network is essentially the maximum future reward.

### Value Network
$$\bigtriangledown_{\theta}\text{J}(\theta) = \frac{1}{N}\sum_{i=1}^N{ [ \sum_{t=0}^T{[ \text{smooth_l1_loss}(v(s_{i, t}), r(s_{i, t}, a_{i, t}) + \gamma v(s_{i, t+1}) )  ]}] }$$
 
 
## Coding Notes
+ In this code, the reward should be divided by 100

+ ReLU is very important
 
+ $\gamma v(s_{i, t+1})$ can be multiplied by (1-done). In other words, if this (s, a) pair terminates this episode, then $\gamma v(s_{i, t+1})$ is not required.

+ Update the parameters during the episode. However, the Policy Gradient updates the parameters after one episode

## Questions

+ Why does the reward should be multiplied by 100?

+ What role does the rollout perform?

## 1. Import packages

In [0]:
import gym
import torch
import torch.nn as nn
from torch.distributions import Categorical
import torch.nn.functional as F

## 2. Define constants

In [0]:
gamma = 0.98
num_epochs = 10000
num_rollouts = 5
reward_div = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 3. Prepare data

In [0]:
env = gym.make("CartPole-v0")

def get_sample(env, policy):
    done = False
    s = env.reset() # (state_size, )
    while not done:
        ss, aa, rr, nss, dones, log_probs = list(), list(), list(), list(), list(), list()
        for t in range(num_rollouts):
            a, log_prob = policy.predict(torch.Tensor(s).to(device))
            ns, r, done, _ = env.step(a) # a is 0 or 1
            ss.append(s)
            aa.append(a)
            rr.append(r)
            nss.append(ns)
            dones.append(done)
            log_probs.append(log_prob) # log probability of the current action
            s = ns
            if done:
                break
                
        sample = (torch.Tensor(ss).to(device), torch.LongTensor(aa).to(device), torch.Tensor(rr).to(device), torch.Tensor(nss).to(device), torch.Tensor(dones).to(device))
        log_probs = torch.Tensor(log_probs).to(device)
        yield sample, log_probs

## 4. Build model

In [0]:

class ActorCritic(nn.Module):
    def __init__(self):
        super(ActorCritic, self).__init__()
        
        self.fc1 = nn.Linear(4, 256)
        self.fc_pi = nn.Linear(256, 2)
        self.fc_v = nn.Linear(256, 1)
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0002, betas=(0.9, 0.99))

    def policy(self, state, softmax_dim=0):
        net = F.relu(self.fc1(state)) # (B, 256) # !!! Do not forget ReLU
        net = self.fc_pi(net) # (B, 2)
        probs = F.softmax(net, dim=softmax_dim)
        return probs
        
    def predict(self, state, softmax_dim=0): # state: (4,) => indicates that the fully-connected layer in PyTorch can receive inputs without batch_size
        probs = self.policy(state)
        m = Categorical(probs) # !!! The cpu or gpu version will influence the seed. In other words, even if we set the seed to be 2, different versions of `probs` might produce different results
        a_pred = m.sample().item()
        return a_pred, torch.log(probs[a_pred]) # (predicted action: 0 or 1, log of probability of current action)

    def value(self, state):
        net = F.relu(self.fc1(state)) # !!! Do not forget ReLU
        return self.fc_v(net)
      
    def fit(self, sample, log_probs): # samples: [(s1, a1, r1), (s2, a2, r2), ...], log_probs: (log_prob1, log_prob2, ...)
        (s, a, r, ns, done) = sample
        
        r /= reward_div # !!! divide by 100 is very important
        td_target = (r + gamma * self.value(ns).squeeze() * (1 - done)).unsqueeze(1) # (num_rollouts, 1)
        vs = self.value(s) # (num_rollouts, 1)
        delta = td_target - vs # (num_rollouts, 1)
        
        probs = self.policy(s, softmax_dim=1) # (num_rollouts, action_size=2)
        probs = probs.gather(1, a.unsqueeze(1)) # (num_rollouts, 1)
        loss = torch.mean(-torch.log(probs) * delta.detach() +  F.smooth_l1_loss(td_target.detach(), vs))
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
ac = ActorCritic().to(device)

## 5. Train

In [5]:
score = 0.0

for epoch in range(num_epochs):
    sample_iter = get_sample(env, ac)
    for sample, log_probs in sample_iter:
        ac.fit(sample, log_probs)
        rewards = sample[2] * reward_div
        score += sum(rewards)
        
    if epoch % 100 == 0:
        print('Epoch %d || Average Score: %.6f'%(epoch, score / (epoch + 1)))

Epoch 0 || Average Score: 50.000000
Epoch 100 || Average Score: 15.366337
Epoch 200 || Average Score: 26.328358
Epoch 300 || Average Score: 49.136211
Epoch 400 || Average Score: 78.668335
Epoch 500 || Average Score: 99.351295
Epoch 600 || Average Score: 114.339432
Epoch 700 || Average Score: 123.582031
Epoch 800 || Average Score: 130.641693
Epoch 900 || Average Score: 134.974472
Epoch 1000 || Average Score: 140.289703
Epoch 1100 || Average Score: 145.574936
Epoch 1200 || Average Score: 149.524567
Epoch 1300 || Average Score: 152.730972
Epoch 1400 || Average Score: 155.738754
Epoch 1500 || Average Score: 158.640244
Epoch 1600 || Average Score: 159.883835
Epoch 1700 || Average Score: 160.951202
Epoch 1800 || Average Score: 161.444199
Epoch 1900 || Average Score: 162.806946
Epoch 2000 || Average Score: 162.570221
Epoch 2100 || Average Score: 163.102814
Epoch 2200 || Average Score: 162.592911
Epoch 2300 || Average Score: 162.478043
Epoch 2400 || Average Score: 162.519363
Epoch 2500 || Aver

KeyboardInterrupt: ignored