## Notes

*Note: This paper goes through A3C, later when I do the implementation I'll make some tweaks to make it A2C as that's what's more used apparently*

### High Level Quick Read Through

What problem is this paper solving? We saw in the DQN paper that one of their innovations was the experience replay. This allowed us to store many transitions and randomly sample, breaking the temporal data correlations and stabalizing training. However in this paper the authors point out several drawbacks to the experience replay:

1. It's memory intensive
2. It's computationally heavy, each transition is replayed multiple times
3. it's off-policy only. You can't apply it to other methods that learn from the current policy.

They proposed a new way to stabalize the training that replaces the experience replay with asynchronicity. They have a new setup where there's one global network that we want to train. We then launch multiple threads that each have a copy of the network parameters and their own seperate copy of the environment. Each of these workers does a small number of steps and collects experiences from those different networks. They then all compute the gradients of the loss and then push those to the global network.

In A3C the workers don't wait for each other and once they calculate the gradients they push them to the global network. None of them have a lock on the global parameters. This works because the experiences of the different agents are not super correlated with each other. This has a stabalizing effect. They apply this framework to four different RL algorithms, but the one I'll focus on is the application to Advantage Actor-Critic.

Some things I'm thinking about:
Do each of the workers in the environment initialize to the same state? or do they randomly start from different parts of the environment? what if there's not many choices at the start and they all do the same thing? when each of them explores are they doing random things or what's the way the decide there?

The actor network outputs a probability distribution which the agent samples from. They also add an entropy bonus (kind of like a temperature) that smooths out the distribution. So they explore different areas.

In A2C workers wait for each other?

They wait for each other. They all go into their environments and collect some data and then wait. Then it updates all of them in one batch. This can leverage GPUs better.

##### Quick dive into Actor-Critic and Advantage Function

First off DQN is not an actor-critic algorithm. It's a purely value-based algorithm, the policy is implicitly derived from the Q-values predicted for each action in a state. The policy is to pick whichever action has the highest Q value. There's no network that outputs actions or anything that represents the policy. Note: one drawback of these value based methods is that they're not good for continuous actions (e.g. how hard should I do action x).

In actor critic we have an actor and the critic. The actor has a policy and it's job is to take an action based on the policy. The critic's job is to learn a value function that predicts how good a state is. The learning loop is as follows:

1. The actor takes an action based on its policy.
2. The actor then gets a reward and a new state.
3. The critic then calculates how surprising that outcome was. The critic has an expectation of the state at t=0, the outcome at t+1 that we compare it to is the reward plus the value of the new state we're in (kinda similar to DQN). We calculate the TD error (this is the surprise), we look at the difference between the reward plus the value function at the new state and the value that the critic predicted from the t=0. The TD error is an estimate of *advantage*. It tells the actor how much better or worse the action was than expected.
4. Then there's the learning, both the actor and the critic learn. The actor adjusts it's weights. If the TD error was positive (the policy did better than the critic expected) then it adjusts its policy to make that action more likely. And the inverse if the TD error is negative. The critic also learns. They adjust the value function to be closer to the observed reality.

Why is this better though? It's better because we learn a policy directly while also using a value function to make the learning process more stable and efficient. This has a few advantages. 

The first is that we get an explicit stochastic policy. There are a few reasons this is good. The first is that it naturally helps to solve the exploration-exploitation problem by giving us a distribution to sample from. The second is that it helps to solve partial observability. In a partially observable world two or more different world states can look identical to the agent. In those two world states two different seperate actions could each be optimal for each. For a deterministic policy it will be unstable and prefer both actions equally, it's not perfect but it's stable and robust. The last is that optimal policies are sometimes inherently stochastic (think about how you have to sometimes play that bad hand in poker lol).

The second is that it can handle continuous actions. Say for example steering a car. In DQN you might need to discretize the steering wheel angle into maybe slots of every 10 degrees. In A3C you can output a continuous probability distribution. So for the steering wheel you could have two output neurons, one that represents the mean of the distribution and one that outputs the standard deviation for the distribution. For example mean angle = 30degrees with an STD of 5 in given state s. Then we sample from it and apply it to the sterring wheel. Note if we have some combo of things to control some discrete and continuous we'd just have different output heads for each component of the action. This is set up such that we might have one big shared main network that outputs a vector (maybe somehow representing the world state) that feeds into a seperate network for each of the different possible actions. We backprop through all and sum the gradients.

Aside: What if the optimal action is a distribution but one that's not gaussian however? A bit advanced but we can output other values to make up for this, for example mixture of gaussians or skewed distribution.

How is the actor's policy updated? // for more gaussian output actions

If we see a positive advantage (the action was better than expected) then we want to update our parameters to make that action more likely in the future. This is different to supervised learning where we have a ground truth label or where the loss is based on a difference that we calculate.

The "loss function" or objective function we're maximizing is not based on an action error. It's based on the log-probability of the action we took, scaled in magnitude and direction by how good the outcome was (the advantage). For this we need to look at the policy gradient:

$\text{Policy Gradient} = \nabla_{\theta} \log \pi(a | s; \theta) \times \text{Advantage}$

Translated to english this formula says that the policy gradient is the gradient (with respect to the parameters theta) of the log probability of taking action a in state s under the policy pi multiplied by the advantage. Inside the log(pi(a|s)) term we get the formula for the probability density for a gaussian which is:

$\pi(a|\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(a - \mu)^2}{2\sigma^2}}$

For context this is what that graph looks like:

<img src="attachment:f5bc0d2f-04ee-4122-9be4-c25aa65069cf.png" width="400">

This is kind of scary looking but if we take the natural log of it it simplifies down to this

$\log \pi(a|\mu, \sigma) = -\log(\sigma) - \frac{1}{2}\log(2\pi) - \frac{(a - \mu)^2}{2\sigma^2}$

From here we calculate the partial derivatives with respect to either the mean or the standard deviation. If the advantage is positive then the optimizer will adjust the network to move the mean in whichever direction the action is. But what happens if we take the partial derivative with respect to the standard deviation? If we calculate it out we get this

$\frac{\partial}{\partial\sigma} \left( \log(\sigma) + \frac{(a - \mu)^2}{2\sigma^2} \right) = \frac{1}{\sigma} - \frac{(a - \mu)^2}{\sigma^3}$

Note: If a gradient is positive the parameter value will decrease, if it's negative the parameter value will increase.

The gradient is interesting because it has two parts to it. It has the 1/STD and the other term. That first term will always be positive. If a good action is close to the mean then the second term will be small making the expression positive. This will serve to decrease the STD. If the good action was far from the mean then the second term will be large making it negative, therefore the optimizer will increase the std.

So overall the mean is nudged towards and away from the action that was taken depending on the advantage. The STD is also modulated by similar things but it depends on how far away the action was from the current mean.

Now, how is the critic network setup? The critic is actually pretty coupled with the actor network. Similarly to how each seperate action might have it's own head after going through the main body of the main network and getting what might be considered the vector representing the world state, the critic network also gets it's own head. It, however, just outputs a single value V(s) representing the value. The value head is trying to predict the total cumulative discounted future reward, or the return. The critic's setup is a bit easier and here we just calculate the MSE loss compared to what the next state's reward + value is.

<blockquote>
    An aside on the gradients of a few different action types and how they flow backwards:
    Earlied we looked quickly at the gaussian actions which are cool. However there could be many types of actions.

    If we need a categorical action the network would just outputs the logits, one for each of the action categories. We then softmax it to get the probability distribution. We then sample an action from this distribution. To do the update we do the -log(action probablity) * advantage. This will cause backprop to adjust the logits to increase / decrease the log probability based on the sign and magnitude of the advantage.
</blockquote>

Sidenote: why does the actor critic setup make the learning more stable?

To understand why it's helpful to compare it to the two "pure" methods that it's derived from, actor only and critic only methods.

1. Compared to actor only methods.

In actor only methods the agent will play an episode and then at the end get a total discounted return. At the end it goes back and reinforces every action. The problem here is that it's an inefficient way to assign credit. You reinforce everything (even things that might not actually have helped) or you supress everything (even good actions). Actor critic solves this by introducing a critic which introduces a baseline. Instead of using the raw total return we use the advantage function. This tells us how much better we did compared to average / baseline. This helps with stability because you can better assign credit. Imagine you take action A and you get a score of +1000, if you take action B you get a score of +990. Those are both going to get reinforced hard. But now imagine that the critic starts to learn the average value of this state gets us around 995. That becomes our baseline for this state. Now we know that Action A is good and gives us +5 and action B is actually bad and gives us -5. This helps with stability.

Yet what if the critic is wrong and is giving bad signal? Hopefully what happens is that the actor and critic learn together. The critic has it's own loss function which forces it to constantly improve its estimate.


2. Compared to critic only methods like DQN

Critic only methods like DQN create their policy by taking the action that has the highest estimated q value. This can lead to instability and a problem called policy chattering. If something is 0.51 do action 1 and 0.5 do action 2 then it will always do action 1. After learning one update now it may favor action 2 and always do that. So a small change causes a complete reversal of the agents behavior. Actor critic solves this by learning a policy that is a distribution. This is smoother and won't lead to the jittering.

One thing I think about the jittering is that it can kind of scramble up the policy. Imagine you keep doing action 1 in the above scenario and now you start doing action 2. Well maybe the policy now goes in a completely different rtajectory. The Q values it learned from going to the action 1 path now may end up getting kind of overwritten / forgotten. This can lead to oscilattions or just more inefficient learning.

But DQN is not actor critic so how is that somewhat stable? It's stable through a more brute force mechanism. It's an off-policy method that avoids "true" online updates. It's stability comes from the experience replay and the target network. So it's stable but it pays the price of having a huge memory buffer and is computationally heavy.

N step returns
To understand the -nsteps returns part we should first take a look at the two extreme ends of estimating the value of the state. The first is monte carlo estimation. You start at a state and play until the end of the episode. Then you loop back and add up all the discounted rewards you received. This is quite high variance. The other end is TD estimation / 1 step return. You start at a state and then take one step. You then get the reward and ask the critic for the value of the next state. Your total estimate is the real reward plus that estimate. This is quite biased but is low variance.

Now to get the best of both worlds we can use n step returns. It's a bit of a slider between the two extremes. Instead of one step or all the way to the end we look n steps ahead. We use the real observed rewards for those n steps and then finally add the critics value estimate after n steps. This is the formula:

$ R_t^{(n)} = r_t + \gamma r_{t+1} + \dots + \gamma^{n-1}r_{t+n-1} + \gamma^n V(s_{t+n}) $

The n then becomes a hyper parameter. In the A3C paper they chose n=5. Implementation wise we calculate it by working backwards. This is so that we don't repeat work and uses a dynamic programming trick. Setting the n lets you control the bias-variance trade-off. Having a higher n gives you high variance but low bias (like monte carlo). Having a lower n gives you low variance but high bias.

How do the loss functions work? We mostly covered it earlier in this already.

The actor's loss is the negative log probability of the action that was taken. We take the gradient of the loss with respect to the weights to make that action more or less likely based on the advantage. One note about the advantage is that during the actor / policy loss we don't backprop the critic's weights. That's treated as a fixed constant.

The critic's loss is the squared error difference between it's prediction and the n steps return that was calculated.

The authors also added an entropy bonus to the loss to improve exploration. This adds a small bonus to the objective function that rewards it for having a high-entropy policy. To do this they modify the loss function. The loss function has two parts, the policy loss and the value loss. We now add in an extra entropy term

$ L_{total} = L_{policy} + c \cdot L_{value} - \beta \cdot H(\pi(s; \theta)) $

The beta is a hyperparameter that is the entropy coefficient. The H is the formula for the entropy. The h is calculated as 

$ H = -\sum_i p_i \log(p_i) $.

Note we subtract because we want to minimize the loss, therefore to make the loss smaller we need to make the entropy larger.

##### Full algorithm walkthrough
Now, it might be helpful to do a full algorithm walkthrough.

1. First we initialize our say 8 parallel threads / workers, each one has it's own copy of the environment
2. Each of them will copy the weights from the master copy
3. Then they all run n steps simultaneously

Now let's say we examine a single worker and imagine this is happening simultaneously

1. The worker starts in a state
2. It feeds the state into the actor / policy network and gets a distribution of the actions
3. It then samples from that distribution and takes that action
4. This lands it in a new state with perhaps a reward
5. It stores the transition from the state 0 to state 1, the action it took, the distribution and the new reward in a small buffer
6. It continues this n times
7. It then waits until all workers have completed their n steps

Back to the main master thread

4. We now have 8 of these buffers and 40 transitions
5. For each of those transitions we loop backwards and calculate the advantage and the return for all the steps
6. To calculate the return we add the reward for this state + gamma times the value / critic's estimate of the next state
7. NOTE: we only use the critic's estimate of the next state the first time we do this, per the n-step return we should then use this value for the state at t-1 to "back propogate" these numbers with more reality grounding
8. To calculate the advantage we use that value of the return we just calculated and then we get the value for this state get the difference
9. Note: to not repeat work we can save the forward pass through the critics network for the value at each state
10. Then we will have a batch of 40 transitions and for each we have the state, action taken, the n-step return and the calculated advantage

Now we do the network update

11. We take those 40 transitions and do a forward pass through the full network. Note that we group them up into one big batch to run on the GPU, that's why it's more efficient, in A2C we don't run mini batches (but heads up in PPO we might)
12. We then get our policy loss and the value loss, we also calculate the entropy for each of them to get our total loss
13. Then we call .backward() on the total loss value and then we run the optimizer one step.

NEXT: and this is important!

We don't reset the simulator or anything for all of them. What now do is we continue from there and repeat this process in chunks of n steps. If the episode ends or terminal state is reached then we reset the simulator.

### Readthrough / Other Misc Notes
1. Sequence of observed data encountered by an online RL agent and RL updates are strongly correlated.
2. One can explicitly used different exploration policies in each actor-learner to maximize this diversity
3. Adding the entropy of the policy pi to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies.
4. The proposed framework scales well with the number of parallel workers
5. Interestingly certain algorithms exhibit superlinear speedups that can't be explained by purely computational gains. Certain methods require less data to achieve a particular score. This could be because having multiple threads helps reduce the bias.
6. Method is quite robust to the choice of learning rate.
7. Network architecture was copied from Mnih -- convolutional layer followed by fully connected layer with Relu

### Implementation Plan A2C actor + critic

I should try to code one up from scratch now I'm going to run it on gynasium cart pole.

Steps

1. Create the actor critic class with shared body with actor and critic head
2. Worker class that runs in a seperate thread
3. Main training script that coordinates everything


In [1]:
import sys
!{sys.executable} -m pip install "gymnasium[classic-control]"
import gymnasium as gym


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import time
import threading
import torch.nn.functional as F

device = torch.device("mps")

class ActorCritic(nn.Module):
    def __init__(self, n_actions, hidden_size):
        # initialize the shared body
        # create the actor head and the critic head as their own extra linear layers on the side
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(4, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU()
        )
        self.actor_head = nn.Linear(hidden_size, 2)
        self.critic_head = nn.Linear(hidden_size, 1)
        self.to(device)
        
    def forward(self, x):
        x = torch.as_tensor(x, dtype=torch.float32, device=device)
        shared_features = self.shared(x)
        action_distribution = self.actor_head(shared_features)
        critic_value = self.critic_head(shared_features)
        return action_distribution, critic_value

class Worker(threading.Thread):
    def __init__(self, worker_id, task_queue, results_queue, actor_critic):
        super().__init__()
        self.worker_id = worker_id
        self.task_queue = task_queue

        # environment and agent attributes
        self.gym_environment = gym.make("CartPole-v1")
        self.environment_state = self.gym_environment.reset()[0]
        self.actor_critic = actor_critic
        self.results_queue = results_queue
    
    def run(self):
        while True:
            command, data = self.task_queue.get()
            if command == 'collect':
                experience = self.collect_experience(data['n_steps'])
                self.results_queue.put((self.worker_id, experience))
            elif command == 'stop':
                break
            
    def collect_experience(self, n_steps):
        states = []
        actions = []
        rewards = []
        dones = []
        last_step_terminates = False

        for i in range(n_steps):
            last_step_terminates = False
            with torch.no_grad():
                action_distribution_logits, critic_estimate = self.actor_critic.forward(self.environment_state)
            log_probabilities = F.log_softmax(action_distribution_logits, dim=-1)
            probs = torch.exp(log_probabilities)
            action = torch.multinomial(probs, num_samples=1).item()
            next_state, reward, terminated, truncated, info = self.gym_environment.step(action)
            
            states.append(self.environment_state)
            actions.append(action)
            rewards.append(reward)
            dones.append(terminated)

            self.environment_state = next_state

            if(terminated):
                self.environment_state = self.gym_environment.reset()[0]
                last_step_terminates = True
                
        last_state = self.environment_state
        last_done = terminated
                    
        return {
            "states": states,
            "actions": actions,
            "rewards": rewards,
            "n_step_returns": self.calculate_n_step_returns(rewards, dones, last_state, last_done),
            }         

    def calculate_n_step_returns(self, rewards, dones, last_state, last_done):
        discount_factor = 0.99
        if last_done:
            next_return = 0.0
        else:
            with torch.no_grad():
                next_return = self.actor_critic.forward(last_state)[1].item()

        returns = []
        for i in reversed(range(len(rewards))):
            next_return = rewards[i] + discount_factor * next_return * (1 - dones[i])
            returns.append(next_return)

        returns.reverse()
        return returns

In [10]:
# main training loop
import queue
print("starting")

# hyper params
NUM_WORKERS = 6
EPISODES = 10000
UPDATE_STEPS = 5
LEARNING_RATE = 0.0005

# setup the global network and optimizer
global_network = ActorCritic(4, 128)
optimizer = optim.Adam(global_network.parameters(), lr=LEARNING_RATE)

workers = []
task_queues = []
results_queue = queue.Queue()
episode_rewards = [] 

for i in range(NUM_WORKERS):
    task_q = queue.Queue()
    worker = Worker(i, task_q, results_queue, global_network)
    workers.append(worker)
    task_queues.append(task_q)
    worker.start()

for episode in range(EPISODES):
    if (episode + 1) % 200 == 0:
        print(f"Episode {episode+1}")

    for q in task_queues:
        q.put(('collect', {"n_steps": UPDATE_STEPS}))

    all_states = []
    all_actions = []
    all_n_step_returns = []

    for _ in range(NUM_WORKERS):
        worker_id, experience = results_queue.get()
        all_states.extend(experience['states'])
        all_actions.extend(experience['actions'])
        all_n_step_returns.extend(experience['n_step_returns'])
        episode_rewards.append(sum(experience['rewards']))

    all_states = torch.tensor(all_states, dtype=torch.float32).to(device)
    all_actions = torch.tensor(all_actions, dtype=torch.int64).to(device)
    all_n_step_returns = torch.tensor(all_n_step_returns, dtype=torch.float32).to(device)

    action_logits, critic_values = global_network(all_states)
    all_critic_estimates = critic_values.squeeze()

    critic_loss = F.mse_loss(all_critic_estimates, all_n_step_returns)

    all_advantages = all_n_step_returns - all_critic_estimates.detach()

    log_probs = F.log_softmax(action_logits, dim=-1)
    action_log_probs = log_probs.gather(1, all_actions.unsqueeze(1)).squeeze()

    policy_loss = -(action_log_probs * all_advantages).mean()

    total_loss = policy_loss + 0.5 * critic_loss

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()


for q in task_queues:
    q.put(('stop', None))
for w in workers:
    w.join()


print("done")

starting
Episode 200
Episode 400
Episode 600
Episode 800
Episode 1000
Episode 1200
Episode 1400
Episode 1600
Episode 1800
Episode 2000
Episode 2200
Episode 2400
Episode 2600
Episode 2800
Episode 3000
Episode 3200
Episode 3400
Episode 3600
Episode 3800
Episode 4000
Episode 4200
Episode 4400
Episode 4600
Episode 4800
Episode 5000
Episode 5200
Episode 5400
Episode 5600
Episode 5800
Episode 6000
Episode 6200
Episode 6400
Episode 6600
Episode 6800
Episode 7000
Episode 7200
Episode 7400
Episode 7600
Episode 7800
Episode 8000
Episode 8200
Episode 8400
Episode 8600
Episode 8800
Episode 9000
Episode 9200
Episode 9400
Episode 9600
Episode 9800
Episode 10000
done


In [None]:
import torch
import time
import torch.nn.functional as F

eval_env = gym.make("CartPole-v1", render_mode="human")

state, info = eval_env.reset()
done = False
total_reward = 0

with torch.no_grad():
    while not done:
        eval_env.render()

        action_logits, _ = global_network.forward(state)
        
        action_probs = F.softmax(action_logits, dim=-1)
        
        action = torch.multinomial(action_probs, num_samples=1).item()
        
        next_state, reward, terminated, truncated, info = eval_env.step(action)
        
        done = terminated or truncated
        
        state = next_state
        total_reward += reward
        

eval_env.close()

print("\n" + "="*40)
print(f"Evaluation episode finished!")
print(f"Total Reward: {total_reward}")
print("="*40)


Evaluation episode finished!
Total Reward: 500.0


: 