Actor critic methods are a way of reducing the amount of variance from a policy gradient. The idea is that by using two networks, or at least two different function approximators, one can learn two more easily digestible curriculums which can help an agent successfully master whatever environment it is currently in. 

Making a discrete actor critic now. 

To motivate actor critic algorithms mathematically, recall the policy gradient is  

$$\nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} log \pi_{\theta}(a_{i,t} | s_{i,t}) G_{i,t}$$

Where $G_{i,t}$ represents some advantage/reward-to-go this policy takes has compared to the other possible actions at that state. In the regular old policy-gradient, $G$ is simply the total remaining rewards to go, implying that this action could either be really good or really bad depending on the value of $G$. 

Unfortunately, there are many caveats associated with $G$ simply being the total reward to go. For one, this value is  insensitive to a relative reward difference, meaning that if you were to simply add $40$ to all possible rewards the agent receives, you could end up with a very different policy! Good policies should not be sensitive to scalar shifts. Clearly, we have some work to do... 

There are a bunch of ways to fix $G_{i,t}$ for variance reduction, where one of the most popular (and already implemented) was a baseline which corresponded to the mean reward. Additionally, we could see how something like an expected reward to go, would be a better estimate and variance reduction technique. We'll see that shortly.

With a baseline, the policy gradient is 

$$\nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} log \pi_{\theta}(a_{i,t} | s_{i,t})( G_{i,t} - b(s_{i,t})$$

However, there is no reason for this baseline to be a constant number! In fact, it would be a great idea if this baseline could also be a function of the state observation $s_t$. At the same time, this value is starting to look a lot like the reinforcement learning concepts we considered at the beginning of this course, mainly, the $Q$ function, the $V$ function, and the $A$ function. 

The $Q$ value is the expected reward to go, given a state and action. 

$Q(s_t,a_t) = E[G_t | (s_t, a_t)]$

The $V$ value is the average $Q$ values in a given state, over possible actions. 

$V(s_t) = \sum_a Q(s,a)$

and $A$ represents the advantage a particular $Q$ value has over the average rewards to go in a state. It is simply 

$A = Q - V= r + V' - V$

and $V'$ is the value at the next state. 

Any three of these functions, (Q,V,A) could be represented as some combination of the other one, and thus could be used as the approximator we plug into the AC algorithm.

In any case we can leverage Bellman optimality conditions, and use that as a method to fit this critic network. 

The code for this is below. More specifically, how to make an Actor-Critic Algorithm that trains an actor-critic agent in a continuous action space.

Spoiler alert: This algorithm won't work! But that just gives more motivation to making my own implementation. 

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym 
#from gym import wrappers
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
# Two networks, with two hidden layers. 

#we'll have an actor network, and a critic network.

class GenericNetwork(nn.Module):
    # To use for both networks 
    def __init__(self,lr,input_dims,fc1_dims,fc2_dims,
                n_actions):
        super(GenericNetwork,self).__init__()
        
        self.lr = lr
        self.input_dims = input_dims
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_actions = n_actions
    
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims)
        self.fc2 = nn.Linear(self.fc1_dims,self.fc2_dims)
        
        self.fc3 = nn.Linear(self.fc2_dims,self.n_actions)
        
        #basis network for both A and C, # of output will change!
        
        self.optimizer = torch.optim.Adam(self.parameters(),lr = self.lr)
        self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
        self.to(self.device)
    #now the forward pass!
    
    def forward(self,state): 
        #state and obs are interchangeable, not a POMDP setting...
        state = torch.Tensor(state).to(self.device)        
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) #no activation, handled later 
        return x

In [7]:
class Agent(object):
    def __init__(self,alpha,beta,input_dims,gamma=0.99, n_actions = 4, 
                layer1_size=64,layer2_size = 64):
        self.gamma = gamma
        self.log_probs = None # log probability of selecting an action 
        self.actor = GenericNetwork(alpha,input_dims,layer1_size,layer2_size,n_actions = n_actions)
        self.critic = GenericNetwork(beta,input_dims,layer1_size,layer2_size,n_actions = 1) 
          
    def choose_action(self,observation):
        probabilities = F.softmax(self.actor.forward(observation))
        # now we need to force these values to make sense, i.e make std real and non-zero. 
        action_probs = torch.distributions.Categorical(probabilities) # probability distribuition dictated by policy network
        #draw sample from it 
        action = action_probs.sample()
        self.log_probs = action_probs.log_prob(action).to(self.actor.device)

        return action.item() #not a tensor!
    
    def learn(self,state,reward,new_state,done):
        #temporal difference style, using state at current and next state.
        #PG doesn't use that, Monte-carlo style, which accumulates rewards over the episode. 
        #This is just a state by state 
        self.actor.optimizer.zero_grad()
        self.critic.optimizer.zero_grad()
        
        critic_value_ = self.critic.forward(new_state) #Q value of next state for TD critic loss
        critic_value = self.critic.forward(state) #Q val at current state
        
        reward = torch.Tensor([reward]).to(self.actor.device)
        
        delta = reward + self.gamma* critic_value_*(1-int(done))  - critic_value#loss! but also the Advantage? # Q - V = A and this is the same thing as V - y, just re-ordered. In any case use for both. 
            #y_i - Vphi i , and this is the same as advantage
        
        actor_loss = -self.log_probs * delta
        
        critic_loss = delta **2  #minimize 
        
        #now sum the two, and backpropagate
        
        (actor_loss + critic_loss).backward() #can't do two backward passes at once, but these losses
        #are inpdt so it will update separately :) 
        self.actor.optimizer.step()
        self.critic.optimizer.step()
        
        

In [None]:
# now the main

agent = Agent(alpha = 0.001, beta = .001, input_dims = [4], gamma=0.99,
             layer1_size = 64, layer2_size=64,n_actions=2)

env = gym.make('CartPole-v1')
score_history = []
num_episodes = 2000
for i in range(num_episodes):
    done = False
    score = 0
    observation = env.reset()
    #now the game!
    
    while not done:
        action = agent.choose_action(observation)
        env.render()
        observation_,reward, done, _ = env.step(action)
  #      agent.learn(observation,reward,observation_, done)
        
        observation = observation_        
        score += reward
    score_history.append(score)
    
    print("Episode ", i, 'score ', score)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Episode  0 score  9.0
Episode  1 score  17.0
Episode  2 score  14.0
Episode  3 score  37.0
Episode  4 score  40.0


  # Remove the CWD from sys.path while we load stuff.


Episode  5 score  18.0
Episode  6 score  46.0
Episode  7 score  33.0
Episode  8 score  12.0
Episode  9 score  28.0
Episode  10 score  9.0
Episode  11 score  74.0
Episode  12 score  15.0
Episode  13 score  13.0
Episode  14 score  9.0
Episode  15 score  29.0
Episode  16 score  12.0
Episode  17 score  32.0
Episode  18 score  22.0
Episode  19 score  14.0
Episode  20 score  14.0
Episode  21 score  11.0
Episode  22 score  17.0
Episode  23 score  13.0
Episode  24 score  38.0
Episode  25 score  14.0
Episode  26 score  20.0
Episode  27 score  16.0
Episode  28 score  12.0
Episode  29 score  22.0
Episode  30 score  15.0
Episode  31 score  26.0
Episode  32 score  19.0
Episode  33 score  13.0
Episode  34 score  18.0
Episode  35 score  16.0
Episode  36 score  11.0
Episode  37 score  25.0
Episode  38 score  14.0
Episode  39 score  30.0
Episode  40 score  19.0
Episode  41 score  10.0
Episode  42 score  12.0
Episode  43 score  12.0
Episode  44 score  9.0
Episode  45 score  11.0
Episode  46 score  21.0


If this works, I should feel good. I very quickly was able to adjust an architecture on the fly in order to switch from a continuous to discrete action space!

So why doesn't this win? It only ever learns to minimize the negative reward. 


It learns quite hilariously, that if it stays at the bottom, the negative reward is minimized!

This is just one example of why AC and RL in general is a shaky tool!

