# Actor-Critic Method

This code is an example using Advantage Actor-Critic Method (a.k.a., A2C) The code is modified from [this one](https://github.com/pytorch/examples/blob/main/reinforcement_learning/actor_critic.py). [This website](https://medium.com/towards-data-science/understanding-actor-critic-methods-931b97b6df3f) might also be useful as a future reference, although the code can be outdated. 

There are multiple Actor-Critic Methods, but here we focus on 


In [52]:
import sys
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

gamma     = 0.99
is_render = False 


# Actor-Critic Neural Network.

We merge the two networks in a single class, although we can separate these two. Since we have merged the two Actor and Critic networks, they are updated in synchrony, and they also share the same state input.
A method which asynchronously updates the two networks are called the "Asynchronous Advantage Actor-Critic Method" (a.k.a. A3C), where the details are in [this website](https://medium.com/@shagunm1210/implementing-the-a3c-algorithm-to-train-an-agent-to-play-breakout-c0b5ce3b3405).

### Actor Network
The actor network maps state to action, hence the input is the number of states of the example, $n_s$ and the output is the number of actions $n_a$.

### Critic Network
The critic network maps state to a scalar real value, which is known to be the Value function $V^{\pi}(s)$.

### Advantage Value 



In [57]:
def Gt_calc( rewards: np.ndarray, gamma: float = 1, is_normalize = False ) :
    """
        This method uses vectors for the Gt array calculation.

        Args:
            [1] rewards: an array of rewards for each time step. 

            [2] gamma: discount ratio, valued between 0 to 1. 
                       If gamma = 1, then there is no discount applied

        Return:
            G_t: an array of discounted (or if gamma = 1, simple sum of) rewards
    """

    N = len( rewards )

    tmp = np.concatenate( ( np.ones( 1 ), np.cumprod( gamma * np.ones( N - 1 ) ) ) )
    Gt_arr = np.flip( np.cumsum( tmp * rewards  ) )

    if is_normalize: Gt_arr = ( Gt_arr - Gt_arr.mean( ) ) / ( Gt_arr.std( ) + 1e-9 ) 


    return Gt_arr


class ActorCritic( nn.Module ):
    """
        Implements both actor and critic in one model
    """
    def __init__( self, num_input, num_output, num_hidden  ):

        super( ActorCritic, self ).__init__(  )

        # Saving this for choosing the action.
        self.n_action = num_output

        # First layer : from input to hidden variable
        self.affine1 = nn.Linear( num_input, num_hidden)

        # Second Actor's layer
        self.actor_layer = nn.Linear( num_hidden, num_output )

        # Second Critic's layer
        self.critic_layer = nn.Linear( num_hidden, 1 )

        # Optimizer 
        self.optimizer = optim.Adam( self.parameters( ), lr = 0.03 )

    def forward( self, x ):
        """
            Forward of both actor and critic.

            Args:
                - Current State Vector

            Outputs:
                - [1] action_prob: The pi( . | s ) itself
                - [2] state_value: The value function V(s)
        """

        # Forward the 1st layer
        x = F.relu( self.affine1( x ) )

        # Actor: Chooses action to take from state s_t 
        action_prob = F.softmax( self.actor_layer( x ), dim = -1 )

        # Critic: evaluates being in the state s_t
        state_value = self.critic_layer( x )

        # return values for both actor and critic as a tuple of 2 values:
        # 1. a list with the probability of each action over the action space
        # 2. the value from state s_t 
        return action_prob, state_value

    def get_action_and_values( self, state: np.ndarray ):
        """
            Args:
                [1] State vector

            Outputs:
                [1]   action: The action chosen from the distribution
                [2]    value: Value from forwarding the network.
                [3] log_prob: The log probability of the action distruction
        """
        state = torch.from_numpy( state ).float( )

        # Forwarding the network.
        prob_dist, value = self.forward( state )

        # The action to take (left or right)
        action = np.random.choice( self.n_action, p = np.squeeze( prob_dist.detach( ).numpy( ) ) )

        # Log-probability
        log_prob = torch.log( prob_dist.squeeze( 0 )[ action ] )

        return action, value.squeeze( 0 ), log_prob

    def update_network( self, rewards, values, log_probs ):
        """
            Training code. Calculates actor and critic loss and performs backprop.

            Args:
                -   rewards: Given the trajectory, a list of rewards 
                -    values: Given the trajectory, there is a list of states, and for each state there is an associated value function
                - log_probs: Given the trajectory, we have a list of p(a|s). 
        """
        
        Gt_arr    = torch.tensor( Gt_calc( rewards, gamma = 0.9, is_normalize = True ) ) 
        values    = torch.stack( values )
        log_probs = torch.stack( log_probs )


        adv_arr   = Gt_arr - values

        policy_losses = -torch.dot( log_probs , adv_arr.float( ) )
        value_losses  = F.smooth_l1_loss( values, Gt_arr )

        # reset gradients
        self.optimizer.zero_grad()

        # sum up all the values of policy_losses and value_losses
        loss = policy_losses + value_losses 

        # perform backprop
        loss.backward( retain_graph=True )
        self.optimizer.step()


# The Advantage Actor-Critic (A2C) Method


In [58]:
# Generate the gym of Cart-and-Pole
env = gym.make( 'CartPole-v1' ) 

# The number of states and actions are +4 and +2
ns  = env.observation_space.shape[ 0 ]
na  = env.action_space.n

# Add the Adam optimizer
a2c_nn    = ActorCritic( ns, na, 128 )
eps       = np.finfo( np.float32 ).eps

# The number of episodes
N_eps = 1000 

num_steps   = np.zeros( N_eps )
all_rewards = np.zeros( N_eps )

for i in range( N_eps ):

    # reset environment and episode reward
    state = env.reset()
    ep_reward = 0

    # for each episode, only run 9999 steps so that we don't 
    rewards   = []
    log_probs = [] 
    values    = []

    for t in range( 1000 ):

        # select action from policy
        if is_render: env.render( )

        action, value, log_prob  = a2c_nn.get_action_and_values( state )

        # Run one step
        new_state, reward, done, _ = env.step( action )

        rewards.append( reward )
        values.append( value )
        log_probs.append( log_prob )

        if done: 

            a2c_nn.update_network( rewards, values, log_probs )
            sum_rewards = sum( rewards )

            num_steps[   i ] = t
            all_rewards[ i ] = sum_rewards
            
            sys.stdout.write( "episode: {}, total reward: {}, average_reward: {}, length: {}\n".format( i , np.round( sum_rewards, decimals = 3 ),  np.round( np.mean( all_rewards[ -10 : ] ), decimals = 3 ), steps ) )


        state = new_state

episode: 0, total reward: 20.0, average_reward: 0.0, length: 19


RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

# References 

[1] [Great post](https://danieltakeshi.github.io/2018/06/28/a2c-a3c/)
[2] [Google Co-lab example](https://colab.research.google.com/github/yfletberliac/rlss-2019/blob/master/labs/DRL.01.REINFORCE%2BA2C.ipynb#scrollTo=xDifFS9I4X7A)