# A2C Tutorial

In this tutorial we will train an agent using Advantage Actor-Critic (A2C) for the Pendulum-v0 task from `OpenAI Gym <https://gym.openai.com/>`__.

**Task**

The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright. You can find an
official leaderboard with various algorithms and visualizations at the
[Gym website](https://gym.openai.com/envs/Pendulum-v0).

<img src="https://user-images.githubusercontent.com/8510097/31701471-726f54c0-b385-11e7-9f05-5c50f2affbb4.PNG" alt="Pendulum" style="width: 400px;"/>

This is a continuous control task where the action is a continuous variable of the joint effort. The reward is cost funtion of the observation, and the lowest cost is -16.2736044, while the highest loss is 0. So the reward is always negative. In essence, the goal is maximize the reward, to remain at zero angle (vertical), with the least rotational velocity, and the least effort. More details can be found [here](https://github.com/openai/gym/wiki/Pendulum-v0)

**Algorithm**

We will implement an A2C algorithm for this task. 

In [None]:
%matplotlib inline

In [None]:
import sys
import math
import gym
import numpy as np

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.nn.utils as utils
import torchvision.transforms as T
from torch.autograd import Variable
from torch.distributions import Categorical

In [None]:
class NormalizedActions(gym.ActionWrapper):

    def _action(self, action):
        action = (action + 1) / 2  # [-1, 1] => [0, 1]
        action *= (self.action_space.high - self.action_space.low)
        action += self.action_space.low
        return action

    def _reverse_action(self, action):
        action -= self.action_space.low
        action /= (self.action_space.high - self.action_space.low)
        action = action * 2 - 1
        return actions

**Actor and Critic networks**

For continuous control tasks, the input into the policy (actor) is the state observation, and we assume the policy $\pi(a|s)$ follows a Gaussian distribution $N(\mu(s), \sigma(s))$. The parameters of the Gaussian are estimated using a policy network. Actions can then be sampled from this distribution.

The critic network will take a state as input, and output a value function estimate of the input state. 

In [None]:
### TODO ###
## Implement the Actor and Critic networks

class Actor(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Actor, self).__init__()
        self.action_space = action_space
        num_outputs = action_space.shape[0]

        ### TODO ###
        ## Define layers

    def forward(self, inputs):
        
        ### TODO ###
        ## Implement forward pass
        
        return mu, sigma_sq
    
class Critic(nn.Module):
    def __init__(self, hidden_size, num_inputs):
        super(Critic, self).__init__()

        ### TODO ###
        ## Define layers

    def forward(self, inputs):
        
        ### TODO ###
        ## Define layers

        return value

**Actor and Critic network losses**

You will implement the loss functions for the actor and critic networks below.

Recall policy gradients,
$$
J(\theta) = \mathbb{E}_{\tau \sim p(\tau;\theta)}[r(\tau)]
$$

$$
\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \text{log} \pi_\theta (a_t^i|s_t^i) A^\pi (s_t^i, a_t^i)
$$
where $N, T$ represent the number of agent trajectories and episode length, respectively.

We will use n-step rewards for estimating the Q-function in the advantage as follows. 

Let 
$$
y_t^i = \left( \sum_{t'=t}^{t+N-1} \gamma^{t'-t}r(s_{t'}^i, a_{t'}^i) \right) + \gamma^N V_\phi^\pi(s_{t+N}^i) 
$$

We estimate the advantage as
$$
A^\pi (s_{t}^i, a_{t}^i) =  y_t^i - V_\phi^\pi(s_t^i)
$$

The loss function $\mathcal{L}_\theta$ for the actor network is given by
$$
\mathcal{L}_\theta = -\sum_{t=1}^T \text{log} \pi_\theta(a_t|s_t) A^\pi (s_t, a_t)
$$

The critic network is trained to regress to the targets $y_t^i$. The loss function $\mathcal{L}_\phi$ for the critic network is given by
$$
\mathcal{L}_\phi = \sum_{t=1}^T (V_\phi^\pi(s_t^i) - y_t^i)^2
$$

In [None]:
class A2C:
    def __init__(self, hidden_size, num_inputs, action_space, device):
        self.action_space = action_space
        self.actor = Actor(hidden_size, num_inputs, action_space)
        self.critic = Critic(hidden_size, num_inputs, action_space)
        self.actor = self.actor.to(device)
        self.critic = self.critic.to(device)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=1e-3)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=1e-3)
        self.device = device

    def normal(self, x, mu, sigma_sq):
        pi = Variable(torch.FloatTensor([math.pi])).to(self.device)
        a = (-1*(Variable(x)-mu).pow(2)/(2*sigma_sq)).exp()
        b = 1/(2*sigma_sq*pi.expand_as(sigma_sq)).sqrt()
        return a*b

   
    def select_action(self, state):
        mu, sigma_sq = self.actor(Variable(state).to(self.device))
        state_value = self.critic(Variable(state).to(self.device))
        #softplus is smooth approximation of RELU to constrain the output to be positive
        sigma_sq = F.softplus(sigma_sq) 

        eps = torch.randn(mu.size())
        # calculate the probability
        action = mu + sigma_sq.sqrt()*Variable(eps)
        action = action.to(device).data
        prob = self.normal(action, mu, sigma_sq)
        pi = Variable(torch.FloatTensor([math.pi])).to(self.device)
        entropy = -0.5*((sigma_sq+2*pi.expand_as(sigma_sq)).log()+1)

        log_prob = prob.log()
        return action, log_prob, state_value, entropy

    def compute_loss(self, rewards, log_probs, state_values, entropies, gamma, Nsteps):

        ### TODO ###
        ## Implement the Actor and Critic losses

        self.actor_loss = None
        self.critic_loss = None
        
    def update_parameters(self):
        for loss, optimizer in [(self.actor_loss, self.actor_optimizer), 
                                (self.critic_loss, self.critic_optimizer)]:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

**Training**

According to the pseudo code, we will interact with the environment based on the action from policy network, and collect the episode information to update the actor and critic networks.

In [None]:
env = gym.make('Pendulum-v0')
env = NormalizedActions(env)
env.seed(498)
torch.manual_seed(498)
np.random.seed(498)

device = torch.device('cpu')
agent = A2C(hidden_size=128, num_inputs=env.observation_space.shape[0], action_space=env.action_space, device=device)
reward_list = []
for i_episode in range(3000):
    state = torch.Tensor([env.reset()])
    entropies = []
    log_probs = []
    rewards = []
    state_values = []
    for t in range(1000):
        ## TODO: given the state, get the action from the policy network,
        ## take the action in the environment, put the entropy,log probability
        ## and reward value into the corresponding list, update the state
        action, log_prob, state_value, entropy = agent.select_action(state)
        action = action.cpu()

        next_state, reward, done, _ = env.step(action.numpy()[0])

        entropies.append(entropy)
        log_probs.append(log_prob)
        rewards.append(reward)
        state_values.append(state_value)
        state = torch.Tensor([next_state])
        if done:
            break
        #if i_episode % 100 == 0:
        #    env.render()
    agent.compute_loss(rewards, log_probs, state_values, entropies, gamma=0.99, Nsteps=10)
    agent.update_parameters()
    reward_list.append(np.sum(rewards))
    print("Episode: {}, reward: {}".format(i_episode, np.sum(rewards)))
    
env.close()

import matplotlib.pyplot as plt
reward_list = torch.FloatTensor(reward_list)
plt.figure(2)
plt.plot(reward_list.numpy())
plt.title('Training...')
plt.xlabel('Episode')
plt.ylabel('Reward')
if len(reward_list) >= 100:
    means = reward_list.unfold(0, 100, 1).mean(1).view(-1)
    means = torch.cat((torch.ones(99)*means[0], means))
    plt.plot(means.numpy())
plt.show()

# Other exercises

1. Compare A2C with REINFORCE
2. How does the N above in N-step rewards affect the gradient variance ?
3. Compare DDPG with A2C