In [18]:
import torch
from torch import nn
from torch.nn import functional as F
import gym

## Steps to Implement VPG for cartpole:

1. Create a policy network for cartpole. Need the input size and ranges, output size and ranges, and network architecture.
2. Create value network. Can make this as a separate head of the policy network, so that they share the representation, but one head outputs the value and the other the policy.
3. Initialize the policy network to some sensible parameters.
4. Create the training loop:
    1. Run the set policy for $k$ encounters, and compute the return to go $\hat{R_t}$ for each encounter. (why do we calculate return to go and not the total return?)
    2. Compute the advantage estimates $\tilde{A_t}$. Advantage $\tilde{A_t} = \hat{R_t} - V_{\phi_k}$.
    3. Estimate the policy gradient as 
    $$
    \hat{g} = \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi(a_t | s_t, \theta) \left(\sum_{t' = t}^{T-1} r_{t'} - V_{\phi_k}(s_t)\right)
    $$

### Create and inspect cartpole env

The specification of the cartpole environment is: 

    Box(4):
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24 deg        24 deg
        3	Pole Velocity At Tip      -Inf            Inf
        
    Actions:
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        
    Reward:
        Reward is 1 for every step taken, including the termination step.
        
    Starting State:
        All observations are assigned a uniform random value in [-0.05..0.05].
        
    Episode Termination:
        Pole Angle is more than 12 degrees
        Cart Position is more than 2.4 (center of the cart reaches the edge of the display)
        Episode length is greater than 200
        
    Solved Requirements:
        Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

In [3]:
env = gym.make('CartPole-v0')

In [9]:
env.observation_space

Box(4,)

In [13]:
env.observation_space.shape

(4,)

In [12]:
env.action_space

Discrete(2)

### Create Actor Critic Network from Environment Definition

In [17]:
in_size = env.observation_space.shape[0]
out_size = env.action_space.n
print(f'in={in_size}, out={out_size}')

in=4, out=2


Create a neural network that takes in the observations and outputs (1) the value of the state and (2) the policy values for each action at that state.

The neural network has two output heads, one for the policy and one for the value function. It has one hidden layer to encode the representation of the state:

in -> hidden(16) \
-> value(16) -> value-out\
-> $\pi$(16) -> $\pi$-out

In [None]:
class ActorCritic(nn.Module):
    
    def __init__(self, in_size, out_size):
        super(ActorCritic, self).__init__()
        