# Excercise 9.1 Policy Gradient on Continuous CartPole

## Goal

- understanding policy gradient and implement it
- understand how each hyperparameter contributes to the learning process

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import gym
import numpy as np
import chula_rl as rl
from chula_rl.env.cartpolecont import ContinuousCartPoleEnv

# Step 1: Env

In [3]:
def make_env():
    env = ContinuousCartPoleEnv()
    env = rl.env.wrapper.EpisodeSummary(env)
    return env

## 1.1 Parallel Env (VecEnv)

This kind of env will take a vector of actions, returns a vector of states. This will help stabilize training (and also speed up) greatly especially in on-policy learning.

Example of 2 parallel envs (you could use any):

In [4]:
env = rl.env.DummyVecEnv([make_env] * 2)
s = env.reset()
print('s.shape:', s.shape)

s.shape: (2, 4)


In [5]:
s[0]

array([ 0.00973941,  0.01420194, -0.03185375, -0.03023926], dtype=float32)

You see (2, 4) which means 2 envs of 4 features (normal to CartPole).

An interesting part of the parallel env is that it will "reset" the underlying env automatically (when it is done). This means we can always take action, do not need to care of the underlying environment.

## 1.3 Continuous CartPole

This is the same as a normal CartPole. The only difference is that the action space is "continuous" dictated by a single "float" within (-1, 1). 

Exmaple of taking action in a parallel env: 

Each action has 1 dimension, parallel action becomes 2 dimensions.

In [6]:
ss, r, done, info = env.step(np.array([[-0.8], [1.0]]))
print('ss.shape:', ss.shape)
print('r.shape:', r.shape)

ss.shape: (2, 4)
r.shape: (2,)


# Step 2: Vec n-step Explorer

In a parallel environment setting, we also need a compatible parallel explorer. The code is straightforward to the point that we have implemented it for you already. But you are welcome to read the code. 

Go see `chula_rl.explorer.vec_many_step_explorer`

In policy gradient, we usually use an n-step return of some kind because it is more stable!

In [7]:
n_step = 3
n_max_interaction = 10

In [8]:
exp = rl.explorer.VecManyStepExplorer(n_step, n_max_interaction, env)

In [9]:
exp

<chula_rl.explorer.vec_many_step_explorer.VecManyStepExplorer at 0x1b63359a080>

In [10]:
from chula_rl.policy.base_policy import BasePolicy
import random
class RandomPolicy(BasePolicy):
    def __init__(self, n_action):
        self.n_action = n_action

    def step(self, state):
        return np.array([[random.uniform(0, 1)],[random.uniform(0, 1)]])

In [11]:
policy = RandomPolicy(1)

In [12]:
exp.step(policy)

{'s': array([[[-0.01544014, -0.00956089,  0.02332065,  0.0059557 ],
         [-0.03308207, -0.04741054,  0.0026366 , -0.02749023]],
 
        [[-0.01563136,  0.24221022,  0.02343976, -0.36474264],
         [-0.03403028,  0.2759392 ,  0.0020868 , -0.511738  ]],
 
        [[-0.01078716,  0.4577368 ,  0.01614491, -0.68105304],
         [-0.0285115 ,  0.7742144 , -0.00814796, -1.2585356 ]]],
       dtype=float32), 'a': array([[[0.43069724],
         [0.55245401]],
 
        [[0.36877491],
         [0.85127064]],
 
        [[0.18122867],
         [0.1421889 ]]]), 'r': array([[1., 1.],
        [1., 1.],
        [1., 1.]], dtype=float32), 'done': array([[False, False],
        [False, False],
        [False, False]]), 'final_s': array([[-0.00163242,  0.56359565,  0.00252385, -0.83507425],
        [-0.01302721,  0.8575508 , -0.03331867, -1.3859316 ]],
       dtype=float32)}

# Step 3: Advantage Actor-Critic (A2C) policy + n-step TD residual advantage

A2C requires two components: 
- Actor (policy)
- Critic (value function) 

Both are implemented as neural nets. We leave this section to you. 

Your A2C should subclass `chula_rl.policy.base_policy.BasePolicy`. 

## Words of advice: 

- You code will surely contain bugs! Developing in jupyter notebook might not be a good idea. 
- There is a ton of hyperparameters, it is no easy task to find the right parameters
- Finding the right parameters might need some analysis on how the code performs which is hard if you don't "log" enough
- So, log EVERYTHING, use tensorboard to your advantage
- For example, log the std of the policy, log the current value of the value function. These will be invaluable in debugging
- "ทำไมมันช่างเปราะบางเหลือเกิน ~" is a sentence to describe this section

In [44]:
from chula_rl.policy.base_policy import BasePolicy
from tensorflow.keras import models, layers, optimizers
import tensorflow as tf
class DenseNetwork(models.Model):
    def __init__(self, output_size, hidden_sizes):
        super(DenseNetwork, self).__init__()
        hidden_sizes.append(output_size)
        self.linears = [layers.Dense(i,activation='relu') for i in hidden_sizes]
    def call(self, x):
        for l in self.linears[:-1]:
            x = l(x)
        return self.linears[-1](x)

In [65]:
def make_env():
        env = ContinuousCartPoleEnv()
        env = rl.env.wrapper.EpisodeSummary(env)
        return env

In [337]:
class A2C(BasePolicy):
    def __init__(self,n_env,n_step,discount):
        self.n_env = n_env
        self.n_step = n_step
        self.policy = policy
        self.discount = discount
        self.env = rl.env.DummyVecEnv([make_env] * n_env)
        s = self.env.reset()
        self.pi = DenseNetwork(1,[128,64])
        self.v = DenseNetwork(1,[128,64])
        self.optimizer = optimizers.Adam()
        self.loss_fn = tf.keras.losses.MeanSquaredError()
    def step(self,state):
        # return some action
        action = self.pi(state)
        action = tf.keras.backend.eval(action)
        return action
    def value(self,state):
        #print("state = ",state)
        value = self.v(state)
        value = tf.keras.backend.eval(value)
        return value
    def vanilla_loss(self, q_targets, q_expected):
        return tf.keras.losses.mse(q_targets, q_expected)
    
    def learn(self):
        state = self.env.reset()
        trajectory = []
        q = []
        for i in range(self.n_env):
            q.append(0)
        q = np.array(q,dtype=float)
        for j in range(self.n_step):
            v = self.value(state)
            action = self.step(state)
            #print("action = ",action)
            next_state,reward,done,info = self.env.step(action)
            trajectory.append({'state':state,'next_state':next_state,'reward':reward,'done':done,'info':info})
            #print("reward = ",reward)
            for k in range(self.n_env):
                q[k] += self.discount**j * reward[k]
            state = next_state
        v_sn = self.value(state).reshape((-1))
        q += self.discount**n_step * v_sn
        with tf.GradientTape() as tape:
            #print(np.array([state[i]]))
            q = q.reshape((3,1))
            v = self.value(np.array(state))
            print("v.shape = ",v.shape)
            #print("q.shape = ",q.shape)
            #q = q.reshape((-1,1))
            print("v = ",v)
            #print("q = ",q)
            print('v-q=',v-q)
            #loss_v = self.loss_fn(v-q,v)
            #loss_v = tf.keras.backend.eval(loss_v)
            #print("loss_v = ",loss_v)
            x = tf.convert_to_tensor(v-q)
            y = tf.convert_to_tensor(v)
            loss_v = self.vanilla_loss(x, y)
            gradients_dl = tape.gradient(loss_v,self.v.trainable_weights)
            #print("v-q = ",v-q)
            print("gradients_dl = ",gradients_dl)
            dl = np.sum((v-q)*gradients_dl)
                
        pi = self.step(state)
        loss_pi = self.loss_fn(self.discount**i * (q-v),pi)
        gradients_pi = tape.gradient(loss_pi,self.pi.trainable_weights)
        dj += (self.discount**i * (q-v)) * gradients_pi
        self.optimizer.apply_gradients(zip(dl, self.v.trainable_weights))
        self.optimizer.apply_gradients(zip(dj, self.pi.trainable_weights))

In [338]:
# a2c = A2C(3,5,0.9)

In [339]:
# a2c.learn()

## Run it

If you forgot how to run it already. Here is how: 

```
while True:
    data = exp.step(policy)
    policy.optimize_step(data)
```

## Extra: A2C + n-step Generalized Advantage

You are invited to implement the same A2C but using the generalized advantage instead. Legend has it this is a better advantage estimate! 😎