<a href="https://colab.research.google.com/github/aarsanjani/meansquares/blob/master/temp/Copy_of_RLAlgo_6_DDPG_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> This notebook for Deep Deterministic Policy Gradient (DDPG) belongs to an intallment of notebooks that address the technical implementation of modern RL algorithms. See [this notebook](https://colab.research.google.com/drive/1cDq73ac6N67IPGBqonCH2dyjrmAkclkM) for a description of the DDPG algorithm.

The main references for this notebook are: 

[1]  [Deep Deterministic Policy Gradient](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#deep-deterministic-policy-gradient)

[2] 

# CoLab Preambles

Most of the requirements of python packages are already fulfilled on CoLab. To run Gym, you have to install prerequisites like xvbf,opengl & other python-dev packages using the following codes.

[](To be done next time: )
[](https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26)

In [None]:
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y

# Special gym environment
!pip install gym[atari]

# For rendering environment, you can use pyvirtualdisplay.
!pip install pyvirtualdisplay
!pip install piglet

# Install spinningup on CoLab
!git clone https://github.com/openai/spinningup.git
!cd spinningup
#!pip install -e . # this will incur error: File "setup.py" not found. Directory cannot be installed in editable mode: /content
!pip install -e spinningup

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-opengl is already the newest version (3.1.0+dfsg-1).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
xvfb is already the newest version (2:1.19.6-1ubuntu4.7).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.
fatal: destination path 'spinningup' already exists and is not an empty directory.
Obtaining file:///content/spinningup
Installing collected packages: spinup
  Found existing installation: spinup 0.2.0
    Can't uninstall 'spinup'. No files were found to uninstall.
  Running setup.py develop for spinup
Successfully installed spinup


In [None]:
# To activate virtual display 
# need to run a script once for training an agent as follows
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

# This code creates a virtual display to draw game images on. 
# If you are running locally, just ignore it
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

#
# Import libraries
#
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

# Deep Deterministic Policy Gradient

[DDPG@SpinningUp](https://spinningup.openai.com/en/latest/algorithms/ddpg.html)

### Pseudocode

![DDPG-Pseudocode](https://spinningup.openai.com/en/latest/_images/math/66446594f760581068c5684f053e75f3de9b1404.svg)



[](https://github.com/gabrielgarza/openai-gym-policy-gradient)
[](https://github.com/yukezhu/tensorflow-reinforce/tree/master/rl)

### Define MLP Architecture

In [None]:
# mlp.py
# define MLP

def mlp(x, hidden_sizes=(32,), 
           activation=tf.tanh,
           output_activation=None):
  """
    Multi-Layer Perceptron (MLP) FC
    Construct a feedforward neural network
    INPUT: input x
    OUTPUT: MLP computation graphs
  """
  for hs in hidden_sizes[:-1]:
    x = tf.layers.dense(x, units=hs, activation=activation)
  return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def mlp_actor_critic(x, a,
           hidden_sizes=(256, 256), 
           activation=tf.nn.relu,
           output_activation=tf.tanh,
           action_space=None):
  """
    Actor-Critics for DDPG
    constructs 4 neural networks
    1. running Q-function approx
    2. target Q-function approx
    3. running policy approx
    4. target policy approx
  """
  act_dim = a.shape.as_list()[-1]
  print("before with, act_dim= ",act_dim)
  act_limit = action_space.high[0]
  print("act_limit ", act_limit)
  with tf.variable_scope('pi'):
    print("calling pi in actor_critic")
    print("hidden_size= ",hidden_sizes)
    print("act_dim= ",act_dim)
    pi = act_limit * mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
  with tf.variable_scope('q'):
    print("calling q in actor_critic")
    print("hidden_size= ",hidden_sizes)
    q = tf.squeeze(mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
  with tf.variable_scope('q', reuse=True):
    print("calling q 2nd time in actor_critic")
    print("hidden_size= ",hidden_sizes)
    q_pi = tf.squeeze(mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
  return pi, q, q_pi

### Define DDPG

- INPUT

- OUTPUT

[](https://spinningup.openai.com/en/latest/_modules/spinup/algos/ddpg/ddpg.html#ddpg)

In [None]:
# utils.py
# utils functions

def get_vars(scope):
    return [x for x in tf.global_variables() if scope in x.name]

In [None]:
# ddpg.py
# reference: openai spinup ddpy implementation

import numpy as np
import tensorflow as tf
import time

# use spinup EpochLogger 
from spinup.utils.logx import EpochLogger

# self defined functions
# or just run the previous 
# mlp.py
# utils.py
# from mlp import mlp_actor_critic


class ReplayBuffer:
  """
  Experience replay buffer for DDPG agents: 
  - countinous obs space
  - countinous act space
  """
  def __init__(self, obs_dim, act_dim, size):
    self.obs1_buff = np.zeros([size, obs_dim], dtype=np.float32)
    self.acts_buff = np.zeros([size, act_dim], dtype=np.float32)
    self.obs2_buff = np.zeros([size, obs_dim], dtype=np.float32)
    self.rews_buff = np.zeros(size, dtype=np.float32)
    self.done_buff = np.zeros(size, dtype=np.float32)
    self.cur_ptr, self.cur_size, self.max_size = 0, 0, size
    
  def store(self, obs, act, rew, next_obs, done):
    print("--- storing to ReplayBuffer --- ")
    self.obs1_buff[self.cur_ptr] = obs
    self.acts_buff[self.cur_ptr] = act
    self.obs2_buff[self.cur_ptr] = next_obs
    self.rews_buff[self.cur_ptr] = rew
    self.done_buff[self.cur_ptr] = done
    self.cur_ptr = (self.cur_ptr + 1) % self.max_size
    self.cur_size = min(self.cur_size + 1, self.max_size)
    
  def sample_batch(self, batch_size=32):
    idxs = np.random.randint(0, self.cur_size, size=batch_size)
    samples = dict(obs1 = self.obs1_buff[idxs],
                   acts = self.acts_buff[idxs],
                   obs2 = self.obs2_buff[idxs],
                   rews = self.rews_buff[idxs],
                   done = self.done_buff[idxs])
    return(samples)
  
"""
Deep Deterministic Policy Gradient (DDPG)
"""  
def ddpg(env_fn, 
         actor_critic = mlp_actor_critic,
         ac_kwargs = dict(),
         seed = 0, 
         steps_per_epoch = 5000,
         epochs = 100,
         replay_size = int(1e6),
         gamma = 0.99,
         polyak = 0.995,
         pi_lr = 1e-3,
         q_lr = 1e-3,
         batch_size = 100,
         start_steps = 10000,
         act_noise = 0.1,
         max_ep_len = 1000,
         logger_kwargs = dict(),
         save_freq = 1):
  """
  DDPG
  """
  
  # use spinup util logger
  if not logger_kwargs == None:
    logger = EpochLogger(**logger_kwargs)
    logger.save_config(locals())
  
  # random seed
  tf.set_random_seed(seed)
  np.random.seed(seed)
  
  env, test_env = env_fn(), env_fn()
  obs_dim = env.observation_space.shape[0] # continuous
  act_dim = env.action_space.shape[0] # does not work for discrete() space
  print("ddpg obs_dim ", obs_dim)
  print("ddpg act_dim ", act_dim)

  # action limit: assume all dimensions share the same bound
  act_limit = env.action_space.high[0]
  print("ddpg act_limit ", act_limit)

  # share information about action space with policy architecture
  ac_kwargs['action_space'] = env.action_space
  print("printing env.action_space ", env.action_space)

  # inputs to the computation graph
  x_ph = tf.placeholder(dtype=tf.float32, shape=(None, obs_dim))
  a_ph = tf.placeholder(dtype=tf.float32, shape=(None, act_dim))
  x2_ph = tf.placeholder(dtype=tf.float32, shape=(None, obs_dim))
  r_ph = tf.placeholder(dtype=tf.float32, shape=(None, ))
  d_ph = tf.placeholder(dtype=tf.float32, shape=(None, ))
  print("****** X_ph ****")
  print(x_ph)
  print("****** a_ph ****")
  print(a_ph)
  
  # outputs from the computation graph
  # curr policy and Q-function
  with tf.variable_scope('running'):
    
    print("**********running**********")
    pi, q, q_pi = actor_critic(x_ph, a_ph, **ac_kwargs)
  
  # target policy and Q-function
  with tf.variable_scope('target'):
    print('******target*********')
    # Note that the action placeholder a_ph here is irrelevant
    # because we only need q_targ(s, pi_targ(s))
    pi_targ, _, q_pi_targ = actor_critic(x2_ph, a_ph, **ac_kwargs)
  
  # experience buffer 
  replay_buffer = ReplayBuffer(obs_dim=obs_dim, act_dim=act_dim, size=replay_size)
  
  # count variables and print information
  # To be done ...
  
  #################################################
  ##
  ## Define components of the computation graph
  ##
  
  # Bellman update for Q function
  # do not take graident in this part
  # use target q and pi to stablize computation
  q_bell = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)
  
  # DDPG losses for policy and q function, respectively
  pi_loss = -tf.reduce_mean(q_pi)
  q_loss = tf.reduce_mean((q - q_bell)**2)
  print("-- updated loss and q values -- ")
  # separately train ops for pi, q
  pi_optimizer = tf.train.AdamOptimizer(learning_rate=pi_lr)
  q_optimizer = tf.train.AdamOptimizer(learning_rate=q_lr)
  train_pi_opt = pi_optimizer.minimize(pi_loss, var_list=get_vars('running/pi'))
  train_q_opt = q_optimizer.minimize(q_loss, var_list=get_vars('running/q'))
  print("-- updated loss and q values with optimizer -- ")
  # polyak average update target pi and q function variables
  target_update = tf.group([tf.assign(v_targ, polyak*v_targ + (1-polyak)*v_running) 
                            for v_running, v_targ in zip(get_vars('running'), get_vars('target'))])
  print("polyak average update target pi and q function variables")
  # initialize targets to be the same as the running variable
  target_init = tf.group([tf.assign(v_targ, v_running)
                          for v_running, v_targ in zip(get_vars('running'), get_vars('target'))])
  print("initialize targets to be the same as the running variable")
  #################################################
  ##
  ## Run session to train the computation graph
  ##
  sess = tf.Session()
  sess.run(tf.global_variables_initializer())
  sess.run(target_init)
  
  # setup model saving 
  logger.setup_tf_saver(sess, inputs={'x': x_ph, 'a': a_ph}, outputs={'pi': pi, 'q': q})
  
  # action exploration or exploitation
  # later: eps-greedy or UCB
  def get_action(obs, noise_scale):
    a = sess.run(pi, feed_dict={x_ph: obs.reshape(1,-1)})[0]
    a += noise_scale * np.random.randn(act_dim)
    return np.clip(a, -act_limit, act_limit)
  
  start_time = time.time()
  o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0
  total_steps = steps_per_epoch * epochs
  print("start_steps : ",start_steps)
  # Main loop: collect experience in env and update/log each epoch
  for t in range(total_steps):
    
    # Get action
    if t > start_steps:
      a = get_action(o, act_noise)
    else:
      a = env.action_space.sample()

    # Step the env
    o2, r, d, _ = env.step(a)
    ep_ret += r
    ep_len += 1
    d = False if ep_len==max_ep_len else d
    replay_buffer.store(o, a, r, o2, d)
    o = o2
      
    # update DDPG at the end of the trajectory
    if d or (ep_len == max_ep_len):
      for _ in range(ep_len):
        batch = replay_buffer.sample_batch(batch_size)
        feed_dict = {x_ph: batch['obs1'],
                     x2_ph: batch['obs2'],
                     a_ph: batch['acts'],
                     r_ph: batch['rews'],
                     d_ph: batch['done']
                    }
          
        # Q-learning update
        outs = sess.run([q_loss, q, train_q_opt], feed_dict)
        logger.store(LossQ=outs[0], QVals=outs[1])
          
        # Policy update
        outs = sess.run([pi_loss, train_pi_opt, target_update], feed_dict)
        logger.store(LossPi=outs[0])
          
      logger.store(EpRet=ep_ret, EpLen=ep_len)
      o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0
        
    # End of epoch wrap-up
    if t > 0 and t % steps_per_epoch == 0:
      epoch = t // steps_per_epoch
      
      # Save model
      if (epoch % save_freq == 0) or (epoch == epochs-1):
        logger.save_state({'env': env}, None)
        # Log info about epoch
        logger.log_tabular('Epoch', epoch)
        logger.log_tabular('EpRet', with_min_and_max=True)
        #logger.log_tabular('TestEpRet', with_min_and_max=True)
        logger.log_tabular('EpLen', average_only=True)
        #logger.log_tabular('TestEpLen', average_only=True)
        logger.log_tabular('TotalEnvInteracts', t)
        logger.log_tabular('QVals', with_min_and_max=True)
        logger.log_tabular('LossPi', average_only=True)
        logger.log_tabular('LossQ', average_only=True)
        logger.log_tabular('Time', time.time()-start_time)
        logger.dump_tabular()

### Run DDPG with CartPole Env.

In [None]:
# load RL packages
import gym
import tensorflow as tf

# after training, load policy and show results in video
#from spinup.utils.test_policy import load_policy, run_policy

from spinup.utils.test_policy import load_policy_and_env, run_policy


# global parameters
env_name = 'LunarLanderContinuous-v2'
output_dir = '/content/DDPG/LunarLanderContinuous-v2-1'

# prepare ddpg parameters
env_fn = lambda : gym.make(env_name)

ac_kwargs = dict(hidden_sizes=[64, 64], activation=tf.nn.relu) # [128, 128] or [256, 256]

logger_kwargs = dict(output_dir=output_dir, exp_name=env_name)

##### Exp log
# first attempt 'LunarLander-v2'
# spinup ddpg does not work because the action space is discrete
#env = gym.make('LunarLander-v2') 
#> action_space is discrete(2) 
#env.action_space.shape
#print(env.action_space.shape[0])
#act_dim = env.action_space.shape[0]


In [None]:
ddpg(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=5000, epochs=1, logger_kwargs=logger_kwargs)

[32;1mLogging data to /content/DDPG/LunarLanderContinuous-v2-1/progress.txt[0m
[36;1mSaving config:
[0m
{
    "ac_kwargs":	{
        "activation":	"relu",
        "hidden_sizes":	[
            64,
            64
        ]
    },
    "act_noise":	0.1,
    "actor_critic":	"mlp_actor_critic",
    "batch_size":	100,
    "env_fn":	"<function <lambda> at 0x7f0d202fc488>",
    "epochs":	1,
    "exp_name":	"LunarLanderContinuous-v2",
    "gamma":	0.99,
    "logger":	{
        "<spinup.utils.logx.EpochLogger object at 0x7f0d9e0a9c88>":	{
            "epoch_dict":	{},
            "exp_name":	"LunarLanderContinuous-v2",
            "first_row":	true,
            "log_current_row":	{},
            "log_headers":	[],
            "output_dir":	"/content/DDPG/LunarLanderContinuous-v2-1",
            "output_file":	{
                "<_io.TextIOWrapper name='/content/DDPG/LunarLanderContinuous-v2-1/progress.txt' mode='w' encoding='UTF-8'>":	{
                    "mode":	"w"
                }
     