# DDPG

In this Jupyter notebook I will try to show the differences between a normal critic-actor algorithm and DDPG and how we can use the DDPG Function from `stable baselines`. Finally, I will use `Optuna` to optimize the hyper parameters.

## The Differences in algorithms:

### Actor Cirtic
```python
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = a.shape.as_list()[-1]
    mu = mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
    std = tf.exp(log_std)
    pi = mu + tf.random_normal(tf.shape(mu)) * std
    logp = gaussian_likelihood(a, mu, log_std)
    logp_pi = gaussian_likelihood(pi, mu, log_std)
    return pi, logp, logp_pi

def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh, 
                     output_activation=None, policy=None, action_space=None):
    # Actor
    policy = mlp_gaussian_policy
    with tf.variable_scope('pi'):
        pi, logp, logp_pi = policy(x, a, hidden_sizes, activation, output_activation, action_space)
    # Critic
    with tf.variable_scope('v'):
        v = tf.squeeze(mlp(x, list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, logp, logp_pi, v
```

### DDPG
```python
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def mlp_actor_critic(x, a, hidden_sizes=(400,300), activation=tf.nn.relu, 
                     output_activation=tf.tanh, action_space=None):
    act_dim = a.shape.as_list()[-1]
    act_limit = action_space.high[0]
    # Actor
    with tf.variable_scope('pi'):
        pi = act_limit * mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    # Critic (inputs used action and state)
    with tf.variable_scope('q'):
        q = tf.squeeze(mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    # Critic (inputs action from the policy and state)
    with tf.variable_scope('q', reuse=True):
        q_pi = tf.squeeze(mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, q, q_pi
```

## The Differences in the Cost Function:

### Actor Critic objectives:
```python
# VPG objectives:
# Use Policy gradinet method to train the policy
pi_loss = -tf.reduce_mean(logp * adv_ph)
# Train the critic: The critic should 
# give the summ of reward as output
v_loss = tf.reduce_mean((ret_ph - v)**2)
```

### DDPG objectives:
```python
## DDPG losses
# Train the policy
pi_loss = -tf.reduce_mean(q_pi)

# Train the critic based on the used action in the trajectory 
# Bellman backup for Q function,  use the policy to calculate the target for the q function
backup = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)
q_loss = tf.reduce_mean((q-backup)**2)
```

## DDPG Example with stable_baselines

In [None]:
import gym
import numpy as np
import yaml
import os
import random
from collections import OrderedDict
import tensorflow as tf
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG

### Arguments

In [None]:
env_id = "MountainCarContinuous-v0"
# alg 
algo = "ddpg"
# Tensorboard log dir'
tensorboard_log = 'logs'
# Path to a pretrained agent to continue training
trained_agent = 'trained_agents'
# log-folder
log_folder = 'logs'
# Seed
seed = 0
# Verbose mode (0: no output, 1: INFO)
verbose = 1
# Override log interval (default: -1, no change)
log_interval = 6
# Evaluation 
evaluation = True

### Seed
set the seed for python random, tensorflow, numpy and gym spaces

In [None]:
def set_global_seeds(seed):
    tf.set_random_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

In [None]:
set_global_seeds(seed)

### Hyperparams

Some parameters that we maybe need to define:
* **policy** – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
* **env** – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
* **gamma** – (float) the discount factor
* **eval_env** – (Gym Environment) the evaluation environment (can be None)
* **nb_train_steps** – (int) the number of training steps (how many times we sample batches from Replay Buffer)
* **nb_rollout_steps** – (int) the number of rollout(epsiode) steps
* **nb_eval_steps** – (int) the number of evalutation steps
* **param_noise** – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
* **action_noise** – (ActionNoise) the action noise type (can be None)
* **param_noise_adaption_interval** – (int) apply param noise every N steps
* **tau** – (float) the soft update coefficient (keep old values, between 0 and 1)
* **normalize_returns** – (bool) should the critic output be normalized
* **normalize_observations** – (bool) should the observation be normalized
* **batch_size** – (int) the size of the batch for learning the policy
* **observation_range** – (tuple) the bounding values for the observation
* **return_range** – (tuple) the bounding values for the critic output
* **critic_l2_reg** – (float) l2 regularizer coefficient
* **actor_lr** – (float) the actor learning rate
* **critic_lr** – (float) the critic learning rate
* **clip_norm** – (float) clip the gradients (disabled if None)
* **reward_scale** – (float) the value the reward should be scaled by
* **render** – (bool) enable rendering of the environment
* **render_eval** – (bool) enable rendering of the evalution environment
* **buffer_size** – (int) the max number of transitions to store, size of the replay buffer
* **random_exploration** – (float) Probability of taking a random action (as in an epsilon-greedy strategy) This is not needed for DDPG normally but can help exploring when using HER + DDPG. This hack was present in the original OpenAI Baselines repo (DDPG + HER)
* **verbose** – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
* **tensorboard_log** – (str) the log location for tensorboard (if None, no logging)
* **policy_kwargs** – (dict) additional arguments to be passed to the policy on creation
* **full_tensorboard_log** – (bool) enable additional logging when using tensorboard WARNING: this logging can take a lot of space quickly


In [None]:
# Load hyperparameters from yaml file
with open('hyperparams/{}.yml'.format(algo), 'r') as f:
    hyperparams_dict = yaml.load(f)
    hyperparams = hyperparams_dict[env_id]


# Sort hyperparams that will be saved
saved_hyperparams = OrderedDict([(key, hyperparams[key]) for key in sorted(hyperparams.keys())])
print(saved_hyperparams)

total_timesteps = hyperparams['total_timesteps']
del hyperparams['total_timesteps']



### Environment
Reward is 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions from start to goal.

This reward function raises an exploration challenge, because if the agent does not reach the target soon enough, it will figure out that it is better not to move, and won't find the target anymore.

Note that this reward is unusual with respect to most published work, where the goal was to reach the target as fast as possible, hence favouring a bang-bang strategy.

In [None]:
def create_env():
    """
    Create the environment and wrap it if necessary
    :return: (gym.Env)
    """
    global hyperparams

    env = gym.make(env_id)
    env.seed(seed)
    #if env_wrapper is not None:
    #    env = env_wrapper(env)
    
    env = DummyVecEnv([lambda:env])
    # Optional Frame-stacking
    if hyperparams.get('frame_stack', False):
        n_stack = hyperparams['frame_stack']
        env = VecFrameStack(env, n_stack)
        print("Stacking {} frames".format(n_stack))
        del hyperparams['frame_stack']
    return env

In [None]:
env = create_env()

### Exploration
DDPG trains a deterministic policy in an off-policy way. Because the policy is deterministic, if the agent were to explore on-policy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make DDPG policies explore better, we add noise to their actions at training time. The authors of the original DDPG paper recommended time-correlated OU noise. To facilitate getting higher-quality training data, you may reduce the scale of the noise over the course of training.

In [None]:
if hyperparams.get('noise_type')  is not None:
    noise_type = hyperparams['noise_type'].strip()
    noise_std = hyperparams['noise_std']
    n_actions = env.action_space.shape[0]
    
if 'adaptive-param' in noise_type:
    hyperparams['param_noise'] = AdaptiveParamNoiseSpec(initial_stddev=noise_std,
                                                        desired_action_stddev=noise_std)
elif 'normal' in noise_type:
    if 'lin' in noise_type:
        hyperparams['action_noise'] = LinearNormalActionNoise(mean=np.zeros(n_actions),
                                                              sigma=noise_std * np.ones(n_actions),
                                                              final_sigma=hyperparams.get('noise_std_final', 0.0) * np.ones(n_actions),
                                                              max_steps=total_timesteps)
    else:
        hyperparams['action_noise'] = NormalActionNoise(mean=np.zeros(n_actions),
                                                                    sigma=noise_std * np.ones(n_actions))
elif 'ornstein-uhlenbeck' in noise_type:
    hyperparams['action_noise'] = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions),
                                                               sigma=noise_std * np.ones(n_actions))
    
del hyperparams['noise_type']
del hyperparams['noise_std']

### Evaluation:

In [None]:
if evaluation == True:
    # Environment for evaluation
    eval_env = gym.make('MountainCarContinuous-v0')
    hyperparams['nb_eval_steps'] = 2
    hyperparams['render_eval']   = False

### Create Custom Policy

In [None]:
# Custom MLP policy of two layers of size 16 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[32, 32,32],
                                           layer_norm=False,
                                           feature_extraction="mlp")

### Define a  Model:

In [None]:
# Train an agent from scratch
tensorboard_log = os.path.join(tensorboard_log, env_id)

if evaluation == True:
    # Environment for evaluation
    model = DDPG(env=env,eval_env = eval_env, tensorboard_log=tensorboard_log, verbose=verbose, **hyperparams)
else:
    model = DDPG(env=env, tensorboard_log=tensorboard_log, verbose=verbose, **hyperparams)

#### Train the Model
* **total_timesteps** – (int) The total number of samples to train on
* **callback** – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
* **seed** – (int) The initial seed for training, if None: keep current seed
* **log_interval** – (int) The number of timesteps before logging.
* **tb_log_name** – (str) the name of the run for tensorboard log
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)

In [None]:
kwargs = {}
if log_interval > -1:
    kwargs = {'log_interval': log_interval}

model.learn(total_timesteps=total_timesteps, **kwargs)

### Save trained model

In [None]:
log_path = "{}/{}/".format(log_folder,algo)
save_path = os.path.join(log_path, "{}_{}".format(env_id,0)) #get_latest_run_id(log_path, env_id) + 1))
print("Saving to {}".format(save_path))
model.save("{}/{}".format(save_path, env_id))

### Test trained Model 

### Loading Pretrained Agent

In [None]:
for j in range(4):
    obs = env.reset()
    for i in range(1000):
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()
        if dones:
            break
env.close()

In [None]:
del model

trained_agent = "logs/ddpg/MountainCarContinuous-v0_0/MountainCarContinuous-v0.pkl"

if trained_agent.endswith('.pkl') and os.path.isfile(trained_agent):
    # Continue training
    print("Loading pretrained agent")
    # Policy should not be changed
    #del hyperparams['policy']

    model = DDPG.load(trained_agent, env=env,
                       tensorboard_log=tensorboard_log, verbose=verbose, **hyperparams)

## Hyperparameters Optimization

In [None]:
import optuna
from optuna.pruners import SuccessiveHalvingPruner, MedianPruner
from optuna.samplers import RandomSampler, TPESampler
from optuna.integration.skopt import SkoptSampler

TODO...