### Quick theory
Just like the Actor-Critic method, we have two networks:

* Actor - It proposes an action given a state.
* Critic - It predicts if the action is good (positive value) or bad (negative value) given a state and an action.

DDPG uses two more techniques not present in the original DQN:

1. Uses two Target networks.
        Why? Because it add stability to training. In short, we are learning from estimated targets and Target networks are updated slowly, hence keeping our estimated targets stable.Conceptually, this is like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better", as opposed to saying "I'm going to re-learn how to play this entire game after every move". See this StackOverflow answer.

2. Uses Experience Replay.

### Losses:

**Critic loss** - Mean Squared Error of y - Q(s, a) where y is the expected return as seen by the Target network, and Q(s, a) is action value predicted by the Critic network. y is a moving target that the critic model tries to achieve; we make this target stable by updating the Target model slowly.

**Actor loss** - This is computed using the mean of the value given by the Critic network for the actions taken by the Actor network. We seek to maximize this quantity.

Hence we update the Actor network so that it produces actions that get the maximum predicted value as seen by the Critic, for a given state.

### Initialization:
The initialization for last layer of the Actor must be between -0.003 and 0.003 as this prevents us from getting 1 or -1 output values in the initial stages, which would squash our gradients to zero, as we use the tanh activation.

In [1]:
import shutup
shutup.please()

In [2]:
import numpy as np

from src.agents.agent import Agent
from src.utils.buffer import Buffer

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow_probability as tfp
from tensorflow.keras.layers import Input, Dense, Concatenate

import gym
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

In [3]:
problem = "Pendulum-v1"
env = gym.make(problem)

num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))

upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))

Size of State Space ->  3
Size of Action Space ->  1
Max Value of Action ->  2.0
Min Value of Action ->  -2.0


In [4]:
"""
To implement better exploration by the Actor network, we use noisy perturbations, 
specifically an Ornstein-Uhlenbeck process for generating noise, as described in the paper. 
It samples noise from a correlated normal distribution.
"""

import numpy as np

class OUActionNoise:
    def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, x_initial=None):
        self.theta = theta
        self.mean = mean
        self.std_dev = std_deviation
        self.dt = dt
        self.x_initial = x_initial
        self.reset()

    def __call__(self):
        # Formula taken from https://www.wikipedia.org/wiki/Ornstein-Uhlenbeck_process.
        x = (
            self.x_prev
            + self.theta * (self.mean - self.x_prev) * self.dt
            + self.std_dev * np.sqrt(self.dt) * np.random.normal(size=self.mean.shape)
        )
        # Store x into x_prev
        # Makes next noise dependent on current one
        self.x_prev = x
        return x

    def reset(self):
        if self.x_initial is not None:
            self.x_prev = self.x_initial
        else:
            self.x_prev = np.zeros_like(self.mean)

In [5]:
class Buffer:
    def __init__(self, buffer_capacity=100000, batch_size=64):
        # Number of "experiences" to store at max
        self.buffer_capacity = buffer_capacity
        # Num of tuples to train on.
        self.batch_size = batch_size

        # Its tells us num of times record() was called.
        self.buffer_counter = 0

        # Instead of list of tuples as the exp.replay concept go
        # We use different np.arrays for each tuple element
        self.state_buffer = np.zeros((self.buffer_capacity, num_states))
        self.action_buffer = np.zeros((self.buffer_capacity, num_actions))
        self.reward_buffer = np.zeros((self.buffer_capacity, 1))
        self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))

    # Takes (s,a,r,s') obervation tuple as input
    def record(self, obs_tuple):
        # Set index to zero if buffer_capacity is exceeded,
        # replacing old records
        index = self.buffer_counter % self.buffer_capacity

        self.state_buffer[index] = obs_tuple[0]
        self.action_buffer[index] = obs_tuple[1]
        self.reward_buffer[index] = obs_tuple[2]
        self.next_state_buffer[index] = obs_tuple[3]

        self.buffer_counter += 1

    # Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows
    # TensorFlow to build a static graph out of the logic and computations in our function.
    # This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one.
    @tf.function
    def update(
        self, state_batch, action_batch, reward_batch, next_state_batch,
    ):
        # Training and updating Actor & Critic networks.
        # See Pseudo Code.
        with tf.GradientTape() as tape:
            target_actions = target_actor(next_state_batch, training=True)
            y = reward_batch + gamma * target_critic(
                [next_state_batch, target_actions], training=True
            )
            critic_value = critic_model([state_batch, action_batch], training=True)
            critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))

        critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
        critic_optimizer.apply_gradients(
            zip(critic_grad, critic_model.trainable_variables)
        )

        with tf.GradientTape() as tape:
            actions = actor_model(state_batch, training=True)
            critic_value = critic_model([state_batch, actions], training=True)
            # Used `-value` as we want to maximize the value given
            # by the critic for our actions
            actor_loss = -tf.math.reduce_mean(critic_value)

        actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
        actor_optimizer.apply_gradients(
            zip(actor_grad, actor_model.trainable_variables)
        )

    # We compute the loss and update parameters
    def learn(self):
        # Get sampling range
        record_range = min(self.buffer_counter, self.buffer_capacity)
        # Randomly sample indices
        batch_indices = np.random.choice(record_range, self.batch_size)

        # Convert to tensors
        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        reward_batch = tf.cast(reward_batch, dtype=tf.float32)
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])

        self.update(state_batch, action_batch, reward_batch, next_state_batch)


# This update target parameters slowly
# Based on rate `tau`, which is much less than one.
@tf.function
def update_target(target_weights, weights, tau):
    for (a, b) in zip(target_weights, weights):
        a.assign(b * tau + a * (1 - tau))

In [6]:
def get_actor():
    # Initialize weights between -3e-3 and 3-e3
    last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)

    inputs = layers.Input(shape=(num_states,))
    out = layers.Dense(256, activation="relu")(inputs)
    out = layers.Dense(256, activation="relu")(out)
    outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)

    # Our upper bound is 2.0 for Pendulum.
    outputs = outputs * upper_bound
    model = tf.keras.Model(inputs, outputs)
    return model


def get_critic():
    # State as input
    state_input = layers.Input(shape=(num_states))
    state_out = layers.Dense(16, activation="relu")(state_input)
    state_out = layers.Dense(32, activation="relu")(state_out)

    # Action as input
    action_input = layers.Input(shape=(num_actions))
    action_out = layers.Dense(32, activation="relu")(action_input)

    # Both are passed through seperate layer before concatenating
    concat = layers.Concatenate()([state_out, action_out])

    out = layers.Dense(256, activation="relu")(concat)
    out = layers.Dense(256, activation="relu")(out)
    outputs = layers.Dense(1)(out)

    # Outputs single value for give state-action
    model = tf.keras.Model([state_input, action_input], outputs)

    return model

In [7]:
def policy(state, noise_object):
    sampled_actions = tf.squeeze(actor_model(state))
    noise = noise_object()
    # Adding noise to action
    sampled_actions = sampled_actions.numpy() + noise

    # We make sure action is within bounds
    legal_action = np.clip(sampled_actions, lower_bound, upper_bound)

    return [np.squeeze(legal_action)]

In [8]:
std_dev = 0.2
ou_noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(std_dev) * np.ones(1))

actor_model = get_actor()
critic_model = get_critic()

target_actor = get_actor()
target_critic = get_critic()

# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

# Learning rate for actor-critic models
critic_lr = 0.002
actor_lr = 0.001

critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)

total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005

buffer = Buffer(50000, 64)

In [9]:
# Train loop

# To store reward history of each episode
ep_reward_list = []
# To store average reward history of last few episodes
avg_reward_list = []

# Takes about 4 min to train
for ep in range(total_episodes):

    prev_state = env.reset()
    episodic_reward = 0

    while True:


        tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)

        action = policy(tf_prev_state, ou_noise)
        # Recieve state and reward from environment.
        state, reward, done, info = env.step(action)

        buffer.record((prev_state, action, reward, state))
        episodic_reward += reward

        buffer.learn()
        update_target(target_actor.variables, actor_model.variables, tau)
        update_target(target_critic.variables, critic_model.variables, tau)

        # End this episode when `done` is True
        if done:
            break

        prev_state = state

    ep_reward_list.append(episodic_reward)

    # Mean of last 40 episodes
    avg_reward = np.mean(ep_reward_list[-40:])
    print("Episode * {} * Avg Reward is ==> {}".format(ep, avg_reward))
    avg_reward_list.append(avg_reward)

# Plotting graph
# Episodes versus Avg. Rewards
plt.plot(avg_reward_list)
plt.xlabel("Episode")
plt.ylabel("Avg. Epsiodic Reward")
plt.show()

Episode * 0 * Avg Reward is ==> -1564.4454315284745
Episode * 1 * Avg Reward is ==> -1446.4550418149388
Episode * 2 * Avg Reward is ==> -1476.503785816549
Episode * 3 * Avg Reward is ==> -1469.2357564089987
Episode * 4 * Avg Reward is ==> -1481.987497880444
Episode * 5 * Avg Reward is ==> -1505.7760280675047
Episode * 6 * Avg Reward is ==> -1516.9246416399858
Episode * 7 * Avg Reward is ==> -1524.3587146887714
Episode * 8 * Avg Reward is ==> -1512.7817398272505
Episode * 9 * Avg Reward is ==> -1468.7281507299328
Episode * 10 * Avg Reward is ==> -1432.1537208446389
Episode * 11 * Avg Reward is ==> -1367.6303803084938
Episode * 12 * Avg Reward is ==> -1339.5193071569543
Episode * 13 * Avg Reward is ==> -1299.1608695755629
Episode * 14 * Avg Reward is ==> -1265.967403740705
Episode * 15 * Avg Reward is ==> -1238.3186918592755
Episode * 16 * Avg Reward is ==> -1206.043906395525


KeyboardInterrupt: 

In [None]:
env.action_space

In [None]:
class ActorCriticAgent(Agent):
    def __init__(self, 
                environment, 
                alpha = 0.01,
                gamma = 0.99,
                eps = np.finfo(np.float32).eps.item(),
                optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
                critic_loss= tf.keras.losses.Huber()):
        
        super(ActorCriticAgent, self).__init__(environment)
        
        # Args
        self.alpha = alpha
        self.gamma = gamma 
        self.eps = eps 
        self.optimizer=optimizer
        self.critic_loss = critic_loss
        #type(tf.keras.optimizers.Adam(learning_rate=0.01)).__name__

        self.__init_networks()
        self.__init_buffers()
        
    def __init_buffers(self):
        self.buffer = Buffer(['action_log_probs','critic_values','rewards'])
            
    def __init_networks(self):
        num_inputs = self.observation_shape[0]
        num_hidden = 128

        inputs = Input(shape=(num_inputs,),name="actor_critic_inputs")
        common_layer = Dense(num_hidden, activation="relu", name="actor_critic_common_layer")(inputs)
        
        if self.action_space_mode == "discrete":
            action = Dense(self.n_actions, activation="softmax")(common_layer)
        elif self.action_space_mode == "continuous":
            sigma = Dense(self.n_actions, activation="softplus", name="sigma")(common_layer)
            mu = Dense(self.n_actions, activation="tanh" , name='mu')(common_layer)
            action = Concatenate(axis=-1, name="actor_output")([mu,sigma])

        critic = Dense(1)(common_layer)

        self.model = keras.Model(inputs=inputs, outputs=[action, critic])

    def choose_action(self, state, deterministic=True):
        action_probs, critic_value = self.model(state)
        
        if self.action_space_mode == "discrete":
            # DISCRETE SAMPLING
            if deterministic:
                action = np.argmax(np.squeeze(action_probs))
                action_log_prob = action
            else:
                # Sample action from action probability distribution
                action = np.random.choice(self.n_actions, p=np.squeeze(action_probs))
                action_log_prob = tf.math.log(action_probs[0, action])


        elif self.action_space_mode == "continuous":
            # CONTINUOUS SAMPLING
            mu = action_probs[:,0:self.n_actions]
            sigma = action_probs[:,self.n_actions:]
            
            if deterministic:
                action = mu
                action_log_prob = action
            else:
                norm_dist = tfp.distributions.Normal(mu, sigma)
                action = tf.squeeze(norm_dist.sample(self.n_actions), axis=0)
                action_log_prob = -(norm_dist.log_prob(action)+self.eps)
                action = tf.clip_by_value(
                    action, self.env.action_space.low[0], 
                    self.env.action_space.high[0])

                action = np.array(action[0],dtype=np.float32)
        
        return action, action_log_prob , critic_value
    
    def test(self, episodes=10, render=True):

        for episode in range(episodes):
            state = self.env.reset()
            done = False
            score = 0
            while not done:
                if render:
                    self.env.render()
                
                state = tf.convert_to_tensor(state)
                state = tf.expand_dims(state, 0)
                
                # Sample action, probs and critic
                action, action_log_prob, critic_value = self.choose_action(state)

                # Step
                state,reward,done, info = self.env.step(action)

                # Get next state
                score += reward
            
            if render:
                self.env.close()

            print("Test episode: {}, score: {:.2f}".format(episode,score))
    
    def learn(self, timesteps=-1, plot_results=True, reset=False, log_each_n_episodes=100, success_threshold=False):
        
        self.validate_learn(timesteps,success_threshold,reset)
        success_threshold = success_threshold if success_threshold else self.env.success_threshold
 
        self.buffer.reset()
        score = 0
        timestep = 0
        episode = 0
        
        while self.learning_condition(timesteps,timestep):  # Run until solved
            state = self.env.reset()
            score = 0
            done = False
            with tf.GradientTape() as tape:
                while not done:

                    state = tf.convert_to_tensor(state)
                    state = tf.expand_dims(state, 0)

                    # Predict action probabilities and estimated future rewards
                    # from environment state
                    action, action_log_prob, critic_value = self.choose_action(state, deterministic=False)

                    self.buffer.store('critic_values',critic_value[0, 0])

                    # Sample action from action probability distribution
                    self.buffer.store('action_log_probs',action_log_prob)

                    # Apply the sampled action in our environment
                    state, reward, done, _ = self.env.step(action)
                    self.buffer.store('rewards',reward)

                    score += reward
                    timestep+=1
                # Update running reward to check condition for solving
                self.running_reward.step(score)

                # Time discounted rewards
                returns = []
                discounted_sum = 0
                
                for r in self.buffer.get('rewards')[::-1]:
                    discounted_sum = r + self.gamma * discounted_sum
                    returns.insert(0, discounted_sum)

                # Normalize
                returns = np.array(returns)
                returns = (returns - np.mean(returns)) / (np.std(returns) + self.eps)
                returns = returns.tolist()

                # Calculating loss values to update our network
                history = zip(self.buffer.get('action_log_probs'), self.buffer.get('critic_values'), returns)
                actor_losses = []
                critic_losses = []
                for log_prob, value, ret in history:
                    # At this point in history, the critic estimated that we would get a
                    # total reward = `value` in the future. We took an action with log probability
                    # of `log_prob` and ended up recieving a total reward = `ret`.
                    # The actor must be updated so that it predicts an action that leads to
                    # high rewards (compared to critic's estimate) with high probability.
                    diff = ret - value
                    actor_losses.append(-log_prob * diff)  # actor loss

                    # The critic must be updated so that it predicts a better estimate of
                    # the future rewards.
                    critic_losses.append(
                        self.critic_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
                    )

                # Backpropagation
                loss_value = sum(actor_losses) + sum(critic_losses)
                #print(loss_value)
                grads = tape.gradient(loss_value, self.model.trainable_variables)
                self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))

                # Clear the loss and reward history
                self.buffer.reset()

            # Log details
            episode += 1
            if episode % log_each_n_episodes == 0 and episode > 0:
                print('episode {}, running reward: {:.2f}'.format(episode,self.running_reward.reward))

            if self.did_finnish_learning(self,success_threshold,episode):
                    break

        if plot_results:
            self.plot_learning_results()





In [None]:
from src.environments.discrete.cartpole import environment
agent_discrete = ActorCriticAgent(environment)
agent_discrete.learn(log_each_n_episodes=1)

#{'action': 0, 'action_log_prob': <tf.Tensor: shape=(), dtype=float32, numpy=-0.6900733>}

In [None]:
agent_discrete.test(render=False)

In [None]:
from src.environments.continuous.inverted_pendulum import environment

agent_continuous = ActorCriticAgent(environment)
agent_continuous.learn()

#{'action': array([0.6679693], dtype=float32), 'action_log_prob': array([1.0068668], dtype=float32)}

In [None]:
agent_continuous.test()