# Actor Critic Method

**Author:** [Apoorv Nandan](https://twitter.com/NandanApoorv)<br>
**Date created:** 2020/05/13<br>
**Last modified:** 2020/05/13<br>
**Description:** Implement Actor Critic Method in CartPole environment.

## Introduction

This script shows an implementation of Actor Critic method on CartPole-V0 environment.

### Actor Critic Method

As an agent takes actions and moves through an environment, it learns to map
the observed state of the environment to two possible outputs:

1. Recommended action: A probability value for each action in the action space.
   The part of the agent responsible for this output is called the **actor**.
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
   future. The part of the agent responsible for this output is the **critic**.

Agent and Critic learn to perform their tasks, such that the recommended actions
from the actor maximize the rewards.

### CartPole-V0

A pole is attached to a cart placed on a frictionless track. The agent has to apply
force to move the cart. It is rewarded for every time step the pole
remains upright. The agent, therefore, must learn to keep the pole from falling over.

### References

- [CartPole](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)
- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)


## Setup


In [1]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v1")  # Create the environment

# env.seed(seed)
env.reset(seed=seed)
env.action_space.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)

eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0


## Implement Actor Critic network

This network learns two functions:

1. Actor: This takes as input the state of our environment and returns a
probability value for each action in its action space.
2. Critic: This takes as input the state of our environment and returns
an estimate of total rewards in the future.

In our implementation, they share the initial layer.


In [2]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])


## Train


In [3]:
optimizer = keras.optimizers.Adam(learning_rate=0.001)
huber_loss = keras.losses.Huber()
# action_probs_history = []
# critic_value_history = []
# rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    action_probs_history = []
    critic_value_history = []
    rewards_history = []

    # Corrected state initialization for gymnasium
    state, _ = env.reset()
    episode_reward = 0

    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); # Adding this line would show the attempts

            # Simplified and corrected state conversion
            state_tensor = tf.convert_to_tensor(state[None, :], dtype=tf.float32)

            # Predict action probabilities and estimated future rewards
            action_probs, critic_value = model(state_tensor, training=True) # training=True is fine

            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            # Ensure action_probs is squeezed correctly for np.random.choice
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))

            # CRITICAL: Clip action probabilities to prevent log(0)
            action_probs_history.append(tf.math.log(tf.clip_by_value(action_probs[0, action], 1e-8, 1.0)))

            # Apply the sampled action in our environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            rewards_history.append(reward)
            episode_reward += reward

            # Update state for next iteration
            state = next_state

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize returns
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in zip(action_probs_history, critic_value_history, returns):
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss
            critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))

        # Backpropagation
        # Sum actor and critic losses
        total_loss = tf.reduce_sum(actor_losses) + tf.reduce_sum(critic_losses)

        # Add gradient clipping for stability
        grads = tape.gradient(total_loss, model.trainable_variables)
        grads, _ = tf.clip_by_global_norm(grads, 5.0) # Clip gradients by global norm
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))
    # if episode_count % 50 == 0:
    #     print(f"Episode {episode_count} | Reward: {episode_reward} | Action probs: {action_probs.numpy()} | Value: {critic_value.numpy()}")

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


running reward: 9.55 at episode 10
running reward: 15.87 at episode 20
running reward: 19.15 at episode 30
running reward: 20.20 at episode 40
running reward: 20.47 at episode 50
running reward: 20.63 at episode 60
running reward: 20.22 at episode 70
running reward: 21.37 at episode 80
running reward: 22.30 at episode 90
running reward: 24.03 at episode 100
running reward: 24.25 at episode 110
running reward: 23.38 at episode 120
running reward: 20.28 at episode 130
running reward: 19.85 at episode 140
running reward: 23.19 at episode 150
running reward: 21.46 at episode 160
running reward: 21.78 at episode 170
running reward: 20.52 at episode 180
running reward: 24.66 at episode 190
running reward: 25.58 at episode 200
running reward: 24.94 at episode 210
running reward: 30.03 at episode 220
running reward: 35.45 at episode 230
running reward: 40.28 at episode 240
running reward: 42.18 at episode 250
running reward: 41.46 at episode 260
running reward: 38.36 at episode 270
running rew

## Visualizations
In early stages of training:
![Imgur](https://i.imgur.com/5gCs5kH.gif)

In later stages of training:
![Imgur](https://i.imgur.com/5ziiZUD.gif)


# Reduce Neurons 128 -> 20

In [6]:
num_inputs = 4
num_actions = 2
# num_hidden = 128
num_hidden = 20

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

optimizer = keras.optimizers.Adam(learning_rate=0.001)
huber_loss = keras.losses.Huber()
# action_probs_history = []
# critic_value_history = []
# rewards_history = []
running_reward = 0
episode_count = 0


while True:  # Run until solved
    action_probs_history = []
    critic_value_history = []
    rewards_history = []

    # Corrected state initialization for gymnasium
    state, _ = env.reset()
    episode_reward = 0

    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); # Adding this line would show the attempts

            # Simplified and corrected state conversion
            state_tensor = tf.convert_to_tensor(state[None, :], dtype=tf.float32)

            # Predict action probabilities and estimated future rewards
            action_probs, critic_value = model(state_tensor, training=True) # training=True is fine

            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            # Ensure action_probs is squeezed correctly for np.random.choice
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))

            # CRITICAL: Clip action probabilities to prevent log(0)
            action_probs_history.append(tf.math.log(tf.clip_by_value(action_probs[0, action], 1e-8, 1.0)))

            # Apply the sampled action in our environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            rewards_history.append(reward)
            episode_reward += reward

            # Update state for next iteration
            state = next_state

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize returns
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in zip(action_probs_history, critic_value_history, returns):
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss
            critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))

        # Backpropagation
        # Sum actor and critic losses
        total_loss = tf.reduce_sum(actor_losses) + tf.reduce_sum(critic_losses)

        # Add gradient clipping for stability
        grads = tape.gradient(total_loss, model.trainable_variables)
        grads, _ = tf.clip_by_global_norm(grads, 5.0) # Clip gradients by global norm
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))
    # if episode_count % 50 == 0:
    #     print(f"Episode {episode_count} | Reward: {episode_reward} | Action probs: {action_probs.numpy()} | Value: {critic_value.numpy()}")

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


running reward: 72.67 at episode 10
running reward: 131.25 at episode 20
running reward: 181.07 at episode 30
running reward: 174.21 at episode 40
Solved at episode 49!
