# Project 2: Continuous Control

## Overview

The project solves the multi agent Reacher environment using the PPO algorithm [1].

The PPO code draws heavily from Costa Huang's continuous action PPO implementation for the Gymnasium Half-Cheetah Environment [2].

My original work on the PPO code includes the following:
- Modified to support unityagents multi agent Reacher environment
- Hyperparameter tuning
- Updated network architecture
- Removed non-essential code
- Simplified score recording
- Added model checkpointing

## Learning Algorithm

### Background

The project uses Proximal Policy Optimization [1], which maximizes the surrogate objective function
$$\sum_tE_{s_t \sim p_{\theta}(s_t)}[E_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_{\theta}}(s_t,a_t)]]$$
which is roughly equivalent to maximizing the reinforcement learning objective [3]
$$E_{\tau \sim p_\theta(\tau)}[\sum_{t}\gamma^tr(s_t,a_t)]$$
given that $\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)} \le \epsilon$ [4].

PPO keeps the new policies similar to the old by conditionallly clipping the ratio $\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)}$:
$$\min(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a), clip(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)},1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a))$$

SGD is performed on the above objective with respect to model parameters θ to maximize the agent's performance.

### Implementation details

This particular PPO implementation [2] contains extensions that improve the performance of the vanilla PPO algorithm:
- Orthogonal initialization of weights [5]
- Generalized Advanatage Estimation [6]
- Mini batch updates (prevents overfitting to local minima)
- Global gradient clipping [6]

## Hyperparameters

```
LEARNING_RATE = 3e-4
ADAM_EPS = 1e-5
TOTAL_TIMESTEPS = 4000000
GAMMA = .99
LAMBDA = .95
UPDATE_EPOCHS = 10
N_MINIBATCHES = 32
CLIP_COEF = .2
MAX_GRAD_NORM = 5
GAE_LAMBDA = .95
V_COEF = .5
ENT_COEF = .01
HIDDEN_LAYER_SIZE = 512
ANNEAL_LR = False
ROLLOUT_LEN = 2048
```

Differences from the original PPO implementation [2]:
- Total timesteps increased: 2e6 -> 4e6
- Max grad norm increased: .5 -> 5
- Entropy coefficient increased: 0 -> .01
- Hidden layer size increased: 64 -> 512
- Learning rate annealing disabled (it had no positive effect on performance)
- Batch size increased: 2048 -> 2048 * 20

## Model architecture

The model consists of two networks, one actor network and one critic network, that both consist of three fully connected layers, with hidden layers having size 512.

The actor network takes in inputs of size (n_batch, n_observations) and outputs values of size (n_batch, n_actions) with each value between -1 and 1. The values represent the means of each of the four action components and are used to sample action values from a normal distribution. The actor network uses ReLU activations for the initial layers to maximize learning stability, and uses a tanh output activation to scale values between -1 and 1.

The critic network also takes in inputs of size (n_batch, n_observations), but outputs values of size (n_batch, 1) that represent the predicted value of the observation. The critic net also uses ReLU activations between layers and does not require the last tanh layer.

For convenience in training code, the network module provides a helper function `get_action_and_value` that returns the sampled action, log probabilities of the action, entropies of the action distributions, and predicted values from the critic.

## Learning curve

The agent achieved an average score of 30 in less than 1500 total episodes.

![](learning_curve.png)

## Future work

- Different network architectures and network sizes can be experimented with to find the best fit for the current problem. Batch normalization may also help stabilize learning.
- The agents' performances peak at below 38. Perhaps the entropy coefficient can be annealed to 0 to reduce exploration in the later stages of learning and thereby improve peak performance.
- Observation and reward normalization can also help to stabilize training performance

## References

- [1] https://arxiv.org/pdf/1707.06347.pdf Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- [2] https://github.com/vwxyzjn/ppo-implementation-details/blob/main/ppo_continuous_action.py
- [3] https://www.youtube.com/watch?v=ySenCHPsKJU&list=PL_iWQOsE6TfX7MaC6C3HcdOf1g337dlC9&index=38&ab_channel=RAIL Berkeley CS285 Lecture 9, Part 1
- [4] https://www.youtube.com/watch?v=ySenCHPsKJU&list=PL_iWQOsE6TfX7MaC6C3HcdOf1g337dlC9&index=38&ab_channel=RAIL Berkeley CS285 Lecture 9, Part 2
- [5] https://openreview.net/forum?id=r1etN1rtPB Logan, Engstrom, Ilyas Andrew, Santurkar Shibani, Tsipras Dimitris, Janoos Firdaus, Rudolph Larry, and Madry Aleksander. "Implementation matters in deep RL: A case study on PPO and TRPO." In International Conference on Learning Representations. 2019.
- [6] https://openreview.net/forum?id=nIAxjsniDzg Andrychowicz, Marcin, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot et al. "What matters for on-policy deep actor-critic methods? a large-scale study." In International conference on learning representations. 2021.