# Project 3: Collaboration and Competition

## Overview

This project solves the multi agent Tennis environment using the PPO algorithm [1].

## Learning Algorithm

### Background

The project uses Proximal Policy Optimization [1], which maximizes the surrogate objective function

$$\sum_tE_{s_t \sim p_{\theta}(s_t)}[E_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\gamma^tA^{\pi_{\theta}}(s_t,a_t)]]$$

which is roughly equivalent to maximizing the reinforcement learning objective [3]

$$E_{\tau \sim p_\theta(\tau)}[\sum_{t}\gamma^tr(s_t,a_t)]$$
given that $\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)} \le \epsilon$ [4].

PPO keeps the new policies similar to the old by conditionallly clipping the ratio $\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)}$:

$$\min(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a), clip(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)},1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a))$$

SGD is performed on the above objective with respect to model parameters θ to maximize the agent's performance.

### Implementation details

Orthogonal initialization of weights [5] was adopted for stable DNN training. Entropy regularization [6] was initially used (as it improved performance on the Reacher environment), but turned out to be catastrophic for the training performance as the initial sparse rewards would cause the agents to increase their standard deviation parameters indefinitely.

## Hyperparameters

```
LEARNING_RATE = 3e-4
ADAM_EPS = 1e-5
GAMMA = .99
LAMBDA = .95
UPDATE_EPOCHS = 3
N_MINIBATCHES = 10
CLIP_COEF = .2
MAX_GRAD_NORM = 5
GAE_LAMBDA = .95
V_COEF = .5
HIDDEN_LAYER_SIZE = 32
ROLLOUT_LEN = 1024
N_ROLLOUTS = 50000
ENTROPY_COEF = 0
```

## Model architecture

The model consists of two networks, one actor network and one critic network, that both consist of three fully connected layers, with hidden layers having size 32.

The actor network takes in inputs of size (n_batch, n_observations) and outputs values of size (n_batch, n_actions) with each value between -1 and 1. The values represent the means of each of the two action components and are used to sample action values from a normal distribution. The actor network uses Tanh activations for both the initial layers and the output to scale values between -1 and 1.

The critic network also takes in inputs of size (n_batch, n_observations), but outputs values of size (n_batch, 1) that represent the predicted value of the observation. The critic net also uses tanh activations between layers and does not require the last tanh layer.

## Learning curve

The agents achieved an average score (over 100 episodes) of .5 in 45217 total episodes. As training continued to be stable after this point, I trained the agents for longer to reach an average score of 1.0.

![](learning_curve.png?1)

## Future work

- Different network architectures and network sizes can be experimented with to find the best fit for the current problem. Batch normalization may also help stabilize learning.
- Experimenting with a different suitable entropy coefficient may improve training performance.
- Other training related hyperparameters (update epochs, minibatch size, rollout length, etc.) can be reconfigured until an optimal combination for this environment is found.

## References

- [1] https://arxiv.org/pdf/1707.06347.pdf Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- [3] https://www.youtube.com/watch?v=ySenCHPsKJU&list=PL_iWQOsE6TfX7MaC6C3HcdOf1g337dlC9&index=38&ab_channel=RAIL Berkeley CS285 Lecture 9, Part 1
- [4] https://www.youtube.com/watch?v=ySenCHPsKJU&list=PL_iWQOsE6TfX7MaC6C3HcdOf1g337dlC9&index=38&ab_channel=RAIL Berkeley CS285 Lecture 9, Part 2
- [5] https://openreview.net/forum?id=r1etN1rtPB Logan, Engstrom, Ilyas Andrew, Santurkar Shibani, Tsipras Dimitris, Janoos Firdaus, Rudolph Larry, and Madry Aleksander. "Implementation matters in deep RL: A case study on PPO and TRPO." In International Conference on Learning Representations. 2019.
- [6] http://proceedings.mlr.press/v48/mniha16.html?ref=https://githubhelp.com Mnih, Volodymyr, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. "Asynchronous methods for deep reinforcement learning." In International conference on machine learning, pp. 1928-1937. PMLR, 2016.