# Multi-Agent Collaboration/Competition Project Report

This project builds on the DDPG Pendulum solution provided as part of Udacity's "deep-reinforcement-learning-master" Github repo and extends it to use multiple agents
* [DDPG Pendulum](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum)

While the pendulum solution interacted with an OpenAI Gym environment, this project replaced that environment with Unity's Tennis environment.

# Learning Algorithm

A Pytorch Deep Deterministic Policy Gradient (DDPG) actor/critic approach was trained to solve the environment with two agents.

In PyTorch, the neural network layers are defined in the init function and the forward pass is defined in the forward function, which is invoked automatically when the class is called.

Two networks were defined, namely an Actor and a Critic.

The Actor maps states to a deterministic action. There are two hidden layers (128 fully connected nodes). The output activation function is tanh as the output values vary between -1 and 1.

The Critic takes the Actor's action and uses it for training (used in action value function). There are two hidden layers (128 fully connected nodes). The output activation function is ReLU as the output values needs to be positive.

My original implementation resulted in unstable/ineffective early training which eventually converged in 1551 episodes. I then implemented Batch Normalization which is a method used to make training of neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling values. With Batch Normalization in place after layer activation, the environment was solved in just 700 episodes.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation.  Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

An agent class is defined that interacts with the environment and updates both networks. This environment includes 2 agents.

Noise was used to balance exploration vs. exploitation when selecting each of the agents' action values.  A discount factor of 0.9 was used and a learning rate of 1e-3 was used for both networks.

Experiences from each agent are shared and added to the replay memory as actions are taken. Every time step, the network will sample actions from this replay memory as long as the size of the memory has reached 256 samples. These samples will be used to update the network. A soft update startegy is used (using hyperparameter TAU) to slowly blend the regular network weights (training network) with the target network weights (the one we are using for prediction to stabilize training).

# Plot of Rewards

## Without Batch Normalization
![Scores Plot](./Scores_Per_Episode_No_Batch_Norm.png)

## With Batch Normalization
![Scores Plot](./Scores_Per_Episode.png)

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents).

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single **score** for each episode.

The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5. My environment was solved in 1551 episodes without Batch Normalization and in 700 episodes with Batch Normalization.

# Ideas For Future Work

Future work can consider the use of different combinations of Value and Policy Based Methods for continuous control tasks such as A2C or A3C.  Hyperparameter sweeps could also be used to improve the stability and speed of the training the standard DDPG used here.