# Report on Project 3: Tennis collaboration and competition

### Implementing Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

This file reports the method adopted to train two agents for solving the task of playing game of tennis in a Unity environment.

## Inputs

- state space: 24 (Continuous)
- action space: 2 (Continuous)
- number of agents: 2
- reward structure: 
    - Agent hits the ball over the net --> +0.1. 
    - Agent lets a ball hit the ground or hits the ball out of bounds --> -0.01.

## Method

- The inputs are given to an agent which follows a policy for executing the task. 
- As the state space has real numbers, implementing a Q table is not possible. Hence, neural networks are used for function approximation.
- Also, as the action space is continuous, methods such as DQN cannot be directly implemented.
- A Multi-Agent Deep Deterministic Policy Gradient (DDPG) algorithm (i.e. an Actor-Critic method) was used. 
- The structure of the Actor for each agent is as follows:
    - F1 = ReLU (input_state (states = 24) x 256 neurons)
    - F2 = ReLU (F1 x 256 neurons)
    - F3 = ReLU (F2 x output_state (actions = 2))
- The structure of the Critic for each agent is as follows:
    - F1 = ReLU (input_state (states = 24) x 256 neurons)
    - F2 = ReLU (F1 x 256 neurons)
    - F3 = ReLU (F2 x 1)
- For each agent, two NNs for actor and critic of same architecture are used: local network (θ_local) and target network (θ_target).
- The target network is soft updated using the local network θ_target = τ*θ_local + (1 - τ)*θ_target.

## Hyperparameters

- BUFFER_SIZE = 1e5      # replay buffer size
- BATCH_SIZE = 128       # minibatch size
- GAMMA = 0.99           # discount factor
- TAU = 1e-3             # for soft update of target parameters
- LR_ACTOR = 1e-4        # Actor Learning Rate
- LR_CRITIC = 5e-4       # Critic Learning Rate
- maximum number of timesteps per episode = 1000
- WEIGHT_DECAY = 0       # L2 weight decay

## Rewards plot
* Number of episodes needed to solve the environment = 1555

* The plot of the average rewards received is seen below:

![alt text](./plot_p3.png "plot")

The agent receives higher rewards as the experience i.e. number of episodes increases. 
There are episodes with scores as high as 2.5 but there are episodes with 0 scores as well.



## Future ideas for improving agents performance
- Implement a different Neural Network architecture for actor and critic
- Implement prioritized experience replay for better sampling of episodes. Optimizing the sampling could improve the results.
- Implementing Proximal Policy Optimization (PPO) which might yield better results as this not very high-dimensional state space.