# Report on Project 3: Collaboration and Competition
This file reports the method adopted to train two agents for solving the task of playing game of tennis in a Unity environment.

## Inputs

- state space: 24 (Continuous)
- action space: 2 (Continuous)
- number of agents: 2
- reward structure: 
    - Agent hits the ball over the net --> +0.1. 
    - Agent lets a ball hit the ground or hits the ball out of bounds --> -0.01.

## Method

- The inputs are given to an agent which follows a policy for executing the task. 
- As the state space has real numbers, implementing a Q table is not possible. Hence, neural networks are used for function approximation.
- Also, as the action space is continuous, methods such as DQN cannot be directly implemented.
- Moreover, here we have two agents learning the same task. To leverage the learning both the agents, the Multi-Agent Deep Deterministic Policy Gradient (DDPG) algorithm which is an Actor-Critic method was used. 
    - The agents share the same memory (i.e. replay buffer) for training. 
    - The actor for each agent i.e. policy takes the state and outputs actions.
    - The Critic on the other hand evaluates the expected values from a state,action pair i.e. Q value estimator.
- The structure of the Actor for each agent is as follows:
    - F1 = ReLU (input_state (states = 24) x 256 neurons)
    - F2 = ReLU (F1 x 256 neurons)
    - F3 = ReLU (F2 x output_state (actions = 2))
- The structure of the Critic for each agent is as follows:
    - F1 = ReLU (input_state (states = 24) x 256 neurons)
    - F2 = ReLU (F1 x 256 neurons)
    - F3 = ReLU (F2 x 1)
- Two NNs for actor and critic of same architecture are used: local network (θ_local) and target network (θ_target).
- The target network is soft updated using the local network θ_target = τ*θ_local + (1 - τ)*θ_target.

## Hyperparameters

- BUFFER_SIZE = 1e5      # replay buffer size
- BATCH_SIZE = 128       # minibatch size
- GAMMA = 0.99           # discount factor
- TAU = 1e-3             # for soft update of target parameters
- LR_ACTOR = 1e-4        # Actor Learning Rate
- LR_CRITIC = 1e-3       # Critic Learning Rate
- maximum number of timesteps per episode = 1000
- WEIGHT_DECAY = 0       # L2 weight decay

## Rewards plot
A plot of the average rewards received is seen below:
![alt text](images/reward_plot.png "ABC")
It can be seen that the agent receives higher rewards as the experience i.e. number of episodes increases. 

Number of episodes needed to solve the environment = 1408

## Future ideas for improving agents performance
- Use a different Neural Network architecture for actor and critic
- Implement with other methods such as A3C, PPO, D4PG for faster and improved agent performance.