### Project Collaboration and Competition - Report

#### 1. Overview Of The Project

> In this project our goal is to train two agents to control racket to bounce a ball over the net. A reward of 0.1 is provided for every time the agent hits the ball over the net thus enabling the agent to prioritize hitting the ball over the net else it receives a reward of -0.01.

> We have 8 possible states corresponding to position and velocity of the ball and racket.Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents)

#### 2. Algorithm Explanation

#### Multi-Agent Reinforcement Learning(MADDPG Algorithm):
> This environment is quite interesting compared to single agent environments. It requires the training of two separate agents, and the agents need to collaborate under certain situations (like don’t let the ball hit the ground) and compete under other situations (like gather as many points as possible). Just doing a simple extension of single agent RL by independently training the two agents does not work very well because the agents are independently updating their policies as learning progresses. And this causes the environment to appear non-stationary from the viewpoint of any one agent. While we can have non-stationary Markov processes, the convergence guarantees offered by many RL algorithms such as Q-learning requires stationary environments. While there are many different RL algorithms for multi-agent settings, for this project I chose to use the Multi Agent Deep Deterministic Policy Gradient (MADDPG) algorithm

> In MADDPG, each agent’s critic is trained using the observations and actions from all the agents, whereas each agent’s actor is trained using just its own observations. This allows the agents to be effectively trained without requiring other agents’ observations during inference (because the actor is only dependent on its own observations).

<img src='M1.png'>

##### Deep Deterministic Policy Gradients :
> DDPG uses four neural networks: a Q network, a deterministic policy network, a target Q network, and a target policy network.

<img src='1.png'>

> The Q network and policy network is very much like simple Advantage Actor-Critic, but in DDPG, the Actor directly maps states to actions (the output of the network directly the output) instead of outputting the probability distribution across a discrete action space

> The target networks are time-delayed copies of their original networks that slowly track the learned networks. Using these target value networks greatly improve stability in learning. Here’s why: In methods that do not use target networks, the update equations of the network are interdependent on the values calculated by the network itself, which makes it prone to divergence.

<img src='2.png'>

##### DDPG Algorithm:

<img src='3.png'>

##### Replay Buffer:
> As used in Deep Q learning (and many other RL algorithms), DDPG also uses a replay buffer to sample experience to update neural network parameters. During each trajectory roll-out, we save all the experience tuples (state, action, reward, next_state) and store them in a finite-sized cache — a “replay buffer.” Then, we sample random mini-batches of experience from the replay buffer when we update the value and policy networks.

##### Actor (Policy) & Critic (Value) Network:
> The value network is updated similarly as is done in Q-learning. The updated Q value is obtained by the Bellman equation:

<img src='4.png'>

> However, in DDPG, the next-state Q values are calculated with the target value network and target policy network. Then, we minimize the mean-squared loss between the updated Q value and the original Q value:
<img src='5.png'>

> For the policy function, our objective is to maximize the expected return:
<img src='6.png'>

> To calculate the policy loss, we take the derivative of the objective function with respect to the policy parameter. Keep in mind that the actor (policy) function is differentiable, so we have to apply the chain rule.
<img src='7.png'>

> But since we are updating the policy in an off-policy way with batches of experience, we take the mean of the sum of gradients calculated from the mini-batch:
<img src='8.png'>

#### Hyperparameters Used :
<p>
BUFFER_SIZE = int(1e6) 
    
EPSILON = 1.0 

EPSILON_DECAY = 1e-6 

WEIGHT_DECAY = 0     

BATCH_SIZE = 256  

OU_SIGMA = 0.1

OU_THETA = 0.15

GAMMA = 0.99      

TAU = 2e-3         

LR_ACTOR = 1e-3     

LR_CRITIC = 1e-3     

LEARN_EVERY = 1       

LEARN_NUM = 10         

GRAD_CLIPPING = 1.0 

</p>

#### Network Architecture:
> Here we have two different networks for actor and critic. Below i have mentioned the architectures of the both networks.

                Actor:
>                   fc1(399 units) -- batch_normalization -- fc2(299 units) -- fc3(2 units)

              Critic:
>                   fc1(399 units) -- batch_normalization -- fc2(299 units + 2 units) -- fc3(1 units)

> Here we have used relu as the activation function and adam as the optimizer.

#### Result:
> It took 337 episodes to solve the environment.	 with Average score: 0.505
<img src='9.png'>

#### Ideas For Further Improvement:
> Algorithms such as MAPPO can also be considered for these kind of problems.

> Experiment with different values for hyperparameters such as fc1 units, fc2 units, batch_size etc.

> Adding a few additional layers to the actor and critic networks.