# Project Report: Continuous Control

## Overview

In this project, I solved an environment by training 2 agents which control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Hence, the goal of each agent is to keep the ball in play.

For this task, I have implemented a Multi-agent Deep Deterministic Policy Gradient Method to solve the environment.


## Learning Algorithm

### Background
When I solved the environment in first project of the nanodegree, I implemented a Deep Q-Network to train the banana hunter. A key characteristic of DQN is the marriage of reinforcement learning algothrithm with the advancements in deep learning: by using a deep neural network as function approximator to estimate the action-value functions in the Q table, since state space in real life is mostly continuous. This has achieved excellent results, but it soon proved to be unsuitable for tasks with high-dimensional action spaces. One may extend the  technique of discretizing the state space and apply it to the action space, nevertheless, limitations such as an explosion of dimensionality (or the number of actions) as the number of degrees of freedom arise and will impair our training results.

Therefore, we sought to employ a more straight-forward algorithm: policy-based method. It is direct in a sense that in stead of mapping the state-action pair to obtain an optimal action-value functions and then to determine the optimal policy, policy-based method seeks to find the optimal policy directly. In this project, I had adapted some of my code of the DDPG algorithm for the Continuous Control project. **What I have implemented is a model-free, off-policy actor-critic algorithm that uses deep neural network as function approximators that can learn the optimal policy in high-dimensional, continuous action spaces.**

#### Non-stationarity
In fact, I did not create a meta-agent class which takes into account of the existance of multiple agents rather than treating other agents as part of the environment. Instead, I trained my 2 agents independently, each takes in partial observation of the environment as states and learns their own policies. (which means that the environment is seen differently based on the perspective of the agent) Though the guarantees of convergence may not necessarily hold, I was able to successfully solved the environment in around 1200 episodes. Nevertheless, an implementation of a meta-agent class deserves looking into further in the future.
![](images/non.PNG) 


### The Network itself
#### Local Networks and Target Networks
The DDPG algorithm makes use of 4 networks: a target and local network for each of the actor and critic class. Each of the parameters below approximate the 4 networks. Since we have 2 agents in this environment, there will be 8 networks in total. Similar to DQN, we generate a copy of the local network and set it as the target network to address the issue of moving target in DDPG. The target network is updated slower than the local network and it slowly tracks the learned local network. The implementation can be found in `model.py` file.

![](images/n.png)


#### Actor and Critic Networks
I randomly initialize the a local and target copy of critic and actor network for each agent, having 8 networks in total. I then initialize the target weights as copy of the local weights using the `hard_update` process described in the `maddpg_agent_v2.py` file. I used 2 linear hidden layers with 256 hidden units and Leaky_ReLu activation functions for both layers in both the actor and critic network, shown in the `model.py` file. A batch normalization layer is also added to both the critic and actor networks.


### Experienced Replay
Analogous to the DQN method, I also utilize the technique of experienced replay in DDPG in which I construct a replay buffer of size (100000) to store experience tuples (s, a, r, s_next) and randomly sample mini-batches of tuples for the agents to learn. The agents can then benefit from breaking the correlation of sequential experience tuples and obtain better results. In this case, the experienced replay buffer is shared between 2 agents.

### Actor and Critic Networks Updates
#### Updating Critic Networks (Value)
The update on the critic network is done in a similar fashion as that in DQN. First, the next-state Q-value is calculated using the target network and the target policy. We use the Bellman equation to compute the target Q-values, `y_i`, for the current state. 

![](images/1.png)

Then, I minimize the loss between the target Q-value and the local/actual Q-value which is calulated using the local network, shown below.

![](images/2.png)

#### Updating Actor Networks (Policy)
Keep in mind that, our goal is to maximize the expected return:
![](images/3.png)

To do that, I use the gradient ascent algorithm, taking the derivative of return with respect to the policy parameter. I also take the mean of the sum of gradients since I am updating the policy in an off-policy way and are averaging over mini-batches of experience tuples.
![](images/4.png)

For updating both networks, I chose the Adam optimizer and a learning rate of 0.001 for both. The above process can be found in `maddpg_agent_v2.py` file.


### Target Networks Updates (Soft Updates)
Lastly, I update the target networks using the formula below:
![](images/5.png)

The above process can be found in `maddpg_agent_v2.py` file.


### Exploration
#### Ornstein-Uhlenbeck Process
Additionally, I have included an implementation of the Ornstein-Uhlenbeck Process for exploration purposes. I did this by adding noise to the actions, as suggested by the authors of the DDPG paper. 

![](images/6.png)

The motivation behind this is that exploration in continuous action spaces is difficult. However, with an off-policy method like DDPG, I can treat the exploration problem independent from the learning algorithm. The implementation can be found in `utils.py` file. 


### Changes and Updates
Though much of the implementation is similar to that of the code in Project 2, I had made some detailed changes to make the algorithm perform better:
- I had increased the parameter TAU for soft update to 0.008
- I had increased the hidden units dimensions in the model
- I had used Leaky_ReLu activation function instead of a regular ReLu function
- For the OU noise process, I had increased sigma from 0.2 to 0.5 and changed the scale factor of 0.1 to 1 so that exploration is stronger.
- I had added batch normalization layers to the neural networks.


## Choice of Hyperparameters

The network architecture is consisted of: 2 hidden layers, each having 128 hidden units. For the critic network, it outputs a single value; for the actor network, it outputs actions directly.

Following hyper parameters where used:
``` python
BUFFER_SIZE = int(1e6)        # replay buffer size
BATCH_SIZE = 256              # minibatch size
GAMMA = 0.99                  # discount factor
TAU = 8e-3                    # soft update parameters
LR_ACTOR = 1e-3               # learning rate of the actor 
LR_CRITIC = 1e-3              # learning rate of the critic
WEIGHT_DECAY = 0              # L2 weight decay
fc1_units = 256               # number of hidden units in the first layer of NN
fc2_units = 256               # number of hidden units in the second layer of NN
OU_SIGMA = 0.5                # the volatility parameter
OU_THETA = 0.15               # the speed of mean reversion
```

## Results

The environment was solved around 1200 episodes to achieve an average score of 0.5.

![](images/R.png)


## Ideas for Future Work

There are a few more details and improvements that I can make:

- Implement a meta-agent approach to solve the environment and address the non-stationarity issue.
- Implementing more hidden layers.
- Using a different weight initialization method such as the Kaiming Initializer or the Xavier Initializer other than the normal distribution when resetting the parameters of my critic and actor networks.
- Implementing other algorithms like PPO, A3C, and D4PG in this task and comparing the performance of each algorithm.


```python

```
