# Project Report: Continuous Control

## Overview

In this project, I solved an environment by training 20 agents which are double-jointed arms that can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. The goal of my agents is to maintain their positions at the target locations for as many time steps as possible. 

For this task, I have implemented a Deep Deterministic Policy Gradient Method to solve the environment.


## Learning Algorithm

### Background

When I solved the environment in first project of the nanodegree, I implemented a Deep Q-Network to train the banana hunter. A key characteristic of DQN is the marriage of reinforcement learning algothrithm with the advancements in deep learning: by using a deep neural network as function approximator to estimate the action-value functions in the Q table, since state space in real life is mostly continuous. This has achieved excellent results, but it soon proved to be unsuitable for tasks with high-dimensional action spaces. One may extend the  technique of discretizing the state space and apply it to the action space, nevertheless, limitations such as an explosion of dimensionality (or the number of actions) as the number of degrees of freedom arise and will impair our training results.

Therefore, we sought to employ a more straight-forward algorithm: policy-based method. It is direct in a sense that in stead of mapping the state-action pair to obtain an optimal action-value functions and then to determine the optimal policy, policy-based method seeks to find the optimal policy directly. **In this project, the DDPG algorithm I have implemented is a model-free, off-policy actor-critic algorithm that uses deep neural network as function approximators that can learn the optimal policy in high-dimensional, continuous action spaces.**



### The Network itself
#### Local Networks and Target Networks
The DDPG algorithm makes use of 4 networks: a target and local network for each of the actor and critic class. Each of the parameters below approximate the 4 networks. Similar to DQN, we generate a copy of the local network and set it as the target network to address the issue of moving target in DDPG. The target network is updated slower than the local network and it slowly tracks the learned local network.

![](images/n.png)


#### Actor and Critic Networks
We randomly initialize the a local and target copy of critic and actor network, having 4 networks in total. We then initialize the target weights as copy of the local weights using the `hard_update` process described in the `Agent.py` file. I used 2 linear hidden layers with 128 hidden units and ReLu activation functions for both layers in both the actor and critic network, shown in the `Model.py` file.


### Experienced Replay
Analogous to the DQN method, we also utilize the technique of experienced replay in DDPG in which we construct a replay buffer of size (100000) to store experience tuples (s, a, r, s_next) and randomly sample mini-batches of tuples for the agents to learn. In this case, the agents can benefit from breaking the correlation of sequential experience tuples and obtain better results.

### Actor and Critic Networks Updates
#### Updating Critic Networks (Value)
The update on the critic network is done in a similar fashion as that in DQN. First, the next-state Q-value is calculated using the target network and the target policy. We use the Bellman equation to compute the target Q-values, `y_i`, for the current state. 

![](images/1.png)

Then, we minimize the loss between the target Q-value and the local/actual Q-value which is calulated using the local network, shown below.

![](images/2.png)

#### Updating Actor Networks (Policy)
Keep in mind that, our goal is to maximize the expected return:
![](images/3.png)

To do that, we use gradient ascent, taking the derivative of J with respect to the policy parameter. We also take the mean of the sum of gradients since we are updating the policy in an off-policy way and are averaging over mini-batches of experience tuples.
![](images/4.png)

For updating both networks, I chose Adam optimizer and a learning rate of 0.001 for both. The above process can be found in `Agent.py` file.


### Target Networks Updates (Soft Updates)
Lastly, we update the target networks using the formula below:
![](images/5.png)

The above process can be found in `Agent.py` file.


### Exploration
#### Ornstein-Uhlenbeck Process
Additionally, I have included an implementation of the Ornstein-Uhlenbeck Process for exploration purposes. I did this by adding noise to the actions, as suggested by the authors of the DDPG paper. 

![](images/6.png)

The motivation behind this is that exploration in continuous action spaces is difficult. However, with an off-policy method like DDPG, we can treat the exploration problem independent from the learning algorithm. The implementation can be found in `Utils.py` file. 

#### Epsilon-Greedy Policy
Besides the OU-Noise, I made use of our old friend Epsilon-Greedy Policy. I would like the agents to exploit as much as possible at the beginning and exploit the optimal path as much as possible towards the later phase of training but setting an `epsilon_decay` parameter.



## Choice of Hyperparameters

The network architecture is consisted of: 2 hidden layers, each having 128 hidden units. For the critic network, it outputs a single value; for the actor network, it outputs actions directly.

Following hyper parameters where used:
``` python
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-3         # learning rate of the actor
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
EPSILON = 1.0           # epsilon noise parameter
EPSILON_DECAY = 1e-6    # decay parameter of epsilon
LEARNING_PERIOD = 20    # learning frequency  
UPDATE_FACTOR   = 10    # how much to learn
fc1_units = 128         # number of hidden units in the first layer of NN
fc2_units = 128         # number of hidden units in the second layer of NN
```

## Results

Thought I trained the agent for 2000 episodes, the environment was solved only in 175 episodes.

![](images/R.png)


## Ideas for Future Work

For this project, though I have achieved fantastic results in only 175 episodes, there are a few more details and improvements that I can make:

- Trying out different activation functions such as the Leaky ReLu activation function and tweak the negative slopes to see if it performs better than a regular ReLu activation function.
- Implementing more hidden layers, increasing the number of hidden units and adding in batch normalization.
- Using a different distribution other than the uniform distribution when resetting the parameters of my critic and actor networks.
- Implementing other algorithms like PPO, A3C, and D4PG in this task and comparing the performance of each algorithm.


```python

```
