## Project Report
This project is a solution to the Unity's reacher environment (second version with 20 agents). The environment was solved by using Deep Deterministic Policy Gradients (DDPG) algorithm after 103 episodes.

### Environment
![alt text](https://video.udacity-data.com/topher/2018/June/5b1ea778_reacher/reacher.gif "Reacher")

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. The goal of the agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

### Experience replay buffer
An experience replay buffer with size of `1e6` is used to store the experience tuple. Each tuple contains the input state, the chosen action, the corresponding reward from the environment and the next state. For each training step, 256 tuples will be randomly sampled from the buffer to train the actor and critic network.

### Deep Deterministic Policy Gradients (DDPG)
Deep Deterministic Policy Gradients (DDPG) is an algorithm used for solving environments where the action values are continous. The algorithm combines the actor-critic approach with Deep Q network method in order to solve the environment. With the actor-critic method, the agent uses 2 neural network models called actor and critic. The actor _&micro;(s|&theta;<sup>&micro;</sup>)_ takes the input state and outputs the action value while the critic  _Q(s,a|&theta;<sup>Q</sup>)_ acts as a non linear state value approximator which takes the input state and output the value of that state. With the DQN approach, both the actor and critic networks are initialized as 2 seperate versions, local  _Q(s,a|&theta;<sup>Q</sup>)_, _&micro;(s|&theta;<sup>&micro;</sup>)_ and target _Q'(s,a|&theta;<sup>Q'</sup>)_, _&micro;'(s|&theta;<sup>&micro;'</sup>)_ with identical parameters. Once the replay buffer collects enough experiences and training start, the local networks are trained by target values generated by the target networks to reduce divergence. The weights of the target networks &theta;' are then updated to slowly track their local versions using the soft update method : &theta;' &larr; &theta;&tau; + (1-&tau;)&theta;' with &tau;=`1e-3`.  
The environment was solved by using the following hyper parameters:

### Actor
The actor function _&micro;(s|&theta;<sup>&micro;</sup>)_ specifies the current policy by directly map the input state vector _s &isin; &real;<sup>33</sup>_ to the output action _a &isin; &real;<sup>4</sup>_. To approximate this function, we use a neural network with 3 layers: the input layer, 1 hidden layer with 256 nodes and the output layer. The first 2 layers use Relu activation function while the output use tanh to squash the action values between 1 and -1. Batch normalization is also applied to the hidden layer. 


### Critic
The critic function _Q(s,a|&theta;<sup>Q</sup>)_ approximates the expected return when action _a_ is taken at state _s_ . Again, we use a neural network with 4 layers: the input layer, 2 hidden layers with sizes of 512 and 384 respectively and the output layer. Leaky ReLU is used as activation function for every layer except the output. The output layer does not have an activation function and output the value of action _a_ in state _s_ as a real number.

**Note**: Both the actor and critic network weights needs to be initialized manually using the random uniform distribution otherwise they will not be able to learn effectively. From my experience, agent that is initialized using Pytorch's default setting cannot exceed score larger than 4.

### Exploration noise for continous action value
Following the DDPG paper, we constructed an exploration policy &micro;' by adding noise sampled from a noise process _N_ to our actor policy &micro;:
&micro;'(s<sub>t</sub>) = &micro;(s<sub>t</sub>|&theta;<sup>&micro;</sup>) + _N_. _N_ is sampled from a Ornstein-Uhlenbeck process to generate correlated noise with &mu;=0, &theta;=0.15 and &sigma;=0.4. &sigma; is then gradually reduced to 0.2 following a decaying period of 1000 steps to reduce exploration and increase exploitation as the policy get better.

### Training algorithm

### Hyper parameters
```python
ACTOR_LEARNING_RATE = 1e-3
CRITIC_LEARNING_RATE = 1e-3
BUFFER_SIZE = int(1e6)          # replay buffer size
BATCH_SIZE = 256                # minibatch size
UPDATE_EVERY = 2                # how often to update the network
GAMMA = 0.99                    # discount factor
TAU = 1e-3                      # for soft update of target parameters
LEARN_TIMES = 1
CRITIC_GRADIENT_CLIPPING_VALUE = 1
ACTOR_GRADIENT_CLIPPING_VALUE = 0
```

## Results
The agent was able to solve the environment (average score >= 30) after 115 episode:  
![alt text](https://i.ibb.co/GvZ3QPL/agent-score.png "Average agent score")

## Future improvement
Use Q network with more layers, apply Double DQN, Prioritized Experience Replay and Dueling DQN.