# Continuous Control Project Report

Date: 5/12/2019

Author: Josh Lo

## Introduction

This report describes the implementation of the learning algorithm used to solve a continuous control problem of a doulbe-joint robot. A deep deterministic policy gradient method was used in this project. There are a target and a local neuro-networks for each of the actor and critic models. The agent uses the experience replay to break the correlations between the samples during training. An Ornstein-Uhlenbeck process was also used to provide some randomness in the generated action. A soft updating scheme using an exponential weighted method was adopted to blend the target network parameters with the local network parameters gradually. The details will be listed in the following sections. 

### DDPG Agent

The DDPG agent include an actor and a critic component. The actor takes the state as input and outputs the action. The output action is then fed into the critic network to generate the state-action Q-value. The neuro-network models of the actor and critic components are listed below:

#### Actor Model

- Local network: 2-layer of fully-connected with 400 and 300 hidden units in each layer
- Target network: 2-layer of fully-connected with 400 and 300 hidden units in each layer

For each of the network, a relu layer is used after each linear layer, and a tanh function is used at the output to create a -1 to 1 action output.


#### Critic Model

- Local network: 2-layer of fully-connected with 400 and 300 hidden units in each layer
- Target network: 2-layer of fully-connected with 400 and 300 hidden units in each layer

For each of the network, a relu layer is used after each linear layer. The action from the actor network is concatenated with the first-layer output and fed into the second layer.

#### Actor Model Update

The loss of the actor model is computed as:

- $ actions_{pred} = actor_{local}(states) $
- $ actor_{loss} = -mean(critic_{local}(states, actions_{pred}))$

The negative sign is used to maximize the Q-value.


#### Critic Model Update

The loss of the critic model is computed as:

- $Q_{targets}(s,a)= rewards + \gamma Q_{targets}(s_{next},a)$
- $Q_{locals} = model_{critic}(s,a)$
- $loss_{critic}= mean(Q_{targets} - Q_{locals})**2$

#### Target Model Soft Update

The target model is updated using a soft-update scheme: $\theta_{target} = \tau*\theta_{local} + (1 - \tau)*\theta_{target}$

#### Noise of Action

An Ornstein-Uhlenbeck process was used to introduce noise into the action to facilitate the action. The noise parameters are:

- $\mu = 0.0$
- $\theta= 0.15$
- $\sigma= 0.2$

#### Replay Buffer

A replay buffer with random sample selection scheme was used to break the correlation between adjacent samples.


## Hyper Parameters 

The hyper parameters used to train the agent are listed below:

```
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 1024     # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-4             # for soft update of target parameters
LR_ACTOR = 1e-4       # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 2      # update every x time steps
```

## Plot of Rewards

The agent was trained to solve the problem with 425 episodes. Here is the trainning history:

```
Episode 50      Average Score: 0.83
Episode 100     Average Score: 1.60
Episode 150     Average Score: 2.76
Episode 200     Average Score: 5.60
Episode 250     Average Score: 10.91
Episode 300     Average Score: 17.61
Episode 350     Average Score: 23.11
Episode 400     Average Score: 28.22
Episode 425     Average Score: 30.04
Environment solved in 425 episodes!     Average Score: 30.04
```

The resulting rewards plot is shown below:

![History of Rewards](./images/ScoreHistory.png)

## Ideas of Future Work

The possible direction for improving the model perfomance may include:

- Priority sampling from the replay buffer: using TD error to determine the selection priority may improve the effciency of sample usage.
- Incorporating generalized advantage estimation: adding multiple step estimation using TD($\lambda$) scheme may increase the target estimation accuracy.
- Fine tuning the hyper parameters: since the hyper parameters used to train the model is basically based on the default values from the course repository, a more systematic tuning of the parameters may improve the performance.
- Trying other deep RL framework, such as A2C, A3C, D4PG etc.