# Project report: Continuous control

## 1: Learning algorithm

### 1.1: Description of algorithm

Reinforcement learning historically boils down two kind of algorithms:

* **Value based methods (Q-learning):** These algorithms try to find the underlying Q-values, represented by $q^*$. In Deep Q-learning, we use neural networks to approximate the true $q^*$, and the objective is then to find the weights, $w$, of the Q-network which best approximates $q^*$. For a finite set of actions, these algorithms often work well.

* **Policy based methods (REINFORCE):** These algorithm directly optimize the policy, without the detour through finding Q-values. 

Both of these algorithms have their strength and weaknesses. A natural question then is, can we combine the strengths and escape the weaknesses? Well, yeah! Give way for *Actor-critic methods!* This best-of-two worlds approach will

* Create a *Critic* which measures how good the action taken is
* Create a *Actor* which controls the behaviour of the agent

Loosely speaking, we use the actor to take actions, and the critic gives feedback on how good the action was. The actor then learns from the feedback provided from the critic and simultaneously the critic will learn to give better feedback next time based on what happened when the agent took an action. 

Okay, enough with this hand-waving stuff. Let's explore the mathematical stuff:

We will have two networks (local and target) both for the actor and critic, denoted by $\pi(s,\theta)$ and $\pi(s,\theta')$ for the actor and $\hat{q}(s,a,w)$ and $\hat{q}(s,a,w')$ for the critic, where we use ' to distinguish between local and target, respectively.

### Learning step
The learning procedure goes as follows:

1. Sample from ReplayBuffer (See [Project 1](https://github.com/oelvetun/Deep-Reinforcement-Learning/tree/master/Project%201%20-%20Navigation) for more details regarding replay buffer) of BATCH_SIZE

2. Get the next actions $a_{t+1} = \pi(s_{t+1},\theta')$

3. Compute the $Q_{TARGET} = R_t + (\gamma * \hat{q}(s_{t+1},a_{t+1},w')$

4. Compute the expected Q-function $Q_{EXPECTED} = \hat{q}(s_t,a_t,w)$

5. Measure the error $\|Q_{TARGET} - Q_{EXPECTED}\|_{L^2}$ and run backprop to update $w$.

6. Get the predicted actions $a_{PRED} = \pi(s_t,\theta)$.

7. Calculate the actor loss by $-\frac{1}{n}\sum{\hat{q}(s_t,a_{PRED},w)}$ and perform a backprop step.

8. Perform a soft update step of the actor and the critic:

\begin{eqnarray} w' &=& \tau w + (1-\tau) w' \nonumber \\ \theta' &=& \tau \theta + (1-\tau) \theta' \nonumber \end{eqnarray}

### 1.2: Chosen Hyperparameters

* ``BUFFER_SIZE = int(5e5)``:  Chosen size of replay buffer
* ``BATCH_SIZE = 128``:        Chosen batch size of learning examples 
* ``GAMMA = 0.99``:            Discount factor
* ``TAU = 1e-3``:              Soft update of fixed Q-target weights
* ``LR_ACTOR = 2e-4``:         Learning rate of actor in optimization algorithm
* ``LR_CRITIC = 3e-4``:        Learning rate of critic in optimization algorithm
* ``UPDATE_EVERY = 20``:       Number of actions chosen between each learning step
* ``TIMES_UPDATE = 10``:       Number of batches to run each time we update   
* ``EPSILON = 1``:             Starting point for noise decline
* ``EPSILON_DECAY = 0.005``:   Noise decay for each episode

Parameters in Ornstein-Uhlenbeck process
* ``MU = 0.0``
* ``THETA = 0.15`` 
* ``SIGMA = 0.2``


### 1.3: Neural network

##### Actor

The neural network we use for the *Actor* is a simple feed-forward network with the following layers

* BatchNorm 1
* Layer 1: (state_size, 128)
* ReLU 1
* BatchNorm 2
* Layer 2: (128, 128)
* ReLU 2
* BatchNorm 3
* Layer 3: (128, action_size)
* Tanh

where state_size = 33 and action_size = 4.

##### Critic

The neural network we use for the *Critic* is a simple feed-forward network with the following layers

* Layer 1: (state_size, 256)
* ReLU 1
* BatchNorm
* Layer 2: (cat(128, action_size), 128)
* ReLU 2
* Layer 3: (128, action_size)

## 3: Plot of Rewards

The algorithm used 39 episodes to solve the problem. We see a plot of the rewards received for each episode.

<img src="scores.png" width="500">

## 4: Ideas for Future Work

There are several ways to improve the performance of the agent. Specifically, one could 

* Try to add noise directly to the parameters, instead of on the action. This have been shown to often give superior performance. The algorithm can be explored [here](https://arxiv.org/abs/1706.01905).
* Spend much more effort on tuning the hyperparameters. What would the optimal choice of network architecture? Could we change the learning parameter, or the importance sampling parameters to improve the learning? With more time on our hands, we can spent a lot of time on this. That being said, I have already spent quite some time on this, so we have already come a long way. Be aware that reinforcement learning algorithms are can easily diverge with wrong hyperparameters. Particulary is the DDPG-method applied here relatively unstable.
* Consequently, it will be interesting to implement more stable and advanced algorithms, such as Trust Region Policy Optimization (TRPO) or Truncated Natural Policy Gradient (TNPG).
* Make the algorithm more realistic by using raw pixel data as input, instead of the sensors on velocity, rotation etc. This will make its observation more identical to human perception. 