# Continuous Control

Project #2 for the Udacity Reinforcement Learning Nanodegree.  
Eduardo Peynetti
__________

## Introduction

For this project, we will work with the [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) environment.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

The task is episodic, and in order to solve the environment, the agent must get an average score of +30 over 100 consecutive episodes.

There are two versions of the environment, one with a single agent, an one with 20 agents. A solution for both versions is provided.


## Model

The problem is solved by using a Deep Deterministic Policy Gradient (DDPG) architecture, which is based on the paper [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)

The code for the agents can be found in [drltools/agent.py](https://github.com/lalopey/drl/blob/master/drltools/agent/agent.py), under the DDPGAgent class.  
The code for the PyTorch models for the Critic and the Actor can be found in [drltools/model.py](https://github.com/lalopey/drl/blob/master/drltools/model/model.py)


### Actor-Critic Method 

DDPG is an Actor-Critic method. It attempts to estimate both the policy directly, the Actor, and the value function, the Critic. 

Policy based approaches, which attempt to estimate the optimal policy directly, are low bias/unbiased methods. They require many samples to train, and have high variance. For example, a sampled trajetcory might have taken a good action in general, but received a low reward by chance. This will lower the odds we will pick this action in the future, and only figure out that it was a good action after many samples.

Value based approaches attempt to estimate the Value function. They have lower variance than Policy based methods, as they use an estimate of an estimate for each step, but can be biased, as your estimate might be incorrect, particularly early on in training.


### Q-learning in continuous action space

DDPG tries to bring Q-Learning to continuous action domain. In Q-learning, we update the Q-function by looking at the action that maximizes the Q function for the received future state:

$$Q^{new}_{\pi}(s,a) = Q^{old}_{\pi}(s,a) + \alpha (r_t + \gamma \max_a Q_{\pi}(s_{t+1}, a) - Q^{old}_{\pi}(s,a))$$

The maximum of the Q function can't be easily estimated with a continuous action space. To work through this issue, DDPG tries to estimate the maximum by estimating a deterministic policy.

The Actor network approximates a deterministic policy. This will give us the action that the policy considers the best for a given state. We can use this estimate the maximum in the Q-function update. The Q-function approximation will be the Critic.

### Experience replay

Like in Q-learning, we can profit from using experience replay. When the agent interacts with the environment, the sequence of observations can be highly correlated. The naive Q-learning algorithm that learns from each of these experience tuples in sequential order runs the risk of getting swayed by the effects of this correlation. By instead keeping track of a replay buffer and later sampling from this buffer at random, a method known as experience replay, we can prevent action values from oscillating or diverging catastrophically. Experience replay also allows us to learn more from individual observations multiple times, recall rare events, and in general make better use of our experience.

### Fixed Q-targets

Like in Q-learning, we will have a local and target network, both for the Actor and the Critic. The local network is the network we are training, while the target network is the one we use for prediction to stabilize training. Its weights are updated slowly.

### Noise

### Batch Normalization

### Training multiple agents



## Results

### Agent Training and Model Architecture

The model architecture used to solve the problem is as follows:

- A target and local neural network for the Critic, both with the same architecture:
    - One layer with states as input, with 256 nodes. The layer is activated with ReLu and batch normalized
    - A second layer with 256 nodes, ReLu activated.
    - A final layer that maps into action space, activated with tanh so that the output is between -1 and 1 as its possible for the actions.
    
- A target and local neural network for the Actor, both with the same architecture:
    - One layer with states as input, with 256 nodes. The layer is activated with ReLu and batch normalized
    - A second layer, with 256 nodes, ReLu activated.
    - A final layer with a single output, as we are estimating a deterministic policy

- A batch size of 64 is used.
- The local networks are optimized with Adam, with a learning rate of 1e-4 and 4e-4 for the actor and critic, respectively.
- The $\gamma$ in the TD update is 0.99
- The weights of the target network are updated through a soft update θ_target = τ*θ_local + (1 - τ)*θ_target with τ = 1e-3 
- A buffer size of 1e6 for experience replay.
- An Ornstein-Uhlenbeck noise process with parameters $\mu=0$, $\theta=0.15$, $\sigma=0.1$ is used
- For the single agent environment, the buffer is sampled 10 times every 20 steps, while for the 20 agent environment, its samples twice every four steps.


### DDPG

The environment solved with DDPG is solved in 172 episodes.

Episode 100	Average Score: 6.97  
Episode 172	Average Score: 30.11	Episode score (max over agents): 38.92  
Environment solved in 172 episodes with an Average Score of 30.11  

<img src="images/ddpg_reacher.png" width="450"  />

### Multi Agent DDPG

The environment for 20 agents is solved in 100 episodes. It converges very fast as all the arms learn at the same time.

Episode 100	Average Score: 31.53   

<img src="images/ddpg_reacher20.png" width="450"  />

# Future improvements


The DQN architecture can still be improved significantly. Some possible ideas are:

- [Dueling DQNs](https://arxiv.org/abs/1511.06581)
- [Learning from multi-step bootstrap agents](https://arxiv.org/abs/1602.01783)
- [Distributional DQNs](https://arxiv.org/abs/1707.06887)
- [Noisy DQNs](https://arxiv.org/abs/1706.10295)


