# Project 2: Continuous Control - Report

### Algorithm

The algorithm is DDPG (Deep Deterministic Policy Gradient). 
DDPG is 

- model free : there is no model of the environment; the agent uses trial and error learning to produce a policy
- off policy : the policy is not learnt from the current interactions with the environment but from samples of past interactions
The DDPG agent is equipped with an actor network (with local and target version) and a critic network (with local and target version). The actor network learns the optimal policy deterministically. The optimal policy maps any given state to the best believed action for this state :
$$ s \rightarrow \mu(s ; \theta_{\mu}) $$
The critic network approximates the Q-value function that maps a state and to its value :
$$ s \rightarrow Q(s, \mu(s ; \theta_{\mu}); \theta_{Q}) $$

The agent runs episodes of interactions with the environment and learns periodically from its past interactions :
At each step, the agent :
> determines the action a from the current state $ a = \mu_{local}(s; \theta_{\mu_{local}}) $ <br>
> interacts with the environment <br>
> stores the interaction $<s, a, r, s'>$ in its replay buffer <br>
> after a given number of steps, the agent samples a batch of interactions and updates its actor and critic networks along with a soft update of the target networks <br><br>

> updates the critic network (for a sample of interactions $<s, a, r, s'>$):
>>$a' = \mu_{target}(s; \theta_{\mu_{target}})$<br>
>>$\hat{q} = r + \gamma * Q_{target}(s', a'; \theta_{Q_{target}})$ <br>
>>$q = Q_{local}(s, a, \theta_{Q_{local}})$ <br>
>> Minimize $Loss(\hat(q), q)$ <br>
>> Update $\theta_{Q_{local}}$ <br>

> updates the actor network (for a sample of interactions $<s, a, r, s'>$):
>> $\tilde{a} = \mu_{local}(s; \theta_{\mu_{local}})$<br>
>> $\tilde{q} = Q_{local}(s, \tilde{a}; \theta_{Q_{local}}))$<br> 
>> Maximize $Mean(\tilde{q})$<br>
>> Update $\theta_{\mu_{local}}$ <br>

>perform soft update of the target networks :
>> $ \theta_{\mu_{target}} = \tau * \theta_{\mu_{local}} + (1 - \tau) * \theta_{\mu_{target}}$ <br>
>> $ \theta_{Q_{target}} = \tau * \theta_{Q_{local}} + (1 - \tau) * \theta_{Q_{target}}$<br>

The target networks are used by the critic to produce a prediction of the qvalue $\hat{q}$ while the local networks are used to produce the estimates of the qvalue of current state and action $q$ and estimates of best action $\tilde{a}$ and corresponding qvalue $\tilde{q}$. Only the parameters of the local networks ($\theta_{\mu_{local}}$ , $\theta_{Q_{local}}$) are learnt. The parameters of the target networks follow those of the local networks using soft updates.
   

### Implementation
- Environment : Unity environment provided (Reacher.app)
- Agent : DDPG, equipped with actor networks (local and target) critic networks (local and targets) and a replay buffer to store the interactions
- The ddpg function drives the agent to interact with the environment through episodes and to learn.
- A checkpoint file for the agent local actor network is saved and can be used to test the agent after it has been trained.


#### Hyperparameters

```
BUFFER_SIZE = int(1e5) # Replay buffer size
BATCH_SIZE = 512       # mini batch size
GAMMA = 0.99           # discount factor
TAU = 1e-3             # for soft update of the target parameters
LR_ACTOR = 1e-4        # Learning rate of actor
LR_CRITIC = 1e-3       # Learning rate of critic
UPDATE_EVERY = 4       # Periodicity : learn & soft updates every 4 interactions
```

### Results
![](results.png)



### Further Research
- Tuning hyperparameters : in particular the periodicity of learning and the batch size
- Improving the architectures of the networks : adding an extra layer to the actor networks
- Trying another algorithm : PPO but taking in account the continuous action spaces; the networks will produce statistics ($\mu$, $\sigma$) for the distributions or actions and the actions will then be sampled from these distributions.
