## Project 2 - Continuous Control 


**Controlling the two-link arm reacher**

Project Goals:

* Implement the TD3 Algorithm to control the two link arm reacher ( unity environment ).
* The agent is able to receive an average reward of +13 (or higher) over a 100 episode epoch.

[//]: # (Image References)

[image1]: ./data_files/actor_network.png "Actor_Network_Arch"
[image2]: ./data_files/critic_network.png "Critic_Network_Arch"

[single-agent]: ./data_files/exp0_single_agent.png "EXP0_Single_EXP"
[image3]: ./data_files/exp0_single_agent_results.png "EXP0_Single_Agent_Graph"

[update-critic-loss]: ./data_files/critic_loss_function.png "EXP0_Multi_EXP"
[multi-agent]: ./data_files/exp0_mutli_agent_training.png "EXP0_Multi_EXP"
[image4]: ./data_files/exp0_multi_agent_score_graph.png "EXP0_Multi_Agent_Graph"
[image5]: ./data_files/exp0_multi_agent_eval_results.png "EXP0_Multi_Agent_Eval"

[image6]: ./data_files/exp1_multi_agent_scores_graph.png "EXP1_Multi_Agent_Graph"
[image7]: ./data_files/exp1_multi_agent_eval_result.png "EXP1_Multi_Agent_Eval"


---

### Report


### Learning Algorithm
#### 1. Description
I've implemented the Twin Delay DDPG (TD3) Algorithm to solve the task described above. The TD3 Algorithm is an extension of the Vanilla DDPG Algorithm that was introduced in this Nano-Degree.

The TD3 differs from the original DDPG Algorithm in three distinct ways:

1. TD3 learns from two separate target critic networks (Q1_target, Q2_target). Hence the "Twin" in Twin Delay DDPG 

2. The local critic network updates more frequently than the local actor network and the target networks. It is recommended to use a 2:1 update frequency ratio - i.e. update critic network 2 times for every actor / target network update. Hence the "Delay" in Twin Delay DDPG

3. Addition of noise to target actions with the intend to stabilize the local critic network.  

 

For further information please refer to OpenAI's Spinning Up documentation here: [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/td3.html)

<u><b>Modules:</b></u>
- <u>Replay Buffer</u>: Used to store and collect experience tuples (state, action, reward, next_state, done). 
- <u>Agent</u>: Agent class containing act, step, learn, and soft_update functions.  
- <u>ModelsQ</u>: Definition of the Deep Neural Net Architecture using Pytorch.


#### 2. Final Set of Hyper-parameters for EXP 1:
```python
#Module Variables
#Replay Buffer
BUFFER_SIZE = int(1e5) # memory replay buffer size
BATCH_SIZE = 64 # batch size
#Q Network Hyper-parameters
GAMMA = 0.99 # Q learning discount size
TAU = 1e-3 # for soft update from local network to target network
LR_ACTOR = 5e-4 # learning rate
LR_CRITIC = 5e-4 # learning rate
#Update frequencies
UPDATE_CRITIC_EVERY = 10 # number of frames used to update the local network
UPDATE_ACTOR_TARGET = 2 * UPDATE_CRITIC_EVERY
NN_NUM_UPDATES = 10

#Noise Parameters
NOISE_FACTOR = 0.8
NOISE_MIN_MAX = 0.5
```

#### 3. Final Model Architectures


![alt text][image1]<center>**Figure 1**</center>


![alt text][image2]<center>**Figure 2**</center>



### Plot of Rewards
<p></p>

<center><b>In this report, I show the results of the VERY FIRST AND LAST SET of experiments:</b></center>


Please refer to the ./data_files/EXPS_ARCHIVE to view the initial set of experiments - these experiments show the learning trend, but fail to correctly illustrate the score function. 

### Single Agent Experiments

This experiment was a first attempt at solving the task at hand. As shown below, the agent was not able to learn much from the training environment. The actor-critic update frequency was set to 1:2 respectively. 

![alt text][single-agent]<center>**Figure 3**</center>
![alt text][image3]<center>**Figure 4**</center>


### Multi Agent Experiments

After further experiments - experimenting with the hyper-parameters: network update frequencies, noise factor values,  different model architectures - attempts, the following results where acquired:

Note: the critic loss function was updated from previous experiments (EXPS_ARCHIVE used the loss function shown in red):

![alt text][update-critic-loss]<center>**Figure 6**</center>
![alt text][multi-agent]<center>**Figure 6**</center>

#### EXP 0:
![alt text][image4]<center>**Figure 7**</center>
![alt text][image5]<center>**Figure 8**</center>


In the final experiment, I ran the training for a longer number of periods. In this final experiment the agent was able to learn much faster than the previous experiment. 

#### EXP 1:
![alt text][image6]<center>**Figure 9**</center>
![alt text][image7]<center>**Figure 10**</center>





### Ideas for Future Work

#### TD3:
- Further explore the different combinations of hyper-parameters.
- Implement a PPO algorithm, and compare it to this implementation.
- Run the last experiment - EXP1 - multiple times to see if the Q function converges every time. 
- Implement more complex Q-functions.

 