## Project 3 - Collaboration and Competition 


**Learning to play table tennis**

Project Goals:

* Implement the TD3 Algorithm to teach the agent to learn to play table tennis.
* The agent is able to get a max score of 2.5 or greater

[//]: # (Image References)

[image1]: ./data_files/actor_network.png "Actor_Network_Arch"
[image2]: ./data_files/critic_network.png "Critic_Network_Arch"


[image3]: ./data_files/multi_agent_exp0.png "EXP0_Multi_Agent_Graph"
[image4]: ./data_files/multi_agent_exp1.png "EXP0_Multi_Agent_Graph"


---

### Report


### Learning Algorithm

I've modified my project 2 implementation to solve this multi-agent learning task. The algorithm description is as follows:

#### 1. Description
I've implemented the Twin Delay DDPG (TD3) Algorithm to solve the task described above. The TD3 Algorithm is an extension of the Vanilla DDPG Algorithm that was introduced in this Nano-Degree. 

The TD3 differs from the original DDPG Algorithm in three distinct ways:

1. TD3 learns from two separate target critic networks (Q1_target, Q2_target). Hence the "Twin" in Twin Delay DDPG 

2. The local critic network updates more frequently than the local actor network and the target networks. It is recommended to use a 2:1 update frequency ratio - i.e. update critic network 2 times for every actor / target network update. Hence the "Delay" in Twin Delay DDPG

3. Addition of noise to target actions with the intend to stabilize the local critic network.  

 

For further information please refer to OpenAI's Spinning Up documentation here: [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/td3.html)

<u><b>Modules:</b></u>
- <u>Replay Buffer</u>: Used to store and collect experience tuples (state, action, reward, next_state, done). 
- <u>Agent</u>: Agent class containing act, step, learn, and soft_update functions.  
- <u>ModelsQ</u>: Definition of the Deep Neural Net Architecture using Pytorch.


#### 2. Final Set of Hyper-parameters for EXP 1:
```python
#Module Variables
#Replay Buffer
BUFFER_SIZE = int(1e5) # memory replay buffer size
BATCH_SIZE = 64 # batch size
#Q Network Hyper-parameters
GAMMA = 0.99 # Q learning discount size
TAU = 1e-3 # for soft update from local network to taget network
LR_ACTOR = 5e-4 # learning rate
LR_CRITIC = 5e-4 # learning rate
#Update frequencies
UPDATE_CRITIC_EVERY = 1 # number of frames used to update the local network
UPDATE_ACTOR_TARGET = 2 * UPDATE_CRITIC_EVERY
NN_NUM_UPDATES = 2

#Noise Parameters
NOISE_SCALE = 1.0
```

#### 3. Final Model Architectures


![alt text][image1]<center>**Figure 1**</center>


![alt text][image2]<center>**Figure 2**</center>



### Plot of Rewards


#### Multi Agent Experiments


#### EXP 0:

This first experiment had the following noise parameter values

```python
#Please refer to line 139 inside agent.py
#n_factor=0.4 - target actions noise factor
actions_next = self.act(next_states, n_factor=0.4, use_target=True, add_noise=True)

#Please refer to the Tennis jupyter notebook
#n_start=0.8 - start of local actions noise factor
train_agentTD3(agent, exp_name='EXP1',n_episodes=3000, print_every=50, max_t=1000, 
                   n_start=0.8, n_end=0.0001, n_decay=0.995):
```


![alt text][image3]<center>**Figure 3**</center>

From the results seen above, one can see that the agent starts learning an optimal policy with smaller noise factor values ( due to noise decay ).

#### EXP 1:

After evaluating the results from experiment EXP0, both noise_factor parameter values were decreased (intuition).

```python
#Please refer to line 139 inside agent.py
#n_factor=0.25 - target actions noise factor
actions_next = self.act(next_states, n_factor=0.4, use_target=True, add_noise=True)

#Please refer to the Tennis jupyter notebook
#n_start=0.4 - start of local actions noise factor
train_agentTD3(agent, exp_name='EXP1',n_episodes=3000, print_every=50, max_t=1000, 
                   n_start=0.4, n_end=0.0001, n_decay=0.995):
```

![alt text][image4]<center>**Figure 4**</center>

With lower noise parameters values, the agent learns an optimal policy faster and in a more stable way ( as seen above ). 



### Ideas for Future Work

#### TD3: (Same basic ideas from the last project)
- Further explore the different combinations of hyper-parameters.
- Implement a PPO algorithm, and compare it to this implementation.
- Run the last experiment - EXP1 - multiple times to see if the Q function converges every time. 
- Implement more complex Q-functions.

 