## Algorithm

The algorithm used to solve the environment is Double Deep Q learning as presented [here](https://arxiv.org/pdf/1509.06461.pdf)

The [Deep Q learning](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) uses multi-layered neural network to approximate the state-action value function $\mathcal{Q}$. This network takes state $s$ as input and provides apporximation to $\mathcal{Q(s,a)}$ for all the discrete actions $a$.   
To stablize the training process it introduces two very important ideas:
- **Replay Memory**: All the one step interactions( or multi-step if multistep returns is used) are stored in a memory called replay buffer/memory. To update network a batch of transitions is sampled uniformly from this buffer after sufficient number of transitions are available in the replay buffer. This step randomizes over data and helps to remove the correlations present in sequentially gathered data.
- **Target Network**: A target network is same as online deep Q network, except that its parameters are copied from the online network after fixed period time $\tau$. Between periods its parameters remain fixed. Thus it reduces correlation with target value.

If $\theta$ represents local online network parameters and $\theta^{'}$ target network parameters then the loss function at iteration $i$ is:   
$\mathcal{L}_{i}(\theta_{i}) = \mathbb{E}_{(s,a,r,s{'})\sim\mathcal{U(D)}}\lbrack(r + \gamma max_{a_{'}}\mathcal{Q}(s^{'},a^{'};\theta^{'}_{i}) - \mathcal{Q}(s,a;\theta_{i}))^{2}\rbrack$   

The max operator in Deep Q Learning is used for both selecting an action and evaluatinf its value. It has been [shown](https://arxiv.org/pdf/1509.06461.pdf) that this results in overestimation of state-action values $\mathcal{Q}$. Overestimation of state-action value can result in sub-optimal performance/learning. This over-estimation can be avoided by seperating the selection of an action and evaluating corresponding state-action value. This is the idea behind Double Deep Q learning.   

In double deep Q learing the local online Q network is used for selecting an action value and the target network is used for evaluating  corresponding state-action value. This small change helps greatly in reducing the over-estimation of state-action values.   
The loss for Double deep Q learning is:   
$\mathcal{L}_{i}(\theta_{i}) = \mathbb{E}_{(s,a,r,s{'})\sim\mathcal{U(D)}}\lbrack(r + \gamma\mathcal{Q}(s^{'},a_{max};\theta^{'}_{i}) - \mathcal{Q}(s,a;\theta_{i}))^{2}\rbrack$   
where, $a_{max} = argmax_{a^{'}} \mathcal{Q}({s^{'},a^{'};\theta_{i}})$

### Algorithm Implemented

1. `Initialize a local Q-network and a target Q-network`.
  - `local Q-network parameters` $\theta$ `and target Q-network parameters` $\theta^{'}$. `Initially` $\theta^{'} = \theta$
2. `Initialize a Replay Buffer` $\mathcal{D}$.
3. `Intialize the start value of epsilon` $\epsilon = 1$
4. `for episode = 1 to M:`
  - `Reset environment and get initial state` $s$
  - `set value of step` $step = 0$
  - `Untill the episode is complete do:`
    - `An agent chooses an action as follows:`
      - $a_{max} = argmax_{a^{'}} \mathcal{Q}(s,a^{'};\theta)$   
      `with probability` $1-\epsilon, a = a_{max}$    
      `with probability`$\epsilon, a = \mathcal{U}(0,1,2,3)$
    - `Perform action` $a$ `and obtain reward` $r$ `and next state` $s^{'}$
    - `Store `$(s,a,r,s^{'})$ `in` $\mathcal{D}$
    - $s = s^{'}$
    - `update step value ` $step = (step + 1) \% $ `UPDATE-EVERY` 
    - `if len`$(\mathcal{D})$ ` > UPDATE_AFTER and` $step == 0$   
      - `Sample a mini-batch ` $\mathcal{B}$ `from` $\mathcal{D}$.   
      $\mathcal{B}= (s^{j},a^{j},r^{j},s^{'j})$    
      `where,` $s^{j}$: `states`    
      $a^{j}$: `actions`    
      $r^{j}$: `rewards`    
      $s^{'j}$: `next states`    
      - `Set` $target= r^{j} + \gamma\mathcal{Q}(s^{'j},a_{max}^{j};\theta^{'})$   
      `where, ` $a_{max}^{j} = argmax_{a^{'}} \mathcal{Q}({s^{'j},a^{'};\theta})$
      - `Update local Q-network by minimizing the following loss:` $\frac{1}{B} \sum_{j}(target - Q(s^{j},a^{j};\theta))^{2}$
      - `Update target network parameters:`
      - $\theta^{'} = \tau \theta + (1-\tau)\theta^{'}$


## Hyper-parameters

- BUFFER_SIZE = $10^{5}$
   - Size of the replay buffer
- BATCH_SIZE = 64 
   - number of instances sampled in a batch
- GAMMA = 0.99
   - Discount factor
- TAU = $10^{-3}$ 
   - Parameter for soft update of the target network parameters
- LR = $5\times10^{-4}$
   - Learning rate for critic network
- UPDATE_AFTER = 64
   - Minimum number of samples in the buffer after which learning can start
- UPDATE_EVERY = 4
   - Network updation period
- $\epsilon_{start}$ = 1.0
   - initial value of $\epsilon$
- $\epsilon_{end}$ = 0.01
   - Final value of $\epsilon$

### Critic Architecture

- Critic Network takes state of an agent and predicted action as its input. 
- Individual State is of size 33, so a tensor of shape (33,)becomes input to the first linear layer of size 256.
- Input to the Second linear layer is a concatenated tensor. Action tensor from the input is concatenated with the output of first_linear_layer + Leaky_ReLU block. Second layer is of size 256. It is followed by a leaky ReLU layer.
- Third layer is of size 128, it is followed by a leaky ReLU layer.
- Final layer is of size 1.

Detailed arcitecture is given below

1. **linear1**: Linear(in_features=33, out_features=256, bias=True)
2. **Leaky ReLU** layer
3. **Concatenation**: Concatenate output from `linear1 + Leaky_ReLU` layer to action tensor of size (4, ).
4. **linear2**: Linear(in_features=260, out_features=256, bias=True)
5. **Leaky ReLU** layer
6. **linear3**: Linear(in_features=256, out_features=128, bias=True)
7. **Leaky ReLU** layer
8. **linear4**: Linear(in_features=128, out_features=1, bias=True)

![CriticArchitecture](./critic-architecture.png)

### Actor Architecture

An actor takesa state as input (tensor of size 33) and outputs a action tensor of size (4,)

Detailed architecture is given below:

1. **linear1**: Linear(in_features=33, out_features=256, bias=True)
2. **ReLU** layer
3. **linear2**: Linear(in_features=256, out_features=4, bias=True)
4. **activation**: Tanh()

![ActorArchitecture](./actor-architecture.png)

### Plot of the Rewards

**The environment is solved in 1300 episodes**

![Result](./result.png)

### Future Work

- **Tuning Hyperparameters**: Trying different setting of hyper-parameters and presenting how changing them have impact on learning time or if the model diverges for some setting.
- **Multi-Agent Environment**: Applying this same algorithm to Multi-Agent reacher environment with a common replay memory for training multiple agents simultaneously.
- **Other Algorithms**: Trying PPO and A3C to solve the environment and compare the results.

### Trained agents playing
![AgentPlaying](./sample_play.gif)

[//]: # (Image References)

[image1]: ./sample_play.gif "Trained Agent"


[//]: # (Image References)

[image1]: ./sample_play.gif "Trained Agent"