# Training report

---

## Agents

5 types of agents were trained:
- DQN: Vanilla Deep-Q-Network
- DDQN: double DQN leveraging target Q network to evaluate the Q value) during training
- Prioritized: DQN trained with prioritized experience replay during training
- Dueling: Dueling Networks DQN
- DDQN+Prioritized+Dueling: combines Double DQN/Priotized replay/Dueling networks

They all share the same underlying DL model, with the exception of the dueling network DQN (and the combined network).
However the dueling networks are designed so that their total number of weights is similar to that of the single network used in (DQN, DDQN, Priorized). 

The DQN network is simple with 2 fully connected hidden layer, each with 64 nodes. I didn't try different networks as this configuration worked well and its limited size appears adequate for the limited dimensions of the environment (37 states, 4 actions).

All agents in general share the same hyperparameters. An epsilon greedy policy is used during training with a decay of 0.995 consistent with the target 1000 training epidodes (0.995 ^ 100 = 0.6, 0.995 ^ 1000 < 0.01).

The agents are compared, score wise, during training and in testing.
A higher than required early termination (sucess) threshold of 17 is used. All agents reach the minimum required score of 17

## Results

#### Training

<table>
<tr>
<td>Agent</td>
<td>Plot</td>
<td>Notes</td>
</tr>
<tr>
<td>DQN</td>
<td><img src="./Report_images/dqn.png" alt="DQN score v/s episode" /></td>
<td>Score >= 13 cleared around episode 250-300. Did not reach score of 17 (in several trainings) and seems to plateau around 15</td>
</tr>   
<tr>
<td>DDQN</td>
<td><img src="./Report_images/double-dqn.png" alt="DDQN score v/s episode" /></td>
<td>Reaches higher overall scores than DQN but didn't reach 17 (in several trainings) and also seems to be reaching a plateau around 16</td>
</tr>   
<tr>
<td>Prioritized Replay</td>
<td><img src="./Report_images/prioritized.png" alt="Prioritized replay score v/s episode" /></td>
<td>Reached consistently 17 in several trainings (typically around episodes 800-900. Could benefit from more episodes as the score curve trend is still rising. In general the best of the single-improvement networks</td>
</tr>   
<tr>
<td>Dueling Networks</td>
<td><img src="./Report_images/dueling.png" alt="Dueling DQN score v/s episode" /></td>
<td>Reaches scores >=16, but typically did not hit 17 in 1000 episides. The score curve trends seems to still be rising slightly. Increasing the number of episode could allow this agent to reach 17</td>
</tr>   
<tr>
<td>DDQN+Prioritized+Dueling </td>
<td><img src="./Report_images/combined.png" alt="DDQN+Prioritized+Dueling score v/s episode" /></td>
<td>Typically reaches a score of 17 within 1000 episodes. While it doesn't reach higher scores than the simple prioritized replay agent, the combined agent shows the best score growth in the initial phase of training and its trend also shows the possibly to increase further the training score</td>
</tr>   
</table>

#### Testing
<table>
<tr>
<td>Agent</td>
<td>Mean score</td>
<td>median</td>
</tr>
<tr>
<td>DQN </td>
<td>15.28</td>
<td>16.5</td>
</tr>   
<tr>
<td>DDQN </td>
<td>16.81</td>
<td>17.0</td>
</tr>    
<tr>
<td>Prioritized replay </td>
<td>17.5</td>
<td>18.0</td>
</tr>    
<tr>
<td>Dueling networks </td>
<td>16.75</td>
<td>17.0</td>
</tr>    
<tr>
<td>DDQN+Prioritized+Dueling </td>
<td>16.84</td>
<td>16.0</td>
</tr>    
</table>

The testing scores generally match the observations from training with prioritized replay yielding the best average and median scores.

The combined agent (DDQN+Prioritized+Dueling) doesn't improve much on any of its individual variations. Its median score is actually lower than the simple DQN network. The findings from the 'rainbow' paper don't quite apply for this environment.

## Future work ideas

* Try more agent variants including the options outlined in the 'rainbow' paper (https://arxiv.org/abs/1710.02298).
* Shmoo hyperparameters, trade-off complexity (network size) v/s highest score
* Defined and train agents which state observation is given by the 2D image, as seen during the game, rather than by processed data provided by the game environment (rays...). The DL network would include convolution layers. 