## Introduction

The goal of this project is to train an agent to navigate a large rectangular world and pick up as many yellow bananas as possible while avoiding pickup up any blue bananas. The state space of the world is 37 dimensions which includes the agent's velocity and ray based perception. The action space is 4 (forward,back,left,right).

## Problems

With traditional Q learing it relies on fixed size environment made up of a table of states and actions SxA where S is the total number of states and A is the total number of actions.In a continuous environment this table would become too large to be usable in practice.To overcome this problem we use a DQN network (<a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">Deep Q Network</a>). This uses a neural work to act as a Q-value function approximator where we pass the current environment observation as input and the output is the Q-value corresponding to a possible action.



### Correlated data and catastrophic forgetting
Correlated data and the need for a replay buffer. The problem with deep learning is that catastrophic forgetting. When something new is learned it tends to replace what has been previously learned rather than adding to it and this behaviour causes a correlated data problem. To overcome this we use the technique of a replay buffer. We store tuples of state,actions,reward,observations into a buffer which can be sampled out of order to break the sequence and to help prevent this data correlation.

### Q-learning and continous space
Since our space is continuous is becomes quickly untenable to use a tablular method like Q learning. To overcome that we will use a <a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">Function Approximator.</a> 

## Implementation

Below is the DQN algorithm. The algorithm uses two neural networks Qlocal and Qtarget which are identical to each other. The target network lags behind the local network.Once we hit our replay buffer size we synchronize the two networks. The lag is done so the local network is not chasing a moving target of Q values. The overall algorithm of DQN is below.



<img src="dqnalgo.png" align="left">



### Hyperparameters
<table border="3" align="left">
    <tr>
    <th>Hyperparmameter</th>
    <th>Value</th>
    </tr>
    <tr>
    <td>Replay buffer size</td>
    <td>100,000</td>
    </tr>
    <tr>
    <td>gamma</td>
    <td>0.99</td>
    </tr>
    <tr>
    <td>learning rate</td>
    <td>0.02</td>
    </tr>
    <tr>
    <td>hidden layer</td>
    <td>64</td>
    </tr>
    <tr>
    <td>batch size</td>
    <td>64</td>
    </tr>
    <tr>
    <td>update network</td>
    <td>4</td>
    </tr>
</table>


## Results

Below are the results for solving the environment with a score of 13 for at least 100 consecutive episodes.

<img src="results.png" align="left">

## Ideas for improvement

Double DQN -> DQN has been shown to overestimate the actions values. Double DQN has been shown to reduce this.

Prioritized Experience replay -> This method tends to emphasize more important transitions that would be sampled with a higher probability and hence make the agent more efficient.