## Solution description
### Deep Q Network
In order to find the optimal policy for this project, I used a deep neural network to approximate the Q value function. The activation of each layer(except the output) is ReLU and a dropout probability of 0.1 is applied.
To solve the Unity environment, I used the deep Q network with the following hyper parameters: 
```python
NETWORK_LINEAR_SIZES = "1024,512,256" # dimension for every layer in Q network
```

```python
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 4        # how often to update the network
```

### Experience replay buffer
An experience replay buffer with size of `1e5` is used to store the experience tuple. Each tuple contains the input state, the chosen action, the corresponding reward from the environment and the next state. For each training step, 64 tuples will be randomly sampled from the buffer to train the Q network.

### epsilon-greedy action selection
At every timestep, the agent has a probability equal to the epsilon to select a random action from the action state (explore) instead of the action with the highest Q value (exploit). epsilon-greedy action selection is used to encourage the agent to explore the environment more at the start of the training session. The epsilon is intialized at `1.0` and will be reduced after every episode, down to the minimum value `0.01`.

### Temporal difference with fixed Q target
The agent is trained with temporal difference algorithm with fixed Q target technique, which means 2 Q networks, the target and the local are trained to approximate the Q function. At every train step, a batch of experience tuples are randomly sampled from the replay buffer. The target value is calculated by feeding the next state to the target network and take the action with the highest output value. The input state is fed into the local network, calculate the Q value for each action and update the action selected in this tuple using the target value. The local network is updated at every train step while the target network is only updated after a number of steps (4 in this project) using soft update, which is controlled by the hyper parameter `τ`. Simply put, the parameters of the target network is updated by the following formula at every 4 train step: `θ−=θ-×τ+θ×(1−τ)`. Where `θ−` is the target network parameter and `θ` is the local network parameter. For this project, I set the value of `τ` at `1e-3`.

```python
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 4        # how often to update the network
```

## Results
The agent was able to solve the environment (average score >= 13) after 450 episode:
![alt text](https://i.ibb.co/31Vnt38/Figure-1.png "Average agent score")

## Future improvement
Use Q network with more layers, apply Double DQN, Prioritized Experience Replay and Dueling DQN.