# Report on Project 1: Navigation

This file reports the method adopted to train an agent for solving the task of picking yellow bananas while avoiding the blue bananas in a Unity environment.

## Inputs

- state space: 37
- action space: 4 
- reward structure: +1 for picking yellow banana; -1 for picking blue banana; 0 otherwise

## Method

- The inputs are given to an agent which follows an epsilon greedy policy. 
- As the state space has real numbers, implementing a Q table is not possible. Hence, the Q values i.e. Q function is approximated using a Neural Network (NN). 
- The structure of the NN is as follows:
    - F1 = ReLU (input_state (states = 37) x 64 neurons)
    - F2 = ReLU (F1 x 64 neurons)
    - logits = ReLU (F2 x output_state (actions = 4))
- The NN outputs the estimated rewards from a given state for all actions. An action from these action values is chosen in an epsilon greedy manner. 
- Two NNs of the same architecture are used: local network (θ_local) and target network (θ_target).
- The weigts of local network are updated as follows:
<img src="images/loss_function.png" alt="Drawing" style="width: 500px;"/>
    - where the TD error is calculated using the difference of θ_target - θ_local.

- The target network is trained based on the local network θ_target = τ*θ_local + (1 - τ)*θ_target


## Hyperparameters

- BUFFER_SIZE = 100000    # replay buffer size
- BATCH_SIZE = 64         # minibatch size
- GAMMA = 0.99            # discount factor
- TAU = 1e-3              # for soft update of target parameters
- LR = 5e-4               # learning rate 
- UPDATE_EVERY = 4        # how often to update the network
- maximum number of timesteps per episode =1000, 
- eps_start=1.0           # Starting epsilon value
- eps_end=0.01            # Minimum value of epsilon
- eps_decay=0.995         # Epsilon decay rate

## Rewards plot
A plot of the average rewards received is seen below:
![alt text](images/training_graph.png "ABC")
It can be seen that the agent receives higher rewards as the experience i.e. number of episodes increases. 

Number of episodes needed to solve the environment = 600

## Future ideas for improving agents performance
- Use a different Neural Network for approximating the Q values
- Implement a double DQN, a dueling DQN, or prioritized experience replay for faster and improved agent performance.
- Implement using Hierarchical reinforcement learning
    - Pick up yellow banana
    - Avoid blue banana
    - Give location of a yellow banana
- Implement using policy gradient methods