## Description

In this project, an agent lies in a square world and aims to collect yellow bananas and avoid blue bananas via adjusting its navigation. State space consist to the agent's velocity and ray-based perception of objects around the agent's forward direction. Our task is episodic and using the state space information together with the rewards given below, the agent trained to choose best actions to achieve an average score of +13 over 100 consecutive episodes.

* Reward of yellow banana: +1 
* Reward of blue banana: -1
* Dimension of the state space: 37

Available actions (discrete): 

* 0 - move forward.
* 1 - move backward.
* 2 - turn left.
* 3 - turn right.




## Model

We use Deep Q-learning to approximate Q-function. We design 2-layered DNN with Adam Optimizer: 
* First layer (37, 64)
* Second layer (64, 4)

where input dimension is the dimension of the state space: 37 and output dimension at the last layer is the dimension of action vector: 4.


In order to improve performance of Q-learning we used Experience Replay and Fixed Q-value methods. In the Experience Replay we use a buffer memory to store experiences and then randomly sample from these experiences to use in learning step multiple times. Randomized selection helps to avoid harmful correlations in the sequential experiences. 

In the Fixed Q-value method we aim to avoid instability caused by the dependency of TD target on the weights w and we use a separate network as target network with identical architecture. 



### Hyperparameters

    BUFFER_SIZE = int(1e5)  # replay buffer size
    BATCH_SIZE = 64         # minibatch size
    GAMMA = 0.99            # discount factor
    TAU = 1e-3              # for soft update of target parameters
    LR = 5e-4               # learning rate 
    UPDATE_EVERY = 4        # how often to update the network

    n_episodes=1800         # maximum number of training episodes
    max_t=1000              # maximum number of timesteps per episode
    eps_start=1.0           # starting value of epsilon, for epsilon-greedy action selection
    eps_end=0.01            # minimum value of epsilon
    eps_decay=0.995         # multiplicative factor (per episode) for decreasing epsilon

## Evaluation

![p1_score](p1_scores.png)

Average score of 13.06 in the last 100  episodes was achieved in 515 episodes. 

## Future Work

As a future work, I would like to experiment variations of DQN e.g double DQN and dueling DQN and prioritized experience replay to improve agent's performance. 