# Navigation project report - Course 2 “Value-Based Methods” - Nanodegree program “Deep Reinforcement Learning” - Udacity

# The purpose

In the provided Unity “Banana” environment, the aim is to train an agent to harvest yellow bananas, while avoiding blue bananas, during an episode of 300 timesteps maximum (imposed by the environment), in order to achieve a score greater than or equal to 13. The objective is to obtain a shifted average score over 100 episodes greater than or equal to 13. An optional objective is to train the agent to reach this objective in less than 1800 episodes.

For this, the agent will rely on a reinforcement learning method Deep Q-Network (DQN), based on the value function, which is estimated by a neural network.

## The method

The DQN method is implemented in python with the [PyTorch](https://pytorch.org/) library. It is a method that uses a neural network to approximate the action value function. The agent learns from the rewards it receives by interacting with the environment, adjusting its actions to maximize the long-term cumulative reward.

The neural network used as an estimator introduces instability into the learning process. To remedy this, we use a replay memory (**experience replay**) to store past experiences and a target network (**fixed q-target**) to reduce this instability. The target network is updated less frequently than the main network, thus reducing the variance of updates.

Since using the DQN algorithm requires a large number of iterations to converge on an optimal policy, it is essential to set up exploration and exploitation mechanisms. The agent must explore different actions to discover those that lead to high rewards, while exploiting the knowledge acquired to maximize the reward at each stage.

The learning loop relies on regular updating of the neural network weights, using the loss function between the estimated action value and the target action value. The loss function used is the mean square error (MSE), which measures the difference between the values predicted by the network and the target values.

Between each update, the agent interacts a set number of times with the environment, choosing actions based on a $\epsilon$-greedy strategy. This means it chooses a random action with probability $\epsilon$, and the best known action with probability $1 - \epsilon$. This strategy allows the agent to explore new actions while exploiting the knowledge it has acquired.

These interactions are stored in the replay memory, which is a circular buffer of fixed size. At each update, a random sample of experiences is extracted from the replay memory to train the neural network on a batch of interactions. This breaks the correlation between successive experiments and improves learning stability.

The neural network used in the DQN algorithm is a linear network which takes as input the state of the environment and produces as output the action values for each possible action. The size of the input of the network corresponds to the size of the state space of the environment, and the size of the output corresponds to the number of possible actions.

In [1]:
from unityagents import UnityEnvironment
env = UnityEnvironment(file_name="Banana.app")
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

Mono path[0] = '/Users/me/Dropbox (Compte personnel)/Perso NG/Cours et Mooc/Udacity/Deep Reinforcement learning/Cours 2 - Value-Based Methods/Udacity Course 2 Project/Udacity Course 2 Project - Source/Banana.app/Contents/Resources/Data/Managed'
Mono config path = '/Users/me/Dropbox (Compte personnel)/Perso NG/Cours et Mooc/Udacity/Deep Reinforcement learning/Cours 2 - Value-Based Methods/Udacity Course 2 Project/Udacity Course 2 Project - Source/Banana.app/Contents/MonoBleedingEdge/etc'


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Here's an overview of the neural network architecture used in the DQN algorithm, in relation to the “Banana” environment:

In [2]:
from models.linear_v1 import QNetworkLinear

model = QNetworkLinear(
    state_size=brain.vector_observation_space_size,
    action_size=brain.vector_action_space_size,
)
print(model)

QNetworkLinear(
  (fc1): Linear(in_features=37, out_features=64, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)


## Results

### Training

We obtain the following results running the training part of the `Training DQN Agent.ipynb` notebook, on a MacbookAir M2:

<img src="./images/model_weights_275_solved training stats.png" alt="Stats from training with Training DQN Agent.ipynb notebook" width="600"/>

- Training the agent enables it to solve the environment in 521 episodes (well below the 1800 episodes indicated for the challenge), with a constraint on the number of time steps per episode of 250 (instead of the 300 imposed by the environment).
- We can see that the agent solves the environment more and more quickly as training progresses, as indicated by the average time-step curve for successfully solving the environment (i.e. reaching the score of 13).

The model weights associated with the agent are saved in the file `model_weights_275_solved.pth` and can be used to test the agent in the environment.



### Evaluation

We obtain the following results running the evaluation part of the `Training DQN Agent.ipynb` notebook:

<img src="./images/model_weights_275_solved evaluation stats.png" alt="Stats from training with Training DQN Agent.ipynb notebook" width="600"/>

We note that the objectives set were achieved, even with tighter constraints on the number of 
episodes to solve the problem, the maximum number of time steps available per episode, as well as the stability of the rolling average over 100 episodes. We also note that even after 100 episodes, the rolling average is stable. In the end, only one successful episode fails to meet the constraint of the maximum number of time steps imposed.

On the other hand, a non-negligible proportion of episodes (~29%) are not solved, but these are offset by the high scores of the successful ones.

A video of the agent's interaction with the environment is available on [YouTube](https://youtu.be/G3rj4Yoc8bQ).

<img src="./images/youtube video.png" alt="Youtube video" width="400"/>

## Ideas for future work

Some ideas for improving the DQN method:
- carry out research into optimizing the values of hyperparameters related to the neural network structure (number of layers, layer size, etc.) and the DQN algorithm (learning rate, batch size, discount factor $\gamma$, exploration rate $\epsilon$, number of time steps per episode, etc);
- study the importance of network initialization;
- use of other DQN extensions, such as Double DQN, Dueling DQN, Prioritized Experience Replay, etc;
- study the evolution of negative rewards:
    - adding significant pmlus penalties?
    - how to increase potential gains in the medium term train actions producing negative rewards in the short term (by playing on $\gamma$ and the length of a sequence);
- go into learning mode with the pixel state space to see if the agent can perform better, and what cost this entails compared to learning with the “discrete” state space.

Other ideas for improvements:
- study the failure rate of episodes when the average reward reaches the target;
- How many training episodes are needed to increase the rate of successful episodes in evaluation?
- what is the maximum average score that can be achieved?