## Deep Reinforcement Learning Project 1 : Navigation

This report is a description of our environment and algorithms we use to train an agent to move a double-jointed arm and maintain it at its target position for as long as possible. 

A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

### 1. Environnement

For this project we have used [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) environments to design, train, and evaluate our Deep Reinforcement learning algorithms. 

#### Characteristics of the environment

+ Unity Academy name: Academy
        - Number of Brains: 1

+ Unity brain name: ReacherBrain
        - Number of Visual Observations (per agent): 0
        - Vector Observation space type: continuous
        - Vector Observation space size (per agent): 33
        - Number of stacked Vector Observation: 1
        - Vector Action space type: continuous
        - Vector Action space size (per agent): 4
        Vector Action descriptions: , , ,

### 2. Learning Algorithm

To solve this problem we use a DDPG (Deep Deterministic Policy Gradients) algorithms. We were given 33 continuous observation states, 4 action states. The neural networks used the rectified non-linearity for all hidden layers. The final output layer of the actor was a tanh layer, to bound the actions. The Actor and the Critic had 2 hidden layers with 400 and 300 units respectively. Below is the model implementation in PyTorch of the model.

#### PyTorch implementation

```python
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=400, fc2_units=300):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return F.tanh(self.fc3(x))


class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fcs1_units=400, fc2_units=300):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fcs1_units (int): Number of nodes in the first hidden layer
            fc2_units (int): Number of nodes in the second hidden layer
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fcs1 = nn.Linear(state_size, fcs1_units)
        self.fc2 = nn.Linear(fcs1_units+action_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units, 1)
        self.reset_parameters()

    def reset_parameters(self):
        self.fcs1.weight.data.uniform_(*hidden_init(self.fcs1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        xs = F.relu(self.fcs1(state))
        x = torch.cat((xs, action), dim=1)
        x = F.relu(self.fc2(x))
        return self.fc3(x)
```

### 3. Hyperparameters

We used Adam for learning the neural network parameters with a learning rate of 1e-3 and 1e-3 for the actor and critic respectively. For Q we included a discount factor of gamma = 0.99. For the soft target updates we used tau = 1e-3. 

```python
BUFFER_SIZE = int(1e6)  # Replay buffer size
BATCH_SIZE = 128        # Minibatch size
GAMMA = 0.99            # Discount factor
TAU = 1e-3              # For soft update of target parameters
LR_ACTOR = 1e-3         # Learning rate of the actor
LR_CRITIC = 1e-4        # Learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 20       # How often to update the network
EPSILON = 1.0           # To control noise
EPSILON_DECAY = 1e-6    # To gradually decrease noise
```

### 4. DDPG Algorithm

<p align="center"> 
    <img src="Assets/DDPG_Algorithm.png" align="left" alt="drawing" width="700px">
</p>

### 5. Plotted Rewards

The problem is solved in 366 episodes. The plot below shows the reward per epidode.

```python
Episode 100	Average Score: 3.03
Episode 200	Average Score: 10.52
Episode 300	Average Score: 20.27
Episode 366	Average Score: 30.05
Environment solved in 366 episodes!	 Average score: 30.05
```

<p align="center"> 
    <img src="Assets/Training_plot.png" align="left" alt="drawing" width="700px">
</p>

### 5. Future works

>[Solve the Second Version Twenty (20) Agents](https://github.com/nalbert9/Deep_Reinforcement_Learning/blob/master/P2_Continuous-control/Twenty_Agents.md): The second version is useful for algorithms like [PPO](https://arxiv.org/pdf/1707.06347.pdf), [A3C](https://arxiv.org/pdf/1602.01783.pdf), and [D4PG](https://openreview.net/pdf?id=SyZipzbCb) that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience.  

>[Solve the Crawler Environment](https://github.com/nalbert9/Deep_Reinforcement_Learning/blob/master/P2_Continuous-control/Continuous_Control.ipynb): The goal is to teach a creature with four legs to walk forward without falling.