## Deep Reinforcement Learning Project 3 : Tennis

This report is a description of our environment and algorithms we use to train two agents which control rackets to bounce a ball over a net.

If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.

### 1. Environnement

For this project we have used [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment to design, train, and evaluate our Deep Reinforcement learning algorithms. 

#### Characteristics of the environment

+ Unity Academy name: Academy
        - Number of Brains: 1
        - Number of External Brains : 1
		
+ Unity brain name: TennisBrain
        - Number of Visual Observations (per agent): 0
        - Vector Observation space type: continuous
        - Vector Observation space size (per agent): 8
        - Number of stacked Vector Observation: 3
        - Vector Action space type: continuous
        - Vector Action space size (per agent): 2

### 2. Learning Algorithm

To solve this problem we use a  MADDPG (multi-agent deep deterministic policy gradient) and a DDPG algorithms for training multiple agents. for more details about the agents implementation see the ddpg_agent.y and multi_agent.py files

### 3. Model Architecture

### 3.1. Actor Network

```python
class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return torch.tanh(self.fc3(x))
```

### 3.2. Critic Network

```python
class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fcs1_units=256, fc2_units=128):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fcs1_units (int): Number of nodes in the first hidden layer
            fc2_units (int): Number of nodes in the second hidden layer
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fcs1 = nn.Linear(state_size, fcs1_units)
        self.fc2 = nn.Linear(fcs1_units + action_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units, 1)
        self.reset_parameters()

    def reset_parameters(self):
        self.fcs1.weight.data.uniform_(*hidden_init(self.fcs1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        xs = F.relu(self.fcs1(state))
        x = torch.cat((xs, action), dim=1)
        x = F.relu(self.fc2(x))
        return self.fc3(x)
```

### 3. Hyperparameters

We used Adam for learning the neural network parameters with a learning rate of 1e-3 and 1e-3 for the actor and critic respectively. For Q we included a discount factor of gamma = 0.99. For the soft target updates we used tau = 1e-3. 

```python
BUFFER_SIZE = int(1e5)  # Replay buffer size
BATCH_SIZE = 128        # Minibatch size
GAMMA = 0.99            # Discount factor
TAU = 1e-3              # For soft update of target parameters
LR_ACTOR = 1e-3         # Learning rate of the actor
LR_CRITIC = 1e-4        # Learning rate of the critic
```

### 4. DDPG Algorithm

<p align="center"> 
    <img src="Assets/DDPG_Algorithm.png" align="left" alt="drawing" width="700px">
</p>

### 5. Plotted Rewards

The problem is solved in 366 episodes. The plot below shows the reward per epidode.

```python
Episode 100	Average Score: 3.03
Episode 200	Average Score: 10.52
Episode 300	Average Score: 20.27
Episode 366	Average Score: 30.05
Environment solved in 366 episodes!	 Average score: 30.05
```

<p align="center"> 
    <img src="Assets/Training_plot.png" align="left" alt="drawing" width="700px">
</p>

### 5. Future works


>[Solve the Soccer Environment](https://github.com/nalbert9/Deep_Reinforcement_Learning/blob/master/P3_Collab-compet/Soccer.md)