# REPORT: DRL for Collaborative and Competitive Agents

## Description and Implementation

The code is written in four files:

1. `model.py` has to inherited classes from `nn.Module` from torch. One called `Actor` and the other called `Critic`. Both are multilayer and is posible to change the number of nodes in each layer.
1. `ddpg_agent.py` has an `Agent` class which handles the learning mechanism over a structure Actor-Critic. Furthermore this file contains two more classes, a class `ReplayBuffer` and a class `OUNoise`, to handle experience replay and exploration-exploitation dilemma respectively.
1. `ddpg_interact.py` has a general function `maddpg()` that handles the interation between the `Agent` and the `UnityEnvironment`.
1. `learn_and_prove.py` has the main program and handles all the parameters present in the `config.ini` file needed to run all the classes, also all the input option from the command line.

## Learning Algorithm

The __DDPG__ algorithm is adapted to multi-agent case __MDDPG__. Each agent has its own experience.

### About Hyperparameters
```python
config = {
    n_episodes:      2000
    max_t:           1000
    print_every:       50
    SEED:               0
    BUFFER_SIZE:      1e5
    BATCH_SIZE:       128
    UPDATE_EVERY:       1
    GAMMA:           0.99
    SIGMA:           0.20
    TAU:             6e-2
    LR_ACTOR:        1e-4
    LR_CRITIC:       1e-3
    WEIGHT_DECAY:       0
    FC1_ACTOR:        400
    FC2_ACTOR:        300
    FC1_CRITIC:       400
    FC2_CRITIC:       300
    }
```

### About Model Architectures
```
Actor(
  (fc1): Linear(in_features=48, out_features=400, bias=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=2, bias=True)
)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                  [-1, 400]          19,600
            Linear-2                  [-1, 300]         120,300
            Linear-3                    [-1, 2]             602
================================================================
Total params: 140,502
Trainable params: 140,502
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.54
Estimated Total Size (MB): 0.54
----------------------------------------------------------------
Critic(
  (fcs1): Linear(in_features=48, out_features=400, bias=True)
  (fc2): Linear(in_features=404, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=1, bias=True)
)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                  [-1, 400]          19,600
            Linear-2                  [-1, 300]         121,500
            Linear-3                    [-1, 1]             301
================================================================
Total params: 141,401
Trainable params: 141,401
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.54
Estimated Total Size (MB): 0.55

```

## Plot of Rewards
![learning curve](images/drl001_learning_curve.png)

Output console:
```
Episode   50/2000 || Score 0.10000 || Last avg. scores 0.00580 || Best avg. score 0.00580 
Episode  100/2000 || Score 0.00000 || Last avg. scores 0.01130 || Best avg. score 0.01284 
Episode  150/2000 || Score 0.00000 || Last avg. scores 0.02390 || Best avg. score 0.02490 
Episode  200/2000 || Score 0.10000 || Last avg. scores 0.03390 || Best avg. score 0.03390 
Episode  250/2000 || Score 0.09000 || Last avg. scores 0.06670 || Best avg. score 0.06670 
Episode  300/2000 || Score 0.19000 || Last avg. scores 0.08430 || Best avg. score 0.08740 
Episode  350/2000 || Score 0.10000 || Last avg. scores 0.08140 || Best avg. score 0.08810 
Episode  400/2000 || Score 0.20000 || Last avg. scores 0.09620 || Best avg. score 0.09620 
Episode  450/2000 || Score 0.00000 || Last avg. scores 0.11530 || Best avg. score 0.11640 
Episode  500/2000 || Score 0.10000 || Last avg. scores 0.12570 || Best avg. score 0.12770 
Episode  550/2000 || Score 0.10000 || Last avg. scores 0.14890 || Best avg. score 0.14890 
Episode  600/2000 || Score 0.10000 || Last avg. scores 0.14450 || Best avg. score 0.15400 
Episode  650/2000 || Score 0.10000 || Last avg. scores 0.15160 || Best avg. score 0.16440 
Episode  700/2000 || Score 0.50000 || Last avg. scores 0.23530 || Best avg. score 0.23530 
Episode  750/2000 || Score 0.40000 || Last avg. scores 0.32340 || Best avg. score 0.32340 
Episode  800/2000 || Score 0.60000 || Last avg. scores 0.36760 || Best avg. score 0.36760 
Episode  850/2000 || Score 0.10000 || Last avg. scores 0.45130 || Best avg. score 0.45630 
Episode  888/2000 || Score 2.10000 || Last avg. scores 0.50820 || Best avg. score 0.50820 
Environment solved in 888 episodes!	Average Score: 0.51	in 1975.80 secs
```

## Ideas of Future Works

* It would be interesting to apply this MADDPG to a more complex environment where several agents can be analyzed.
* To explore competition and competitive experices to improve the algorithm