# REPORT: DRL for continuous control

## Description of the implementation
The code is written in four files: 
1. `model.py` has to inherited classes from `nn.Module` from `torch`. One called `Actor` and the other called `Critic`. Both are multilayer and is posible to change the number of nodes in each layer.
1. `ddpg_agent.py` has an `Agent` class which handles the learning mechanism over a structure Actor-Critic. Furthermore this file contains two more classes, a class `ReplayBuffer` and a class `OUNoise`, to handle experience replay and exploration-exploitation dilemma respectively.
1. `ddpg_interact.py` has a general function `ddpg()` that handles the interation between the `Agent` and the `UnityEnvironment`.
1. `learn_and_prove.py` has the main program and handles all the parameters present in the `params.ini` file needed to run all the classes, also all the input option from the command line.

## Learning Algorithm
The way of the problem is solved here is one way for getting solutions of __Robotic Optimal Control__ or __Behavior Character Animation__ problems present in real and practical problems. That is, an MDP with continuous state and action spaces, and one well-known algorithm to tackle the problem is the __Deep Deterministic Policy Gradient [(DDPG)](https://arxiv.org/abs/1509.02971)__.

DDPG uses two neural network architectures as function approximator. One network is known as the Actor and is just the policy or controller of the agent; this controller performs an action decision based on its perception, represented by a vector state. The other network is known as the Critic, and its signal is used to improve the performance of the Actor. Instead of the policy-based methods where the gradient is approximated by a set of experience rollouts and with a high variance, this DDPG Actor-Critic has low variance for the estimate of the gradient.

### About Hyperparameters
This algorithm has many parameters very hard for manual tunning, because of the long time of each run needs. As a matter of time, not all the following parameters have been explored. Nonetheless the next config has a great performance, taking about 40 minutes to train:

```python
config = {
    n_episodes:       200, # max. number of episode to train the agent
    max_t:           1000, # max. number of steps per episode
    SEED:               0, # seed stochasticity
    BUFFER_SIZE:      1e5, # ALWAYS FIXED: replay buffer size
    BATCH_SIZE:       256, # minibatch size
    UPDATE_EVERY:       1, # how often to update the network
    GAMMA:           0.99, # ALWAYS FIXED: discount factor
    SIGMA:           0.20, # std noise over actions for exploration
    TAU:             1e-3, # for soft update or target parameters
    LR_ACTOR:        1e-4, # learning rate of the actor
    LR_CRITIC:       1e-4, # learning rate of the critic
    WEIGHT_DECAY:     0.0, # L2 weight decay for the critic
    FC1_ACTOR:        400, # number of neurons in actor first layer
    FC2_ACTOR:        300, # number of neurons in actor first layer
    FC1_CRITIC:       400, # number of neurons in critic first layer
    FC2_CRITIC:       300, # number of neurons in critic second layer
}
```
As an example of the complexity of exploring the parameter space, the following picture shows the same algorithm only changing de `SEED` and the parameter for noise amplitude `SIGMA`.

![image](images/drl002_sigma_and_seed.png)

Fixed values was taken from DDPG the paper.
* `n_episodes` was explored between $[100-1000]$, most of the time was set to 100 to save hours of waiting.
* `max_t` was set to 1000 to easily fill the replay buffer size
* `BATCH_SIZE` was explored as $[16, 32, 64, 128, 256]$ depends on the architecture.
* `UPDATE_EVERY` was explored between $[1-20]$ where was kept to speed up results. 
* `SIGMA` was expored as $[0.05, 0.10, 0.20]$ where 0.2 a good choice among the others.
* `WEIGHT_DECAY` was expored as $[0.0, 1e-2, 1e-4]$, the best choice by experiments was set it to zero.

### About Model Architecture
Several architectures was proved, the following figure shows the first search on this parameters without succeed, with architictures of one hidden layer for the Actor and two hidden layers for the Critic:

![image](images/drl001_architecture_and_batch.png)

as you can see the score of the total 20 arms is ploted and accumulated average over 100 episodes. It can be concluded the need for a more complex architecture.

Without success under 100 episodes:
* Actor `(33->16->4)`, Critic `([33+4]->32->16->1)`
* Actor `(33->32->4)`, Critic `([33+4]->32->16->1)`
* Actor `(33->64->4)`, Critic `([33+4]->32->16->1)`
* Actor `(33->128->4)`, Critic `([33+4]->32->16->1)`
* Actor `(33->32->4)`, Critic `([33+4]->32->32->1)`
* Actor `(33->16->4)`, Critic `([33+4]->32->32->1)`
* Actor `(33->16->4)`, Critic `([33+4]->64->64->1)`
* Actor `(33->128->4)`, Critic `(33->[(128)+4]->128)->1)` adding the action state to the second layer

With success:
* Actor `(33->128->128->4)`, Critic `(33->[(128)+4]->128)->1)` 
* Actor `(33->400->300->4)`, Critic `(33->[(400)+4]->300)->1)` 

The most successful architecture found was:
```
Actor(
  (fc1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=4, bias=True)
)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                  [-1, 400]          13,600
            Linear-2                  [-1, 300]         120,300
            Linear-3                    [-1, 4]           1,204
================================================================
Total params: 135,104
Trainable params: 135,104
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.52
Estimated Total Size (MB): 0.52
----------------------------------------------------------------
Critic(
  (fcs1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=404, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=1, bias=True)
)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                  [-1, 400]          13,600
            Linear-2                  [-1, 300]         121,500
            Linear-3                    [-1, 1]             301
================================================================
Total params: 135,401
Trainable params: 135,401
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.52
Estimated Total Size (MB): 0.52
----------------------------------------------------------------
```
## Plot of Rewards

Our first attempts to solve the project didn't get the target,

![image](images/drl003_first_attempts.png)

but after many hours of work we finally found several sets of working configurations:

![image](images/drl004_successful_configs.png)

The final output of the `id:6` shown in the last picture was:
```
Episode    1/ 100 || Score 0.54700 || Last avg. scores 0.54700 || Best avg. score 0.54700 
Episode    2/ 100 || Score 0.72250 || Last avg. scores 0.63475 || Best avg. score 0.63475 
Episode    3/ 100 || Score 0.59700 || Last avg. scores 0.62217 || Best avg. score 0.63475 
Episode    4/ 100 || Score 0.80800 || Last avg. scores 0.66862 || Best avg. score 0.66862 
Episode    5/ 100 || Score 0.82350 || Last avg. scores 0.69960 || Best avg. score 0.69960 
Episode    6/ 100 || Score 0.65200 || Last avg. scores 0.69167 || Best avg. score 0.69960 
Episode    7/ 100 || Score 0.82200 || Last avg. scores 0.71029 || Best avg. score 0.71029 
Episode    8/ 100 || Score 0.84000 || Last avg. scores 0.72650 || Best avg. score 0.72650 
Episode    9/ 100 || Score 1.35300 || Last avg. scores 0.79611 || Best avg. score 0.79611 
Episode   10/ 100 || Score 1.72700 || Last avg. scores 0.88920 || Best avg. score 0.88920 
Episode   11/ 100 || Score 2.01200 || Last avg. scores 0.99127 || Best avg. score 0.99127 
Episode   12/ 100 || Score 2.30550 || Last avg. scores 1.10079 || Best avg. score 1.10079 
Episode   13/ 100 || Score 3.49050 || Last avg. scores 1.28462 || Best avg. score 1.28462 
Episode   14/ 100 || Score 3.62700 || Last avg. scores 1.45193 || Best avg. score 1.45193 
Episode   15/ 100 || Score 3.78300 || Last avg. scores 1.60733 || Best avg. score 1.60733 
Episode   16/ 100 || Score 4.87550 || Last avg. scores 1.81159 || Best avg. score 1.81159 
Episode   17/ 100 || Score 6.69900 || Last avg. scores 2.09909 || Best avg. score 2.09909 
Episode   18/ 100 || Score 10.33500 || Last avg. scores 2.55664 || Best avg. score 2.55664 
Episode   19/ 100 || Score 15.44400 || Last avg. scores 3.23492 || Best avg. score 3.23492 
Episode   20/ 100 || Score 15.52100 || Last avg. scores 3.84922 || Best avg. score 3.84922 
Episode   21/ 100 || Score 20.29500 || Last avg. scores 4.63236 || Best avg. score 4.63236 
Episode   22/ 100 || Score 24.49900 || Last avg. scores 5.53539 || Best avg. score 5.53539 
Episode   23/ 100 || Score 29.07100 || Last avg. scores 6.55867 || Best avg. score 6.55867 
Episode   24/ 100 || Score 29.46050 || Last avg. scores 7.51292 || Best avg. score 7.51292 
Episode   25/ 100 || Score 30.62200 || Last avg. scores 8.43728 || Best avg. score 8.43728 
Episode   26/ 100 || Score 33.18850 || Last avg. scores 9.38925 || Best avg. score 9.38925 
Episode   27/ 100 || Score 36.16950 || Last avg. scores 10.38111 || Best avg. score 10.38111 
Episode   28/ 100 || Score 37.08200 || Last avg. scores 11.33471 || Best avg. score 11.33471 
Episode   29/ 100 || Score 37.22200 || Last avg. scores 12.22738 || Best avg. score 12.22738 
Episode   30/ 100 || Score 37.85550 || Last avg. scores 13.08165 || Best avg. score 13.08165 
Episode   31/ 100 || Score 37.99450 || Last avg. scores 13.88529 || Best avg. score 13.88529 
Episode   32/ 100 || Score 38.06650 || Last avg. scores 14.64095 || Best avg. score 14.64095 
Episode   33/ 100 || Score 38.20800 || Last avg. scores 15.35511 || Best avg. score 15.35511 
Episode   34/ 100 || Score 37.88650 || Last avg. scores 16.01779 || Best avg. score 16.01779 
Episode   35/ 100 || Score 38.07600 || Last avg. scores 16.64803 || Best avg. score 16.64803 
Episode   36/ 100 || Score 38.31700 || Last avg. scores 17.24994 || Best avg. score 17.24994 
Episode   37/ 100 || Score 37.75150 || Last avg. scores 17.80404 || Best avg. score 17.80404 
Episode   38/ 100 || Score 38.45000 || Last avg. scores 18.34735 || Best avg. score 18.34735 
Episode   39/ 100 || Score 37.42550 || Last avg. scores 18.83654 || Best avg. score 18.83654 
Episode   40/ 100 || Score 37.65250 || Last avg. scores 19.30694 || Best avg. score 19.30694 
Episode   41/ 100 || Score 37.13500 || Last avg. scores 19.74177 || Best avg. score 19.74177 
Episode   42/ 100 || Score 37.38450 || Last avg. scores 20.16183 || Best avg. score 20.16183 
Episode   43/ 100 || Score 37.50550 || Last avg. scores 20.56517 || Best avg. score 20.56517 
Episode   44/ 100 || Score 37.23350 || Last avg. scores 20.94400 || Best avg. score 20.94400 
Episode   45/ 100 || Score 37.16050 || Last avg. scores 21.30437 || Best avg. score 21.30437 
Episode   46/ 100 || Score 37.33950 || Last avg. scores 21.65296 || Best avg. score 21.65296 
Episode   47/ 100 || Score 37.18300 || Last avg. scores 21.98338 || Best avg. score 21.98338 
Episode   48/ 100 || Score 37.31950 || Last avg. scores 22.30288 || Best avg. score 22.30288 
Episode   49/ 100 || Score 36.98350 || Last avg. scores 22.60249 || Best avg. score 22.60249 
Episode   50/ 100 || Score 37.35400 || Last avg. scores 22.89752 || Best avg. score 22.89752 
Episode   51/ 100 || Score 37.62650 || Last avg. scores 23.18632 || Best avg. score 23.18632 
Episode   52/ 100 || Score 36.89600 || Last avg. scores 23.44997 || Best avg. score 23.44997 
Episode   53/ 100 || Score 36.54250 || Last avg. scores 23.69700 || Best avg. score 23.69700 
Episode   54/ 100 || Score 37.56450 || Last avg. scores 23.95381 || Best avg. score 23.95381 
Episode   55/ 100 || Score 36.87550 || Last avg. scores 24.18874 || Best avg. score 24.18874 
Episode   56/ 100 || Score 36.95350 || Last avg. scores 24.41669 || Best avg. score 24.41669 
Episode   57/ 100 || Score 36.99350 || Last avg. scores 24.63733 || Best avg. score 24.63733 
Episode   58/ 100 || Score 37.12550 || Last avg. scores 24.85265 || Best avg. score 24.85265 
Episode   59/ 100 || Score 37.45600 || Last avg. scores 25.06626 || Best avg. score 25.06626 
Episode   60/ 100 || Score 36.93250 || Last avg. scores 25.26403 || Best avg. score 25.26403 
Episode   61/ 100 || Score 37.74850 || Last avg. scores 25.46870 || Best avg. score 25.46870 
Episode   62/ 100 || Score 37.20750 || Last avg. scores 25.65803 || Best avg. score 25.65803 
Episode   63/ 100 || Score 37.52650 || Last avg. scores 25.84642 || Best avg. score 25.84642 
Episode   64/ 100 || Score 37.35300 || Last avg. scores 26.02621 || Best avg. score 26.02621 
Episode   65/ 100 || Score 36.68300 || Last avg. scores 26.19016 || Best avg. score 26.19016 
Episode   66/ 100 || Score 36.78250 || Last avg. scores 26.35065 || Best avg. score 26.35065 
Episode   67/ 100 || Score 37.37400 || Last avg. scores 26.51518 || Best avg. score 26.51518 
Episode   68/ 100 || Score 36.76900 || Last avg. scores 26.66597 || Best avg. score 26.66597 
Episode   69/ 100 || Score 36.49850 || Last avg. scores 26.80847 || Best avg. score 26.80847 
Episode   70/ 100 || Score 37.17000 || Last avg. scores 26.95649 || Best avg. score 26.95649 
Episode   71/ 100 || Score 37.13600 || Last avg. scores 27.09987 || Best avg. score 27.09987 
Episode   72/ 100 || Score 37.40600 || Last avg. scores 27.24301 || Best avg. score 27.24301 
Episode   73/ 100 || Score 37.32300 || Last avg. scores 27.38109 || Best avg. score 27.38109 
Episode   74/ 100 || Score 37.63900 || Last avg. scores 27.51971 || Best avg. score 27.51971 
Episode   75/ 100 || Score 37.15750 || Last avg. scores 27.64821 || Best avg. score 27.64821 
Episode   76/ 100 || Score 37.47650 || Last avg. scores 27.77753 || Best avg. score 27.77753 
Episode   77/ 100 || Score 37.98500 || Last avg. scores 27.91010 || Best avg. score 27.91010 
Episode   78/ 100 || Score 37.33800 || Last avg. scores 28.03097 || Best avg. score 28.03097 
Episode   79/ 100 || Score 37.21800 || Last avg. scores 28.14726 || Best avg. score 28.14726 
Episode   80/ 100 || Score 37.75500 || Last avg. scores 28.26736 || Best avg. score 28.26736 
Episode   81/ 100 || Score 36.77550 || Last avg. scores 28.37239 || Best avg. score 28.37239 
Episode   82/ 100 || Score 37.33350 || Last avg. scores 28.48168 || Best avg. score 28.48168 
Episode   83/ 100 || Score 37.09650 || Last avg. scores 28.58547 || Best avg. score 28.58547 
Episode   84/ 100 || Score 37.08300 || Last avg. scores 28.68663 || Best avg. score 28.68663 
Episode   85/ 100 || Score 37.45600 || Last avg. scores 28.78980 || Best avg. score 28.78980 
Episode   86/ 100 || Score 37.51300 || Last avg. scores 28.89123 || Best avg. score 28.89123 
Episode   87/ 100 || Score 37.58750 || Last avg. scores 28.99119 || Best avg. score 28.99119 
Episode   88/ 100 || Score 37.73100 || Last avg. scores 29.09051 || Best avg. score 29.09051 
Episode   89/ 100 || Score 37.67250 || Last avg. scores 29.18693 || Best avg. score 29.18693 
Episode   90/ 100 || Score 37.38900 || Last avg. scores 29.27807 || Best avg. score 29.27807 
Episode   91/ 100 || Score 37.08600 || Last avg. scores 29.36387 || Best avg. score 29.36387 
Episode   92/ 100 || Score 37.55700 || Last avg. scores 29.45292 || Best avg. score 29.45292 
Episode   93/ 100 || Score 37.33900 || Last avg. scores 29.53772 || Best avg. score 29.53772 
Episode   94/ 100 || Score 37.42150 || Last avg. scores 29.62159 || Best avg. score 29.62159 
Episode   95/ 100 || Score 37.96800 || Last avg. scores 29.70945 || Best avg. score 29.70945 
Episode   96/ 100 || Score 37.13450 || Last avg. scores 29.78679 || Best avg. score 29.78679 
Episode   97/ 100 || Score 37.66350 || Last avg. scores 29.86799 || Best avg. score 29.86799 
Episode   98/ 100 || Score 37.63100 || Last avg. scores 29.94721 || Best avg. score 29.94721 
Episode   99/ 100 || Score 38.14400 || Last avg. scores 30.03000 || Best avg. score 30.03000 

Environment solved in 99 episodes!	Average Score: 30.03	in 1990.58 secs
Episode  100/ 100 || Score 38.02700 || Last avg. scores 30.10997 || Best avg. score 30.10997 
no proves on controller

real	38m29.935s
user	41m8.096s
sys	6m19.636s
```

## Ideas for Future Work

* Continue exploring the space parameters.
* For improving the training time is important to explore algorithms like PPO, D4PG or GPS, where the problem can be solved in fewer attempts. 
* Explore minimal structures of the neural networks for this problem.
* Extract features that can be related to the inverse kinematics and dynamics of the robot.