## Problem

This project trains an agent in a Unity environment to move a robotic arm to reach a green ball that is circling around it. There are **4 possible continuous acitions** that apply a torque to the joints to control the final position of the hand along time:
    
A reward of +0.1 is given if the hand of the robot is close to the objective.

The whole state is complex, but the state we observe is made of **33 different signals** that the agent takes as inputs. With that inputs the agent must learn a policy to act to maximize reward.

There are 2 versions of the environment. The version 1 consist on just one agent. The version 2 consists on 20 pararell agents, but the problem is the same. A environment is considered solved when:

 - [version 1] the agent receives an average reward (over 100 episodes) of at least +30, or
 - [version 2] the agent is able to receive an average reward (over 100 episodes, and over all 20 agents) of at least +30.
 
The problem chosen here was the **version 1**.

## Learning Algorithm:

The implemented algorithm is a DDPG Agent. This algorithm is very similar to the Vanilla Policy Gradient (or REINFORCE), but it has some slight differences. The Actor netowrk creates a deterministic policy (hence the name of the algorithm), and this action(s) are fed into the Critic network as an input, in order to evaluare the quality of that state/action pair. He we can expand the Actor-Critic architecture to the continuous domain.

![reward plot](imgs/ddpg.png)

source: https://www.researchgate.net/figure/Diagram-of-the-actor-critic-architecture-for-DDPG_fig1_333652544

It uses a Neural Network with the following architecture:
    
**MU NETWORK**

    Input shape: (33, )
    Dense_1 : 256 neurons, ReLU activation
    Dense_2 : 128 neurons, ReLU activation
    Output: 4 actions (TanH activation)

**MU NETWORK**

    Input shape: (33, )
    Dense_1 : 256 neurons, ReLU activation
    Dense_2 : 128 neurons, ReLU activation [input = previous layer + actions]
    Output: 4 actions (TanH activation)


A soft weight update of the target Mu and Q Network is made with a blending factor TAU=0.001, every timesteps.

To train, a batch size of 128 is used, along with a Replay Buffer of capacity 100000. The optimizer is Adam with a LR of 10e-3 for both networks

A discount factor GAMMA of 0.99 is used.


The number of episodes is set to 1000. Every episode run until its done.

The model weights are saved in '*mu.pth*' and  '*q.pth*'

## Plot of reward

This is the plot of reward over time. The training is finished when the average reward is over 30 (red line)

![reward plot](imgs/rewards.png)

## Ideas for Future Work

To improve the performance or speed up training several ideas are proposed:

    - Try different NN architectures, this is a problem of hyperparamter tuning.
    - Implement Prioritized Experience Replay
    - Implement a distributed version
    - Experiment with more advanced architectures such as D4PG