# Policy Based Methods Continuous Control Project Report

This project builds on the DDPG Pendulum solution provided as part of Udacity's "deep-reinforcement-learning-master" Github repo
* [DDPG Pendulum](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum)

While the pendulum solution interacted with an OpenAI Gym environment, this project replaced that environment with Unity's Reacher environment.

# Learning Algorithm

A Pytorch Deep Deterministic Policy Gradient (DDPG) actor/critic approach was trained to solve the environment.

In PyTorch, the neural network layers are defined in the init function and the forward pass is defined in the forward function, which is invoked automatically when the class is called.

Two networks were defined, namely an Actor and a Critic.

The Actor maps states to a deterministic action. There are two hidden layers (128 fully connected nodes). The output activation function is tanh as the output values vary between -1 and 1.

The Critic takes the Actor's action and uses it for training (used in action value function). There are two hidden layers (128 fully connected nodes). The output activation function is ReLU as the output values needs to be positive.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector must be a number between -1 and 1.

An agent is defined that interacts with the environment and updates both networks.

Noise was used to balance exploration vs. exploitation when selecting action values.  A discount factor of 0.9 was used and a learning rate of 1e-3 was used for both networks.

Every time step, the network will sample actions from replay memory as long as the size of the memory has reached 256 samples. These samples will be used to update the network. A soft update startegy is used (using hyperparameter TAU) to slowly blend the regular network weights (training network) with the target network weights (the one we are using for prediction to stabilize training).

# Plot of Rewards

![Scores Plot](./Scores_Per_Episode.png)

The environment is considered solved when the average reward for an episode reaches 30 over 100 episodes. My environment was solved in 182 episodes with an Average Score value of 30.20. This was very difficult to achieve. Hours were spent tweaking network structures and hyperparameters. There were a number of different combinations of hyperparameters that led to very promising training just to have it break down after hitting a peak as shown below. 

![Training Breakdown](./Training_Breakdown.png)

In these runs, the results per episode were varying wildly and the networks were generally unstable. Tweaking gamma values, making noisy updates less noisy (reducing sigma), and using less nodes in my neural net made this solution finally converge.

# Ideas For Future Work

Future work can consider the use of different Policy Based Methods for continuous control tasks such as REINFORCE, TNPG, RWR, REPS, TRPO, CEM and CMA-ES.  Hyperparameter sweeps could also be used to improve the stability and speed of the training the standard DDPG.