# Report on Project 2: Continuous Control
This file reports the method adopted to train an agent for solving the task of continuous control of a double jointed arm in a Unity environment.

## Inputs

- state space: 33 (Continuous)
- action space: 4 (Continuous)
- reward structure: +0.1 for reaching the goal position

## Method

- The inputs are given to an agent which follows a policy for executing the task. 
- As the state space has real numbers, implementing a Q table is not possible. Hence, neural networks are used for function approximation.
- Also, as the action space is continuous, methods such as DQN cannot be directly implemented.
- An initial attempt with the REINFORCE method was made but it was found out that the training is unstable and would be much better methods for getting the desired results. 
- Therefore, the Deep Deterministic Policy Gradient (DDPG) algorithm which is an Actor-Critic method was used.
    - In this method the actor is the agent i.e. policy that takes the state and outputs actions.
    - While the Critic evaluates the expected values from a state,action pair i.e. Q value estimator
- The structure of the Actor is as follows:
    - F1 = ReLU (input_state (states = 33) x 128 neurons)
    - F2 = ReLU (F1 x 128 neurons)
    - F3 = ReLU (F2 x output_state (actions = 4))
- The structure of the Critic is as follows:
    - F1 = ReLU (input_state (states = 33) x 128 neurons)
    - F2 = ReLU (F1+action_size (=4) x 128 neurons)
    - F3 = ReLU (F2 x 1) 
- Two NNs for actor and critic of same architecture are used: local network (θ_local) and target network (θ_target).
- The target network is soft updated using the local network θ_target = τ*θ_local + (1 - τ)*θ_target.

## Hyperparameters

- BUFFER_SIZE = 1e5       # replay buffer size
- BATCH_SIZE = 128         # minibatch size
- GAMMA = 0.99            # discount factor
- TAU = 1e-3              # for soft update of target parameters
- LR_ACTOR = 2e-4       # Actor Learning Rate
- LR_CRITIC = 2e-4       # Critic Learning Rate
- maximum number of timesteps per episode =1000
- WEIGHT_DECAY = 0 # L2 weight decay

## Rewards plot
A plot of the average rewards received is seen below:
![alt text](images/plot.png "ABC")
It can be seen that the agent receives higher rewards as the experience i.e. number of episodes increases. 

Number of episodes needed to solve the environment = 235

## Future ideas for improving agents performance
- Use a different Neural Network architecture for actor and critic
- Implement for distributed learning wiht 20 agents
- Implement with other methods such as A3C, PPO, D4PG for faster and improved agent performance.
- Implement using Hierarchical reinforcement learning
    - Move near the goal roughly
    - Reach goal fine