# Continuous Control

---
The goal of this project is to train an agent to control a double-jointed arm to target locations from Unity ML-Agents toolkit Bananas environment.

### Environment:

The environment consists 20 double-jointed arms which get a goal location at every timestep. Every timestep the arm stays within goal bounds agent receives a +1 reward. The goal of the agent is to follow the goal locations at every timestep and collect as many rewards as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1. The task is complete when agents get an average score of +30 (over 100 consecutive episodes, and over all agents)

In [1]:
from unityagents import UnityEnvironment
import numpy as np
sys.path.append('./dist_ppo')
from agent import TrainedAgent

In [2]:
env = UnityEnvironment(file_name='dist_ppo/Reacher.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


In [5]:
def get_actions(agent, states):
    return zip(*[agent.act(obs) for obs in states])

def run_agent(env, agent):
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = agent.get_action(states)
        env_info = env.step(np.array(actions))[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    return np.mean(scores)

### Approach
Agent is trained using Proximal Policy Approximation (PPO) algorithm. We train a policy network and a value network paralelly in Actor-Critic style.

#### Proximal Policy Approximation (PPO) 
***TRPO surrogate function***  

PPO is an improvement on Trust Region Policy Optimization (TRPO). TRPO maximizes a “surrogate” objective

\begin{equation}
     L^{CPI}(\theta)= \hat{\mathop{\mathbb{E}}}_{t}\bigg[\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A}_{t}\bigg]
\label{lcpi}
\end{equation}
The same equation can be written as 

\begin{equation}
L^{CPI}(\theta)= \hat{\mathop{\mathbb{E}}}_{t}\big[ratio_{t}(\theta) \hat{A}_{t}\big]
\label{lcpi_2}
\end{equation}

where $ratio_{t}(\theta) = \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}$


***Clipped surrogate function***  

PPO uses a modified TRPO objective, to penalize changes to the policy that move $ratio_{t}(\theta)$ away from 1.

\begin{equation}
     L^{CLIP}(\theta)= \hat{\mathop{\mathbb{E}}}_{t}\big[\min(ratio_{t}(\theta)\hat{A}_{t}, clip(ratio_{t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_{t})\big]
\label{lclip}
\end{equation}

***Generalized Advantage Estimate (GAE)***

\begin{equation}
\hat{A}_{t} = \delta_{t} + (\gamma\lambda)\delta_{t+1} + ... + ... + (\gamma\lambda)^{T-t+1}\delta_{T-1} \\
\label{ae}
\end{equation}

where $\delta_{t} = r_{t} + \gamma V(s_{t+1}) - V(s_{t})$


Where  
    $\theta$ is parameters of policy function,  
    $\gamma$ is Discount factor,  
    $s_{t}$ is current state,    
    $s_{t+1}$ is future state,    
    $a_{t}$ is current action,  
    $\lambda$ is Generalized advantage estimate parameter  
    $\hat{A}_{t}$ Generalized advantage estimate
    $V$ is the Value function  
    $L$ Surrogate function  
    $r_{t}$ is reward at timestep t

#### Algorithm
<img src="docs/algorithm.png" style="width:600px;height:200px;"> 

#### Model Architecture
For the parameterized value function I used the following architecture where FC is a Fully connected layer and ReLu is a Rectified Linear Unit

*Policy Network*

\begin{equation}
    [State]_{(1\times33)}  \longrightarrow [FC]_{(33\times32)} \longrightarrow [ReLu] \longrightarrow [FC]_{(32\times32)} \longrightarrow [ReLu] \longrightarrow [FC]_{(32\times4)} \longrightarrow [Actions]_{(1\times4)}
\label{model}
\end{equation}

*Value Network*

\begin{equation}
    [State]_{(1\times33)}  \longrightarrow [FC]_{(33\times32)} \longrightarrow [ReLu] \longrightarrow [FC]_{(32\times32)} \longrightarrow [ReLu] \longrightarrow [FC]_{(32\times1)} \longrightarrow [StateValue]_{(1\times1)}
\label{valuemodel}
\end{equation}


#### Hyperparameters

|Parameter|Description|Value|
|:---:|:---:|:---:|
|PPO Epochs| Number of times a trajectory is used to train | 4|
|$\epsilon_{clip}$| Epsilon value of probability ratio clippling function|0.2|
|Trajectory size|Maximum length of a trajectory| 33|
|$\gamma$|Discount factor|0.99|
|$\alpha_{policy}$| Policy network optimizer learning rate| 0.0001|
|$\alpha_{value}$| Value network optimizer learning rate| 0.0005|
|$\lambda$| Generalized Advantage Estimation (GAE) parameter| 0.6|


In [6]:
agent = TrainedAgent(checkpoint_path='dist_ppo/checkpoints/dist_ppo.pth.tar')

In [7]:
scores = []
for _ in range(200):
    score = run_agent(env, agent)
    scores.append(score)

print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 35.910741697332355


In [8]:
env.close()

In [9]:
from bokeh.plotting import figure 
from bokeh.models import Legend
from bokeh.layouts import column
from bokeh.io import output_notebook, show
output_notebook()

### Results

In [34]:
PLOT_WIDTH = 900
PLOT_HEIGHT = 300
LINE_WIDTH = 2


def get_figure(data, x_axis_label, y_axis_label):
    #data = list(map(lambda x : 0 if x == float('nan') else x , data))
    #print(data)
    fig = figure(
        plot_width=PLOT_WIDTH,
        plot_height=PLOT_HEIGHT,
        y_axis_label=y_axis_label,
        x_axis_label=x_axis_label
    )
    fig.line(range(len(data)), data, line_width=LINE_WIDTH)
    return fig

plots = []

plots.append(get_figure(scores, 'Episodes', 'Avg Episodic Reward over 20 arms'))
plots.append(get_figure([np.mean(scores[:i+1][-100:]) for i in range(len(scores))],
                        'Episodes', 'Moving Avg Episodic Reward over 20 arms'))


main_row = column(*plots)
show(main_row)


Plot 1 shows average rewards over 20 arms received by the agent per episode. Plot 2 show average reward (over 100 episodes) recieved by the agent. The endgoal of the project was to train an agent which is able to receive an average reward (over 100 episodes, and over all 20 agents) of at least +30. Plot 2 shows that the PPO agent presented in this project acheived reward that is more than 30. Agent took 2500 training episodes to learn.

#### Future work
- I would like to extend this to work with new environemnts.
- The agent performs well with current hyperparameters. It took around 2500 episodes to reach this level, I would like to further tune the hyperparameters to see if the agent learns any faster.

#### Refrences
- [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
- [60 Days RL Challenge](https://github.com/andri27-ts/60_Days_RL_Challenge/tree/master/Week5)
