## Architecture

My Hierarchical Reinforcement Learning architecture consists of a lower-level (Actions) PPO policy which chooses actions for the environment and a higher-level (Goaly) PPO policy which sets goals for the Actions policy.  There is also an inverse-dynamics model trained to predict both the action and goal taken at each step.  

<img src="goaly_model.png" alt="" width="600"/>


## Intrinsic Reward

Both policies are trained separately based on extrinsic and intrinsic rewards.  The extrinsic reward is just the reward from the environment, while the intrinsic reward is calclulated based on the predictions made by the inverse dynamics model.  Inverse dynamics model is a modeled by a neural network which predicts actions and goals.  Training this neural network is done by learning function *m* defined as the following:

\begin{equation*}
\hat{a}, \hat{g} = m(s_t, s_t+t; \phi_I)
\end{equation*}

where, $\hat{a}_t$ is the predicted estimate of the action $a_t$, and $\hat{g}_t$ is the predicted estimate of the goal $g_t$.  The neural network parameters $\phi_I$ are trained to optimize 

\begin{equation*}
\underset{\phi_I}{min} = L_I(\hat{a}_t, \hat{g}_t, a_t, g_t)
\end{equation*}

where $L_I$ is the loss function that measures the difference between the predicted actions and goals, and actions and goals observed during agent's interaction with the environment.

### Stability
The main purpose of the intrinsic reward is to keep the interpretation of goals by the Actions policy stable.  The second purpose is to encourage assignment of new goals when exploring new areas of the state-space.  The stability reward uses two values computed by the inverse model *m*:

\begin{equation*}
\bar{a} = \frac{\hat{a}_t - a_t}{\hat{a}_t + a_t}
\end{equation*}
, where the denominator ensures that the reward doesn't favor low-amplitude actions and is capped to avoid divides by zero, and keep the difference value normalized near *0,1* range.  

\begin{equation*}
\bar{g} = \frac{\hat{g}_t - g_t}{g_{count}}
\end{equation*}
, where goals are represented by a scalar (not one hot).  This has the effect of penalizing small goal differences less than large ones, which means that goal space is treated as a continuum rather than as completely independent values.

Finally the stability reward is defined as

\begin{equation*}
S = 1 + \bar{g}(\bar{a} - 1) + \bar{a}(\bar{g} - 1)
\end{equation*}




## Exploration

