

## Agent architecture

### PPOBuffer
* PPO
 * ``obs_buf (flaot)(size, obs_dim)``: Environment observations.
 * ``act_buf (float)(size, act_dim)``: Actions taken in environment
 * ``adv_buf (float) (size)`` 
  * ``deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]``
  * ``self.adv_buf[path_slice] = core.discount_cumsum(deltas, self.gamma * self.lam)``
 * ``rew_buf (float) (size)``: rewards received at step
 * ``ret_buf (float) (size)``: rewards-to-go. Rewards to be accumulated following this step, used to train the 
   value function.
  * ``self.ret_buf[path_slice] = core.discount_cumsum(rews, self.gamma)[:-1]``
 * ``val_buf (float) (size)``: value function estimate at step
 * ``logp_buf (float) (size)``: log probabilities of action at step
 * ``gamma(float)``: Discount factor of future rewards. (Always between 0 and 1.) 
 * ``lam (float)``: Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
 * ``path_start_idx``: index where current trajectory through environment started.  
  * This looks back in the buffer to where the trajectory started, and uses rewards and value estimates from
    the whole trajectory to compute advantage estimates with GAE-Lambda, as well as compute the 
    rewards-to-go for each state, to use as the targets for the value function.
 * ``ptr``: current index into buffer 
 * ``self.max_size``: buffer size limit
* Inverse Model
 * ``goal_buf(flaot) (size, goals_dim)``

## actor_critic
* ``pi``       (batch, act_dim) Samples actions from policy given states.
*  ``logp``     (batch,) Gives log probability, according to the policy, of taking actions ``a_ph``
in states ``x_ph``.
* ``logp_pi``  (batch,) Gives log probability, according to the policy, of the action 
sampled by ``pi``.
* ``v``        (batch,) Gives the value estimate in ``x_ph``. (Critical: make sure
to flatten this!)


## Inverse Model
* inputs: 
 * ``obs (float) (2, obs_dim)``, flatten
* predict: 
 * ``action_predicted (float) (act_dim)``
 * ``action_probability = exp(logp_pi)``
 * ``goal_predicted (float) (goals_dim)``
* loss: 
 * ``action_error = (action - action_predicted)``
 * ``inverse_model_goal_error = (goal - goal_predicted) * (action_error * action_probability + goal_error_base)``
 * ``inverse_model_goal_error_base``: parameter (0, 1) that defines the minimum correction for goals, even when action prediction is very accurate (action_error ~= 0)
 * ``inverse_model_loss = mean_squared_error(action_error) + mean_squared_error(goal_error)``


## Action Policy Model

* action policy reward: The action selection policy is rewarded for stability.  This means that the action policy is rewarded for selecting actions consistently in context of repeated observations.  Action policy is parametrized by goals and it is desired that different goals produce different trajectories through the environment.  The stability reward penalizes well-known actions (based on action error), if they are taken in context of the wrong goal.  This should steer the action policy to chose new actions in context of new goals.
 * ``reward_goal_error = (goal - goal_predicted)``
 * ``reward_action_error = (action - action_predicted) * action_probability``
 * ``action_error_discount``: How much action error should reduce stability bonus. If there is a large
   error in predicting the action in the inverse model, it indicates that the action policy is exploring 
   a new region of the environment.  In this case, guessing the wrong goal should not pentalize the policy.
 * ``action_error_factor``: How much the action error contributes to the stability loss.  
 * ``goal_error_factor``: How much the goal error contributes to the stability loss.  
 * ``stability_loss = goal_error_factor * reward_goal_error * (1 - reward_action_error) + action_error_factor * reward_action_error * (1 - reward_goal_error)


## Goal Policy Reward
* goal policy: Selects goals that are given to the Action Policy as input.  The goal policy is a PPO policy that maximizes long-term rewards.  The rewards are both external and internal, where the external reward is given by the enrivonment, and the internal reward is the same stability reward used for the action policy.  Both rewards are attenuated over time through habituation.  The internal reward is habituated by a factor that is separate for each goal. If a given goal is selected, it's factor is decreased by ``stability_habituation_factor``, and increased by ``stability_restoration_factor`` when the goal is not selected.  The external rewards are similarly controlled by ``reward_habituation_factor`` and ``reward_restoration_factor``, which are used depending on whether the reward is received on a given step.  If the reward is above mean, then reward is habituated, otherwise restored.
 * ``stability_habituation`` (float)(num_goals): vector of current habituation factors.  Initially set to 1 for each goal.
 * ``reward_habituation`` (float)(num_rewards): scalar indicating current reward habituation, initially set to 1.

## Inverse Dynamics

## Lower Level Policy

## Higher Level Policy



